NobleBlocks

Intel (India)

companyBengaluru, Karnataka, India

Research output, citation impact, and the most-cited recent papers from Intel (India) (India). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works
1.4K
Citations
33.2K
h-index
75
i10-index
586
Also known as
Intel (India)

Top-cited papers from Intel (India)

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems
Md. Vasimuddin, Sanchit Misra, Heng Li, Srinivas Aluru
20192.7Kdoi:10.1109/ipdps.2019.00041

Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. For example, the Illumina NovaSeq 6000 sequencer can generate 6 Terabases of data in less than two days, sequencing nearly 20 Billion short DNA fragments called reads at the low cost of $1000 per human genome. Large sequencing centers typically employ hundreds of such systems. Such highthroughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK (Genome Analysis ToolKit) best practices workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing and is usually processed on clusters or cloud deployments with multicore processors usually being the platform of choice. Since the application can be easily parallelized across multiple sockets (even across distributed memory systems) by simply distributing the reads equally, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of the three kernels by 1) using techniques to improve cache reuse, 2) simplifying the algorithms, 3) replacing many small memory allocations with a few large contiguous ones to improve hardware prefetching of data, 4) software prefetching of data, and 5) utilization of SIMD wherever applicable and massive reorganization of the source code to enable these improvements. As a result, we achieved nearly 2x, 183x, and 8x speedups on the three kernels, respectively, resulting in up to 3.5x and 2.4x speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM (running on a single CPU) while using a single CPU or a single CPU-single GPGPU/FPGA combination.

An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS
Sriram Vangal, Jason Howard, Greg Ruhl, Saurabh Dighe +4 more
2008· IEEE Journal of Solid-State Circuits627doi:10.1109/jssc.2007.910957

This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.

IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments
Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker +1 more
2019331doi:10.1109/wacv.2019.00190

While several datasets for autonomous navigation have become available in recent years, they have tended to focus on structured driving environments. This usually corresponds to well-delineated infrastructure such as lanes, a small number of well-defined categories for traffic participants, low variation in object or background appearance and strong adherence to traffic rules. We propose DS, a novel dataset for road scene understanding in unstructured environments where the above assumptions are largely not satisfied. It consists of 10,004 images, finely annotated with 34 classes collected from 182 drive sequences on Indian roads. The label set is expanded in comparison to popular benchmarks such as Cityscapes, to account for new classes. It also reflects label distributions of road scenes significantly different from existing datasets, with most classes displaying greater within-class diversity. Consistent with real driving behaviors, it also identifies new classes such as drivable areas besides the road. We propose a new four-level label hierarchy, which allows varying degrees of complexity and opens up possibilities for new training methods. Our empirical study provides an in-depth analysis of the label characteristics. State-of-the-art methods for semantic segmentation achieve much lower accuracies on our dataset, demonstrating its distinction compared to Cityscapes. Finally, we propose that our dataset is an ideal opportunity for new problems such as domain adaptation, few-shot learning and behavior prediction in road scenes.

A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS
Amit Kumary, Partha Kunduz, A. Singh, Li-Shiuan Pehy +1 more
2007271doi:10.1109/iccd.2007.4601881

As chip multiprocessors (CMPs) become the only viable way to scale up and utilize the abundant transistors made available in current microprocessors, the design of on-chip networks is becoming critically important. These networks face unique design constraints and are required to provide extremely fast and high bandwidth communication, yet meet tight power and area budgets. In this paper, we present a detailed design of our on-chip network router targeted at a 36-core shared-memory CMP system in 65nm technology. Our design targets an aggressive clock frequency of 3.6GHz, thus posing tough design challenges that led to several unique circuit and microarchitectural innovations and design choices, including a novel high throughput and low latency switch allocation mechanism, a non-speculative single-cycle router pipeline which uses advanced bundles to remove control setup overhead, a low-complexity virtual channel allocator and a dynamically-managed shared buffer design which uses prefetching to minimize critical path delay. Our router takes up 1.19mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> area and expends 551 mW power at 10% activity, delivering a single-cycle no-load latency at 3.6GHz clock frequency while achieving a peak switching data rate in excess of 4.6Tbits/s per router node.

A 280mV-to-1.2V wide-operating-range IA-32 processor in 32nm CMOS
Shailendra Jain, Surhud Khare, Satish Yada, V Ambili +4 more
2012265doi:10.1109/isscc.2012.6176932

Near-threshold computing brings the promise of an order of magnitude improvement in energy efficiency over the current generation of microprocessors [1]. However, frequency degradation due to aggressive voltage scaling may not be acceptable across all single-threaded or performance-constrained applications. Enabling the processor to operate over a wide voltage range helps to achieve best possible energy efficiency while satisfying varying performance demands of the applications. This paper describes an IA-32 processor fabricated in 32nm CMOS technology [2], demonstrating a reliable ultra-low voltage operation and energy efficient performance across the wide voltage range from 280mV to 1.2V.

Privacy in mobile technology for personal healthcare
Sasikanth Avancha, Amit Baxi, David Kotz
2012· ACM Computing Surveys235doi:10.1145/2379776.2379779

Information technology can improve the quality, efficiency, and cost of healthcare. In this survey, we examine the privacy requirements of mobile computing technologies that have the potential to transform healthcare. Such mHealth technology enables physicians to remotely monitor patients' health and enables individuals to manage their own health more easily. Despite these advantages, privacy is essential for any personal monitoring technology. Through an extensive survey of the literature, we develop a conceptual privacy framework for mHealth, itemize the privacy properties needed in mHealth systems, and discuss the technologies that could support privacy-sensitive mHealth systems. We end with a list of open research questions.

A multimode 76-to-81GHz automotive radar transceiver with autonomous monitoring
Brian Ginsburg, Karthik Subburaj, Sreekiran Samala, Karthik Ramasubramanian +4 more
2018188doi:10.1109/isscc.2018.8310232

Radar is a key sensing technology for advanced driver assistance systems and autonomous vehicles due to its strong detection capability, long range, and robustness to environmental variations such as inclement weather and lighting extremes. As these radars demonstrate increased levels of integration and performance [1,3], it is desired to have a single multimode radar transceiver that can address the stringent form factor constraints of corner radars and the wide and narrow field-of-view requirements of front radars used in urban and highway driving, respectively. This paper presents a single-chip 76-to-81 GHz radar transceiver, which utilizes frequency-modulated continuous wave (FMCW) synthesis, 3 transmitters, and 4 receivers with integrated ADCs, built in a 45nm CMOS technology. It achieves high resolution and flexible multimode operation to address all classes of short, medium, and long range. The design also features autonomous fault monitoring of the RF chain to support system-level functional safety.

A Single-Inductor Multiple-Output Switcher With Simultaneous Buck, Boost, and Inverted Outputs
Pradipta Patra, Amit Patra, Neeraj Kumar Misra
2011· IEEE Transactions on Power Electronics175doi:10.1109/tpel.2011.2169813

Portable applications require multiple supplies with different output levels and some applications also require negative outputs. Single-inductor multiple-output (SIMO) switchers are a good for existing parallel output configurations. This study presents an SIMO dc–dc converter capable of generating buck, boost, and inverted outputs simultaneously. The operation of this class of converter being driven by the ripple in the inductor current the conventional averaging method does not work well. An inductor current ripple-based modeling approach has been proposed to accurately model and analyze the converter. The control, cross-coupling, and cross-regulation transfer functions, generated through the model, accurately represent the performance of the converter. The proof of concept has been carried out with discrete components on an in-house built PCB and the experimental results validating the steady state and ac responses of the converter are presented.

Technology comparison for large last-level caches (L&lt;sup&gt;3&lt;/sup&gt;Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM
Mu-Tien Chang, Paul Rosenfeld, Shih‐Lien Lu, Bruce Jacob
2013170doi:10.1109/hpca.2013.6522314

Large last-level caches (L <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> Cs) are frequently used to bridge the performance and power gap between processor and memory. Although traditional processors implement caches as SRAMs, technologies such as STT-RAM (MRAM), and eDRAM have been used and/or considered for the implementation of L <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> Cs. Each of these technologies has inherent weaknesses: SRAM is relatively low density and has high leakage current; STT-RAM has high write latency and write energy consumption; and eDRAM requires refresh operations. As future processors are expected to have larger last-level caches, the goal of this paper is to study the trade-offs associated with using each of these technologies to implement L <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> Cs. In order to make useful comparisons between SRAM, STTRAM, and eDRAM L <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> Cs, we model them in detail and apply low power techniques to each of these technologies to address their respective weaknesses. We optimize SRAM for low leakage and optimize STT-RAM for low write energy. Moreover, we classify eDRAM refresh-reduction schemes into two categories and demonstrate the effectiveness of using dead-line prediction to eliminate unnecessary refreshes. A comparison of these technologies through full-system simulation shows that the proposed refresh-reduction method makes eDRAM a viable, energy-efficient technology for implementing L <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> Cs.

Relyzer
Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, Pradeep Ramachandran
2012146doi:10.1145/2150976.2150990

Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains a non-negligible risk that several faults may escape the symptom detectors and result in silent data corruptions (SDCs). Most prior evaluations of symptom-based detectors perform fault injection campaigns on application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the application's execution (application fault site). Since the total number of application fault sites is very large (trillions for standard benchmark suites), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly selected sample of faults. Such studies do not provide any feedback on the portions of the application where faults were not injected. Some of those instructions may be vulnerable to SDCs, and identifying them could allow protecting them through other means if needed.

Practical Machine Learning with Python
Dipanjan Sarkar, Raghav Bali, Tushar Sharma
2017· Apress eBooks143doi:10.1007/978-1-4842-3207-1

This book is your perfect companion for learning the art and science of machine learning to become a successful practitioner. The concepts, techniques, tools, frameworks, and methodologies used in this book will teach you how to think, design, build, and execute machine learning systems.

Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers
Alexander Heinecke, Alexander Breuer, Sebastian Rettenberger, Michael Bäder +4 more
2014139doi:10.1109/sc.2014.6

We present an end-to-end optimization of the innovative Arbitrary high-order DERivative Discontinuous Galerkin (ADER-DG) software SeisSol targeting Intel® Xeon Phi coprocessor platforms, achieving unprecedented earthquake model complexity through coupled simulation of full frictional sliding and seismic wave propagation. SeisSol exploits unstructured meshes to flexibly adapt for complicated geometries in realistic geological models. Seismic wave propagation is solved simultaneously with earthquake faulting in a multiphysical manner leading to a heterogeneous solver structure. Our architecture aware optimizations deliver up to 50% of peak performance, and introduce an efficient compute-communication overlapping scheme shadowing the multiphysics computations. SeisSol delivers near-optimal weak scaling, reaching 8.6 DP-PFLOPS on 8,192 nodes of the Tianhe-2 supercomputer. Our performance model projects reaching 18 -- 20 DP-PFLOPS on the full Tianhe-2 machine. Of special relevance to modern civil engineering needs, our pioneering simulation of the 1992 Landers earthquake shows highly detailed rupture evolution and ground motion at frequencies up to 10 Hz.

Control Scheme for Reduced Cross-Regulation in Single-Inductor Multiple-Output DC–DC Converters
Pradipta Patra, Jyotirmoy Ghosh, Amit Patra
2012· IEEE Transactions on Industrial Electronics135doi:10.1109/tie.2012.2227895

Single-inductor multiple-output (SIMO) dc-dc switching regulators are potentially very good replacement to multiple parallel converters in today's power management units for portable applications where multiple supplies are required. The outputs in these converters being coupled, cross-regulation among the outputs plays a major role in deciding the performance of the system. This paper proposes a control scheme that ensures good load and line regulation and stable system dynamics and reduces cross-regulation effect significantly. In designing a control scheme, proper analysis of the system is an important factor, and SIMO class of converters being driven by a ripple in the inductor current, conventional modeling does not hold good. Consequently, a ripple-based modeling approach that accurately judges the system performance is adopted. A cross-derivative state feedback control methodology has been proposed so as to completely decouple the outputs. Finally, a single-inductor dual-output SIMO converter has been built on a printed circuit board using discrete components, and the test results presented validate the modeling technique proposed. The simulation and experimental results show that the proposed control scheme significantly reduces cross-regulation at the outputs.

Within-Die Variation-Aware Dynamic-Voltage-Frequency-Scaling With Optimal Core Allocation and Thread Hopping for the 80-Core TeraFLOPS Processor
Saurabh Dighe, Sriram Vangal, Paolo Aseron, Sachin Sharma Ashok Kumar +4 more
2010· IEEE Journal of Solid-State Circuits135doi:10.1109/jssc.2010.2080550

In this paper, we present measured within-die core-to-core Fmax and leakage variation data for an 80-core processor in 65 nm CMOS and 1) populate a parameterized energy/performance model to determine the most energy-efficient operating point for a workload; 2) examine impacts of per-core clock and power gating on optimal dynamic voltage-frequency-core scaling (DVFCS) operating points; and 3) compare improvements in energy efficiency achievable by variation-aware DVFCS and core mapping on Single-Voltage/Multiple-Frequency (SVMF), Multiple-Voltage/Single-Frequency (MVSF) and Multiple-Voltage/Multiple-Frequency (MVMF) designs. Variation-aware DVFS with optimal core mapping is shown to improve energy efficiency 6%-35% across a range of compute/communication activity workloads. A new dynamic thread hopping scheme boosts performance by 5%-10% or energy efficiency by 20%-60%.

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel&amp;#x00AE; Xeon Phi Coprocessor
Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov +4 more
2013128doi:10.1109/ipdps.2013.113

Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">®</sup> Xeon Phi™ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.

A privacy framework for mobile health and home-care systems
David Kotz, Sasikanth Avancha, Amit Baxi
2009126doi:10.1145/1655084.1655086

In this paper, we consider the challenge of preserving patient privacy in the context of mobile healthcare and home-care systems, that is, the use of mobile computing and communications technologies in the delivery of healthcare or the provision of at-home medical care and assisted living. This paper makes three primary contributions. First, we compare existing privacy frameworks, identifying key differences and shortcomings. Second, we identify a privacy framework for mobile healthcare and home-care systems. Third, we extract a set of privacy properties intended for use by those who design systems and applications for mobile healthcare and home-care systems, linking them back to the privacy principles. Finally, we list several important research questions that the community should address. We hope that the privacy framework in this paper can help to guide the researchers and developers in this community, and that the privacy properties provide a concrete foundation for privacysensitive systems and applications for mobile healthcare and home-care systems.

Banana Plant Disease Classification Using Hybrid Convolutional Neural Network
K. Lakshmi Narayanan, R. Santhana Krishnan, Harold Robinson, E. Golden Julie +3 more
2022· Computational Intelligence and Neuroscience125doi:10.1155/2022/9153699

Banana cultivation is one of the main agricultural elements in India, while the common problem of cultivation is that the crop has been influenced by several diseases, while the pest indications have been needed for discovering the infections initially for avoiding the financial loss to the farmers. This problem will affect the entire banana productivity and directly affects the economy of the country. A hybrid convolution neural network (CNN) enabled banana disease detection, and the classification is proposed to overcome these issues guide the farmers through enabling fertilizers that have to be utilized for avoiding the disease in the initial stages, and the proposed technique shows 99% of accuracy that is compared with the related deep learning techniques.

Physical Insight Toward Heat Transport and an Improved Electrothermal Modeling Framework for FinFET Architectures
Mayank Shrivastava, Manish Agrawal, Sunny Mahajan, Harald Goßner +3 more
2012· IEEE Transactions on Electron Devices105doi:10.1109/ted.2012.2188296

We report on the thermal failure of fin-shaped field-effect transistor (FinFET) devices under the normal operating condition. Pre- and post failure characteristics are investigated. A detailed physical insight on the lattice heating and heat flux in a 3-D front end of the line and complex back end of line-of a logic circuit network-is given for bulk/silicon-on-insulator (SOI) FinFET and extremely thin SOI devices using 3-D TCAD. Moreover, the self-heating behavior of both the planar and nonplanar devices is compared. Even bulk FinFET shows critical self-heating. Layout, device, and technology design guidelines (based on complex 3-D TCAD) are given for a robust on-chip thermal management. Finally, an improved framework is proposed for an accurate electrothermal modeling of various FinFET device architectures by taking into account all major heat flux paths.

Application-to-core mapping policies to reduce memory system interference in multi-core systems
Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar +1 more
2013103doi:10.1109/hpca.2013.6522311

Future many-core processors are likely to concurrently execute a large number of diverse applications. How these applications are mapped to cores largely determines the interference between these applications in critical shared hardware resources. This paper proposes new application-to-core mapping policies to improve system performance by reducing inter-application interference in the on-chip network and memory controllers. The major new ideas of our policies are to: 1) map network-latency-sensitive applications to separate parts of the network from network-bandwidth-intensive applications such that the former can make fast progress without heavy interference from the latter, 2) map those applications that benefit more from being closer to the memory controllers close to these resources. Our evaluations show that, averaged over 128 multiprogrammed workloads of 35 different benchmarks running on a 64-core system, our final application-to-core mapping policy improves system throughput by 16.7% over a state-of-the-art baseline, while also reducing system unfairness by 22.4% and average interconnect power consumption by 52.3%.

A 2 Tb/s 6&lt;formula formulatype="inline"&gt;&lt;tex Notation="TeX"&gt;$\,\times\,$&lt;/tex&gt; &lt;/formula&gt;4 Mesh Network for a Single-Chip Cloud Computer With DVFS in 45 nm CMOS
Praveen Salihundam, Shailendra Jain, Tiju Jacob, Sunil Kumar +4 more
2011· IEEE Journal of Solid-State Circuits102doi:10.1109/jssc.2011.2108121

A packet-switched 6 × 4 2-D mesh network providing 2 Tb/s of bisectional bandwidth with a per-hop latency of 4-cycles, forms the high performance communication fabric for a Single-Chip Cloud Computer (SCC) with 48 Pentium™ class IA-32 cores. The fabric operates on an independent power supply and frequency domain. The router micro-architecture achieves over 90% network utilization by effective use of a single-cycle Wrapped Wave-Front Allocator (WWFA) and virtual channel (VC) flow control. A router transit latency of 2 ns is achieved through early buffer write, route pre-computation and a single-cycle WWFA implementation. This 640 K transistor, 1.32 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> router operates at 2 GHz at 1.1 V while dissipating 550 mW. The 24-node mesh network with 1.28 Tb/s router and 16B, 5.4 mm wide links consumes only 5% of the chip area, 1.2% of the transistors and 10% of total chip power at 1.1 V in a 45 nm nine-metal CMOS process. The router energy efficiency scales from 1.3 Tb/s/W to 7.2 Tb/s/W over a dynamic voltage range from 0.7 V to 1.25 V.