IBM Research - Austin
facilityAustin, Texas, United States
Research output, citation impact, and the most-cited recent papers from IBM Research - Austin (United States). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from IBM Research - Austin
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.
Inspired by the brain's structure, we have developed an efficient, scalable, and flexible non-von Neumann architecture that leverages contemporary silicon technology. To demonstrate, we built a 5.4-billion-transistor chip with 4096 neurosynaptic cores interconnected via an intrachip network that integrates 1 million programmable spiking neurons and 256 million configurable synapses. Chips can be tiled in two dimensions via an interchip communication interface, seamlessly scaling the architecture to a cortexlike sheet of arbitrary size. The architecture is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification. With 400-pixel-by-240-pixel video input at 30 frames per second, the chip consumes 63 milliwatts.
This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants . Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.
The new era of cognitive computing brings forth the grand challenge of developing systems capable of processing massive amounts of noisy multisensory data. This type of intelligent computing poses a set of constraints, including real-time operation, low-power consumption and scalability, which require a radical departure from conventional system design. Brain-inspired architectures offer tremendous promise in this area. To this end, we developed TrueNorth, a 65 mW real-time neurosynaptic processor that implements a non-von Neumann, low-power, highly-parallel, scalable, and defect-tolerant architecture. With 4096 neurosynaptic cores, the TrueNorth chip contains 1 million digital neurons and 256 million synapses tightly interconnected by an event-driven routing infrastructure. The fully digital 5.4 billion transistor implementation leverages existing CMOS scaling trends, while ensuring one-to-one correspondence between hardware and software. With such aggressive design metrics and the TrueNorth architecture breaking path with prevailing architectures, it is clear that conventional computer-aided design (CAD) tools could not be used for the design. As a result, we developed a novel design methodology that includes mixed asynchronous-synchronous circuits and a complete tool flow for building an event-driven, low-power neurosynaptic chip. The TrueNorth chip is fully configurable in terms of connectivity and neural parameters to allow custom configurations for a wide range of cognitive and sensory perception applications. To reduce the system's communication energy, we have adapted existing application-agnostic very large-scale integration CAD placement tools for mapping logical neural networks to the physical neurosynaptic core locations on the TrueNorth chips. With that, we have successfully demonstrated the use of TrueNorth-based systems in multiple applications, including visual object recognition, with higher performance and orders of magnitude lower power consumption than the same algorithms run on von Neumann architectures. The TrueNorth chip and its tool flow serve as building blocks for future cognitive systems, and give designers an opportunity to develop novel brain-inspired architectures and systems based on the knowledge obtained from this paper.
This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits. We describe and validate an end-to-end model that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs. The model captures the effects of two important masking phenomena, electrical masking and latching-window masking, which inhibit soft errors in combinational logic. We quantify the SER due to high-energy neutrons in SRAM cells, latches, and logic circuits for feature sizes from 600 nm to 50 nm and clock periods from 16 to 6 fan-out-of-4 inverter delays. Our model predicts that the SER per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Our result emphasizes that computer system designers must address the risks of soft errors in logic circuits for future designs.
Recently, learning based hashing techniques have attracted broad research interests because they can support efficient storage and retrieval for high-dimensional data such as images, videos, documents, etc. However, a major difficulty of learning to hash lies in handling the discrete constraints imposed on the pursued hash codes, which typically makes hash optimizations very challenging (NP-hard in general). In this work, we propose a new supervised hashing framework, where the learning objective is to generate the optimal binary hash codes for linear classification. By introducing an auxiliary variable, we reformulate the objective such that it can be solved substantially efficiently by employing a regularization algorithm. One of the key steps in this algorithm is to solve a regularization sub-problem associated with the NP-hard binary optimization. We show that the sub-problem admits an analytical solution via cyclic coordinate descent. As such, a high-quality discrete solution can eventually be obtained in an efficient computing manner, therefore enabling to tackle massive datasets. We evaluate the proposed approach, dubbed Supervised Discrete Hashing (SDH), on four large image datasets and demonstrate its superiority to the state-of-the-art hashing methods in large-scale image retrieval.
Cloud computing makes extensive use of virtual machines because they permit workloads to be isolated from one another and for the resource usage to be somewhat controlled. In this paper, we explore the performance of traditional virtual machine (VM) deployments, and contrast them with the use of Linux containers. We use KVM as a representative hypervisor and Docker as a container manager. Our results show that containers result in equal or better performance than VMs in almost all cases. Both VMs and containers require tuning to support I/Ointensive applications. We also discuss the implications of our performance results for future cloud architectures.
To instill greater confidence in computations outsourced to the cloud, clients should be able to verify the correctness of the results returned. To this end, we introduce Pinocchio, a built system for efficiently verifying general computations while relying only on cryptographic assumptions. With Pinocchio, the client creates a public evaluation key to describe her computation; this setup is proportional to evaluating the computation once. The worker then evaluates the computation on a particular input and uses the evaluation key to produce a proof of correctness. The proof is only 288 bytes, regardless of the computation performed or the size of the inputs and outputs. Anyone can use a public verification key to check the proof. Crucially, our evaluation on seven applications demonstrates that Pinocchio is efficient in practice too. Pinocchio's verification time is typically 10ms: 5-7 orders of magnitude less than previous work; indeed Pinocchio is the first general-purpose system to demonstrate verification cheaper than native execution (for some apps). Pinocchio also reduces the worker's proof effort by an additional 19-60x. As an additional feature, Pinocchio generalizes to zero-knowledge proofs at a negligible cost over the base protocol. Finally, to aid development, Pinocchio provides an end-to-end toolchain that compiles a subset of C into programs that implement the verifiable computation protocol.
The IBM POWER4 is a new microprocessor organized in a system structure that includes new technology to form systems. The name POWER4 as used in this context refers not only to a chip, but also to the structure used to interconnect chips to form systems. In this paper we describe the processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor.
Recent changes in CMOS device structures and materials motivated by impending atomistic and quantum-mechanical limitations have profoundly influenced the nature of delay and power variability. Variations in process, temperature, power supply, wear-out, and use history continue to strongly influence delay. The manner in which tolerance is specified and accommodated in high-performance design changes dramatically as CMOS technologies scale beyond a 90-nm minimum lithographic linewidth. In this paper, predominant contributors to variability in new CMOS devices are surveyed, and preferred approaches to mitigate their sources of variability are proposed. Process-, device-, and circuit-level responses to systematic and random components of tolerance are considered. Exploratory, novel structures emerging as evolutionary CMOS replacements are likely to change the nature of variability in the coming generations.
A CELL processor is a multi-core chip consisting of a 64b power architecture processor, multiple streaming processors, a flexible IO interface, and a memory interface controller. This SoC is implemented in 90nm SOI technology. The chip is designed with a high degree of modularity and reuse to maximize the custom circuit content and achieve a high-frequency clock-rate.
Condition monitoring of dynamic systems based on vibration signatures has generally relied upon Fourier-based analysis as a means of translating vibration signals in the time domain into the frequency domain. However, Fourier analysis provided a poor representation of signals well localized in time. In this case, it is difficult to detect and identify the signal pattern from the expansion coefficients because the information is diluted across the whole basis. The wavelet packet transform (WPT) is introduced as an alternative means of extracting time-frequency information from vibration signatures. The resulting WPT coefficients provide one with arbitrary time-frequency resolution of a signal. With the aid of statistical-based feature selection criteria, many of the feature components containing little discriminant information could be discarded, resulting in a feature subset having a reduced number of parameters without compromising the classification performance. The extracted reduced dimensional feature vector is then used as input to a neural network classifier. This significantly reduces the long training time that is often associated with the neural network classifier and improves its generalization capability.
The CLP( ℛ ) programming language is defined, its underlying philosophy and programming methodology are discussed, important implementation issues are explored in detail, and finally, a prototype interpreter is described. CLP( ℛ ) is designed to be an instance of the Constraint Logic Programming Scheme, a family of rule-based constraint programming languages defined by Jaffar and Lassez. The domain of computation ℛ of this particular instance is the algebraic structure consisting of uninterpreted functors over real numbers. An important property of CLP( ℛ )is that the constraints are treated uniformly in the sense that they are used to specify the input parameters to a program, they are the only primitives used in the execution of a program, and they are used to describe the output of a program. Implementation of a CLP language, and of CLP( ℛ ) in particular, raises new problems in the design of a constraint-solver. For example, the constraint solver must be incremental in the sense that solving additional constraints must not entail the resolving of old constraints. In our system, constraints are filtered through an inference engine, an engine/solver interface, an equation solver and an inequality solver. This sequence of modules reflects a classification and prioritization of the classes of constraints. Modules solving higher priority constraints are isolated from the complexities of modules solving lower priority constraints. This multiple-phase solving of constraints, together with a set of associated algorithms, gives rise to a practical system.
Three-dimensional (3D) silicon integration of active devices with through-silicon vias (TSVs), thinned silicon, and silicon-to-silicon fine-pitch interconnections offers many product benefits. Advantages of these emerging 3D silicon integration technologies can include the following: power efficiency, performance enhancements, significant product miniaturization, cost reduction, and modular design for improved time to market. IBM research activities are aimed at providing design rules, structures, and processes that make 3D technology manufacturable for chips used in actual products on the basis of data from test-vehicle (i.e., prototype) design, fabrication, and characterization demonstrations. Three-dimensional integration can be applied to a wide range of interconnection densities (<10/cm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> to 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">8</sup> /cm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ), requiring new architectures for product optimization and multiple options for fabrication. Demonstration test structures, which are designed, fabricated, and characterized, are used to generate experimental data, establish models and design guidelines, and help define processes for future product consideration. This paper 1) reviews technology integration from a historical perspective, 2) describes industry-wide progress in 3D technology with examples of TSV and silicon-silicon interconnection advancement over the last 10 years, 3) highlights 3D technology from IBM, including demonstration test vehicles used to develop ground rules, collect data, and evaluate reliability, and 4) provides examples of 3D emerging industry product applications that could create marketable systems.
This monograph presents our recent results on the proportional-integr- derivative (PID) controller and its design, analysis, and synthesis. The fo cus is on linear time-invariant plants that may conta
Servers: high-end, multiprocessor systems running commercial workloads, have typically included extensive cooling systems and resided in custom-built rooms for high-power delivery. Recently, as transistor density and demand for computing resources have rapidly increased, even these high-end systems face energy-use constraints. Commercial-server energy management now focuses on conserving power in the memory and microprocessor subsystems. Because their workloads are typically structured as multiple application programs, system-wide approaches are more applicable to multiprocessor environments in commercial servers than techniques that primarily apply to single-application environments, such as those based on compiler optimizations.
Server farms today consume more than 1.5% of the total electricity in the U.S. at a cost of nearly $4.5 billion. Given the rising cost of energy, many industries are now seeking solutions for how to best make use of their available power. An important question which arises in this context is how to distribute available power among servers in a server farm so as to get maximum performance.
This paper considers the problem of stabilizing a first-order plant with dead time using a proportional-integral-derivative (PID) controller. Using a version of the Hermite-Biehler theorem that is applicable to quasi-polynomials, the complete set of stabilizing PID parameters is determined for both open-loop stable and unstable plants. The range of admissible proportional gains is first determined in closed form. For each proportional gain in this range, the stabilizing set in the space of the integral and derivative gains is shown to be either a trapezoid, a triangle or a quadrilateral. For the case of an open-loop unstable plant, a necessary and sufficient condition on the time delay is determined for the existence of stabilizing PID controllers.
This survey shows that heterogeneous server clusters can be made more efficient by conserving power and energy while exploiting information from the service level, such as request priorities established by service-level agreements.
Magnetic random access memory (MRAM) is a promising memory technology, which has fast read access, high density, and non-volatility. Using 3D heterogeneous integrations, it becomes feasible and cost-efficient to stack MRAM atop conventional chip multiprocessors (CMPs). However, one disadvantage of MRAM is its long write latency and its high write energy. In this paper, we first stack MRAM-based L2 caches directly atop CMPs and compare it against SRAM counterparts in terms of performance and energy. We observe that the direct MRAM stacking might harm the chip performance due to the aforementioned long write latency and high write energy. To solve this problem, we then propose two architectural techniques: read-preemptive write buffer and SRAM-MRAM hybrid L2 cache. The simulation result shows that our optimized MRAM L2 cache improves performance by 4.91% and reduces power by 73.5%compared to the conventional SRAM L2 cache with the similar area.