An Annotated Bibliography on Compute-in-Memory and Hyperdimensional Computing for Low-Power Machine Learning¶
Context and Motivation¶
This annotated bibliography surveys recent work at the intersection of software systems and analog compute-in-memory (CiM) hardware, emphasizing resistive and phase-change crossbars for low-power machine learning.
The selected papers demonstrate that CiM architectures can drastically reduce energy by collapsing data movement between memory and compute, yet leave open problems in compiler design, calibration, and training for non-ideal devices. In parallel, hyperdimensional computing (HDC) offers an algorithmic framework inherently tolerant to noise and low precision, making it a natural fit for analog substrates.
Collectively, the works summarized below define key challenges—non-ideality modeling, energy accounting, and algorithm–hardware co-design—and prefigure where new software abstractions can make a substantive systems contribution.
References¶
[1] Shafiee et al. (2016)¶
Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J. P., Hu, M., Williams, R. S., & Srikumar, V. (2016). ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. Proceedings of the 43rd Annual International Symposium on Computer Architecture (ISCA), 14–26. ACM. DOI
This paper introduces ISAAC, a complete system architecture that integrates memristor crossbars as in-situ analog vector–matrix multipliers within a pipelined convolutional neural network (CNN)—and a fully connected deep neural network (DNN)—accelerator. The authors demonstrate substantial energy and throughput gains over prior digital accelerators by eliminating data movement between memory and compute and by using bit-serial encoding matched to on-chip ADC precision.
ISAAC has since become the canonical baseline for analog compute-in-memory research: its hierarchical array organization, tile-level dataflow, and device-interface assumptions underpin most later simulators and energy models. The design also incorporates per-layer tile mapping, pipeline balancing through weight replication, and multi-bit encoding strategies that reduce ADC/DAC conversion cost while maintaining throughput. On CNN and DNN workloads, ISAAC reports 14.8× higher throughput, 5.5× lower energy, and 7.5× higher computational density than the digital DaDianNao architecture.
These design principles point to extensions of ISAAC’s mapping strategies into a more programmable, analog-aware compilation flow. Relevant reported metrics and derived approximations include 128 × 128 crossbar arrays, 16-bit DACs and 8-bit ADCs, an energy efficiency of roughly 1 pJ per MAC, and a peak throughput of 2.3 TOPS/W. Future work could validate ISAAC-style mappings under modern transformer workloads and tighter ADC/DAC budgets to confirm energy gains at current scales.
[2] Ankit et al. (2019)¶
Ankit, A., El Hajj, I., Chalamalasetti, S. R., Ndu, G., Foltin, M., Williams, R. S., Faraboschi, P., Hwu, W.-M. W., Strachan, J. P., Roy, K., & Milojicic, D. S. (2019). PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference. Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19), 715–731. DOI
This paper presents PUMA, a programmable memristor-based accelerator that extends the fixed-function ISAAC architecture into a general-purpose analog compute-in-memory (CiM) platform. The authors introduce an instruction set architecture (ISA) and a compiler that together enable diverse machine-learning workloads—such as CNNs, LSTMs, and fully connected networks—to execute efficiently across large crossbar arrays. On benchmarked workloads, PUMA achieved up to 2,446× higher energy efficiency and 66× lower latency compared with then-state-of-the-art GPUs (circa 2019), underscoring the substantial performance and sustainability potential of analog CiM accelerators relative to contemporary digital hardware.
PUMA demonstrates how integrating digital control units and analog crossbar tiles can preserve flexibility while maintaining the energy and area benefits of in-memory computation. The work’s key contribution lies in elevating CiM from a model-specific accelerator to a programmable substrate with a unified software stack. Its compiler performs graph partitioning, instruction scheduling, and register allocation across hundreds of tiles, providing the blueprint for an analog-aware compilation layer.
The architecture also illustrates the trade-offs between programmability and efficiency, showing that a modest digital overhead can enable orders-of-magnitude broader applicability. Relevant reported metrics and inferred estimates include 128 × 128 crossbars, 8-bit ADCs, a peak throughput of roughly 10 TOPS, and an energy efficiency near 0.5 pJ per MAC. A remaining gap is end-to-end tooling that co-optimizes the PUMA ISA with analog-aware training and calibration for diverse, evolving models.
[3] Rasch et al. (2023)¶
Rasch, M. J., Mackin, C., Le Gallo, M., Chen, A., Fasoli, A., Odermatt, F., Li, N., Nandakumar, S. R., Narayanan, P., Tsai, H., Burr, G. W., Sebastian, A., & Narayanan, V. (2023). Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators. Nature Communications, 14, 5282. DOI
This paper proposes a hardware-aware training methodology for deep neural networks deployed on analog in-memory computing (AIMC) accelerators. The authors construct a detailed, hardware-realistic model of crossbar non-idealities—including conductance drift, asymmetric nonlinearity, finite precision, and device-to-device variability—and incorporate these effects directly into the forward and backward passes during training.
This closed-loop approach allows networks to learn weights that are robust to hardware imperfections without modifying the underlying inference architecture. By retraining models with the AIMC device model in the loop, the study demonstrates that inference accuracy can be recovered to within about one percentage point of floating-point digital baselines across CNN, RNN, and Transformer workloads, while requiring only a modest increase in training epochs. It would be valuable to corroborate the training-in-the-loop results on silicon across device types to quantify hardware–model mismatch in practice.
The work provides a principled software framework for analog-aware calibration and retraining, establishing a foundation for compiler or runtime passes that systematically inject and compensate for realistic noise sources. Typical parameterizations in the AIMC model include device-to-device variability on the order of 5–8%, conductance-drift exponents near 0.05–0.08, and asymmetric nonlinearity up to 15%—values that can guide the noise models used in a simulator.
[4] Kleyko et al. (2022)¶
Kleyko, D., Davies, M., Frady, E. P., Kanerva, P., Kent, S. J., Olshausen, B. A., Osipov, E., Rabaey, J. M., Rachkovskij, D. A., Rahimi, A., & Sommer, F. T. (2022).
Vector symbolic architectures as a computing framework for emerging hardware. Proceedings of the IEEE, 110(10), 1538–1571. DOI
This paper offers a comprehensive survey of vector symbolic architectures (VSAs)—also known as hyperdimensional computing (HDC)—and their realization across emerging hardware substrates. The authors trace the evolution of high-dimensional computing from cognitive models to hardware-efficient implementations, unifying diverse formulations such as binary, bipolar, and real-valued representations under a single algebraic framework. The article also reviews encoding schemes, similarity measures, and compositional operators, emphasizing how simple arithmetic primitives (addition, binding, permutation) can form a universal computing substrate.
A key contribution of this work is its detailed mapping between HDC operations and device-level primitives available in neuromorphic, analog, and resistive-memory hardware. By analyzing the trade-offs between vector dimensionality, precision, and noise tolerance, the survey highlights why HDC naturally aligns with hardware that favors approximate arithmetic and high parallelism. It also catalogues implementations on platforms ranging from digital CMOS to memristive crossbars and carbon-nanotube FET arrays, providing a taxonomy that links algorithmic robustness to physical design parameters.
This reference defines the algorithmic design space from which hardware-compatible HDC variants can be selected. It clarifies which encoding methods and bundling strategies best tolerate quantization and analog noise, directly informing the workloads to simulate on a compute-in-memory platform. Relevant reported metrics include vector dimensionalities between 10^3 and 10^5, tolerance to bit-flip rates exceeding 10%, and projected reductions in multiply–accumulate energy by more than one order of magnitude compared with conventional digital inference. A useful extension would be standardized benchmarks that compare VSA/HDC variants under common precision/noise regimes on shared hardware.
[5] Karunaratne et al. (2020)¶
Karunaratne, G, Le Gallo, M., Cherubini, G., Benini, L., Rahimi, A., & Sebastian, A. (2020). In-memory hyperdimensional computing. Nature Electronics, 3, 327–337. DOI
This paper demonstrates an end-to-end implementation of hyperdimensional computing (HDC) using memristive crossbars for analog in-memory matrix–vector multiplications combined with CMOS logic for elementwise operations. The authors map key HDC primitives—binding, bundling, and similarity search—directly onto resistive memory arrays, showing how these high-dimensional vector operations can be executed efficiently in situ. The work experimentally validates this mapping on a hardware prototype, highlighting that the inherently distributed and noise-tolerant nature of HDC is well matched to analog compute substrates.
Through a series of classification benchmarks, the study quantifies that accuracy degradation under realistic analog noise remains marginal, even when device variability and limited precision are introduced. The hardware achieves competitive inference accuracy on tasks such as language recognition and image classification while operating at far lower precision than digital counterparts. These findings provide empirical confirmation that HDC can gracefully tolerate non-idealities that would cripple conventional neural architectures.
The results directly support the research hypothesis that hyperdimensional computing forms a natural algorithmic complement to compute-in-memory systems. By demonstrating that HDC maintains robustness with limited bit precision and analog noise, this work strengthens the case for exploring HDC workloads in a simulator and compiler. Relevant reported metrics include crossbar arrays of 256 × 256 cells, classification accuracies above 90% under up to 10% device noise, and projected energy savings exceeding an order of magnitude compared with digital vector operations. Future work could benchmark HDC against state-of-the-art neural baselines on larger datasets to bound accuracy–energy trade-offs at scale.
[6] Leroux et al. (2025)¶
Leroux, N., Manea, P.-P., Sudarshan, C., Finkbeiner, J., Siegel, S., Strachan, J. P., & Neftci, E. (2025). Analog in-memory computing attention mechanism for fast and energy-efficient large language models. Nature Computational Science, 5, 813–824. DOI
This paper introduces an analog in-memory computing (AIMC) architecture that implements the attention mechanism central to transformer-based large language models (LLMs). The design employs hybrid analog–digital gain-cell arrays that perform the key, query, and value matrix–vector multiplications directly in memory, followed by lightweight digital normalization and activation. To address the high cost of pretraining, the authors propose an initialization algorithm that achieves text-generation performance comparable to GPT-2 without training the model from scratch, enabling near-instant deployment of large generative transformers.
Experimental evaluations demonstrate that the proposed AIMC attention engine achieves orders-of-magnitude improvements in both energy efficiency and latency compared with GPU implementations. Specifically, the architecture reduces attention latency by up to two orders of magnitude and energy consumption by up to four orders of magnitude while maintaining GPT-2–level text perplexity. These gains arise from exploiting in-memory analog multiplication for the dominant attention operations, minimizing data movement between compute and memory and capitalizing on the intrinsic parallelism of the crossbar arrays.
This work exemplifies how compute-in-memory techniques can scale to modern, high-dimensional workloads such as transformers—the same family of models that motivate energy concerns in contemporary AI. It also bridges our focus on low-power machine learning with the hyperdimensional computing paradigm, since both leverage high-dimensional vector representations and similarity-based computation. The demonstrated performance gains offer concrete benchmarks for evaluating future compiler and simulation optimizations within an analog-in-memory framework. Relevant reported metrics include effective 6-bit analog precision per multiply–accumulate, latency reductions up to 10^2×, and energy reductions up to 10^4× relative to GPU baselines. Future work could examine attention-layer bottlenecks in larger LLMs with full system I/O to confirm end-to-end latency/energy across contexts.
[7] Rasch et al. (2024)¶
Rasch, M. J., Carta, F., Fagbohungbe, O., & Gokmen, T. (2024). Fast and robust analog in-memory deep neural network training. Nature Communications, 15, 7133. DOI
This paper introduces two algorithms for analog in-memory training of deep neural networks that preserve the fast runtime characteristics of prior analog training methods while eliminating the need for a precisely calibrated reference (zero-point) conductance. The authors analyze how these schemes interact with device and circuit non-idealities and show that training convergence and accuracy can be maintained without relying on tight offset calibration.
Through simulation-based studies, the paper characterizes regimes of device variation—including write noise, asymmetry, retention loss, and endurance limits—under which the proposed algorithms remain effective, thereby relaxing device requirements that have previously hindered practical analog training. The study reports that stable learning is preserved for noise levels up to about 10% and asymmetry up to 30%, while training speed improves roughly threefold relative to earlier analog methods.
These results supply actionable software-level levers (offset-free or zero-point-independent training schemes) that enable deploying lower-precision analog inference with fewer calibration demands. Although the paper does not quantify energy directly, demonstrating reliable convergence under relaxed reference constraints is a prerequisite for lowering ADC/DAC precision and refresh overheads in compute-in-memory systems. An interesting question may be how zero-point-independent training interacts with drift/retention over time when weights are periodically refreshed on-chip.
[8] Haensch et al. (2023)¶
Haensch, W., Raghunathan, A., Roy, K., Chakrabarti, B., Phatak, C. M., Wang, C., & Guha, S. (2023). Compute-in-memory with non-volatile elements for neural networks: A review from a co-design perspective. Advanced Materials, 35(37), e2204944. DOI
This review provides a comprehensive analysis of compute-in-memory (CiM) architectures that integrate non-volatile memory devices—such as resistive RAM (RRAM), phase change memory, and ferroelectric FETs—into neural network accelerators. The authors advocate a co-design approach spanning devices, circuits, architectures, and workloads, arguing that meaningful progress in CiM efficiency requires simultaneous optimization across all four layers. The paper organizes prior research into a structured taxonomy, highlighting how material properties and device physics shape computational precision, latency, and endurance at higher abstraction levels.
A central contribution of this work is the introduction of a unified metric framework that explicitly links device-level characteristics to system-level outcomes in energy, delay, and accuracy. By comparing technology nodes and device types under equivalent neural workloads, the study exposes dominant trade-offs among conductance linearity, retention, and ADC/DAC overhead. The review also quantifies how variability, peripheral power, and limited precision cumulatively constrain achievable gains, establishing a realistic upper bound on CiM energy efficiency.
This reference offers a rigorous blueprint for evaluating analog accelerators and structuring simulator outputs. Its co-design methodology and cross-layer metrics can directly inform how to present energy, latency, and accuracy trade-offs in future results. Relevant reported metrics include typical operating energies between 0.1–1 pJ per MAC, latency reductions up to 10× compared with digital baselines, and endurance spanning 10^5–10^8 cycles across different non-volatile technologies, providing essential context for benchmarking the sustainability of emerging CiM platforms. As a potential complement, a reference methodology that unifies device-level reporting (e.g., variability, IR-drop) with system-level SOTA baselines would enable apples-to-apples comparisons.
[9] Soliman et al. (2023)¶
Soliman, T., Chatterjee, S., Laleni, N., Müller, F., Kirchner, T., Wehn, N., Kämpfe, T., Chauhan, Y. S., & Amrouch, H. (2023). First demonstration of in-memory computing crossbar using multi-level Cell FeFET. Nature Communications, 14, 6348. DOI
This paper reports the first experimental demonstration of an in-memory computing (IMC) crossbar based on multi-level ferroelectric field-effect transistors (FeFETs). The authors fabricate a CMOS-compatible array in which each FeFET cell stores multiple analog conductance states through controlled partial polarization of the ferroelectric layer. The array performs vector–matrix multiplications directly in memory, validating FeFETs as an emerging platform for analog multiply–accumulate operations.
Electrical characterization confirms linear and symmetric conductance tuning, stable multi-level behavior, and excellent endurance and retention. The devices exhibit up to sixteen programmable levels per cell, read energies below 1 pJ per MAC, endurance beyond 10^6 cycles, and retention exceeding 10^5 s without significant drift. These results demonstrate that FeFETs can deliver non-volatile, energy-efficient computation while remaining compatible with standard silicon processes.
This study provides concrete device-level parameters to extend a simulator beyond RRAM- and PCM-based assumptions. Incorporating FeFET characteristics such as multi-level precision, retention stability, and write endurance may allow evaluation of how compiler and calibration strategies generalize across technologies. The reported metrics—multi-level storage (4–16 states per cell), sub-picojoule read energy, and high endurance—offer realistic baselines for cross-technology energy and variability modeling in analog compute-in-memory systems. Future studies could evaluate multi-kilobyte to megabyte-scale FeFET arrays under workload-realistic read/write traffic and temperature variation.
[10] Xu et al. (2024)¶
Xu, J., Liu, H., Duan, Z., Liao, X., Jin, H., Yang, X., Li, H., Liu, C., Mao, F., & Zhang, Y. (2024). ReHarvest: An ADC resource-harvesting crossbar architecture for ReRAM-based DNN accelerators. ACM Transactions on Architecture and Code Optimization (TACO), 21(3), Article 63, 1–26. DOI
This paper addresses the analog-to-digital conversion bottleneck in ReRAM-based processing-in-memory accelerators, where tightly coupled and sparsely utilized ADCs limit throughput and dominate power consumption. ReHarvest decouples ADCs from individual crossbars and pools them at the tile level so that crossbars can dynamically harvest conversion resources on demand, improving utilization and scalability.
The proposed architecture incorporates a many-to-many mapping between crossbars and ADCs, a multi-tile matrix mapping (MTMM) scheme to enhance data parallelism, and a bus-based multicast interconnect for efficient vector distribution. Evaluation on DNN workloads shows that resource pooling and concurrency, rather than higher ADC precision, are the primary levers for improving system efficiency under realistic matrix–vector scheduling.
ReHarvest provides concrete guidance for compiler and simulator models of ADC sharing and scheduling. Its results suggest how a dynamic converter-allocation policy could be captured in energy-accounting and hardware-partitioning modules. Reported gains include up to 3.2× higher ADC utilization, 3.5× throughput speedup, and 3.1× lower ReRAM resource usage relative to the FORMS baseline, offering grounded parameters for tuning ADC/DAC efficiency models in analog compute-in-memory simulations. Co-designing mapping/scheduling with dynamic ADC pooling in a full runtime to verify throughput and QoS in multi-tenant scenarios could be a promising direction.
[11] Lammie et al. (2025)¶
Lammie, C., Büchel, J., Vasilopoulos, A., Le Gallo, M., & Sebastian, A. (2025). The inherent adversarial robustness of analog in-memory computing. Nature Communications, 16, 1756. DOI
This paper demonstrates that the stochastic and nonlinear behavior of analog in-memory computing (AIMC) hardware can inherently improve neural network resilience to adversarial perturbations. Using a phase-change-memory AIMC chip, the authors show that device noise and variability act as implicit regularizers that smooth decision boundaries and weaken gradient-based attacks.
Experiments on image-classification benchmarks reveal that moderate hardware noise (5–10%) reduces adversarial attack success rates by about 25% with negligible loss in clean accuracy. The analysis identifies both recurrent and non-recurrent noise sources as contributors to this built-in robustness.
These findings highlight how analog noise—normally viewed as a limitation—can be modeled as a beneficial design parameter. They motivate incorporating tunable stochastic noise models into a simulator, reinforcing the broader view that robustness and energy efficiency can emerge jointly in noisy analog systems. It remains to test robustness–accuracy trade-offs under stronger adaptive attacks and to characterize how hardware noise budgets interact with calibration.
[12a] Kleyko et al. (2022)¶
Kleyko, D., Rachkovskij, D. A., Osipov, E., & Rahimi, A. (2022). A survey on hyperdimensional computing aka vector symbolic architectures, part I: Models and data transformations. ACM Computing Surveys, 55(6), Article 130, 1–40. DOI
The first part of this two-part survey formalizes the mathematical foundations of hyperdimensional computing (HDC), also known as vector symbolic architectures (VSAs). It unifies binary, bipolar, and real-valued representations into a common algebraic framework and details how binding, bundling, and permutation operations implement symbolic reasoning and associative memory. The review also analyzes encoding strategies for symbolic, sensory, and structured data, outlining how information capacity and similarity preservation scale with vector dimensionality.
By consolidating diverse formulations under a single theoretical lens, the authors clarify how high-dimensional operations achieve robustness to noise and quantization. The survey quantifies representational properties such as orthogonality, capacity, and correlation decay across vector dimensions ranging from 10^3 to 10^5.
Part I provides the algorithmic foundation for selecting and parameterizing HDC operations to deploy on analog compute-in-memory hardware. Its analysis of binding and encoding precision directly informs how to represent data robustly under analog noise and limited-bit quantization. Formal bounds relating dimensionality, correlation, and task accuracy under fixed precision would sharpen design rules for hardware mapping.
[12b] Kleyko et al. (2023)¶
Kleyko, D., Rachkovskij, D. A., Osipov, E., & Rahimi, A. (2023). A survey on hyperdimensional computing aka vector symbolic architectures, part II: Applications, cognitive models, and challenges. ACM Computing Surveys, 55(9), Article 175, 1–52. DOI
The second part of this two-part survey extends the discussion to applications and physical realizations of hyperdimensional computing (HDC). It reviews implementations on digital CMOS, FPGA, and emerging analog substrates such as memristive crossbars, spintronic arrays, and photonic circuits. The authors summarize how hardware characteristics—precision, stochasticity, and connectivity—affect algorithmic choices and learning dynamics across domains including natural language processing, robotics, and biosignal processing.
The review identifies open challenges in dimensionality scaling, training efficiency, and integration with conventional neural networks, emphasizing that HDC’s distributed representations naturally tolerate low precision and device variability. Citing results from recent analog prototypes, the paper highlights empirical evidence of substantial multiply–accumulate energy reductions when mapping HDC operations to in-memory crossbars.
Part II serves as a hardware–algorithm bridge, defining how HDC primitives can be realized efficiently on analog compute-in-memory architectures. It guides the design of workloads and simulator benchmarks that exploit HDC’s noise tolerance and vector-level parallelism for sustainable low-power inference. A consolidated hardware taxonomy with reproducible, cross-platform benchmarks would clarify which substrates best support HDC primitives.