Menu Close

Opportunistic Memory Architecture and Resiliency

Current student: Irina Alam

Memories are one of the key bottlenecks in the performance, reliability and energy efficiency of most computing systems. As computing systems have scaled over the decades, the need for memory systems where large amount of data can be stored and retrieved efficiently have also risen rapidly. To achieve this, memory systems have been scaled for maximum information density.
Moore’s Law has been the primary driver behind the phenomenal advances in computing capability of the past several decades. However, with technology scaling having reached the nanoscale era, integrated circuits, especially memory systems, are becoming increasingly sensitive to process variations leading to reliability and yield concerns. Since memories are primarily designed to maximize bit storage density, they are particularly sensitive to manufacturing process variation, environmental operating conditions, and aging-induced wearout.

On-chip Memory:

Addressing reliability concerns is particularly challenging in on-chip caches or embedded memories like scratchpads in IoT devices as additional area, power and latency overheads of reliability techniques in these memories need to be minimized as much as possible. Also, in these SRAM based memories, if supply voltage is reduced for energy savings, memory hard fault rate increases exponentially and they also become more susceptible to radiation induced soft faults. We work on innovative low overhead solutions to deal with both hard and soft faults in lightweight on-chip memories. FaultLink is one such approach that deals with known hard faults in software managed memories. It helps to tolerate up to 250x increase in bit error rate or allows about 250mV-300mV reduction in supply voltage as compared to the nominal. For soft errors during runtime, we have proposed novel error detection and correction solutions like Parity++ and Software Defined Error Localizing Codes. These error correcting codes have 3-6x lower storage overhead as compared to today’s standard codes while providing opportunistic single bit error correction.

Off-chip Main Memory:

Memory reliability is a significant problem not just in on-chip memories, but also in off-chip main memory systems. In warehouse-scale computers, memory errors have become expensive culprits that cause machine crashes, corrupted data, security vulnerabilities, service disruption, and costly repairs and hardware servicing. To provide stronger protection at lower overhead, we work on Software Defined Error Correcting Codes (SDECC) – software defined heuristic recovery using side channel information for double-bit/double symbol errors while having the same overhead as a single-bit/single symbol error correcting code. SDECC can successfully correct up to 99.9% double-bit/double chip errors.

Main memories serve a pivotal role, sitting in between the processor cores and the slow storage devices. Hence, there is an ever increasing demand for main memory capacity in order to be able to exploit the processing power of these multicore and manycore systems and maintain the performance growth. Though DRAM is still the main memory workhorse, several application contexts need different properties from the main memory (higher density, non-volatility, higher performance, etc). Hence, it is becoming increasingly important to consider alternative technologies that can potentially avoid the problems faced by DRAM and enable new opportunities. Several emerging non-volatile memory (NVM) technologies are now being considered as potential replacements for or enhancements to DRAM. However, the biggest problem that these emerging technologies face is the high stochastic bit error rate. In fact, the reliability challenges can offset the density and energy advantages that NVMs offer. Due to the random nature of the bit errors, these memory technologies require stronger in-field error-correcting code (ECC). We work on novel solutions for NVMs such as Compression with Multi-ECC (CME) for Magnetic Memories. CME first compresses cachelines stored in the memory and then uses the additional space opportunistically to increase protection by accommodating stronger error correction code. We are also working on novel solutions to deal with the limited endurance of these non volatile memories.

Collaborators: Lara Dolecek (UCLA)

Publications

    [1] [PDF] I. Alam and P. Gupta, “SAME-Infer: Software Assisted Memory Resilience for Efficient Inference at the Edge,” in Proceedings of the International Symposium on Memory Systems, New York, NY, USA, 2020
    [Bibtex]
    @inproceedings{C110,
    address = {New York, NY, USA},
    author = {Alam, Irina and Gupta, Puneet},
    booktitle = {{Proceedings of the International Symposium on Memory Systems}},
    keywords = {memres},
    location = {Washington, District of Columbia},
    month = {September},
    numpages = {13},
    publisher = {ACM},
    series = {MEMSYS '20},
    title = {{SAME-Infer: Software Assisted Memory Resilience for Efficient Inference at the Edge}},
    year = {2020}
    }

    [2] [PDF] C. Schoeny, F. Sala, M. Gottscho, I. Alam, P. Gupta, and L. Dolecek, “Context-Aware Resiliency: Unequal Message Protection for Random-Access Memories,” IEEE Transactions on Information Theory, 2019.
    [Bibtex]
    @article{J62,
    author = {Schoeny, Clayton and Sala, Frederic and Gottscho, Mark and Alam, Irina and Gupta, Puneet and Dolecek, Lara},
    journal = {{IEEE Transactions on Information Theory}},
    keywords = {memres},
    month = {October},
    title = {{Context-Aware Resiliency: Unequal Message Protection for Random-Access Memories}},
    year = {2019}
    }

    [3] [PDF] [DOI] I. Alam, S. Pal, and P. Gupta, “Compression with multi-ECC: Enhanced Error Resiliency for Magnetic Memories,” in Proceedings of the International Symposium on Memory Systems, New York, NY, USA, 2019, p. 85–100
    [Bibtex]
    @inproceedings{C108,
    acmid = {3357533},
    address = {New York, NY, USA},
    author = {Alam, Irina and Pal, Saptadeep and Gupta, Puneet},
    booktitle = {{Proceedings of the International Symposium on Memory Systems}},
    doi = {10.1145/3357526.3357533},
    isbn = {978-1-4503-7206-0},
    keywords = {memres},
    location = {Washington, District of Columbia},
    month = {September},
    numpages = {16},
    pages = {85--100},
    publisher = {ACM},
    series = {MEMSYS '19},
    title = {{Compression with multi-ECC: Enhanced Error Resiliency for Magnetic Memories}},
    url = {http://doi.acm.org/10.1145/3357526.3357533},
    year = {2019}
    }

    [4] [PDF] C. Schoeny, I. Alam, M. Gottscho, P. Gupta, and L. Dolecek, “Error Correction and Detection for Computing Memories Using System Side Information,” in IEEE Information Theory Workshop (ITW), 2018
    [Bibtex]
    @inproceedings{C106,
    author = {Schoeny, Clayton and Alam, Irina and Gottscho, Mark and Gupta, Puneet and Dolecek, Lara},
    booktitle = {{IEEE Information Theory Workshop (ITW)}},
    keywords = {memres},
    month = {November},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/C108_paper.pdf},
    title = {{Error Correction and Detection for Computing Memories Using System Side Information}},
    year = {2018}
    }

    [5] [PDF] I. Alam, C. Schoeny, L. Dolecek, and P. Gupta, “Parity++: Lightweight Error Correction for Last Level Caches,” in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018
    [Bibtex]
    @inproceedings{C104,
    author = {Alam, Irina and Schoeny, Clayton and Dolecek, Lara and Gupta, Puneet},
    booktitle = {{IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)}},
    keywords = {memres},
    month = {June},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/C104_paper.pdf},
    title = {{Parity++: Lightweight Error Correction for Last Level Caches}},
    year = {2018}
    }

    [6] [PDF] I. Alam, C. Schoeny, L. Dolecek, and P. Gupta, “Parity++: Lightweight Error Correction for Last Level Caches,” in IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE), 2018 – Best of SELSE
    [Bibtex]
    @conference{W14,
    author = {Alam, Irina and Schoeny, Clayton and Dolecek, Lara and Gupta, Puneet},
    booktitle = {{IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE)}},
    keywords = {memres,parity,ecc,memory,reliability,architecture,coding,systems,caches},
    note = {Best of SELSE},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/W14_paper.pdf},
    title = {{Parity++: Lightweight Error Correction for Last Level Caches}},
    year = {2018}
    }

    [7] [PDF] I. Alam, S. Pal, and P. Gupta, “Compression with Multi-ECC: Enhanced Error Resiliency for Magnetic Memories,” in IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE), 2018
    [Bibtex]
    @conference{W15,
    author = {Alam, Irina and Pal, Saptadeep and Gupta, Puneet},
    booktitle = {{IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE)}},
    keywords = {memres,compression,ecc,memory,reliability,architecture,coding,systems,stt_ram},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/W15_paper.pdf},
    title = {{Compression with Multi-ECC: Enhanced Error Resiliency for Magnetic Memories}},
    year = {2018}
    }

    [8] [PDF] C. Schoeny, F. Sala, M. Gottscho, I. Alam, P. Gupta, and L. Dolecek, “Context-Aware Resiliency: Unequal Message Protection for Random-Access Memories,” in IEEE Information Theory Workshop (ITW), 2017
    [Bibtex]
    @inproceedings{C101,
    author = {Schoeny, Clayton and Sala, Fredric and Gottscho, Mark and Alam, Irina and Gupta, Puneet and Dolecek, Lara},
    booktitle = {{IEEE Information Theory Workshop (ITW)}},
    keywords = {memres},
    month = {November},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/C101_paper.pdf},
    title = {{Context-Aware Resiliency: Unequal Message Protection for Random-Access Memories}},
    year = {2017}
    }

    [9] M. Gottscho, I. Alam, C. Schoeny, L. Dolecek, and P. Gupta, “Low-Cost Memory Fault Tolerance for IoT Devices,” , 2017. – Best paper award
    [Bibtex]
    @article{C98,
    author = {Gottscho, Mark and Alam, Irina and Schoeny, Clayton and Dolecek, Lara and Gupta, Puneet},
    booktitle = {{ACM/IEEE International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), published in ESWEEK special issue of the ACM Transcations on Embedded Computing Systems (TECS)}},
    keywords = {memres},
    month = {October},
    note = {Best paper award},
    number = {},
    pages = {},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/J54_paper.pdf},
    slideurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/C98_slides.pptx},
    title = {{Low-Cost Memory Fault Tolerance for IoT Devices}},
    volume = {},
    year = {2017}
    }

    [10] [PDF] M. Gottscho, I. Alam, C. Schoeny, L. Dolecek, and P. Gupta, “Low-Cost Memory Fault Tolerance for IoT Devices,” ACM/IEEE International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), published in ESWEEK special issue of the ACM Transcations on Embedded Computing Systems (TECS), 2017.
    [Bibtex]
    @article{J54,
    author = {Gottscho, Mark and Alam, Irina and Schoeny, Clayton and Dolecek, Lara and Gupta, Puneet},
    journal = {{ACM/IEEE International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), published in ESWEEK special issue of the ACM Transcations on Embedded Computing Systems (TECS)}},
    keywords = {memres},
    month = {October},
    number = {},
    pages = {},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/J54_paper.pdf},
    title = {{Low-Cost Memory Fault Tolerance for IoT Devices}},
    volume = {},
    year = {2017}
    }

    [11] [PDF] M. Gottscho, M. Shoaib, S. Govindan, B. Sharma, D. Wang, and P. Gupta, “Measuring the Impact of Memory Errors on Application Performance,” IEEE Computer Architecture Letters (CAL), 2016.
    [Bibtex]
    @article{J46,
    author = {Gottscho, Mark and Shoaib, Mohammed and Govindan, Sriram and Sharma, Bikash and Wang, Di and Gupta, Puneet},
    issue = {},
    journal = {{IEEE Computer Architecture Letters (CAL)}},
    keywords = {memres},
    month = {August},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/J46_paper.pdf},
    title = {{Measuring the Impact of Memory Errors on Application Performance}},
    volume = {},
    year = {2016}
    }

    [12] M. Gottscho, C. Schoeny, L. Dolecek, and P. Gupta, “Software-Defined Error-Correcting Codes,” in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2016
    [Bibtex]
    @conference{C93,
    author = {Gottscho, Mark and Schoeny, Clayton and Dolecek, Lara and Gupta, Puneet},
    booktitle = {{IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)}},
    keywords = {ecc,memory,reliability,architecture,coding,systems,dram,caches,memres},
    month = {June},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/W13_paper.pdf},
    slideurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/W13_slides.pptx},
    title = {{Software-Defined Error-Correcting Codes}},
    year = {2016}
    }

    [13] [PDF] M. Gottscho, C. Schoeny, L. Dolecek, and P. Gupta, “Software-Defined Error-Correcting Codes,” in IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE), 2016 – Best Paper Award
    [Bibtex]
    @conference{W13,
    author = {Gottscho, Mark and Schoeny, Clayton and Dolecek, Lara and Gupta, Puneet},
    booktitle = {{IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE)}},
    keywords = {memres,hsi,ecc,memory,reliability,architecture,coding,systems,dram,caches},
    note = {Best Paper Award},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/W13_paper.pdf},
    slideurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/W13_slides.pptx},
    title = {{Software-Defined Error-Correcting Codes}},
    year = {2016}
    }

    [14] [PDF] M. Gottscho, A. Banaiyan Mofrad, N. Dutt, A. Nicolau, and P. Gupta, “Power / Capacity Scaling: Energy Savings With Simple Fault-Tolerant Caches,” in Proc. ACM/IEEE Design Automation Conference (DAC), 2014
    [Bibtex]
    @inproceedings{C79,
    author = {Gottscho, Mark and Banaiyan, Mofrad, Abbas and Dutt, Nikil and Nicolau, Alex and Gupta, Puneet},
    booktitle = {{Proc. ACM/IEEE Design Automation Conference (DAC)}},
    keywords = {memres},
    month = {June},
    paperurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/C79_paper.pdf},
    slideurl = {https://nanocad.ee.ucla.edu/pub/Main/Publications/C79_slides.pdf},
    title = {{P}ower / {C}apacity {S}caling: {E}nergy {S}avings {W}ith {S}imple {F}ault-{T}olerant {C}aches},
    year = {2014}
    }