Ayuda
Ir al contenido

Dialnet


Memory systems for high-performance computing: the capacity and reliability implications

  • Autores: Darko Zivanovic
  • Directores de la Tesis: Petar Radojkovic (dir. tes.), Eduard Ayguadé Parra (tut. tes.)
  • Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2018
  • Idioma: español
  • Tribunal Calificador de la Tesis: Harald Servat Gelabert (presid.), Jaime Abella Ferrer (secret.), Gulay Yalcin (voc.)
  • Programa de doctorado: Programa de Doctorado en Arquitectura de Computadores por la Universidad Politécnica de Catalunya
  • Materias:
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Memory systems are signicant contributors to the overall power requirements, energy consumption, and the operational cost of large high-performance computing systems (HPC). Limitations of main memory systems in terms of latency, bandwidth and capacity, can signicantly affect the performance of HPC applications, and can have strong negative impact on system scalability. In addition, errors in the main memory system can have a strong impact on the reliability, accessibility and serviceability of large-scale clusters. This thesis studies capacity and reliability issues in modern memory systems for high-performance computing. The choice of main memory capacity is an important aspect of high-performance computing memory system design. This choice becomes in- creasingly important now that 3D-stacked memories are entering the market. Compared with conventional DIMMs, 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. We analyze memory capacity requirements of important HPC benchmarks and applications. The study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step towards the adoption of this novel technology in the HPC domain. For HPC domains where large memory capacities are required, we propose scaling-in of HPC applications to reduce energy consumption and the running time of a batch of jobs. We also propose upgrading the per-node memory capacity, which enables greater degree of scaling-in and additional energy savings. Memory system is one of the main causes of hardware failures. In each generation, the DRAM chip density and the amount of the memory in systems increase, while the DRAM technology process is constantly shrinking. Therefore, we could expect that the DRAM failures could have a serious impact on the future-systems reliability. This thesis studies DRAM errors observed on a production HPC system during a period of two years. We clearly distinguish between two different approaches for the DRAM error analysis: categorical analysis and the analysis of error rates. The first approach compares the errors at the DIMM level and partitions the DIMMs into various categories, e.g. based on whether they did or did not experience an error. The second approach is to analyze the error rates, i.e., to present the total number of errors relative to other statistics, typically the number of MB-hours or the duration of the observation period. We show that although DRAM error analysis may be performed with both approaches, they are not interchangeable and can lead to completely different conclusions. We further demonstrate the importance of providing statistical significance and presenting results that have practical value and real-life use. We show that various widely-accepted approaches for DRAM error analysis may provide data that appear to support an interesting conclusion, but are not statistically signifcant, meaning that they could merely be the result of chance. We hope the study of methods for DRAM error analysis presented in this thesis will become a standard for any future analysis of DRAM errors in the field.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno