Scheduling and resource management solutions for the scalable and efficient design of today's and tomorrow's hpc machines

Marco D'amico

Ayuda

Scheduling and resource management solutions for the scalable and efficient design of today's and tomorrow's hpc machines

Autores: Marco D'amico
Directores de la Tesis: Julita Corbalán González (dir. tes.), Ana Jokanovic (codir. tes.)
Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2021
Idioma: español
Materias:
- Ciencias tecnológicas
  - Tecnología de los ordenadores
Texto completo no disponible (Saber más ...)
Resumen
- In recent years, high-performance computing research became essential in pushing the boundaries of what men can know, predict, achieve, and understand in the experimented reality.
  
  HPC Workloads grow in size and complexity hand to hand with the machines that support them, accommodating big data, data analytic, and machine learning applications at the side of classical compute-intensive workloads. Simultaneously, power demand is hugely increasing, becoming a constraint in the design of these machines.
  
  The increasing diversification of processors and accelerators, new special-purpose devices, and new memory layers allow better management of these workloads. At the same time, libraries and tools are being developed to support and to make the most out of the hardware while offering standardized and straightforward interfaces to users and developers. Different scheduling and resource management layers are fundamental in organizing the work and the access to resources.
  
  This thesis focuses on the job scheduling and resource management layer. We claim that this layer needs research in the following three directions: awareness, dynamicity, and automatization. First, awareness of the hardware and applications characteristics would improve the configuration, scheduling, and placement of tasks. As a second point, dynamic systems are more responsive. They react fast to changes in the hardware, e.g.,in failure cases, and they adapt to application's requirements changes. Finally, automatization is the last direction. In our opinion future systems need to act autonomously. A system that keeps relying on user guidance is prone to errors and requires unnecessary user expertise.
  
  This thesis presents three main contributions contributing to fill those gaps. First, we developed DROM, a transparent library that allows parallel applications to shrink and expand dynamically in the computing nodes. DROM enables efficiently utilization of the available resources, with no effort for developers and users. We enabled vertical malleability, i.e., internal malleability in computing nodes, by including in DROM API and data structures that allow resource managers to control the number of threads and their pinning to computing cores at runtime. We measured a negligible overhead and we integrating DROM with OpenMPI, OmpSs, and MPI.
  
  As a second contribution, we developed a system-wide malleable scheduling and resource management policy that uses slowdown predictions to optimize the scheduling of malleable jobs. We called this policy Slowdown-Driven policy (SD-policy). SD-policy uses malleability and node sharing to run new jobs by shrinking running jobs with a lower slowdown, only if the new job has reduced predicted end time compared to the static scheduling. We obtained a very promising reduction in the slowdown, makespan, and consumed energy with workloads combining compute-bounded and memory-bounded jobs.
  
  Ultimately, we used and extended an energy and runtime model to predict the job's runtime and dissipated energy for multiple hardware architectures. We implemented an energy-aware multi-cluster policy, EAMC-policy, that uses predictions to select optimal core frequencies and to filter and prioritize job submissions into the most efficient hardware in case of heterogeneity. This is done automatically, reducing user's intervention and necessary knowledge. Simulations based on real-world hardware and workloads show high energy savings and reduced response time are achieved compared to non-energy-aware and non-heterogeneous aware scheduling.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: