Ayuda
Ir al contenido

Dialnet


Convergence of deep learning and high performance computing: challenges and solutions

  • Autores: Albert Njoroge Kahira
  • Directores de la Tesis: Rosa M Badia (dir. tes.), Leonardo Arturo Bautista Gomez (codir. tes.)
  • Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2021
  • Idioma: español
  • Materias:
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Deep Learning has achieved outstanding results in many fields and led to groundbreaking discoveries. With the steady increase in datasets and model sizes, there has been a recent surge in Machine Learning applications in High-Performance Computing (HPC) to speed up training. Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models or using high dimension inputs. However, training DNN in HPC infrastructures presents a unique set of challenges: scalability, IO contention, network congestion and fault tolerance. Solving these problems is particularly challenging and unique due to DL applications’ nature and the history of adaptation of DL in HPC. This thesis addresses scalability and resilience challenges by looking at different parts of the Machine Learning Workflow.

      We first address hyper-parameters optimisation (HPO), which is one of the most time consuming and resource-intensive parts of a Machine Learning Workflow. We present a HPO scheme built on top of PyCOMPSs, a programming model and runtime which aims to ease the development of parallel applications for distributed infrastructures. We show that PyCOMPSs is a robust framework that can accelerate the process of Hyperparameter Optimisation across multiple devices and computing units. We perform a detailed performance analysis showing different configurations to demonstrate the effectiveness of our approach.

      We then analyse the compute, communication, and memory requirements of DNNs to understand the trade-offs of different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility that can help detect the limitations and bottlenecks of different parallelism approaches at scale.

      While significant effort has been put to facilitate distributed training by DL frameworks,fault tolerance has been largely ignored. We examine the checkpointing implementation of popular DL platforms. We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide take-away points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno