Exploring strategies to improve FPGA design with higher levels of abstraction

Tobías Alonso Pugliese

Ayuda

Exploring strategies to improve FPGA design with higher levels of abstraction

Autores: Tobías Alonso Pugliese
Directores de la Tesis: Gustavo Sutter Capristo (dir. tes.), Jorge Enrique López de Vergara Méndez (codir. tes.)
Lectura: En la Universidad Autónoma de Madrid ( España ) en 2022
Idioma: inglés
Número de páginas: 252
Títulos paralelos:
- Exploración de estrategias para mejorar el diseño para FPGAs con mayores niveles de abstracción
Tribunal Calificador de la Tesis: Luca Valcarenghi (presid.), Alberto Sánchez González (secret.), Fernando Rincon Calle (voc.), María Liz Crespo (voc.), Raúl Mateos Gil (voc.)
Programa de doctorado: Programa de Doctorado en Ingeniería Informática y de Telecomunicación por la Universidad Autónoma de Madrid
Materias:
- Ciencias tecnológicas
  - Tecnología de los ordenadores
    - Diseño lógico
Enlaces
- Tesis en acceso abierto en: Biblos-e Archivo
Resumen
- While our data processing needs continue to grow at a fast pace, the capacity of general purpose hardware to increase their compute power is decreasing with each successive technology generation. Then, it becomes more important to resort to other means to improve performance. Custom designed hardware, in particular, implemented in FPGAs, has been a vital tool in many application areas and, under the current circumstances, it might be essential to push performance forward. In this context, this thesis explores strategies that can be used in the different hardware design stages to enhance performance of custom-designed solutions using FPGAs and the ability of hardware designers to tackle larger hardware design problems. The work is focused on three application areas, image processing, network packet processing and artificial intelligence, where system requirements and design challenges are analyzed, and later addressed, proposing solutions.
  
  In the area of image processing, we tackle compression under stringent constraints, like those faced in satellite and drone sensing. Given that available image compression algorithms do not properly address the problems in these scenarios, an algorithm-hardware co-design approach was followed. We proposed a series of enhancements to the JPEG-LS standard, aimed at improving its coding efficiency at a low computational overhead, and later, developed an FPGA implementation of the encoder. As a result, a low latency near-lossless compressor with the highest pixel rate and highest compression ratio was achieved, outperforming existing implementations.
  
  In the area of network packet processing, we focus on the problem of offloading the flow metering process for 100 Gbit Ethernet. Although for slower link rates a high-level synthesis implementation can provide a solution, for the target link and relevant applications that require managing large memories, this route was insufficient. For this reason, an alternative architecture was designed to address the problem of implementing high-throughput and complex read-update processes dealing with significant propagation delays associated with the memory system. This allowed to implement in an FPGA a TCP flow metering system supporting 100GbE packet rate and achieving over a 50\% offload of the computational load. We also demonstrate that using arrays of the developed cores enable the use of larger flow tables, increasing the offloading capabilities, while still supporting the maximum packet rate.
  
  In addition, we studied a complementary throughput optimization for read-update processes, the conditional stalling technique. We concluded that, with efficient implementations, its utilization will not have frequency penalties for most designs and that few extra resources are required. Also, we examined the performance of the technique as a function of input data and architecture characteristics, showing that, even in adverse cases, it can significantly enhance performance. However, we demonstrated that to optimize throughput, we must take into account both the memory address statistics and the evolution of frequency as the pipeline is deepened.
  
  In the area of artificial intelligence, we deal with a set of related implementation problems that arise when scaling up convolutional neural network dataflow accelerators, in particular in non-monolithic (multi-SLR) FPGAs. To tackle them, we developed a partitioning and resource balancing optimization tool. This tool addresses the control signals connection of large designs in multi-SLR FPGAs, and balances multiple resources across FPGA regions and/or chips, while it minimizes the communication cost among them. The tool natively maps systems to a multi-node implementation if it does not fit in a single FPGA, and supports different multi-node paradigms. The application of this optimization significantly enhanced performance of the accelerators. Targeting multi-node platforms led to further compute density, latency and power improvements.
  
  Finally, from the experience obtained in the development of these applications, we identified key methodological aspects that led us to a successful hardware implementation using high-level synthesis (HLS). In particular, we observed that HLS can achieve great quality of results, leveraging wider algorithmic exploration and function specialization. Also, we identify the benefits of modular partitioning and refinement, and using a hardware oriented development mentality.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: