This thesis mainly focuses on state-of-the-art challenges of distributed execution models and research on the system support for artificial intelligence and high performance computing applications. In this context, we focus on investigating in detail about co-designing the Dataflow-Threads execution model. Moreover, to facilitate support, development, and debug the Dataflow-Threads execution model, we introduced DRT; a lightweight Dataflow runtime. DRT has been written in portable C code (tested with the GNU C compiler), and it is open-source. It can be used on real machines based on architectures like x86, AArch, RISC-V ISA. Furthermore, we consider major problematic applications in the domain of the Artificial Intelligence (AI) and High Performance Computing (HPC) and address the main challenges and bottlenecks to extend our dataflow runtime. To do this, we used widely known benchmarks to stress the capabilities of the DF-Threads execution model and its evaluation against other parallel programming models. We choose Blocked Matrix Multiplication and Recursive Fibonacci. Matrix multiplication is one of the main kernels of AI and HPC Applications. Plus, Recursive Fibonacci is a simple benchmark which creHigh-Performancember of threads and processes and stress the entire execution model. In this thesis, we are mainly interested in heterogeneous platforms. A heterogeneous platform is a hardware device that contains a range of computing components, such as multicore CPUs, GPU, or FPGAs. Their capabilities have provided many features for researchers to use this kind of structure in their state-of-the-art works. Heterogeneous systems are flexible, cost-efficient, and well-supported by communities. Our work focuses mainly on CPU+FPGA Heterogeneous systems, mostly a general-purpose CPU (x86 or ARM) within a Unix-based operating system besides an FPGA accelerator. Subsequently, because of a need in our hardware platform structure, we design and fabricate the Gluon board, which uses serial transceivers in Xilinx Ultrascale+ Heterogeneous accelerator and facilitates GTH transceivers in high rate data transfer applications. Gluon boards are modular and can carry up to 18 Gbps on each lane with specific data types and payload sizes. The end-user cost to manufacture the Gluon board is less than 400 euros with enormous capabilities. Moreover, a real application demonstrates a distributed graph processing application to express the distributed computing execution model and further extend our execution model to cover the real-world application like Graph Processing in large scale. In the first step, we provided a comprehensive baseline, designed and proposed a large scale distributed graph processing application and evaluated it within the PageRank algorithm using well-known datasets. We show how graph partitioning combined with a multi-FPGA architecture leads to higher performance without limitation on the size of the graph, even when the graph has trillions of vertices. Our performance analysis, in the case of PageRank, forecasts performance improvement of up to 20 times and a cost-normalized improvement of up to 12 times when comparing the proposed approach on one Xilinx Alveo U250 FPGA accelerator against a state-of-the-art baseline graph processing software implementation on a Intel Xeon server CPU with a 40-core processor at 2.50 GHz.

Reconfigurable Architectures for Accelerating Distributed Applications, "A Graph Processing Application Case Study" / Amin Sahebi. - (2022).

Reconfigurable Architectures for Accelerating Distributed Applications, "A Graph Processing Application Case Study"

Amin Sahebi
2022

Abstract

This thesis mainly focuses on state-of-the-art challenges of distributed execution models and research on the system support for artificial intelligence and high performance computing applications. In this context, we focus on investigating in detail about co-designing the Dataflow-Threads execution model. Moreover, to facilitate support, development, and debug the Dataflow-Threads execution model, we introduced DRT; a lightweight Dataflow runtime. DRT has been written in portable C code (tested with the GNU C compiler), and it is open-source. It can be used on real machines based on architectures like x86, AArch, RISC-V ISA. Furthermore, we consider major problematic applications in the domain of the Artificial Intelligence (AI) and High Performance Computing (HPC) and address the main challenges and bottlenecks to extend our dataflow runtime. To do this, we used widely known benchmarks to stress the capabilities of the DF-Threads execution model and its evaluation against other parallel programming models. We choose Blocked Matrix Multiplication and Recursive Fibonacci. Matrix multiplication is one of the main kernels of AI and HPC Applications. Plus, Recursive Fibonacci is a simple benchmark which creHigh-Performancember of threads and processes and stress the entire execution model. In this thesis, we are mainly interested in heterogeneous platforms. A heterogeneous platform is a hardware device that contains a range of computing components, such as multicore CPUs, GPU, or FPGAs. Their capabilities have provided many features for researchers to use this kind of structure in their state-of-the-art works. Heterogeneous systems are flexible, cost-efficient, and well-supported by communities. Our work focuses mainly on CPU+FPGA Heterogeneous systems, mostly a general-purpose CPU (x86 or ARM) within a Unix-based operating system besides an FPGA accelerator. Subsequently, because of a need in our hardware platform structure, we design and fabricate the Gluon board, which uses serial transceivers in Xilinx Ultrascale+ Heterogeneous accelerator and facilitates GTH transceivers in high rate data transfer applications. Gluon boards are modular and can carry up to 18 Gbps on each lane with specific data types and payload sizes. The end-user cost to manufacture the Gluon board is less than 400 euros with enormous capabilities. Moreover, a real application demonstrates a distributed graph processing application to express the distributed computing execution model and further extend our execution model to cover the real-world application like Graph Processing in large scale. In the first step, we provided a comprehensive baseline, designed and proposed a large scale distributed graph processing application and evaluated it within the PageRank algorithm using well-known datasets. We show how graph partitioning combined with a multi-FPGA architecture leads to higher performance without limitation on the size of the graph, even when the graph has trillions of vertices. Our performance analysis, in the case of PageRank, forecasts performance improvement of up to 20 times and a cost-normalized improvement of up to 12 times when comparing the proposed approach on one Xilinx Alveo U250 FPGA accelerator against a state-of-the-art baseline graph processing software implementation on a Intel Xeon server CPU with a 40-core processor at 2.50 GHz.
Professor Roberto Giorgi
IRAN
Amin Sahebi
File in questo prodotto:
File Dimensione Formato  
Amin-Sahebi-thesis-2022.pdf

embargo fino al 01/07/2022

Tipologia: Pdf editoriale (Version of record)
Licenza: Open Access
Dimensione 3.55 MB
Formato Adobe PDF
3.55 MB Adobe PDF Visualizza/Apri

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/2158/1271765
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact