Reconﬁgurable Architectures for Accelerating Distributed Applications, "A Graph Processing Application Case Study"

Sahebi, Amin

This thesis mainly focuses on state-of-the-art challenges of distributed execution models and research on the system support for artiﬁcial intelligence and high performance computing applications. In this context, we focus on investigating in detail about co-designing the Dataﬂow-Threads execution model. Moreover, to facilitate support, development, and debug the Dataﬂow-Threads execution model, we introduced DRT; a lightweight Dataﬂow runtime. DRT has been written in portable C code (tested with the GNU C compiler), and it is open-source. It can be used on real machines based on architectures like x86, AArch, RISC-V ISA. Furthermore, we consider major problematic applications in the domain of the Artiﬁcial Intelligence (AI) and High Performance Computing (HPC) and address the main challenges and bottlenecks to extend our dataﬂow runtime. To do this, we used widely known benchmarks to stress the capabilities of the DF-Threads execution model and its evaluation against other parallel programming models. We choose Blocked Matrix Multiplication and Recursive Fibonacci. Matrix multiplication is one of the main kernels of AI and HPC Applications. Plus, Recursive Fibonacci is a simple benchmark which creHigh-Performancember of threads and processes and stress the entire execution model. In this thesis, we are mainly interested in heterogeneous platforms. A heterogeneous platform is a hardware device that contains a range of computing components, such as multicore CPUs, GPU, or FPGAs. Their capabilities have provided many features for researchers to use this kind of structure in their state-of-the-art works. Heterogeneous systems are ﬂexible, cost-eﬃcient, and well-supported by communities. Our work focuses mainly on CPU+FPGA Heterogeneous systems, mostly a general-purpose CPU (x86 or ARM) within a Unix-based operating system besides an FPGA accelerator. Subsequently, because of a need in our hardware platform structure, we design and fabricate the Gluon board, which uses serial transceivers in Xilinx Ultrascale+ Heterogeneous accelerator and facilitates GTH transceivers in high rate data transfer applications. Gluon boards are modular and can carry up to 18 Gbps on each lane with speciﬁc data types and payload sizes. The end-user cost to manufacture the Gluon board is less than 400 euros with enormous capabilities. Moreover, a real application demonstrates a distributed graph processing application to express the distributed computing execution model and further extend our execution model to cover the real-world application like Graph Processing in large scale. In the ﬁrst step, we provided a comprehensive baseline, designed and proposed a large scale distributed graph processing application and evaluated it within the PageRank algorithm using well-known datasets. We show how graph partitioning combined with a multi-FPGA architecture leads to higher performance without limitation on the size of the graph, even when the graph has trillions of vertices. Our performance analysis, in the case of PageRank, forecasts performance improvement of up to 20 times and a cost-normalized improvement of up to 12 times when comparing the proposed approach on one Xilinx Alveo U250 FPGA accelerator against a state-of-the-art baseline graph processing software implementation on a Intel Xeon server CPU with a 40-core processor at 2.50 GHz.

Reconﬁgurable Architectures for Accelerating Distributed Applications, "A Graph Processing Application Case Study"

Amin Sahebi

2022

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

Reconﬁgurable Architectures for Accelerating Distributed Applications, "A Graph Processing Application Case Study"

Amin Sahebi

2022

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)