Tesis "Processor Array synthesis for Loop Nests with Non-Rectangular Iteration Spaces Using the Polytope Model"

**Alumno:** José Roberto Pérez Andrade**Asesor:** Dr. César Torres Huitzil**Sinodales:** Dr. René Armando Cumplido Parra, Dr. Arturo Dìaz Pérez, Dr. José Juan García Hernández, Dr. José Luis Tecpanecatl Xihuitl, Dr. Javier Díaz Carmona

High-level synthesis methods are concerned with the translation of algorithmic specifications into representations at register transfer level or into a hardware description language. One of the representations used for high-level synthesis is the polytope model, whichprovides an abstraction to represent loop computations of an algorithmic specification as integer points inside of a polyhedron. As a results, the polytope model is able to derive dedicated hardware parallel architectures in form of processor arrays. Processor arrays consist of a set of processing elements connected in a regular and local way, and able to exploit several levels of parallelism. In order to derive totally functional processor arrays, besides of the processor array data-path, control schemes able to generate the processing elements activation signals, and able to select the required operations are needed.

Also, external memory systems capable of providing data from an external source to the array and capable of extracting data produed by the array are required. Previous research works have focused on the generation of processor arrays able to deal with a unique problem size, and for algorithms whose loop bounds form a rectangular shape. In this dissertation, a control scheme able to generate the control signals for algorithms with rectangular and non-rectauglar loop bounds, and whose problem size is larger than a maximum value provided during synthesis time is proposed. This control scheme uses local and distributed modules in order to orchestrate the computations of the processor array. On the other hand, also previous works assume that the input data are available when the processor array requires them, and they assume that the output data are extracted when they are produced.

In this sense, this dissertation also proposes an external memory system built on four architectural cases which could occur using the polytope model. These architectural cases are based on the use of a multi-clock domain approach, and on the use of dual-port memories. The proposed control scheme and memory system are integrated into a hardware architecture framework which was validated generating functional processor arrays for two cases of stud y: matrix-matrix multiplication and Cholesky decomposition algorithms. Each generated processor array has different design parameters, and different processor array sizes. All these processor arrays are targeted for different FPGAs devices. Experimental results exhibit that there is a major impact on increasing the size of the control on the operational frequencies than increasing the problem size that the processor array can solve.

Moreover, the external memory results show that the peak I/O bandwidth produced by each of the four architectural cases exceeds the processor array I/O requirements. Also,results demonstrate that one limitation for implementing the processor arrays (including data-path, control and memory) is the amount of memory available inside the FPGAs. Furthermore, the results suggest that solving larger problem sizes comes at the price of dedicating more FPGA silicon to store data than to compute data. Finally, these processor arrays were evaluated with traditional metrics such as acceleration, efficiency and relative load imbalance, showing that algorithms with rectangular loop bounds and low latency operations have a major acceleration, a higher efficiency, and a lower load imbalance than algorithms with non-rectangular loop bounds and high latency hardware operations.