Técnicas de optimización de rutinas paralelas de álgebra lineal en sistemas heterogéneos

  1. Camara Moreno, Jesus
Supervised by:
  1. Domingo Giménez Cánovas Director
  2. Antonio Javier Cuenca Muñoz Director

Defence university: Universidad de Murcia

Fecha de defensa: 23 July 2020

Committee:
  1. Leonel Augusto Pires Seabra De Sousa Chair
  2. Pedro Enrique López de Teruel Alcolea Secretary
  3. Victor Manuel García Molla Committee member

Type: Thesis

Abstract

The aim of this Thesis is the development of a hierarchical auto-tuning methodology for linear algebra routines in heterogeneous computational environments. This methodology has been designed to allow obtaining routines executed eficiently from the lowest level of the hierarchy, both hardware and software, to the highest level in each case. In terms of hardware, the components considered at the lowest level are the basic computational elements: multicore CPU, GPU and MIC (Xeon Phi). The parameters to take into account to optimize the routines are diferent in each of them, but the methodology must be valid in all cases. These computational elements can be grouped into higher level components (nodes), which generally consists of a multicore CPU and one or more coprocessor, in some cases of a diferent type (GPU or MIC) or with diferent computational power even if they are of the same type. Therefore, the methodology must consider how to combine the auto-tuning information from the basic computational elements to guide the optimization at node level. The nodes, in turn, can be grouped into clusters, either homogeneous or heterogeneous, so the hierarchical methodology must be able to obtain the optimization information at node level to guide the optimization routines at cluster level. As for the software (linear algebra routines), the process starts with basic routines, mainly the matrix multiplication, since it is the basic computational component used for the implementation of other eficient higher-level linear algebra routines. The methodology is then extended to higher level routines, such as the Strassen multiplication or the LU factorization, which will use the previously optimized matrix multiplication routine. Similarly, in the next level of the hierarchy, it could be considered the execution of scientific codes that call lower level routines, such as simulations in which the resolution of several systems of equations is performed simultaneously. In addition, since there are multiple optimized linear algebra libraries and, in some cases, parallelized for diferent systems, these routines are used and the application of the hierarchical auto-tuning methodology is analyzed considering diferent libraries. In this way, the efectiveness of the methodology can be shown independently of the routines and libraries used. The main goal of the auto-tuning process is to obtain eficient routines in the computational systems for which they are designed. Therefore, it is necessary to select the values of the parameters of a set of adjustable parameters that lead to execution times close to the experimental optima. The choice of the set of parameters depend on both the routine and the computational system. Thus, the auto-tuning methodology must ensure its validity for multiple routines, libraries and computing systems, so that it can be easily extended in any dimension. To achieve the main goal established, the working methodology used will begin by analysing the behaviour of the hierarchical auto-tuning methodology by performing a study by levels (both hardware and software), considering diferent scenarios (systems and routines) and parameters at each level until reaching the highest level considered in each case, so that the conclusions are general and do not depend on the routine or a specific system. The experiments will be carried out on heterogeneous platforms made up of nodes with diferent number and type of parallel processing elements (CPU, GPU, MIC). Several configurations of computing units will be considered, e.g. multicores with diferent number of cores or groups of nodes at cluster level, in order to obtain diferent degrees of heterogeneity. Once the methodology is analyzed at the first level of hardware, the process moves to the next level of computing elements, analyzing the information from the previous level that can be used to accelerate the decisión taken at the current level while maintaining the quality of the optimization process. Similarly, after analyzing the routines at one level, the process will continue to the next level of the software hierarchy, moving through the levels of the hardware hierarchy incrementally. Thus, while at the level of basic routines it is enough to work with the matrix multiplication, in the next level it could consider routines that internally call that routine and use diferent parameters, such as the recursion level in the Strassen multiplication or the block size in matrix factorizations, such as LU. The research work carried out in this Thesis has culminated with the publication of an article in an international journal. This article describes the main ideas developed in the Thesis and shows the main results obtained with the proposed hierarchical auto-tuning methodology. The results are satisfactory and show the efectiveness of applying a hierarchical approach both in the hardware and software levels. In this way, the validity of the proposed methodology is confirmed, due to its capacity to be adapted to any type of computing system and to be extended with other routines, libraries and techniques for the selection of the values of the algorithmic and system parameters. The goal is to continue advancing in the same direction, extending the methodology with new features and considering its inclusión within higher level software that can make use of it to achieve an optimized execution. It is expected that these advances allow the eficient application of the hierarchical auto-tuning methodology in solving scientific and engineering problems.