Par4All: From Convex Array Regions to Heterogeneous Computing

Mehdi Amini, Béatrice Creusillet, Stéphanie Even, Ronan Keryell, Onig Goubier, Serge Guelton, Janice Onanian Mcmahon, François-Xavier Pasquier, Grégoire Péan, Pierre Villalon

To cite this version:


HAL Id: hal-00744733
https://hal-mines-paristech.archives-ouvertes.fr/hal-00744733
Submitted on 2 Nov 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Par4All: From Convex Array Regions to Heterogeneous Computing

Mehdi Amini1,2 Béatrice Creusillet2 Onig Goubier2 Serge Guelton2 François-Xavier Pasquier2
Stéphanie Even2 Janice Onanian McMahon2 Grégoire Péan2
Pierre Villalon2
1MINES ParisTech firstname.lastname@mines-paristech.fr
2HPC Project firstname.lastname@hpc-project.com

Keywords
Heterogeneous computing, convex array regions, source-to-source compilation, polyhedral model, GPU, CUDA, OpenCL.

ABSTRACT
Recent compilers comprise an incremental way for converting software toward accelerators. For instance, the PGI Accelerator [14] or HMPP [3] require the use of directives. The programmer must select the pieces of source that are to be executed on the accelerator, providing optional directives that act as hints for data allocations and transfers. The compiler generates all code automatically.

CUDA [15] offers a simpler interface to target CUDA from JAVA. Data transfers are automatically generated for each call. Arguments can be declared as IN, OUT, or INOUT to avoid useless transfers, but no piece of data can be kept in the GPU memory between two kernel launches. There have also been several initiatives to automate transformations for OPENMP annotated source code to CUDA [10, 11]. The GPU programming model and the host accelerator paradigm greatly restrict the potential of this approach, since OPENMP is designed for shared memory computer. Recent work [6, 9] adds extensions to OPENMP that account for CUDA specificity. These make programs easier to write, but the developer is still responsible for designing and writing communications code, and usually the programmer have to specialize his source code for a particular architecture.

Unlike these approaches, Par4ALL [13] is an automatic parallelizing and optimizing compiler for C and Fortran sequential programs funded by the HPC Project startup. The purpose of this source-to-source compiler is to integrate several compilation tools into an easy-to-use yet powerful compiler that automatically transforms existing programs to target various hardware platforms. Heterogeneity is everywhere nowadays, from the supercomputers to the mobile world, and the future seems to be promised to more and more heterogeneity. Thus adapting automatically programs on targets such as multicore systems, embedded systems, high performance computers and GPUs is a critical challenge.

Par4ALL is mainly based on the PIPS [7, 1] source-to-source compiler infrastructure and benefits from its interprocedural capabilities like memory effects, reduction detection, parallelism detection, but also polyhedral-based analyses such as convex array regions [4] and preconditions.

The source-to-source nature of Par4ALL makes it easy to integrate third-party tools into the compilation flow. For instance, we are using PIPS to identify parts that are of interest in a whole program, and we rely on the PCC [12] polyhedral loop optimizer to perform memory accesses optimizations on these parts, in order to exhibit locality for instance.

The combination of PIPS’ analyses together and the insertion of other optimiser in the middle of the compilation flow is automated by Par4ALL using a programmable pass manager [5] to perform whole program analysis, spot parallel loops and generate mostly OPENMP, CUDA or OpenCL code.

To that end, we mainly face two challenges: parallelism detection and data transfer generation. The OPENMP directives generation relies on coarse grain parallelization and semantic-based reduction detection [8]. The CUDA and OpenCL targets add the difficulty of data transfer management. We tackle it using convex array regions that are translated into optimized, interprocedural data transfers between host and accelerator as described in [2].

The demonstration will provide the assistance with a global understanding of Par4ALL internals compilation flow, going through the interprocedural results of PIPS analyses, parallelism detection, data transfer generation and resulting code execution. Several benchmark examples and some real-world scientific applications will be used as a showcase.

1. REFERENCES
Figure 1: Speedup relative to naive sequential version for an OpenMP version, a version with basic PGI and HMPP directives, a naive CUDA version, and an optimized CUDA version, all automatically generated from the naive sequential code.

PIPS Is not (only) Polyhedral Software. In First International Workshop on Polyhedral Compilation Techniques, IMPACT, April 2011.


