E. Agullo, O. Aumage, B. Bramas, O. Coulaud, and S. Pitoiset, Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method, IEEE Trans. Parallel Distrib. Syst, vol.28, issue.10, pp.2794-2807, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01517153

E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner et al., Taskbased FMM for multicore architectures, SIAM J. Sci. Comput, vol.36, issue.1, pp.66-93, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00807368

E. Anderson, LAPACK Users' Guide, 1999.

P. Atkinson, S. Mcintosh-smith, B. R. De-supinski, S. L. Olivier, C. Terboven et al., On the performance of parallel tasking runtimes for an irregular fast multipole method application, IWOMP 2017, vol.10468, pp.92-106, 2017.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurr. Comput.: Pract. Exper, vol.23, issue.2, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

A. R. Benson, J. Poulson, K. Tran, B. Engquist, and L. Ying, A parallel directional fast multipole method, SIAM J. Sci. Comput, vol.36, issue.4, pp.335-352, 2014.

C. Bordage, Parallelization on heterogeneous multicore and multi-GPU systems of the fast multipole method for the Helmholtz equation using a runtime system, Proceedings of the Sixth International Conference on Advanced Engineering Computing and Applications in Sciences, pp.90-95, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00773114

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault et al., PaRSEC: exploiting heterogeneity to enhance scalability, Comput. Sci. Eng, vol.15, issue.6, pp.36-45, 2013.

Z. Budimli´cbudimli´c, A. Chandramowlishwaran, K. Knobe, G. Lowney, V. Sarkar et al., Multicore implementations of the Concurrent Collections programming model, 14th Workshop on Compilers for Parallel Computing, 2009.

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Comput, vol.35, issue.1, pp.38-53, 2009.

A. Chandramowlishwaran, K. Knobe, and R. Vuduc, Performance evaluation of Concurrent Collections on high-performance multicore computing systems, 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010), pp.1-12, 2010.

F. A. Cruz, M. G. Knepley, and L. A. Barba, PetFMM-a dynamically load-balancing parallel fast multipole library, Int. J. Numer. Methods Eng, vol.85, issue.4, pp.403-428, 2011.

E. Darve, C. Cecka, and T. Takahashi, The fast multipole method on parallel clusters, multicore processors, and graphics processing units, Comptes Rendus Mécanique, vol.339, issue.2, pp.185-193, 2011.

U. Dastgeer, C. Kessler, and S. Thibault, Flexible runtime support for efficient skeleton programming on hybrid systems, Proceedings of the ParCo-2011 International Conference on Parallel Computing, vol.22, pp.159-166, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00606200

A. Duran, OmpSs: a proposal for programming heterogeneous multi-core architectures, Parallel Proces. Lett, vol.21, issue.02, pp.173-193, 2011.

. Efield®,

J. Enmyren and C. Kessler, SkePU: a multi-backend skeleton programming library for multi-GPU systems, Proceedings of the 4th Internatioanl Workshop on High-Level Parallel Programming and Applications (HLPP-2010). ACM, September 2010

A. Ernstsson, L. Li, and C. Kessler, SkePU 2: flexible and type-safe skeleton programming for heterogeneous parallel systems, Int. J. Parallel Program, vol.46, issue.1, 2018.

J. Filipovic and S. Benkner, OpenCL kernel fusion for GPU, Xeon Phi and CPU, Proceedings of the 27th International Symposium on Computer Architecture and High-Performance Computing (SBAC-PAD 2015), pp.98-105, 2015.

,

J. Filipovic, M. Madzin, J. Fousek, and L. Matyska, Optimizing CUDA code by kernel fusion: application on BLAS, J. Supercomput, vol.71, pp.3934-3957, 2015.

K. Fukuda, M. Matsuda, N. Maruyama, R. Yokota, K. Taura et al., Tapas: an implicitly parallel programming framework for hierarchical n-body algorithms, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), pp.1100-1109, 2016.

D. Gelernter and N. Carriero, Coordination languages and their significance, Commun. ACM, vol.35, issue.2, pp.97-107, 1992.

B. Gijsbers and C. Grelck, An efficient scalable runtime system for macro data flow processing using S-Net, Int. J. Parallel Program, vol.42, issue.6, pp.988-1011, 2014.

F. Gouin, Methodology for image processing algorithms mapping on massively parallel architectures, MINES ParisTech, 2018.

F. Gouin, C. Ancourt, and C. Guettier, An up to date mapping methodology for GPUs, 20th Workshop on Compilers for Parallel Computing (CPC 2018), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01759238

C. Grelck, J. Julku, and F. Penczek, Distributed S-Net: cluster and grid computing without the hassle, 12th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2012.

C. Grelck, S. Scholz, and A. Shafarenko, Asynchronous stream processing with S-Net, Int. J. Parallel Program, vol.38, issue.1, pp.38-67, 2010.

C. Grelck, S. B. Scholz, and A. Shafarenko, Coordinating data parallel SAC programs with S-Net, Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007.

K. Gupta, J. A. Stuart, and J. D. Owens, A study of persistent threads style GPU programming for GPGPU workloads, Innovative Parallel Computing -Foundations and Applications of GPU, Manycore, and Heterogeneous Systems (INPAR 2012), pp.1-14, 2012.

L. Gürel and O. Ergül, Hierarchical parallelization of the multilevel fast multipole algorithm (MLFMA), Proc. IEEE, vol.101, pp.332-341, 2013.

M. Holm, S. Engblom, A. Goude, and S. Holmgren, Dynamic autotuning of adaptive fast multipole methods on hybrid multicore CPU and GPU systems, SIAM J. Sci. Comput, vol.36, issue.4, 2014.

C. Kessler, Programmability and performance portability aspects of heterogeneous multi-/manycore systems, Proceedings of the DATE-2012 Conference on Design, Automation and Test in Europe, pp.1403-1408, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00776610

K. Knobe, Ease of use with Concurrent Collections (CnC), USENIX Workshop on Hot Topics in Parallelism, 2009.

J. Kurzak and B. M. Pettitt, Massively parallel implementation of a fast multipole method for distributed memory machines, J. Parallel Distrib. Comput, vol.65, issue.7, pp.870-881, 2005.

I. Lashuk, A massively parallel adaptive fast multipole method on heterogeneous architectures, Commun. ACM, vol.55, issue.5, pp.101-109, 2012.

L. Li and C. Kessler, Lazy allocation and transfer fusion optimization for GPU-based heterogeneous systems, Proceedings of the Euromicro PDP-2018 International Conference on Parallel, Distributed, and Network-Based Processing, pp.311-315, 2018.

M. Li, M. Francavilla, F. Vipiana, G. Vecchi, and R. Chen, Nested equivalent source approximation for the modeling of multiscale structures, IEEE Trans. Antennas Propag, vol.62, issue.7, pp.3664-3678, 2014.

M. Li, M. Francavilla, F. Vipiana, G. Vecchi, Z. Fan et al., A doubly hierarchical MoM for high-fidelity modeling of multiscale structures, IEEE Trans. Electromagn. Compat, vol.56, issue.5, pp.1103-1111, 2014.

M. Li, M. A. Francavilla, R. Chen, and G. Vecchi, Wideband fast kernel-independent modeling of large multiscale structures via nested equivalent source approximation, IEEE Trans. Antennas Propag, vol.63, issue.5, pp.2122-2134, 2015.

H. Ltaief and R. Yokota, Data-driven execution of fast multipole methods, Concurr. Comput.: Pract. Exp, vol.26, issue.11, pp.1935-1946, 2014.

A. Maghazeh, U. D. Bordoloi, U. Dastgeer, A. Andrei, P. Eles et al., Latencyaware packet processing on CPU-GPU heterogeneous systems, Proceedings of the Design Automation Conference (DAC), 2017.

J. R. Mautz and R. F. Harrington, Electromagnetic scattering from homogeneous material body of revolution, Arch. Electron. ¨ Ubertragungstech, vol.33, pp.71-80, 1979.

M. Nilsson, Fast numerical techniques for electromagnetic problems in frequency domain, 2003.

F. Penczek, W. Cheng, C. Grelck, R. Kirner, B. Scheuermann et al., A data-flow based coordination approach to concurrent software engineering, 2nd Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2012), 2012.

F. Penczek, Parallel signal processing with S-Net, Procedia Comput. Sci, vol.1, issue.1, pp.2079-2088, 2010.

J. M. Pérez, R. M. Badia, and J. Labarta, A dependency-aware task-based programming environment for multi-core architectures, Proceedings of the 2008 IEEE International Conference on Cluster Computing, pp.142-151, 2008.

-. Puma,

B. Qiao, O. Reiche, F. Hannig, and J. Teich, Automatic kernel fusion for image processing DSLs, Proceedings of the 21th International Workshop on Software and Compilers for Embedded Systems, 2018.

S. Rao, D. Wilton, and A. Glisson, Electromagnetic scattering by surfaces of arbitrary shape, IEEE Trans. Antennas Propag, vol.30, issue.3, pp.409-418, 1982.

S. M. Seo and J. F. Lee, A fast IE-FFT algorithm for solving PEC scattering problems, IEEE Trans. Magn, vol.41, issue.5, pp.1476-1479, 2005.

J. Song, C. C. Lu, and W. C. Chew, Multilevel fast multipole algorithm for electromagnetic scattering by large complex objects, IEEE Trans. Antennas Propag, vol.45, issue.10, pp.1488-1493, 1997.

S. Thibault, On Runtime Systems for Task-based Programming on Heterogeneous Platforms, HabilitationàHabilitationà diriger des recherches, L'Université, 2018.
URL : https://hal.archives-ouvertes.fr/tel-01959127

P. Thoman, H. Jordan, and T. Fahringer, Adaptive granularity control in task parallel programs using multiversioning, Euro-Par 2013, vol.8097, pp.164-177, 2013.

M. Tillenius, SuperGlue: a shared memory framework using data versioning for dependency-aware task-based parallelization, SIAM J. Sci. Comput, vol.37, issue.6, 2015.
DOI : 10.1137/140989716

URL : http://uu.diva-portal.org/smash/get/diva2:868641/FULLTEXT01

M. Tillenius, E. Larsson, R. M. Badia, and X. Martorell, Resource-aware task scheduling, ACM Trans. Embedded Comput. Syst, vol.14, issue.1, p.25, 2015.
DOI : 10.1145/2638554

URL : http://dl.acm.org/ft_gateway.cfm?id=2638554&type=pdf

S. Velamparambil and W. C. Chew, Analysis and performance of a distributed memory multilevel fast multipole algorithm, IEEE Trans. Antennas Propag, vol.53, issue.8, pp.2719-2727, 2005.
DOI : 10.1109/tap.2005.851859

F. Vipiana, M. Francavilla, and G. Vecchi, EFIE modeling of high-definition multiscale structures, IEEE Trans. Antennas Propag, vol.58, issue.7, pp.2362-2374, 2010.

M. Wahib and N. Maruyama, Scalable kernel fusion for memory-bound GPU applications, Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC 2014), pp.191-202, 2014.
DOI : 10.1109/sc.2014.21

G. Wang, Y. Lin, and W. Yi, Kernel fusion: an effective method for better power efficiency on multithreaded GPU, Proceedings of the IEEE/ACM International Conference on Green Computing and Communications and International Conference on Cyber, pp.344-350, 2010.
DOI : 10.1109/greencom-cpscom.2010.102

Y. Wen, M. F. O'boyle, and C. Fensch, MaxPair: enhance OpenCL concurrent kernel execution by weighted maximum matching, Proceedings of the GPGPU-11

, , 2018.

A. Yarkhan, J. Kurzak, and J. Dongarra, Quark users' guide: queueing and runtime for kernels, 2011.

A. Zafari, R. Wyrzykowski, J. Dongarra, and E. Deelman, TaskUniVerse: a task-based unified interface for versatile parallel execution, PPAM 2017, vol.10777, pp.169-184, 2018.
DOI : 10.1007/978-3-319-78024-5_16

URL : http://arxiv.org/pdf/1705.02970

A. Zafari, Task parallel implementation of a solver for electromagnetic scattering problems, 2018.

A. Zafari, E. Larsson, and M. Tillenius, DuctTeip: an efficient programming model for distributed task-based parallel computing, 2019.

P. Zaichenkov, B. Gijsbers, C. Grelck, O. Tveretina, and A. Shafarenko, The cost and benefits of coordination programming: two case studies in Concurrent Collections (CnC) and S-Net, Parallel Process. Lett, vol.26, issue.3, 2016.

B. Zhang, Asynchronous task scheduling of the fast multipole method using various runtime systems, 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing, pp.9-16, 2014.

K. Zhao, M. N. Vouvakis, and J. F. Lee, The adaptive cross approximation algorithm for accelerated method of moments computations of EMC problems, IEEE Trans. Electromagn. Compat, vol.47, issue.4, pp.763-773, 2005.

, ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use