A Smooth Introduction to SYCL for C++20 afficionados - Joel Falcou - Meeting C++ 2023

Ғылым және технология

A Smooth Introduction to SYCL for C++20 afficionados - Joel Falcou - Meeting C++ 2023
From HPC to mobile development, the prevalence of accelerators and other performance-driven architectures is a fact you can't argue with anymore. What if you want to tap into those source of performances but you don't really want to sacrifice the elegance of your C++20 ? What if you may want to explore some of those architectures now but change your mind later without dropping all the code you already wrote ?
That's where SYCL comes down. SYCL is a open standard aiming at providing a cross-platform programming model, tools and compilers to target accelerators at large. Made to be interoperable with C++, it is simplify the design, debugging and deployment of applications over a large selection of accelerators: multi-cores, GPGPU or even FPGA.
In this talk, we will give a short tour of SYCL saillant point, how to get started with it, how it can smoothly be integrated in your C++ code without major changes in your programming habit and we will conclude with some result of such an integration in a High ENergy Physic applications.

Пікірлер: 4

@wolpumba40994 ай бұрын
*Abstract* In the talk "A Smooth Introduction to SYCL for C++20 aficionados" at Meeting C++ 2023, Joel Falcou, an associate professor and co-founder of a C++ and HPC training company, delves into the challenges faced by developers in the era of performance-driven architectures such as GPUs and FPGAs. Addressing the need for efficient programming without compromising the elegance of C++20, Falcou introduces SYCL, an open standard for cross-platform programming that enables developers to target a variety of accelerators. The presentation provides insight into SYCL's compatibility with C++ syntax, its programming model, and how it facilitates device connection and management through 'queue' objects and memory management techniques. Additionally, Falcou outlines SYCL's support for hierarchical parallelism and its implementation in scientific computing, specifically particle physics. Falcou discusses the use of a custom C++20 library called Kiwaku, which is designed for multidimensional data storage and processing. The library leverages C++20 features and concepts to provide efficient execution contexts, data views, and algorithmic execution. The talk also touches on OneAPI's role as the implementation of SYCL used for the presentation and alternatives to Intel's implementation. Furthermore, Falcou emphasizes SYCL's ease of use, deployability, and forward-thinking approach to concepts and type handling. He concludes with remarks on SYCL's documentation, hardware support updates, and encourages the use of SYCL for working with accelerators. The talk ends with a discussion on device selection, fallback implementations, and CPU support, along with implementation recommendations for those getting started with SYCL. *Chapter Titles* *Chapter 1: Introduction to SYCL and Computing Paradigms* - 0:00 Introduction and Background - 1:24 Overview of Computing Challenges and Tools - 3:39 Introduction to SYCL (Pronounced 'sickle') - 6:06 OneAPI and Supporting Companies *Chapter 2: Understanding the SYCL Programming Model* - 7:48 Programming Model and Device Connection - 10:38 Device Selection and Queue Management - 14:22 Introduction to Shared Memory and Parallel Operations - 15:46 Synchronization and Memory Management *Chapter 3: Optimizing Memory Operations in SYCL* - 17:35 Improving Memory Operations with Buffers - 19:57 Automatic Data Transfer with Buffer Destruction *Chapter 4: Advanced Parallelism Techniques in SYCL* - 21:11 Leveraging Hierarchical Parallelism - 22:53 Implementing Algorithms with Hierarchical Parallelism - 24:51 Utilizing SYCL-based Parallel STL Implementation *Chapter 5: SYCL Applications in High-Performance Computing* - 26:00 Application in Scientific Computing - 28:16 Overview of Accelerated Computation in Particle Physics *Chapter 6: The Role of Custom C++ Libraries in Computational Efficiency* - 29:43 Advantages of a Custom C++ Library - 31:09 Utilizing C++20 Features and Concepts - 31:57 Data Views and Parametric Concepts - 33:40 Algorithmic Execution and Slicing Techniques *Chapter 7: Performance Considerations Across Different Hardware* - 35:38 Contexts and Performance on Different Hardware - 37:31 Flexibility and Extensibility of the Library - 39:00 Concluding Remarks on Acceleration APIs and the Future *Chapter 8: Compilation Aspects and Device Handling* - 42:22 Compilation Timing and Options - 45:08 Device Selection and Ranking - 48:04 Fallback Implementations and Targeting Specific Compilers - 49:53 CPU Support and SIMD Vectorization - 53:16 Decoding the Secret String and Implementation Recommendations
@wolpumba4099
4 ай бұрын
*Summary* *Introduction and Background* - 00:00 Speaker introduces himself as an associate professor at a computer science lab near Paris and co-founder of Codon, a company focused on C++ and HPC training. - 00:30 His research includes parallel computing and creating interfaces and abstractions in C++. *Overview of Computing Challenges and Tools* - 01:24 Discussion on how increasing computer complexity and core numbers lead to challenges for developers in writing efficient code for various hardware systems. - 02:04 The difficulty lies in handling threads, vectorizing code, and now, thousands of cores in GPUs and reconfigurable systems. *Introduction to SYCL (Pronounced 'sickle')* - 03:29 SYCL is an open standard for writing C++ code that targets various computing systems like CPUs, GPUs, and FPGAs. - 04:14 It maintains proximity to regular C++ syntax and allows for more accessible reasoning about code and building around it. *OneAPI and Supporting Companies* - 06:06 OneAPI by Intel is mentioned as the implementation of SYCL used for their presentation, including various Intel-specific libraries and compilers. - 06:54 Alternatives to Intel's implementation, like Clang, which supports SYCL starting from version 50, are mentioned. *Programming Model and Device Connection* - 07:48 SYCL programming model compared to other GPU programming models, with differences highlighted in terms of verbosity and explicitness. - 09:30 Connection to a device using the 'queue' object is discussed, which acts as an intermediary for data and operation transfer between host and device. *Device Selection and Queue Management* - 10:38 Describes how developers can select devices based on properties or write custom logic to rank devices based on specific criteria. - 12:25 Explains the explicit nature of building queues and the flexibility to manage multiple queues and devices, all operating asynchronously. *Introduction to Shared Memory and Parallel Operations* - 14:22 Explains the concept of a shared memory block that is not the same as CUDA's shared memory. It's shared between CPU and device, not within the device. - 14:48 Describes initiating data transfer to the shared memory block and starting parallel operations using a queue. - 15:05 Discusses how C++ lambdas or callable objects can be used as kernel functions for parallel operations. *Synchronization and Memory Management* - 15:46 Details the process of waiting for the completion of parallel operations and the option to wait on the queue or an event object. - 16:12 Once operations are complete, the result is already in the shared memory, and it can be sent back to the system. *Improving Memory Operations with Buffers* - 17:35 Introduces the use of buffers and accessors to create a relationship between host and device memory for more efficient operations. - 18:49 Discusses the use of host accessors to read data back from the device to the CPU. - 19:07 Highlights the significance of accessor modifiers to infer task graph dependencies. *Automatic Data Transfer with Buffer Destruction* - 19:57 Describes how automatically transferring data back to the host is handled by scoping buffers and destroying them after computation is complete. *Leveraging Hierarchical Parallelism* - 21:11 Explores the concept of work groups and subgroups to exploit different levels of parallelism, which can improve performance on different hardware architectures. *Implementing Algorithms with Hierarchical Parallelism* - 22:53 Illustrates writing an algorithm using work groups and subgroups to perform parallel computations on a dataset. - 24:18 Describes how a regular C++ lambda function is transferred and executed on the device without additional complexity. *Utilizing SYCL-based Parallel STL Implementation* - 24:51 Examines the Parallel STL implementation by Kronos which uses SYCL-based execution policy to run algorithms on the GPU. *Application in Scientific Computing* - 26:00 Describes the use of parallel computing techniques in scientific research, such as analyzing data from the Large Hadron Collider's ATLAS experiment. *Overview of Accelerated Computation in Particle Physics* - 28:16 Scientists looking to accelerate computations in particle physics use GPUs and FPGAs to handle hundreds of gigabytes per second during collisions. - 28:50 Multiple ICTs, such as Nvidia machines with Cuda and CLE, show significant speedup over CPU versions. *Advantages of a Custom C++ Library* - 29:43 Discusses a C++20 library called Kaku designed for multidimensional data storage, highlighting its flexibility and efficiency. - 30:04 Kaku differentiates from other libraries by combining owning and non-owning data structures and offering an API for algorithm and interface definitions. - 30:53 The library focuses on data storage and processing in a configurable way, avoiding linear algebra and expression templates. *Utilizing C++20 Features and Concepts* - 31:09 Emphasizes the use of C++20 features like template metaprogramming and concepts to handle data processing efficiently. - 31:33 Describes creating execution contexts that users can define themselves, differing from execution policies. *Data Views and Parametric Concepts* - 31:57 Explains the creation and definition of data views with named parameter interfaces and complex deduction guides for ease of use. - 32:44 Discusses the importance of parametric concepts in handling complicated types without relying on specific implementations like zip. *Algorithmic Execution and Slicing Techniques* - 33:40 Provides examples of using algorithms to handle data, including transforming views and creating subranges similar to practices in MATLAB or NumPy. - 34:50 Describes more complex slicing and the transformation of data using custom algorithmic contexts for specific hardware. *Contexts and Performance on Different Hardware* - 35:38 Showcases how they offer a variety of contexts, including CPU and GPU contexts, to run algorithms effectively on diverse hardware. - 36:10 Discusses the performance results of complex computations using the CPU and GPU, demonstrating the efficiency of the library's design. *Flexibility and Extensibility of the Library* - 37:31 The library aims to provide proper implementation for a wide range of algorithms and support complex operations by leveraging simple base operations. - 38:15 Shares the experience of integrating the CLE into their C++20 codebase, which took about two weeks to wrap elements correctly and achieve good performance. *Concluding Remarks on Acceleration APIs and the Future* - 39:00 Mentions the thorough documentation provided by the Kronos Group on using their APIs and the support for updating hardware support in SYCL. - 40:31 Encourages the use of tools like SYCL for working with accelerators to combine knowledge of business algorithms and machine-specific expertise. - 41:01 Praises SYCL's simplicity, deployability, and compatibility with C++20 and its forward-thinking approach to concepts and type handling. - 41:34 Acknowledges the openness of the Kronos Group for feedback and concludes with a special thanks to their PhD student who contributed to the graphs and explanations. *Compilation Timing and Options* - 42:22 Compiling SQO code for devices can happen either ahead of time, like PTX for CUDA, or at runtime, adapting to available hardware. - 42:45 You can pre-select the target device (CPU or GPU) and the compiler will compile for that device ahead of time. - 43:31 There is also a just-in-time compilation option. - 44:03 Partial compilation was previously supported, but the current status is unclear. *Device Selection and Ranking* - 45:08 You can set multiple conditions for device selection to ensure the best match for execution. - 45:29 A device can be chosen based on whether it fulfills a certain condition or set of conditions. - 45:37 Device aspects and deny lists allow for a fine-grained selection of required or unwanted features. - 46:12 Custom ranking functions enable prioritization of devices based on scores for their properties. *Fallback Implementations and Targeting Specific Compilers* - 48:04 For writing open-source libraries, it's possible to provide a fallback implementation for users with different compilers. - 48:32 There's an option to use a CPU scheduling library implementation if the user's compiler lacks specific support. - 48:45 TriSYCL is suggested for users who want to ensure compatibility across different systems. *CPU Support and SIMD Vectorization* - 49:53 The question about how CLE can get close to handwritten SIMD and threading code targeting CPUs. - 50:20 CPU support for threading uses platforms like OpenMP or TBB in the backend, providing solid multithreading. - 50:33 SIMD support quality depends on the compiler backend and its auto-vectorization capabilities. - 51:54 A specialized platform backend for CLE could potentially utilize a more advanced vectorization system. *Decoding the Secret String and Implementation Recommendations* - 53:16 The secret string is a hello world message from SYCL or oneAPI. - 53:59 Clang can be used directly for SYCL implementation and is recommended for getting started with SYCL. - 54:11 The oneAPI Docker image is suggested for those who don't want to install compilers locally. - 55:03 For advanced users needing CUDA support, oneAPI provides detailed setup documentation for Linux distributions. Disclaimer: I used gpt4-1106 to summarize the video transcript. This method may make mistakes in recognizing words and it can't distinguish between speakers.
@miroslavhoudek70854 ай бұрын
I remember that my colleague was tasked to chose between CUDA and OpenCL some 10 years ago. He liked OpenCL much more, because it allowed him to run code on both cards. But then it turned out that AMD did not support running OpenCL headless, without X server running and screen connected. Also CUDA was slightly more consistent. And that's how OpenCL was removed from feasible standards. It's hard to get also the usability and reliability right, even if the idea is great, I guess.

A Smooth Introduction to SYCL for C++20 afficionados - Joel Falcou - Meeting C++ 2023

Ғылым және технология

Пікірлер: 4

@wolpumba4099

4 ай бұрын

Келесі