-
Essay / Implementation of the Lattice Boltzmann method on multiple devices using Opencl
The scientific computing community has been closely related to high performance computing (HPC), which has been the privilege of a limited group of scientists. Recently, with the rapid development of graphics processing units (GPUs), the parallel processing power of high-performance computers has been extended to all common desktop computers, thereby reducing the cost of scientific calculations. In this paper, we develop a general-purpose Lattice Boltzmann code that runs on a standard computer with multiple heterogeneous devices supporting the OpenCL specification. Different approaches to implementing the Lattice Boltzmann code on a standard computer with multiple devices were explored. Simulation results for different code implementations on multiple devices were compared to each other, to the results obtained for the single device implementation, and to literature results. Simulation results for commodity computing hardware platforms with multiple device implementation showed significant speed improvement compared to simulation implemented on a single device. Say no to plagiarism. Get a tailor-made essay on “Why violent video games should not be banned”? Get the original essay The computer processor industry was at a turning point a few years ago, when improvements in processor performance reached a serious frequency wall. Major CPU vendors have started manufacturing multi-core CPUs and all major GPU vendors have shifted to designing multi-core GPUs. With the development of multi-core and multi-core hardware architectures, digital computer simulations have increased in almost all fields of science and engineering. Recently, the lattice Boltzmann method (LBM) has emerged as an alternative method for computational fluid dynamics (CFD) and has proven its ability to simulate a wide variety of fluid flows. LBM is computationally expensive and memory intensive, but because it is explicit and is the local property of the governing equations (only requires nearest neighbor information), the method is very suitable for parallel computing using multi-core and multi-core hardware architectures. The graphics processing unit (GPU) is a massively multithreaded architecture and is then widely used for graphics and now non-graphics calculations. The main advantage of GPUs is their ability to perform many more floating point operations (FLOPs) per unit of time than CPUs. In order to unify software development of different hardware devices (primarily GPUs), an effort has been made to establish a standard for programming heterogeneous OpenCL platforms. There is a considerable cost associated with utilizing the full potential of modern multi-core CPUs and multi-core GPUs, sequential code must be (re)written to explicitly expose algorithmic parallelism. Various programming models have been established, often vendor specific. The main goal of the present work is to implement the Lattice Boltzmann method according to the OpenCL specification, where the most computationally intensive parts of the algorithm run on multiple heterogeneous devices, resulting in a simulation. speed up compared to the implementation for a single device. Furthermore, one of the goals is to show that by using the Java and OpenCL programming language, all devices available on standard computer hardware can be operatedto speed up scientific simulations. Additionally, two different implementations for a standard computer with several heterogeneous devices are created and their performances are compared. Implementations are developed using the Java programming language for the host (control program) and the OpenCL specification for the kernels (written to parallelize parts of the algorithm across two or more heterogeneous devices). The connection between the host (Java) and kernel (OpenCL) programs is carried out by the Java library (JOCL). The simulation was run on three different baseline hardware platforms. The performance of the implementations is In comparison, it is concluded that implementations that run on two or more OpenCL devices have better performance than implementations presented on a single device. Multi-GPU implementations of LBM using CUDA have been widely discussed in the literature. Implementation of cavity flow, using D3Q19 lattice model, multiple relaxation time (MRT) approximation and CUDA are presented. The simulation was tested on a node consisting of six Tesla C1060s and a POSIX thread is used to implement the parallelism. describes the cavity flow for different depth-to-width aspect ratios using the D3Q19 model and MRT approximation. The simulation is parallelized using OpenMP and tested on a single-node multi-GPU system, consisting of three nVIDIA M2070 devices or three nVIDIA GTX560 devices. presented LBM implementation for fluid flow through porous media on multi-GPU also using CUDA and MPI. Some optimization strategies based on data structure and layout are also proposed. The implementation is tested on a single-node cluster equipped with four Tesla C1060s. The authors adopted the message passing interface (MPI) technique for GPU management for a GPU cluster and explored accelerating the implementation of cavity flow using overlapping communication and calculation. In this reference, the D3Q19 model and the MRT approximation are also used. Xian described the CUDA implementation of flow around a sphere using the D3Q19 model and the MRT approximation. The code parallelism is based on the MPI library. Reducing the size of the communication time is achieved by using a solution domain partitioning method or by using calculation and communication by multiple flows. For the calculation, we use a supercomputer equipped with 170 Tesla S1070 nodes (680 GPUs). implemented single-phase, multi-phase, and multi-component LBM on multi-GPU clusters using CUDA and OpenMP. So far, very few OpenCL implementations of LB codes have been described in the literature. Compares CUDA and OpenCL LBM implementations on a compute unit and shows that properly structured OpenCL code achieves performance levels close to those achieved by the CUDA architecture. To the best of the author's knowledge, no papers have been published regarding the implementation of LBM using Java and OpenCL on several common computer devices.A. Lattice Boltzmann equation In the lattice Boltzmann method, the motion of the fluid is simulated by the movement and collision of particles on a uniform lattice, and the fluid is modeled by a single particle distribution function. The evolution of the distribution function is governed by a lattice Boltzmann equation: where is the distribution function for the particle with velocity at position and time, is the time increment and is the collision operator . The above equation indicates that the distribution function of particles scattered at the neighboring node at the next time step is the distributioncurrent particle plus the collision operator. The flow of a particle distribution function occurs over time over a distance which is the distance between sites in the network. The collision operator models the rate of change of the distribution function due to molecular collision. A collision model was proposed by (BGK) to simplify the analysis of the network Boltzmann equation. Using the LB-BGK approximation, equation (1) can be written as The above equation is a well-known LBGK model and it is consistent with the Navier-Stokes equation for fluid flow in the limit of a small Mach number and incompressible flow. In equation (2) is the local equilibrium distribution and a single relaxation parameter associated with collisional relaxation to local equilibrium. For the application, a lattice Boltzmann model must be chosen. Most research articles are carried out with the D2Q9 model. The D2Q9 model was also used in this work. The name implies that the model is two-dimensional and that at each point in the lattice there are nine velocities (N = 9) in which the particles can move. The equilibrium particle distribution function for the D2Q9 model is given by where and arethe macroscopic velocity and density, respectively, have a magnitude of one in this model, and are the weights and are given by for The discrete velocities for D2Q9 are given by macroscopic quantities and can be evaluated as The macroscopic kinematic viscosity is given by equation (2) is generally solved by assuming according to the following two steps where: denotes the distribution function after the collision, and is the value of the distribution function after the streaming and collision operation is completed. The third step in implementing LBM is determining the boundary conditions. In the present work for walls, the rebound boundary condition was applied because it has easy implementation and reasonable results in the simple bounded domain. For the movable cover, the equilibrium scheme was used. Implementations of the Lattice Boltzmann method for several heterogeneous devices are presented in this section. The main difference between these implementations is the transfer of data to and from heterogeneous OpenCL devices. Both implementations use the same OpenCL kernels. The D2Q9 model is used for data representation, the particle distribution functions are presented by nine tables. Since OpenCL does not support two-dimensional arrays, data is mapped from a two-dimensional array to a one-dimensional array. The two-network algorithm is used for both implementations of the Lattice Boltzmann method. Since this algorithm handles data dependency by storing distribution values in duplicate networks for the streaming phase, a shadow layer of arrays for particle distribution functions is created. The created tables are divided into subdomains, one for each device (multi-core/many cores). in the X direction. The size of the subdomain depends on the characteristics of each device (multi-core/multiple cores). The domain is distributed across devices (multi-core/many cores). Since the edge information after the streaming phase must be exchanged between solver iterations, an additional ghost layer is created. This layer is used to exchange data on particle distribution functions between devices and contains only the edge information that needs to be exchanged. This is done to minimize the amount of data copied from device to host and from host to.