Model-based Parallelization for Simulink Models on Multicore CPUs and GPUs

In this paper, we propose a model-based approach to parallelize Simulink models on multicore CPUs and NVIDIA GPUs at the block level and generate CUDA C codes for parallel execution. In our proposed approach, the Simulink models are converted to directed acyclic graphs (DAGs) based on their block diagrams, wherein the nodes represent tasks of grouped blocks in the model and the edges represent the communication behaviors between blocks. Next, a path analysis is conducted on the DAGs to extract all execution paths and calculate the length of each path, which comprises the execution times of tasks and the communication times of edges on the path. Then, an integer linear programming (ILP) formulation is used to minimize the length of the critical path of the DAG, which represents the execution time of the Simulink model. The ILP formulation also balances the workloads on each CPU core for optimized hardware utilization. We evaluate the proposed approach by parallelizing an image processing model on a platform of two homogeneous CPU cores and two GPUs to determine its effectiveness.


I. INTRODUCTION
Model-based development (MBD) with platforms such as Simulink [1] has been widely used in modeling and simulating complex systems. In many cases of MBD, a Simulink model (or a part of the model) is used to process a large amount of data, such as in image processing and scientific data calculation, and it is possible to accelerate the execution of these types of model-based applications by executing these data parallelism blocks on graphical processing units (GPUs) rather than on central processing unit (CPU) cores for improved performance. To implement these types of Simulink models on a platform of both CPUs and GPUs, extracting the blocks of data parallelism for execution on GPUs and parallelizing Simulink blocks for proper workload balance on CPU cores are both critical. To the best of our knowledge, no algorithms have been proposed to achieve this goal.
In a previous study [2], an integer linear programming (ILP)-based approach was proposed to parallelize Simulink models on single-ISA heterogeneous multicore processors. This approach parallelizes blocks on heterogeneous cores by minimizing the total inter-core communication cost of the core assignment solution. Although a platform of CPUs and GPUs represents a heterogeneous architecture, in which the data communication between the CPU and the GPU is rather heavy, algorithms that process a large amount of data in parallel may run much faster on GPUs than on CPUs. Thus, minimizing communication cost alone cannot solve the parallelization problem with GPUs.
In this paper, a model-based parallelization approach is proposed to parallelize the Simulink models of data parallelism on homogeneous multicore CPUs and NVIDIA GPUs [3]. The target architecture is a platform of homogeneous CPU cores and GPUs, where the number of CPU cores and GPUs is equal, thereby enabling multiple CUDA kernels to be executed concurrently. We parallelized and executed an image processing model on a platform of GPUs and a homogeneous multicore CPU to evaluate the approach, and observed a reasonable speedup performance. Fig. 1 provides an overview of the proposed approach for model-based parallelization on homogeneous multicore CPU cores and NVIDIA GPUs. It is used to find the parallelization solution at the block level for given Simulink models and to generate CUDA C code for execution on the target architecture.

ISOCC 2019
The proposed approach targets single-rate Simulink models in which algorithms such as image processing are implemented by basic blocks or MATLAB function blocks. First, we group continuous blocks of data processing into subsystems with the clustering method given in [2], and these grouped blocks may be then assigned for GPU execution based on the block assignment solution determined by the ILP formulation. Next, empty Atomic Subsystems are added on the input and output signal lines of these subsystems for directed acyclic graph (DAG) generation and code generation. To convert the model to a DAG, we use these empty subsystems to partition the blocks into tasks and build the DAG, in which nodes represent the tasks of blocks and edges represent the signal lines between the blocks. For code generation, CUDA codes are added to replace the codes of these empty subsystems to convert these subsystems to CUDA kernels. We generate DAGs based on the block diagrams after adding empty subsystems, and generate the sequential code of the models using the Simulink Coder. Then, the generated code is executed on the target platform to estimate the execution times of tasks in the DAGs on the CPU and those of data processing tasks on the GPU. Because data processing tasks may be executed faster on a CPU than a GPU due to the heavy communication cost between the GPU and CPU, we also evaluate the target platform for estimating both CPU inter-core and GPU-CPU communication overhead.
Tasks of the generated DAGs are assigned to the target platform of GPUs and CPU cores with an ILP formulation. We use depth-first search on the DAG to find all execution paths. In addition, the sum of the estimated execution times of tasks and communication times of edges on the path is used to represent the length of each path. The longest of these paths is the critical path of the DAG, the length of which represents the total execution time of the model on the target platform. Therefore, the objective function of the ILP formulation is to minimize the length of the critical path to reduce the execution time of the model by assigning tasks to CPU cores or GPUs. To reduce idle times on CPU cores for efficient utilization, the ILP formulation also balances the workload on each core, which consists of the execution times of tasks assigned to this core, the execution times and overhead of CUDA kernels called by this core, and the communication overhead.
Finally, we expand the task assignment solution to Simulink blocks and generate the parallel code based on the block assignment. We use the code generation method in [2] to generate POSIX threads for CPU cores. The data generated by Simulink Coder is copied to GPUs at the start of the each thread, and the processed data is copied to CPU core at the end of each thread. For blocks assigned to GPUs, the for-loops of these blocks are converted to CUDA kernels, and necessary codes are added to replace adjacent empty Atomic Subsystems. For each thread, one GPU is used to execute all CUDA kernels in it, so that when multiple CUDA kernels are called on different CPU cores, they can be executed concurrently.

III. EXPERIMENTS
To evaluate our approach, we implement sobel edge detection algorithm in a Simulink model in Fig. 2 and parallelize it with the proposed approach. We used two cores of an Intel i7-6700 CPU 4.00 GHz and two TITAN X (Pascal) GPUs as the target platform. Fig. 3 shows the generated DAG of the model, where nodes are marked with different colors according to the assignment solution by the proposed approach.    4 shows the acceleration performance of the parallelization by the proposed approach, as compared with the execution time of the sequential C code generated from the original input model with Simulink Coder and executed on only one CPU core, and the execution time of the sequential C code where for-loop codes of the image processing blocks are converted to CUDA kernels, executed on one CPU core and one GPU. This shows that using the proposed approach enables us to parallelize these types of data processing Simulink models on GPUs and CPUs to achieve a reasonable speedup, where concurrent CUDA kernels are not required to wait for available GPUs. In future work, we plan to extend our approach to platforms in which threads on different CPU cores share the same GPU resources.

IV. CONCLUSION
In this paper, we addressed a model-based parallelization approach with ILP to parallelize Simulink models of data parallelism on CPU cores and GPUs. We parallelized an image processing model with the proposed approach and achieved a 6.75x speedup performance.