Abstract | Over the past decade, the design of microprocessors has been shifting to a new model
where the microprocessor has multiple homogeneous processing units, aka cores, as
a result of heat dissipation and energy consumption issues. Meanwhile, the demand
for heterogeneity increases in computing systems due to the need for high performance
computing in recent years. The current trend in gaining high computing power is to incorporate
specialized processing resources such as manycore Graphic Processing Units
in multicore systems, thus making a computing system heterogeneous.
Maximum performance of data-parallel scientific applications on heterogeneous platforms
can be achieved by balancing the load between heterogeneous processing elements.
Data parallel applications can be load balanced by applying data partitioning
with respect to the performance of the platform’s computing devices. However, load
balancing on such platforms is complicated by several factors, such as contention for
shared system resources, non-uniform memory access, limited GPU memory and slow
bandwidth of PCIe, which connects the host processor and the GPU.
In this thesis, we present methods of performance modeling and performance measurement
on dedicated multicore and multi-GPU systems. We model a multicore and
multi-GPU system by a set of heterogeneous abstract processors determined by the configuration
of the parallel application. Each abstract processor represents a processing
unit made of one or a group of processing elements executing one computational kernel
of the application. We group processing units by shared resources, and measure the performance
of processing units in each group simultaneously, thereby taking into account
the influence of resource contention. We investigate the impact of resource contention,
and the impact of process mapping on systems of NUMA architecture on the performance
of processing units. Using the proposed method for measuring performance, we
built functional performance models of abstract processors, and partition data of data
parallel applications using these performance models to balance the workload.
We evaluate the proposed methods with two typical data parallel applications, namely
parallel matrix multiplication and numerical simulation of lid-driven cavity flow. Experimental
results demonstrate that data partitioning algorithms based on functional performance
models built using proposed methods are able to balance the workload of data
parallel applications on heterogeneous multicore and multi-GPU platforms. |