Multithreaded LU Decomposition Analysis

Programming Language:C++, CUDA

Repository: https://github.com/TiagoJoseMagalhaes/lu_decomp

Description

This work was developed as a part of a parallel computing class and its objective was to understand different ways to implement the same algorithm in a parallel way. The 3 ways we had to implement it were single-threaded (just to understand how the algorithm works in its simplest form), using OpenMP, and using CUDA. We implemented matrix LU decomposition using Doolittle’s method. The OpenMP implementation achieved quite nice results, however, the most interesting one was the CUDA implementation. This is the project where I learned that usually iterative algorithms don’t adapt themselves well to highly parallelized environments, in our case we hit the max thread limit per group of Cuda very quickly, and could not find a way to split computations over different groups, this yielded not only a very small maximum matrix size but when accounting for data transfer time over the PCI bus made the usage of Doolittle’s method in CUDA make no sense at all. This being said using another algorithm to do this operation is an option as proven by CUBLAS, which can do this operation very efficiently. Overall it was an interesting experiment, however, we did wish our teacher bothered to explain to us what algorithm we were using, to begin with since most of us were not acquainted with the vast myriad of matrix LU decomposition methods.