Biological Sequence Alignment Using Graphics Processing Units
My thesis subject was the implementation and evaluation of a GPU accelerated protein database search tool, using the Smith-Waterman optimal sequence alignment algorithm.
Update: a paper has been published on my thesis material.
Alignment algorithms are used to find similarity between biological sequences, such as DNA and proteins. By aligning a sequence with a database, similar sequences can be found. These can be used to identify the source of a query sequence, to find commonalities between organisms, or to infer an ancestral relation. Various methods of performing biological sequence alignment exist, including dynamic programming and heuristic methods. Dynamic programming methods are guaranteed to find all optimal alignments, but are relatively slow; heuristic methods are faster but less precise. This thesis investigates the acceleration of one such optimal algorithm, the Smith-Waterman local sequence alignment algorithm, by using graphics processing units (GPUs). A fully functioning GPU-based protein database search tool was designed, implemented and optimized. The optimizations mostly concern the elimination of memory bottlenecks and the conversion of the database to a format well suited for GPU use. The final implementation offers the same features its CPU-based counterparts do, such as user configurable scoring and substitution matrix settings, and includes a web interface for convenient and remote usage. The performance of the GPU accelerated implementation was evaluated and compared to other solutions. It was found to attain a performance of more than 21 GCUPS (giga cell-updates of the Smith-Waterman score matrix per second) when searching the October 2010 release of Swiss-Prot on an NVIDIA Geforce GTX 275 GPU. With this performance, it is the fastest known GPU implementation on comparable hardware. It is also faster than the BLAST heuristic. However, the cost of purchasing a GPU, its power consumption, and the relative difficulty of maintaining a GPU software product are disadvantages of GPU acceleration.
The implementation has been named GASW, for 'GPU Accelerated Smith-Waterman'. Features:
- CUDA-based implementation of the Smith-Waterman sequence alignment algorithm for proteins, optimized for NVIDIA GT200-class GPUs.
- Aligns FASTA-format sequences with FASTA-format databases such as Swiss-Prot.
- Only returns alignment scores, not the actual alignments. However, top scoring sequences can be exported for alignment by tools that do offer this feature (FASTA SSearch).
- Supports all FASTA-format substitution matrices and user configurable affine gap penalties.
- Lenient limits on database and query sequence lengths.
- Comes with an easy to use web interface that allows the tool to be used conveniently and remotely. Can automatically process top scoring sequences with FASTA SSearch to produce the actual alignments.
Thesis - Introduction to biological sequence alignment, GPU acceleration thereof, the implementation and how it was optimized, benchmark results.
User's guide - Explains how to use the tool, limitations, setting up the web interface, and building everything from source.
Program distribution - Binaries and source code (GPLv3). Includes sample data files.
Benchmark results on a machine running a 2.4 GHz Intel Q6600 CPU, 4 GB RAM, NVIDIA Geforce GTX 275 with 896 MB of memory and Windows 7.
The command-line tool.