Week of July 18

This week, I finished a working, multithreaded draft of the custom aligner and was able to run it on the test files with a few different parameters.

Pairwise sequence alignment is well-suited to multithreading because of the nature of the problem; no alignment relies on any other, so performing multiple alignments at once is simple. Once I read in the FASTQ file of query sequences, I know how much memory to allocate for the metadata about each alignment, so that can be set aside in advance to avoid memory collisions (a common side-effect in multi-threaded processes).

After running the aligner with a few different scoring schemes, I settled on using an affine gap penalty going forward. An optimal alignment between two sequences should minimize mismatches and gaps, so these operations are assigned penalty scores. A scoring scheme with an affine gap penalty has two different gap penalties: one for initializing a gap, and one for extending a gap. If the gap initialization penalty is harsher than the gap extension penalty, this scheme will favor alignments with fewer, longer gaps.

On Wednesday and Thursday, I attended a DISC (Data Intensive Studies Center) workshop on high-throughput sequence alignment on the Tufts high-performance cluster (HPC). The first day was a primer on Linux commands and accessing the cluster, which was mostly unnecessary – however, I did learn a few cluster-related tips. Thursday’s workshop was more useful; I learned to use BWA-mem, a freely-available sequence alignment tool, and to visualize alignments in IGV (Integrative Genomics Viewer).

On Friday, I met with Lenore, Mitch, and Nick about the status of the aligner. Our next step is to find the set of parameters that optimizes our results.

Written on July 22, 2022