Week of June 27

This week was my first week back to work after a 10-day trip to the west coast. The view from my desk isn’t quite the same as the one from my tent in Olympic National Park, but I came home to the news that a paper I contributed to last year had been accepted to the 2022 ACM-BCB conference. I spent Monday helping prep the final “camera-ready” submission for the conference.

Olympic National Park

With the paper turned in, I turned my attention to developing our custom aligner. I am modifying the Needleman-Wunsch global alignment algorithm to solve our pairwise alignment problem and hopefully better identify sequencing errors in the data. Most alignment tools are developed with the goal of identifying functional, structural, or evolutionary relationships between a query sequence (DNA or protein) and the set of known genes or proteins belonging to the organism of interest. Identifying the best match for a query is a well-studied computational problem.

In our project, there is no need to search a database for a match; we’re simply trying to identify mutations that occurred during repair. The data contains ~1,100 DNA constructs, or unique sequences, each of which has been cloned, broken, and repaired thousands of times. For each of these 1,100, there is a ground truth/original sequence to compare the repair products to, and the goal is simply to align the repair product to the original sequence so we can analyze any mutations that occurred. Thus a certain amount of mystery (and computational complexity) is removed from the process.

The Needleman-Wunsch algorithm is a classic global sequence alignment algorithm in the computational biology field. Developed in 1970 by Saul Needleman and Christian Wunsch, it is a dynamic programming algorithm that assigns a score s to all sub-alignments of two sequences n and m. This is best conceptualized as an n by m matrix, where each cell at position (i, j) contains the optimal score of the subalignment up to and including the bases at (i, j). Two bases will either align (as a match or a mismatch), or we will insert a gap in one of the two sequences; each of these operations is associated with a numeric score that is combined with that of each possible preceding sub-alignment. We compare each of these sub-alignment scores, pick the best one, and proceeed. When the entire matrix is complete, the score in the bottom right-most cell represents the overall alignment score, and we can retrieve the alignment itself by retracing our steps from that final cell.

My adaptation will leverage additional information about base call quality as well as biological intuition about the position of each base to identify sequencing errors during the alignment step.

DREAM Mentorship Program

During my west-coast trip, I had my first group mentoring meeting (over Zoom) with Dr. Craig Miller at DePaul University and a few members of my DREAM cohort. We introduced ourselves and practiced giving short descriptions of our projects. This week, I had a one-on-one meeting with Dr. Miller and further discussed my project and the direction the mentoring program will go over the summer.

What I’m Reading

Y. Liu, R.S. Zou, S. He, Y. Nihongaki, X. Li, S. Razavi, B. Wu, T. Ha, “Very fast CRISPR on demand,” Science 368, 1265-1269, June 2020. doi: 10.1126/science.aay8204

Written on July 1, 2022