Week of July 11
I wrapped up the bulk of the writing for the aligner this week and tested it on toy examples. The next challenge will be formatting the output and further speeding up the program. The output will need to be in SAM format to support a downstream analysis tool. SAM is an acronym for “Sequence Alignment/Map.” I’m working with three common (to the bioinformatics field) file formats this summer: FASTA, FASTQ, and SAM.
The reference sequences are stored in FASTA format. FASTA files record DNA, RNA, or protein sequence information. A single line beginning with a ‘>’ contains descriptive metadata about the sequence, followed by a line containing the sequence itself. In our case, each file only contains one sequence.
The query sequences, or repair products, are in FASTQ format. FASTQ is similar to FASTA, but contains more information. As in FASTA files, each sequence is recorded first with a single line of description (this time, beginning with the ‘@’ character). The next line is, as in FASTA format, the sequence itself. A line of separation is marked with a ‘+’ , and the final line in a sequence record contains the quality scores for each base in the sequence. These quality scores use the Phred scale, which, for each base call, calculates a quality Q = -10logE, where E is the probability of error in that base call. For each base, the score is reported as a single ASCII character. The FASTQ files for my data contain up to 2.5 million reads, but I’m currently working with a more reasonably-sized test file (~168,000).
Finally, my output needs to be in SAM format. This is the most complicated of the three; it contains all of the same information as the FASTQ file, as well as some additional information about the alignment (many alignment tools produce SAM files). SAM files are tab-delimited; they begin with a header containing tags denoted by the ‘@’ character. These tags reference metadata about the entire file. The body of the file contains 10 mandatory fields, including the query sequence, Phred/quality sequence, and CIGAR string. As of the end of this week, I have basic SAM output working, with a few placeholder values for fields that may not be relevant to our work.
What I’m Reading The SAM file specification