Information Theoretic Approach to Bioinformatics

DNA sequencing, referring to the process of determining the precise order of nucleotides within a DNA molecule, is the basic workhorse of modern day biology and medicine. The DNA is only observed through a large number of reads, which are short segments sampled from random locations of the sequence, and corrupted by noise. The main task in de novo assembly is to merge the reads, and determine the underlying sequence. 

Comparison of the finished DNAs and study of the process of the underlying dynamics show that DNAs of different individuals are significantly similar, and hence, an already assembled DNA can be used as a representative example to facilitate sequencing. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than de novo  assembly. Reference-based shotgun sequencing refers to the reconstruction of DNA based on both a reference and set of reads. This is equivalent is to figure out the variations between the target DNA and the one used as a reference. 

A common theme in all variations of this problem is to extract a desired information from a set of observations (measurements) with stochastic behavior, where a certain cost has to be paid per unit of observation. We are interested in characterizing the fundamental limits of the problem, such as the minimum number of observation from which the desired information can be reliably determined. Moreover, understanding the trade-off between the cost of observations and the complexity of information extraction is a very interesting problem, which lies in the scope of this work.

Information Theoretic Approach to Bioinformatics