Annotation and Clustering of Enviromental Shotgun Sequencing Data

Doug Rusch, S. Yooseph, S. Granger, A.L. Halpern, K. Remington, S. Collins, J.C. Venter
Ventner Institute

The oceanís are teeming with life especially the microscopic kind. These marine microbes play a crucial role in harvesting and utilizing light energy to recycle nutrients; in the process the microbes maintain the chemical composition of the water and air that other organisms including humanity depend on. Despite their importance the techniques and computational resources necessary to catalog and study these microbes are only beginning to become available. Our group has used a technique called Environmental Shotgun Sequencing to randomly collect microbes directly from the environment and analyze their DNA. The result is bits and pieces of thousands perhaps even millions of different organisms. One of the challenges of this dataset is identifying and determining the function of the genes. Traditional machine learning techniques to identify genes are not applicable given the fragmentary nature of the data so we have taken a different approach. By using abundant signals that indicate the beginning and end of genes we have identified all the putative genes traditionally called open reading frames. Presumably genes serve an important biological function and will therefore be maintained by selective forces much more than an arbitrary open reading frame. By performing a massive all against all sequence similarity search using the predicted open reading frames and all the known proteins we can build a graph describing the relatedness of the different open reading frames and thus identify likely genes based on their connectedness in the protein space. Here we describe the approach we took to analyze 29 million putative genes and provide a summary of the results to date.