Background Metagenomics has a great potential to discover previously unattainable information

Background Metagenomics has a great potential to discover previously unattainable information about microbial communities. need more attention in the study of oral cavity and the Crohns disease. Conclusions By taking account of the similarity in the genomic sequence TAEC outperforms other available tools in estimating taxonomic composition at a very low rank, especially when closely related species/strains exist in a metagenomic sample. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-242) contains supplementary material, which is available to authorized users. is generated and aligned against the reference database, where = 1,?,is aligned to multiple genomes, is assigned to the genomes whose alignment scores are greater than or equal to max( [0,1] and is the alignment score of for = 1,?,depends on the length of reads and the complexity of sample data: shorter reads and more complex datasets require higher value of to distinguish highly similar genomes. The ratio between the numbers of reads assigned to and can present the probability that reads originating from can be assigned to , or , where denotes the number of reads assigned to for all genomes in a reference database. Elimination stage Many genomes share more or less similarity in Rabbit polyclonal to ABCA13. the genomic sequence but each genome has its unique regions, which differentiate it from other genomes. Therefore, if a genome is truly present in a sample, there must be some reads uniquely assigned to it as long as the depth of coverage is high enough. We utilize this fact of uniqueness to identify genomes whose presence in the result of an alignment tool is most likely due to the similarity in the genomic sequence to the true genomes in a sample. To this end, each read is assigned to genome(s) with the highest alignment score, and a binary matrix is created with its entry is assigned to and is the number CH5424802 of reads and is the number of genomes present in the result of an alignment tool. For example, the below is the BLAST output for a small set of six reads: Let { we inductively solve the following equation (a simple example of how Eq. (1) works and an equivalent iterative algorithm are provided in Additional file?1): 1 until we get the column a permutation matrix that permutes the column a matrix that subtracts the column for > for becomes as CH5424802 below, i.e., only G3 and G4 are possible true genomes. In practice, reads can be assigned to some random genomes due to sequencing and alignment errors so the stopping criterion for Eq. (1) can be relaxed such that , where is a user defined minimum number of reads for a genome to be included in the subsequent analysis. The whole elimination procedure can be iterated using non-parametric bootstrap [18]. In the bootstrap, the number of occurrences of is used as a criterion to decide whether the genome is a false genome: if it exceeds a CH5424802 user defined number, is considered as a false genome and removed. Correction stage In the elimination stage, the uniqueness of genomes is utilized to remove false genomes, disregarding accuracy in the number of reads assigned to each genome. In the example data genomes of G1 and G2 are removed. In the correction stage, the number of reads assigned to each.