Optimal reference sequence selection for genome assembly using minimum description length principle Sequence and Genome Analysis

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that "counting the number of reads of the novel genome present in the reference sequence" is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of "counting the number of reads that align to the reference sequence" and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.

Original languageEnglish
Article number18
JournalEurasip Journal on Bioinformatics and Systems Biology
Volume2012
Issue number1
DOIs
Publication statusPublished - 2012
Externally publishedYes

Fingerprint

Sequence Analysis
Genome
Genes
Counting
Sufficient
Sufficient Conditions

ASJC Scopus subject areas

  • Signal Processing
  • Statistics and Probability
  • Computer Science(all)
  • Medicine(all)
  • General

Cite this

@article{03ac5dc6f72045e69f4d1c3c6bde4455,
title = "Optimal reference sequence selection for genome assembly using minimum description length principle Sequence and Genome Analysis",
abstract = "Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that {"}counting the number of reads of the novel genome present in the reference sequence{"} is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of {"}counting the number of reads that align to the reference sequence{"} and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.",
author = "Bilal Wajid and Erchin Serpedin and Mohamed Nounou and Hazem Nounou",
year = "2012",
doi = "10.1186/1687-4153-2012-18",
language = "English",
volume = "2012",
journal = "Eurasip Journal on Bioinformatics and Systems Biology",
issn = "1687-4145",
publisher = "Springer Publishing Company",
number = "1",

}

TY - JOUR

T1 - Optimal reference sequence selection for genome assembly using minimum description length principle Sequence and Genome Analysis

AU - Wajid, Bilal

AU - Serpedin, Erchin

AU - Nounou, Mohamed

AU - Nounou, Hazem

PY - 2012

Y1 - 2012

N2 - Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that "counting the number of reads of the novel genome present in the reference sequence" is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of "counting the number of reads that align to the reference sequence" and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.

AB - Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that "counting the number of reads of the novel genome present in the reference sequence" is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of "counting the number of reads that align to the reference sequence" and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.

UR - http://www.scopus.com/inward/record.url?scp=84887046249&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84887046249&partnerID=8YFLogxK

U2 - 10.1186/1687-4153-2012-18

DO - 10.1186/1687-4153-2012-18

M3 - Article

VL - 2012

JO - Eurasip Journal on Bioinformatics and Systems Biology

JF - Eurasip Journal on Bioinformatics and Systems Biology

SN - 1687-4145

IS - 1

M1 - 18

ER -