Minimum description length based selection of reference sequences for comparative assemblers

Bilal Wajid, Erchin Serpedin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Genome sequences are the most basic, yet most essential pieces of data in all biological analysis. Genome sequence is the solution to the Genome Assembly problem which remakes the entire sequence from a set of reads which are unordered and very small in size. Genome Assembly problem is therefore, quite complex and is broadly divided into denovo and comparative assembly. Comparative assembly takes the aid of a reference sequence, closely related to the unassembled genome, to determine the relative order of the reads with respect to one another, and then joins them together to form the sequence. This paper explores all variants of Minimum Description Length (MDL) to find the best reference sequence for comparative assembly. The paper looked at two-part MDL, Sophisticated MDL and MiniMax Regret and found that Sophisticated MDL performs better than two-part MDL, however, MiniMax regret owing to the nature of the problem was unsuitable. The proposed scheme is prior free and can be incorporated in the data preprocessing stage for all comparative assemblers allowing the assembly process to make use of the best reference sequence available.

Original languageEnglish
Title of host publicationProceedings 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11
Pages230-233
Number of pages4
Publication statusPublished - 2011
Externally publishedYes
Event2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11 - San Antonio, TX, United States
Duration: 4 Dec 20116 Dec 2011

Other

Other2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11
CountryUnited States
CitySan Antonio, TX
Period4/12/116/12/11

Fingerprint

Genome
Genes
Emotions

Keywords

  • Comparative assembly
  • Genome assembly
  • MiniMax regret
  • Minimum description length
  • Sophisticated MDL
  • Two-part MDL

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology (miscellaneous)
  • Computational Theory and Mathematics
  • Signal Processing
  • Biomedical Engineering

Cite this

Wajid, B., & Serpedin, E. (2011). Minimum description length based selection of reference sequences for comparative assemblers. In Proceedings 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11 (pp. 230-233). [6169487]

Minimum description length based selection of reference sequences for comparative assemblers. / Wajid, Bilal; Serpedin, Erchin.

Proceedings 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11. 2011. p. 230-233 6169487.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wajid, B & Serpedin, E 2011, Minimum description length based selection of reference sequences for comparative assemblers. in Proceedings 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11., 6169487, pp. 230-233, 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11, San Antonio, TX, United States, 4/12/11.
Wajid B, Serpedin E. Minimum description length based selection of reference sequences for comparative assemblers. In Proceedings 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11. 2011. p. 230-233. 6169487
Wajid, Bilal ; Serpedin, Erchin. / Minimum description length based selection of reference sequences for comparative assemblers. Proceedings 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11. 2011. pp. 230-233
@inproceedings{68f8034e1eb14db9a55c4c4c5847e0ae,
title = "Minimum description length based selection of reference sequences for comparative assemblers",
abstract = "Genome sequences are the most basic, yet most essential pieces of data in all biological analysis. Genome sequence is the solution to the Genome Assembly problem which remakes the entire sequence from a set of reads which are unordered and very small in size. Genome Assembly problem is therefore, quite complex and is broadly divided into denovo and comparative assembly. Comparative assembly takes the aid of a reference sequence, closely related to the unassembled genome, to determine the relative order of the reads with respect to one another, and then joins them together to form the sequence. This paper explores all variants of Minimum Description Length (MDL) to find the best reference sequence for comparative assembly. The paper looked at two-part MDL, Sophisticated MDL and MiniMax Regret and found that Sophisticated MDL performs better than two-part MDL, however, MiniMax regret owing to the nature of the problem was unsuitable. The proposed scheme is prior free and can be incorporated in the data preprocessing stage for all comparative assemblers allowing the assembly process to make use of the best reference sequence available.",
keywords = "Comparative assembly, Genome assembly, MiniMax regret, Minimum description length, Sophisticated MDL, Two-part MDL",
author = "Bilal Wajid and Erchin Serpedin",
year = "2011",
language = "English",
isbn = "9781467304900",
pages = "230--233",
booktitle = "Proceedings 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11",

}

TY - GEN

T1 - Minimum description length based selection of reference sequences for comparative assemblers

AU - Wajid, Bilal

AU - Serpedin, Erchin

PY - 2011

Y1 - 2011

N2 - Genome sequences are the most basic, yet most essential pieces of data in all biological analysis. Genome sequence is the solution to the Genome Assembly problem which remakes the entire sequence from a set of reads which are unordered and very small in size. Genome Assembly problem is therefore, quite complex and is broadly divided into denovo and comparative assembly. Comparative assembly takes the aid of a reference sequence, closely related to the unassembled genome, to determine the relative order of the reads with respect to one another, and then joins them together to form the sequence. This paper explores all variants of Minimum Description Length (MDL) to find the best reference sequence for comparative assembly. The paper looked at two-part MDL, Sophisticated MDL and MiniMax Regret and found that Sophisticated MDL performs better than two-part MDL, however, MiniMax regret owing to the nature of the problem was unsuitable. The proposed scheme is prior free and can be incorporated in the data preprocessing stage for all comparative assemblers allowing the assembly process to make use of the best reference sequence available.

AB - Genome sequences are the most basic, yet most essential pieces of data in all biological analysis. Genome sequence is the solution to the Genome Assembly problem which remakes the entire sequence from a set of reads which are unordered and very small in size. Genome Assembly problem is therefore, quite complex and is broadly divided into denovo and comparative assembly. Comparative assembly takes the aid of a reference sequence, closely related to the unassembled genome, to determine the relative order of the reads with respect to one another, and then joins them together to form the sequence. This paper explores all variants of Minimum Description Length (MDL) to find the best reference sequence for comparative assembly. The paper looked at two-part MDL, Sophisticated MDL and MiniMax Regret and found that Sophisticated MDL performs better than two-part MDL, however, MiniMax regret owing to the nature of the problem was unsuitable. The proposed scheme is prior free and can be incorporated in the data preprocessing stage for all comparative assemblers allowing the assembly process to make use of the best reference sequence available.

KW - Comparative assembly

KW - Genome assembly

KW - MiniMax regret

KW - Minimum description length

KW - Sophisticated MDL

KW - Two-part MDL

UR - http://www.scopus.com/inward/record.url?scp=84863707709&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863707709&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84863707709

SN - 9781467304900

SP - 230

EP - 233

BT - Proceedings 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS'11

ER -