Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence.

Joseph Cheung, Xavier Estivill, Razi Khaja, Jeffrey R. MacDonald, Ken Lau, Lap Chee Tsui, Stephen W. Scherer

Research output: Contribution to journalArticle

175 Citations (Scopus)

Abstract

BACKGROUND: Previous studies have suggested that recent segmental duplications, which are often involved in chromosome rearrangements underlying genomic disease, account for some 5% of the human genome. We have developed rapid computational heuristics based on BLAST analysis to detect segmental duplications, as well as regions containing potential sequence misassignments in the human genome assemblies. RESULTS: Our analysis of the June 2002 public human genome assembly revealed that 107.4 of 3,043.1 megabases (Mb) (3.53%) of sequence contained segmental duplications, each with size equal or more than 5 kb and 90% identity. We have also detected that 38.9 Mb (1.28%) of sequence within this assembly is likely to be involved in sequence misassignment errors. Furthermore, we have identified a significant subset (199,965 of 2,327,473 or 8.6%) of single-nucleotide polymorphisms (SNPs) in the public databases that are not true SNPs but are potential paralogous sequence variants. CONCLUSION: Using two distinct computational approaches, we have identified most of the sequences in the human genome that have undergone recent segmental duplications. Near-identical segmental duplications present a major challenge to the completion of the human genome sequence. Potential sequence misassignments detected in this study would require additional efforts to resolve.

Original languageEnglish
JournalGenome Biology
Volume4
Issue number4
Publication statusPublished - 2003
Externally publishedYes

Fingerprint

Genomic Segmental Duplications
Human Genome
genome
Genome
genome assembly
single nucleotide polymorphism
Single Nucleotide Polymorphism
polymorphism
heuristics
chromosome
genomics
Chromosomes
detection
Databases
chromosomes

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Genetics
  • Cell Biology

Cite this

Cheung, J., Estivill, X., Khaja, R., MacDonald, J. R., Lau, K., Tsui, L. C., & Scherer, S. W. (2003). Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biology, 4(4).

Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. / Cheung, Joseph; Estivill, Xavier; Khaja, Razi; MacDonald, Jeffrey R.; Lau, Ken; Tsui, Lap Chee; Scherer, Stephen W.

In: Genome Biology, Vol. 4, No. 4, 2003.

Research output: Contribution to journalArticle

Cheung, J, Estivill, X, Khaja, R, MacDonald, JR, Lau, K, Tsui, LC & Scherer, SW 2003, 'Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence.', Genome Biology, vol. 4, no. 4.
Cheung, Joseph ; Estivill, Xavier ; Khaja, Razi ; MacDonald, Jeffrey R. ; Lau, Ken ; Tsui, Lap Chee ; Scherer, Stephen W. / Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. In: Genome Biology. 2003 ; Vol. 4, No. 4.
@article{7becdbffefd14109859472b745533303,
title = "Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence.",
abstract = "BACKGROUND: Previous studies have suggested that recent segmental duplications, which are often involved in chromosome rearrangements underlying genomic disease, account for some 5{\%} of the human genome. We have developed rapid computational heuristics based on BLAST analysis to detect segmental duplications, as well as regions containing potential sequence misassignments in the human genome assemblies. RESULTS: Our analysis of the June 2002 public human genome assembly revealed that 107.4 of 3,043.1 megabases (Mb) (3.53{\%}) of sequence contained segmental duplications, each with size equal or more than 5 kb and 90{\%} identity. We have also detected that 38.9 Mb (1.28{\%}) of sequence within this assembly is likely to be involved in sequence misassignment errors. Furthermore, we have identified a significant subset (199,965 of 2,327,473 or 8.6{\%}) of single-nucleotide polymorphisms (SNPs) in the public databases that are not true SNPs but are potential paralogous sequence variants. CONCLUSION: Using two distinct computational approaches, we have identified most of the sequences in the human genome that have undergone recent segmental duplications. Near-identical segmental duplications present a major challenge to the completion of the human genome sequence. Potential sequence misassignments detected in this study would require additional efforts to resolve.",
author = "Joseph Cheung and Xavier Estivill and Razi Khaja and MacDonald, {Jeffrey R.} and Ken Lau and Tsui, {Lap Chee} and Scherer, {Stephen W.}",
year = "2003",
language = "English",
volume = "4",
journal = "Genome Biology",
issn = "1474-7596",
publisher = "BioMed Central",
number = "4",

}

TY - JOUR

T1 - Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence.

AU - Cheung, Joseph

AU - Estivill, Xavier

AU - Khaja, Razi

AU - MacDonald, Jeffrey R.

AU - Lau, Ken

AU - Tsui, Lap Chee

AU - Scherer, Stephen W.

PY - 2003

Y1 - 2003

N2 - BACKGROUND: Previous studies have suggested that recent segmental duplications, which are often involved in chromosome rearrangements underlying genomic disease, account for some 5% of the human genome. We have developed rapid computational heuristics based on BLAST analysis to detect segmental duplications, as well as regions containing potential sequence misassignments in the human genome assemblies. RESULTS: Our analysis of the June 2002 public human genome assembly revealed that 107.4 of 3,043.1 megabases (Mb) (3.53%) of sequence contained segmental duplications, each with size equal or more than 5 kb and 90% identity. We have also detected that 38.9 Mb (1.28%) of sequence within this assembly is likely to be involved in sequence misassignment errors. Furthermore, we have identified a significant subset (199,965 of 2,327,473 or 8.6%) of single-nucleotide polymorphisms (SNPs) in the public databases that are not true SNPs but are potential paralogous sequence variants. CONCLUSION: Using two distinct computational approaches, we have identified most of the sequences in the human genome that have undergone recent segmental duplications. Near-identical segmental duplications present a major challenge to the completion of the human genome sequence. Potential sequence misassignments detected in this study would require additional efforts to resolve.

AB - BACKGROUND: Previous studies have suggested that recent segmental duplications, which are often involved in chromosome rearrangements underlying genomic disease, account for some 5% of the human genome. We have developed rapid computational heuristics based on BLAST analysis to detect segmental duplications, as well as regions containing potential sequence misassignments in the human genome assemblies. RESULTS: Our analysis of the June 2002 public human genome assembly revealed that 107.4 of 3,043.1 megabases (Mb) (3.53%) of sequence contained segmental duplications, each with size equal or more than 5 kb and 90% identity. We have also detected that 38.9 Mb (1.28%) of sequence within this assembly is likely to be involved in sequence misassignment errors. Furthermore, we have identified a significant subset (199,965 of 2,327,473 or 8.6%) of single-nucleotide polymorphisms (SNPs) in the public databases that are not true SNPs but are potential paralogous sequence variants. CONCLUSION: Using two distinct computational approaches, we have identified most of the sequences in the human genome that have undergone recent segmental duplications. Near-identical segmental duplications present a major challenge to the completion of the human genome sequence. Potential sequence misassignments detected in this study would require additional efforts to resolve.

UR - http://www.scopus.com/inward/record.url?scp=0037837485&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037837485&partnerID=8YFLogxK

M3 - Article

C2 - 12702206

AN - SCOPUS:0037837485

VL - 4

JO - Genome Biology

JF - Genome Biology

SN - 1474-7596

IS - 4

ER -