Genome assembly comparison identifies structural variants in the human genome

Razi Khaja, Junjun Zhang, Jeffrey R. MacDonald, Yongshu He, Ann M. Joseph-George, John Wei, Muhammad A. Rafiq, Cheng Qian, Mary Shago, Lorena Pantano, Hiroyuki Aburatani, Keith Jones, Richard Redon, Matthew Hurles, Lluis Armengol, Xavier P. Estivill, Richard J. Mural, Charles Lee, Stephen W. Scherer, Lars Feuk

Research output: Contribution to journalArticle

130 Citations (Scopus)

Abstract

Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs and intermediate-sized variants (ISVs). However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.

Original languageEnglish
Pages (from-to)1413-1418
Number of pages6
JournalNature Genetics
Volume38
Issue number12
DOIs
Publication statusPublished - 5 Dec 2006
Externally publishedYes

Fingerprint

Human Genome
Single Nucleotide Polymorphism
Genome
Genomic Segmental Duplications
DNA

ASJC Scopus subject areas

  • Genetics(clinical)
  • Genetics

Cite this

Khaja, R., Zhang, J., MacDonald, J. R., He, Y., Joseph-George, A. M., Wei, J., ... Feuk, L. (2006). Genome assembly comparison identifies structural variants in the human genome. Nature Genetics, 38(12), 1413-1418. https://doi.org/10.1038/ng1921

Genome assembly comparison identifies structural variants in the human genome. / Khaja, Razi; Zhang, Junjun; MacDonald, Jeffrey R.; He, Yongshu; Joseph-George, Ann M.; Wei, John; Rafiq, Muhammad A.; Qian, Cheng; Shago, Mary; Pantano, Lorena; Aburatani, Hiroyuki; Jones, Keith; Redon, Richard; Hurles, Matthew; Armengol, Lluis; Estivill, Xavier P.; Mural, Richard J.; Lee, Charles; Scherer, Stephen W.; Feuk, Lars.

In: Nature Genetics, Vol. 38, No. 12, 05.12.2006, p. 1413-1418.

Research output: Contribution to journalArticle

Khaja, R, Zhang, J, MacDonald, JR, He, Y, Joseph-George, AM, Wei, J, Rafiq, MA, Qian, C, Shago, M, Pantano, L, Aburatani, H, Jones, K, Redon, R, Hurles, M, Armengol, L, Estivill, XP, Mural, RJ, Lee, C, Scherer, SW & Feuk, L 2006, 'Genome assembly comparison identifies structural variants in the human genome', Nature Genetics, vol. 38, no. 12, pp. 1413-1418. https://doi.org/10.1038/ng1921
Khaja R, Zhang J, MacDonald JR, He Y, Joseph-George AM, Wei J et al. Genome assembly comparison identifies structural variants in the human genome. Nature Genetics. 2006 Dec 5;38(12):1413-1418. https://doi.org/10.1038/ng1921
Khaja, Razi ; Zhang, Junjun ; MacDonald, Jeffrey R. ; He, Yongshu ; Joseph-George, Ann M. ; Wei, John ; Rafiq, Muhammad A. ; Qian, Cheng ; Shago, Mary ; Pantano, Lorena ; Aburatani, Hiroyuki ; Jones, Keith ; Redon, Richard ; Hurles, Matthew ; Armengol, Lluis ; Estivill, Xavier P. ; Mural, Richard J. ; Lee, Charles ; Scherer, Stephen W. ; Feuk, Lars. / Genome assembly comparison identifies structural variants in the human genome. In: Nature Genetics. 2006 ; Vol. 38, No. 12. pp. 1413-1418.
@article{d814c201e1634ee3bdcd9abe12fdd1d0,
title = "Genome assembly comparison identifies structural variants in the human genome",
abstract = "Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs and intermediate-sized variants (ISVs). However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.",
author = "Razi Khaja and Junjun Zhang and MacDonald, {Jeffrey R.} and Yongshu He and Joseph-George, {Ann M.} and John Wei and Rafiq, {Muhammad A.} and Cheng Qian and Mary Shago and Lorena Pantano and Hiroyuki Aburatani and Keith Jones and Richard Redon and Matthew Hurles and Lluis Armengol and Estivill, {Xavier P.} and Mural, {Richard J.} and Charles Lee and Scherer, {Stephen W.} and Lars Feuk",
year = "2006",
month = "12",
day = "5",
doi = "10.1038/ng1921",
language = "English",
volume = "38",
pages = "1413--1418",
journal = "Nature Genetics",
issn = "1061-4036",
publisher = "Nature Publishing Group",
number = "12",

}

TY - JOUR

T1 - Genome assembly comparison identifies structural variants in the human genome

AU - Khaja, Razi

AU - Zhang, Junjun

AU - MacDonald, Jeffrey R.

AU - He, Yongshu

AU - Joseph-George, Ann M.

AU - Wei, John

AU - Rafiq, Muhammad A.

AU - Qian, Cheng

AU - Shago, Mary

AU - Pantano, Lorena

AU - Aburatani, Hiroyuki

AU - Jones, Keith

AU - Redon, Richard

AU - Hurles, Matthew

AU - Armengol, Lluis

AU - Estivill, Xavier P.

AU - Mural, Richard J.

AU - Lee, Charles

AU - Scherer, Stephen W.

AU - Feuk, Lars

PY - 2006/12/5

Y1 - 2006/12/5

N2 - Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs and intermediate-sized variants (ISVs). However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.

AB - Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs and intermediate-sized variants (ISVs). However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.

UR - http://www.scopus.com/inward/record.url?scp=33751340401&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33751340401&partnerID=8YFLogxK

U2 - 10.1038/ng1921

DO - 10.1038/ng1921

M3 - Article

VL - 38

SP - 1413

EP - 1418

JO - Nature Genetics

JF - Nature Genetics

SN - 1061-4036

IS - 12

ER -