BFT

Bit filtration technique for approximate string join in biological databases

S. Alireza Aghili, Divyakant Agrawal, Amr El Abbadi

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.

Original languageEnglish
Pages (from-to)326-340
Number of pages15
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2857
Publication statusPublished - 1 Dec 2003
Externally publishedYes

Fingerprint

Filtration
Join
Strings
Databases
Dimensionality Reduction
Relational Database
Preprocessing
Pairwise
Genome
Genes
Wavelet Transformation
Fourier Transformation
Curse of Dimensionality
Heuristic Search
Processing
Eukaryota
Joining
Dynamic programming
Indexing
Search Space

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science
  • Engineering(all)

Cite this

BFT : Bit filtration technique for approximate string join in biological databases. / Aghili, S. Alireza; Agrawal, Divyakant; El Abbadi, Amr.

In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 2857, 01.12.2003, p. 326-340.

Research output: Contribution to journalArticle

@article{83f5b1da754446b795ed270568d2a35b,
title = "BFT: Bit filtration technique for approximate string join in biological databases",
abstract = "Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.",
author = "Aghili, {S. Alireza} and Divyakant Agrawal and {El Abbadi}, Amr",
year = "2003",
month = "12",
day = "1",
language = "English",
volume = "2857",
pages = "326--340",
journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - BFT

T2 - Bit filtration technique for approximate string join in biological databases

AU - Aghili, S. Alireza

AU - Agrawal, Divyakant

AU - El Abbadi, Amr

PY - 2003/12/1

Y1 - 2003/12/1

N2 - Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.

AB - Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.

UR - http://www.scopus.com/inward/record.url?scp=0142187768&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0142187768&partnerID=8YFLogxK

M3 - Article

VL - 2857

SP - 326

EP - 340

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SN - 0302-9743

ER -