Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.
|Number of pages||15|
|Journal||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Publication status||Published - 1 Dec 2003|
ASJC Scopus subject areas
- Computer Science(all)
- Biochemistry, Genetics and Molecular Biology(all)
- Theoretical Computer Science