Parallel motif extraction from very long sequences

Majed Sahli, Essam Mansour, Panos Kalnis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16, 384 cores on a supercomputer. Copyright is held by the owner/author(s).

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages549-558
Number of pages10
DOIs
Publication statusPublished - 11 Dec 2013
Event22nd ACM International Conference on Information and Knowledge Management, CIKM 2013 - San Francisco, CA, United States
Duration: 27 Oct 20131 Nov 2013

Other

Other22nd ACM International Conference on Information and Knowledge Management, CIKM 2013
CountryUnited States
CitySan Francisco, CA
Period27/10/131/11/13

Fingerprint

Serials
Costs
Owners
Symbol
Functionality
Periodicity
Scalability
Query
Web log

Keywords

  • Cache efficiency
  • In-memory
  • Motif
  • Parallel
  • Suffix tree

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Sahli, M., Mansour, E., & Kalnis, P. (2013). Parallel motif extraction from very long sequences. In International Conference on Information and Knowledge Management, Proceedings (pp. 549-558) https://doi.org/10.1145/2505515.2505575

Parallel motif extraction from very long sequences. / Sahli, Majed; Mansour, Essam; Kalnis, Panos.

International Conference on Information and Knowledge Management, Proceedings. 2013. p. 549-558.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sahli, M, Mansour, E & Kalnis, P 2013, Parallel motif extraction from very long sequences. in International Conference on Information and Knowledge Management, Proceedings. pp. 549-558, 22nd ACM International Conference on Information and Knowledge Management, CIKM 2013, San Francisco, CA, United States, 27/10/13. https://doi.org/10.1145/2505515.2505575
Sahli M, Mansour E, Kalnis P. Parallel motif extraction from very long sequences. In International Conference on Information and Knowledge Management, Proceedings. 2013. p. 549-558 https://doi.org/10.1145/2505515.2505575
Sahli, Majed ; Mansour, Essam ; Kalnis, Panos. / Parallel motif extraction from very long sequences. International Conference on Information and Knowledge Management, Proceedings. 2013. pp. 549-558
@inproceedings{56d93d3df0b74f8e999df7294266a1a8,
title = "Parallel motif extraction from very long sequences",
abstract = "Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90{\%} speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16, 384 cores on a supercomputer. Copyright is held by the owner/author(s).",
keywords = "Cache efficiency, In-memory, Motif, Parallel, Suffix tree",
author = "Majed Sahli and Essam Mansour and Panos Kalnis",
year = "2013",
month = "12",
day = "11",
doi = "10.1145/2505515.2505575",
language = "English",
isbn = "9781450322638",
pages = "549--558",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Parallel motif extraction from very long sequences

AU - Sahli, Majed

AU - Mansour, Essam

AU - Kalnis, Panos

PY - 2013/12/11

Y1 - 2013/12/11

N2 - Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16, 384 cores on a supercomputer. Copyright is held by the owner/author(s).

AB - Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16, 384 cores on a supercomputer. Copyright is held by the owner/author(s).

KW - Cache efficiency

KW - In-memory

KW - Motif

KW - Parallel

KW - Suffix tree

UR - http://www.scopus.com/inward/record.url?scp=84889586473&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889586473&partnerID=8YFLogxK

U2 - 10.1145/2505515.2505575

DO - 10.1145/2505515.2505575

M3 - Conference contribution

SN - 9781450322638

SP - 549

EP - 558

BT - International Conference on Information and Knowledge Management, Proceedings

ER -