Race

A scalable and elastic parallel system for discovering repeats in very long sequences

Essam Mansour, Ahmed El-Roby, Panos Kalnis, Aron Ahmadia, Ashraf Aboulnaga

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

A wide range of applications, including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. The problem of finding maximal pairs subsumes most important types of repetitionfinding tasks. Existing solutions require both the input sequence and its index (typically an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences. In this paper we present RACE, a parallel system for finding maximal pairs in very long sequences. RACE supports parallel execution on stand-alone multicore systems, in addition to scaling to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that allows for cache-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider's prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a typical desktop computer with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.

Original languageEnglish
Title of host publicationProceedings of the VLDB Endowment
Pages865-876
Number of pages12
Volume6
Edition10
Publication statusPublished - Aug 2013

Fingerprint

Supercomputers
Data storage equipment
Random access storage
Bioinformatics
Personal computers
Time series
Genes
Statistics
Availability
Sampling
Costs

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Mansour, E., El-Roby, A., Kalnis, P., Ahmadia, A., & Aboulnaga, A. (2013). Race: A scalable and elastic parallel system for discovering repeats in very long sequences. In Proceedings of the VLDB Endowment (10 ed., Vol. 6, pp. 865-876)

Race : A scalable and elastic parallel system for discovering repeats in very long sequences. / Mansour, Essam; El-Roby, Ahmed; Kalnis, Panos; Ahmadia, Aron; Aboulnaga, Ashraf.

Proceedings of the VLDB Endowment. Vol. 6 10. ed. 2013. p. 865-876.

Research output: Chapter in Book/Report/Conference proceedingChapter

Mansour, E, El-Roby, A, Kalnis, P, Ahmadia, A & Aboulnaga, A 2013, Race: A scalable and elastic parallel system for discovering repeats in very long sequences. in Proceedings of the VLDB Endowment. 10 edn, vol. 6, pp. 865-876.
Mansour E, El-Roby A, Kalnis P, Ahmadia A, Aboulnaga A. Race: A scalable and elastic parallel system for discovering repeats in very long sequences. In Proceedings of the VLDB Endowment. 10 ed. Vol. 6. 2013. p. 865-876
Mansour, Essam ; El-Roby, Ahmed ; Kalnis, Panos ; Ahmadia, Aron ; Aboulnaga, Ashraf. / Race : A scalable and elastic parallel system for discovering repeats in very long sequences. Proceedings of the VLDB Endowment. Vol. 6 10. ed. 2013. pp. 865-876
@inbook{aa36ec50f4384fa8ab952d13b07c9525,
title = "Race: A scalable and elastic parallel system for discovering repeats in very long sequences",
abstract = "A wide range of applications, including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. The problem of finding maximal pairs subsumes most important types of repetitionfinding tasks. Existing solutions require both the input sequence and its index (typically an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences. In this paper we present RACE, a parallel system for finding maximal pairs in very long sequences. RACE supports parallel execution on stand-alone multicore systems, in addition to scaling to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that allows for cache-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider's prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a typical desktop computer with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.",
author = "Essam Mansour and Ahmed El-Roby and Panos Kalnis and Aron Ahmadia and Ashraf Aboulnaga",
year = "2013",
month = "8",
language = "English",
volume = "6",
pages = "865--876",
booktitle = "Proceedings of the VLDB Endowment",
edition = "10",

}

TY - CHAP

T1 - Race

T2 - A scalable and elastic parallel system for discovering repeats in very long sequences

AU - Mansour, Essam

AU - El-Roby, Ahmed

AU - Kalnis, Panos

AU - Ahmadia, Aron

AU - Aboulnaga, Ashraf

PY - 2013/8

Y1 - 2013/8

N2 - A wide range of applications, including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. The problem of finding maximal pairs subsumes most important types of repetitionfinding tasks. Existing solutions require both the input sequence and its index (typically an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences. In this paper we present RACE, a parallel system for finding maximal pairs in very long sequences. RACE supports parallel execution on stand-alone multicore systems, in addition to scaling to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that allows for cache-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider's prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a typical desktop computer with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.

AB - A wide range of applications, including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. The problem of finding maximal pairs subsumes most important types of repetitionfinding tasks. Existing solutions require both the input sequence and its index (typically an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences. In this paper we present RACE, a parallel system for finding maximal pairs in very long sequences. RACE supports parallel execution on stand-alone multicore systems, in addition to scaling to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that allows for cache-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider's prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a typical desktop computer with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.

UR - http://www.scopus.com/inward/record.url?scp=84891075074&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84891075074&partnerID=8YFLogxK

M3 - Chapter

VL - 6

SP - 865

EP - 876

BT - Proceedings of the VLDB Endowment

ER -