COSAC

A framework for combinatorial statistical analysis on cloud

Zhengkui Wang, Divyakant Agrawal, Kian Lee Tan

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally intensive job that typically takes days, if not weeks, to complete. We are, therefore, motivated to provide efficient and practical techniques to speed up the processing exploiting parallelism. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale data sets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: 1) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; 2) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario, where only a selected subset of objects need to be analyzed against all the objects. As a representative application, we developed COSAC to find combinations of Single Nucleotide Polymorphisms (SNPs) that may interact to cause diseases. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable, and flexible.

Original languageEnglish
Article number6205755
Pages (from-to)2010-2023
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Volume25
Issue number9
DOIs
Publication statusPublished - 8 Aug 2013
Externally publishedYes

Fingerprint

Statistical methods
Processing
Salvaging
Cloud computing
Nucleotides
Polymorphism
Resource allocation

Keywords

  • Association mining
  • Combinatorial statistical analysis
  • MapReduce
  • Parallel object combination enumeration

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Information Systems
  • Computer Science Applications

Cite this

COSAC : A framework for combinatorial statistical analysis on cloud. / Wang, Zhengkui; Agrawal, Divyakant; Tan, Kian Lee.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 9, 6205755, 08.08.2013, p. 2010-2023.

Research output: Contribution to journalArticle

Wang, Zhengkui ; Agrawal, Divyakant ; Tan, Kian Lee. / COSAC : A framework for combinatorial statistical analysis on cloud. In: IEEE Transactions on Knowledge and Data Engineering. 2013 ; Vol. 25, No. 9. pp. 2010-2023.
@article{5e5dd0fa3fd24bf39e7c6a7eaf8c6b69,
title = "COSAC: A framework for combinatorial statistical analysis on cloud",
abstract = "In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally intensive job that typically takes days, if not weeks, to complete. We are, therefore, motivated to provide efficient and practical techniques to speed up the processing exploiting parallelism. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale data sets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: 1) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; 2) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario, where only a selected subset of objects need to be analyzed against all the objects. As a representative application, we developed COSAC to find combinations of Single Nucleotide Polymorphisms (SNPs) that may interact to cause diseases. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable, and flexible.",
keywords = "Association mining, Combinatorial statistical analysis, MapReduce, Parallel object combination enumeration",
author = "Zhengkui Wang and Divyakant Agrawal and Tan, {Kian Lee}",
year = "2013",
month = "8",
day = "8",
doi = "10.1109/TKDE.2012.113",
language = "English",
volume = "25",
pages = "2010--2023",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "9",

}

TY - JOUR

T1 - COSAC

T2 - A framework for combinatorial statistical analysis on cloud

AU - Wang, Zhengkui

AU - Agrawal, Divyakant

AU - Tan, Kian Lee

PY - 2013/8/8

Y1 - 2013/8/8

N2 - In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally intensive job that typically takes days, if not weeks, to complete. We are, therefore, motivated to provide efficient and practical techniques to speed up the processing exploiting parallelism. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale data sets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: 1) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; 2) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario, where only a selected subset of objects need to be analyzed against all the objects. As a representative application, we developed COSAC to find combinations of Single Nucleotide Polymorphisms (SNPs) that may interact to cause diseases. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable, and flexible.

AB - In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally intensive job that typically takes days, if not weeks, to complete. We are, therefore, motivated to provide efficient and practical techniques to speed up the processing exploiting parallelism. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale data sets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: 1) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; 2) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario, where only a selected subset of objects need to be analyzed against all the objects. As a representative application, we developed COSAC to find combinations of Single Nucleotide Polymorphisms (SNPs) that may interact to cause diseases. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable, and flexible.

KW - Association mining

KW - Combinatorial statistical analysis

KW - MapReduce

KW - Parallel object combination enumeration

UR - http://www.scopus.com/inward/record.url?scp=84881053311&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84881053311&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2012.113

DO - 10.1109/TKDE.2012.113

M3 - Article

VL - 25

SP - 2010

EP - 2023

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 9

M1 - 6205755

ER -