SPCA

Scalable principal component analysis for big data on distributed platforms

Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, Mohamed Hefeeda

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

Web sites, social networks, sensors, and scientific experiments currently generate massive amounts of data. Owners of this data strive to obtain insights from it, often by applying machine learning algorithms. Many machine learning algorithms, however, do not scale well to cope with the ever increasing volumes of data. To address this problem, we identify several optimizations that are crucial for scaling various machine learning algorithms in distributed settings. We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm. PCA is an important tool in many areas including image processing, data visualization, information retrieval, and dimensionality reduction. We refer to the proposed optimized PCA algorithm as scalable PCA, or sPCA. sPCA achieves scalability via employing efficient large matrix operations, effectively leveraging matrix sparsity, and minimizing intermediate data. We implement sPCA on the widely-used MapReduce platform and on the memory-based Spark platform. We compare sPCA against the closest PCA implementations, which are the ones in Mahout/MapReduce and MLlib/Spark. Our experiments show that sPCA outperforms both Mahout-PCA and MLlib-PCA by wide margins in terms of accuracy, running time, and volume of intermediate data generated during the computation.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages79-91
Number of pages13
Volume2015-May
ISBN (Print)9781450327589
DOIs
Publication statusPublished - 27 May 2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: 31 May 20154 Jun 2015

Other

OtherACM SIGMOD International Conference on Management of Data, SIGMOD 2015
CountryAustralia
CityMelbourne
Period31/5/154/6/15

Fingerprint

Principal component analysis
Learning algorithms
Learning systems
Electric sparks
Data visualization
Big data
Information retrieval
Sensor networks
Scalability
Websites
Image processing
Experiments
Data storage equipment

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Elgamal, T., Yabandeh, M., Aboulnaga, A., Mustafa, W., & Hefeeda, M. (2015). SPCA: Scalable principal component analysis for big data on distributed platforms. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Vol. 2015-May, pp. 79-91). Association for Computing Machinery. https://doi.org/10.1145/2723372.2751520

SPCA : Scalable principal component analysis for big data on distributed platforms. / Elgamal, Tarek; Yabandeh, Maysam; Aboulnaga, Ashraf; Mustafa, Waleed; Hefeeda, Mohamed.

Proceedings of the ACM SIGMOD International Conference on Management of Data. Vol. 2015-May Association for Computing Machinery, 2015. p. 79-91.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Elgamal, T, Yabandeh, M, Aboulnaga, A, Mustafa, W & Hefeeda, M 2015, SPCA: Scalable principal component analysis for big data on distributed platforms. in Proceedings of the ACM SIGMOD International Conference on Management of Data. vol. 2015-May, Association for Computing Machinery, pp. 79-91, ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, Melbourne, Australia, 31/5/15. https://doi.org/10.1145/2723372.2751520
Elgamal T, Yabandeh M, Aboulnaga A, Mustafa W, Hefeeda M. SPCA: Scalable principal component analysis for big data on distributed platforms. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Vol. 2015-May. Association for Computing Machinery. 2015. p. 79-91 https://doi.org/10.1145/2723372.2751520
Elgamal, Tarek ; Yabandeh, Maysam ; Aboulnaga, Ashraf ; Mustafa, Waleed ; Hefeeda, Mohamed. / SPCA : Scalable principal component analysis for big data on distributed platforms. Proceedings of the ACM SIGMOD International Conference on Management of Data. Vol. 2015-May Association for Computing Machinery, 2015. pp. 79-91
@inproceedings{959534e929e74139a800fd4d4ea9b326,
title = "SPCA: Scalable principal component analysis for big data on distributed platforms",
abstract = "Web sites, social networks, sensors, and scientific experiments currently generate massive amounts of data. Owners of this data strive to obtain insights from it, often by applying machine learning algorithms. Many machine learning algorithms, however, do not scale well to cope with the ever increasing volumes of data. To address this problem, we identify several optimizations that are crucial for scaling various machine learning algorithms in distributed settings. We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm. PCA is an important tool in many areas including image processing, data visualization, information retrieval, and dimensionality reduction. We refer to the proposed optimized PCA algorithm as scalable PCA, or sPCA. sPCA achieves scalability via employing efficient large matrix operations, effectively leveraging matrix sparsity, and minimizing intermediate data. We implement sPCA on the widely-used MapReduce platform and on the memory-based Spark platform. We compare sPCA against the closest PCA implementations, which are the ones in Mahout/MapReduce and MLlib/Spark. Our experiments show that sPCA outperforms both Mahout-PCA and MLlib-PCA by wide margins in terms of accuracy, running time, and volume of intermediate data generated during the computation.",
author = "Tarek Elgamal and Maysam Yabandeh and Ashraf Aboulnaga and Waleed Mustafa and Mohamed Hefeeda",
year = "2015",
month = "5",
day = "27",
doi = "10.1145/2723372.2751520",
language = "English",
isbn = "9781450327589",
volume = "2015-May",
pages = "79--91",
booktitle = "Proceedings of the ACM SIGMOD International Conference on Management of Data",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - SPCA

T2 - Scalable principal component analysis for big data on distributed platforms

AU - Elgamal, Tarek

AU - Yabandeh, Maysam

AU - Aboulnaga, Ashraf

AU - Mustafa, Waleed

AU - Hefeeda, Mohamed

PY - 2015/5/27

Y1 - 2015/5/27

N2 - Web sites, social networks, sensors, and scientific experiments currently generate massive amounts of data. Owners of this data strive to obtain insights from it, often by applying machine learning algorithms. Many machine learning algorithms, however, do not scale well to cope with the ever increasing volumes of data. To address this problem, we identify several optimizations that are crucial for scaling various machine learning algorithms in distributed settings. We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm. PCA is an important tool in many areas including image processing, data visualization, information retrieval, and dimensionality reduction. We refer to the proposed optimized PCA algorithm as scalable PCA, or sPCA. sPCA achieves scalability via employing efficient large matrix operations, effectively leveraging matrix sparsity, and minimizing intermediate data. We implement sPCA on the widely-used MapReduce platform and on the memory-based Spark platform. We compare sPCA against the closest PCA implementations, which are the ones in Mahout/MapReduce and MLlib/Spark. Our experiments show that sPCA outperforms both Mahout-PCA and MLlib-PCA by wide margins in terms of accuracy, running time, and volume of intermediate data generated during the computation.

AB - Web sites, social networks, sensors, and scientific experiments currently generate massive amounts of data. Owners of this data strive to obtain insights from it, often by applying machine learning algorithms. Many machine learning algorithms, however, do not scale well to cope with the ever increasing volumes of data. To address this problem, we identify several optimizations that are crucial for scaling various machine learning algorithms in distributed settings. We apply these optimizations to the popular Principal Component Analysis (PCA) algorithm. PCA is an important tool in many areas including image processing, data visualization, information retrieval, and dimensionality reduction. We refer to the proposed optimized PCA algorithm as scalable PCA, or sPCA. sPCA achieves scalability via employing efficient large matrix operations, effectively leveraging matrix sparsity, and minimizing intermediate data. We implement sPCA on the widely-used MapReduce platform and on the memory-based Spark platform. We compare sPCA against the closest PCA implementations, which are the ones in Mahout/MapReduce and MLlib/Spark. Our experiments show that sPCA outperforms both Mahout-PCA and MLlib-PCA by wide margins in terms of accuracy, running time, and volume of intermediate data generated during the computation.

UR - http://www.scopus.com/inward/record.url?scp=84957607832&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84957607832&partnerID=8YFLogxK

U2 - 10.1145/2723372.2751520

DO - 10.1145/2723372.2751520

M3 - Conference contribution

SN - 9781450327589

VL - 2015-May

SP - 79

EP - 91

BT - Proceedings of the ACM SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery

ER -