In-memory distributed matrix computation processing & optimization

Yongyang Yu, Mingjie Tang, Walid G. Aref, Qutaibah M. Malluhi, Mostafa Abbas, Mourad Ouzzani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-ofthe-Art distributed matrix computation systems on a wide range of applications.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017
PublisherIEEE Computer Society
Pages1047-1058
Number of pages12
ISBN (Electronic)9781509065431
DOIs
Publication statusPublished - 16 May 2017
Event33rd IEEE International Conference on Data Engineering, ICDE 2017 - San Diego, United States
Duration: 19 Apr 201722 Apr 2017

Other

Other33rd IEEE International Conference on Data Engineering, ICDE 2017
CountryUnited States
CitySan Diego
Period19/4/1722/4/17

Fingerprint

Data storage equipment
Processing
Costs
Competitive intelligence
Communication
Bioinformatics
Electric sparks
Data mining
Learning systems
Railroad cars

Keywords

  • Distributed computing
  • Matrix computation
  • Query optimization

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Yu, Y., Tang, M., Aref, W. G., Malluhi, Q. M., Abbas, M., & Ouzzani, M. (2017). In-memory distributed matrix computation processing & optimization. In Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017 (pp. 1047-1058). [7930046] IEEE Computer Society. https://doi.org/10.1109/ICDE.2017.150

In-memory distributed matrix computation processing & optimization. / Yu, Yongyang; Tang, Mingjie; Aref, Walid G.; Malluhi, Qutaibah M.; Abbas, Mostafa; Ouzzani, Mourad.

Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017. IEEE Computer Society, 2017. p. 1047-1058 7930046.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yu, Y, Tang, M, Aref, WG, Malluhi, QM, Abbas, M & Ouzzani, M 2017, In-memory distributed matrix computation processing & optimization. in Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017., 7930046, IEEE Computer Society, pp. 1047-1058, 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, United States, 19/4/17. https://doi.org/10.1109/ICDE.2017.150
Yu Y, Tang M, Aref WG, Malluhi QM, Abbas M, Ouzzani M. In-memory distributed matrix computation processing & optimization. In Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017. IEEE Computer Society. 2017. p. 1047-1058. 7930046 https://doi.org/10.1109/ICDE.2017.150
Yu, Yongyang ; Tang, Mingjie ; Aref, Walid G. ; Malluhi, Qutaibah M. ; Abbas, Mostafa ; Ouzzani, Mourad. / In-memory distributed matrix computation processing & optimization. Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017. IEEE Computer Society, 2017. pp. 1047-1058
@inproceedings{3fd5bbbe9d50412fa4dfdb1137a7010c,
title = "In-memory distributed matrix computation processing & optimization",
abstract = "The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-ofthe-Art distributed matrix computation systems on a wide range of applications.",
keywords = "Distributed computing, Matrix computation, Query optimization",
author = "Yongyang Yu and Mingjie Tang and Aref, {Walid G.} and Malluhi, {Qutaibah M.} and Mostafa Abbas and Mourad Ouzzani",
year = "2017",
month = "5",
day = "16",
doi = "10.1109/ICDE.2017.150",
language = "English",
pages = "1047--1058",
booktitle = "Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - In-memory distributed matrix computation processing & optimization

AU - Yu, Yongyang

AU - Tang, Mingjie

AU - Aref, Walid G.

AU - Malluhi, Qutaibah M.

AU - Abbas, Mostafa

AU - Ouzzani, Mourad

PY - 2017/5/16

Y1 - 2017/5/16

N2 - The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-ofthe-Art distributed matrix computation systems on a wide range of applications.

AB - The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-ofthe-Art distributed matrix computation systems on a wide range of applications.

KW - Distributed computing

KW - Matrix computation

KW - Query optimization

UR - http://www.scopus.com/inward/record.url?scp=85021186559&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021186559&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2017.150

DO - 10.1109/ICDE.2017.150

M3 - Conference contribution

SP - 1047

EP - 1058

BT - Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017

PB - IEEE Computer Society

ER -