A distributed approach for graph mining in massive networks

N. Talukder, M. J. Zaki

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.

Original languageEnglish
Pages (from-to)1024-1052
Number of pages29
JournalData Mining and Knowledge Discovery
Volume30
Issue number5
DOIs
Publication statusPublished - 1 Sep 2016
Externally publishedYes

Fingerprint

Communication
Parallel algorithms
Scalability
Genes
Data storage equipment

Keywords

  • Distributed graph mining
  • Frequent subgraph mining
  • High performance computing
  • Parallel graph mining
  • Single large graph

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications

Cite this

A distributed approach for graph mining in massive networks. / Talukder, N.; Zaki, M. J.

In: Data Mining and Knowledge Discovery, Vol. 30, No. 5, 01.09.2016, p. 1024-1052.

Research output: Contribution to journalArticle

Talukder, N. ; Zaki, M. J. / A distributed approach for graph mining in massive networks. In: Data Mining and Knowledge Discovery. 2016 ; Vol. 30, No. 5. pp. 1024-1052.
@article{ae36d64df38148aaba7bb9d9bd6d889a,
title = "A distributed approach for graph mining in massive networks",
abstract = "We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.",
keywords = "Distributed graph mining, Frequent subgraph mining, High performance computing, Parallel graph mining, Single large graph",
author = "N. Talukder and Zaki, {M. J.}",
year = "2016",
month = "9",
day = "1",
doi = "10.1007/s10618-016-0466-x",
language = "English",
volume = "30",
pages = "1024--1052",
journal = "Data Mining and Knowledge Discovery",
issn = "1384-5810",
publisher = "Springer Netherlands",
number = "5",

}

TY - JOUR

T1 - A distributed approach for graph mining in massive networks

AU - Talukder, N.

AU - Zaki, M. J.

PY - 2016/9/1

Y1 - 2016/9/1

N2 - We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.

AB - We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.

KW - Distributed graph mining

KW - Frequent subgraph mining

KW - High performance computing

KW - Parallel graph mining

KW - Single large graph

UR - http://www.scopus.com/inward/record.url?scp=84973659265&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84973659265&partnerID=8YFLogxK

U2 - 10.1007/s10618-016-0466-x

DO - 10.1007/s10618-016-0466-x

M3 - Article

VL - 30

SP - 1024

EP - 1052

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

IS - 5

ER -