Xproj

A framework for projected structural clustering of xml documents

Charu C. Aggarwal, Na Ta, Jianyong Wang, Jianhua Feng, Mohammed Zaki

Research output: Chapter in Book/Report/Conference proceedingConference contribution

82 Citations (Scopus)

Abstract

XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Pages46-55
Number of pages10
DOIs
Publication statusPublished - 14 Dec 2007
Externally publishedYes
EventKDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - San Jose, CA, United States
Duration: 12 Aug 200715 Aug 2007

Other

OtherKDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
CountryUnited States
CitySan Jose, CA
Period12/8/0715/8/07

Fingerprint

XML
Clustering algorithms
Data mining

Keywords

  • Clustering
  • XML

ASJC Scopus subject areas

  • Information Systems

Cite this

Aggarwal, C. C., Ta, N., Wang, J., Feng, J., & Zaki, M. (2007). Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 46-55) https://doi.org/10.1145/1281192.1281201

Xproj : A framework for projected structural clustering of xml documents. / Aggarwal, Charu C.; Ta, Na; Wang, Jianyong; Feng, Jianhua; Zaki, Mohammed.

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007. p. 46-55.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Aggarwal, CC, Ta, N, Wang, J, Feng, J & Zaki, M 2007, Xproj: A framework for projected structural clustering of xml documents. in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 46-55, KDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, United States, 12/8/07. https://doi.org/10.1145/1281192.1281201
Aggarwal CC, Ta N, Wang J, Feng J, Zaki M. Xproj: A framework for projected structural clustering of xml documents. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007. p. 46-55 https://doi.org/10.1145/1281192.1281201
Aggarwal, Charu C. ; Ta, Na ; Wang, Jianyong ; Feng, Jianhua ; Zaki, Mohammed. / Xproj : A framework for projected structural clustering of xml documents. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007. pp. 46-55
@inproceedings{86a1edfb27a54462ab5327bafc3e38f8,
title = "Xproj: A framework for projected structural clustering of xml documents",
abstract = "XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.",
keywords = "Clustering, XML",
author = "Aggarwal, {Charu C.} and Na Ta and Jianyong Wang and Jianhua Feng and Mohammed Zaki",
year = "2007",
month = "12",
day = "14",
doi = "10.1145/1281192.1281201",
language = "English",
isbn = "1595936092",
pages = "46--55",
booktitle = "Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

}

TY - GEN

T1 - Xproj

T2 - A framework for projected structural clustering of xml documents

AU - Aggarwal, Charu C.

AU - Ta, Na

AU - Wang, Jianyong

AU - Feng, Jianhua

AU - Zaki, Mohammed

PY - 2007/12/14

Y1 - 2007/12/14

N2 - XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.

AB - XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.

KW - Clustering

KW - XML

UR - http://www.scopus.com/inward/record.url?scp=36849071950&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36849071950&partnerID=8YFLogxK

U2 - 10.1145/1281192.1281201

DO - 10.1145/1281192.1281201

M3 - Conference contribution

SN - 1595936092

SN - 9781595936097

SP - 46

EP - 55

BT - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ER -