Data mining over multiple data sources has emerged as an important practical problem with applications in different areas such as data streams, data-warehouses, and bioinformatics. Although the data sources are willing to run data mining algorithms in these cases, they do not want to reveal any extra information about their data to other sources due to legal or competition concerns. One possible solution to this problem is to use cryptographic methods. However, the computation and communication complexity of such solutions render them impractical when a large number of data sources are involved. In this paper, we consider a scenario where multiple data sources are willing to run data mining algorithms over the union of their data as long as each data source is guaranteed that its information that does not pertain to another data source will not be revealed. We focus on the classification problem in particular and present an efficient algorithm for building a decision tree over an arbitrary number of distributed sources in a privacy preserving manner using the ID3 algorithm.
- Data mining
- Data privacy and security
ASJC Scopus subject areas
- Information Systems and Management