Privacy-preserving analysis of distributed biomedical data

Designing efficient and secure multiparty computations using distributed statistical learning theory

Fida K. Dankar, Nisha Madathil, Samar K. Dankar, Sabri Boughorbel

Research output: Contribution to journalArticle

Abstract

Background: Biomedical research often requires large cohorts and necessitates the sharing of biomedical data with researchers around the world, which raises many privacy, ethical, and legal concerns. In the face of these concerns, privacy experts are trying to explore approaches to analyzing the distributed data while protecting its privacy. Many of these approaches are based on secure multiparty computations (SMCs). SMC is an attractive approach allowing multiple parties to collectively carry out calculations on their datasets without having to reveal their own raw data; however, it incurs heavy computation time and requires extensive communication between the involved parties. Objective: This study aimed to develop usable and efficient SMC applications that meet the needs of the potential end-users and to raise general awareness about SMC as a tool that supports data sharing. Methods: We have introduced distributed statistical computing (DSC) into the design of secure multiparty protocols, which allows us to conduct computations on each of the parties' sites independently and then combine these computations to form 1 estimator for the collective dataset, thus limiting communication to the final step and reducing complexity. The effectiveness of our privacy-preserving model is demonstrated through a linear regression application. Results: Our secure linear regression algorithm was tested for accuracy and performance using real and synthetic datasets. The results showed no loss of accuracy (over nonsecure regression) and very good performance (20 min for 100 million records). Conclusions: We used DSC to securely calculate a linear regression model over multiple datasets. Our experiments showed very good performance (in terms of the number of records it can handle). We plan to extend our method to other estimators such as logistic regression.

Original languageEnglish
Article numbere12702
JournalJournal of medical Internet research
Volume21
Issue number4
DOIs
Publication statusPublished - 1 Apr 2019

Fingerprint

Privacy
Linear Models
Mathematical Computing
Learning
Information Dissemination
Communication
Biomedical Research
Logistic Models
Research Personnel
Datasets

Keywords

  • Data aggregation
  • Data analytics
  • Patient data privacy
  • Personal genetic information

ASJC Scopus subject areas

  • Health Informatics

Cite this

Privacy-preserving analysis of distributed biomedical data : Designing efficient and secure multiparty computations using distributed statistical learning theory. / Dankar, Fida K.; Madathil, Nisha; Dankar, Samar K.; Boughorbel, Sabri.

In: Journal of medical Internet research, Vol. 21, No. 4, e12702, 01.04.2019.

Research output: Contribution to journalArticle

@article{6c16790eb66e4a86b158fe4cf2a39018,
title = "Privacy-preserving analysis of distributed biomedical data: Designing efficient and secure multiparty computations using distributed statistical learning theory",
abstract = "Background: Biomedical research often requires large cohorts and necessitates the sharing of biomedical data with researchers around the world, which raises many privacy, ethical, and legal concerns. In the face of these concerns, privacy experts are trying to explore approaches to analyzing the distributed data while protecting its privacy. Many of these approaches are based on secure multiparty computations (SMCs). SMC is an attractive approach allowing multiple parties to collectively carry out calculations on their datasets without having to reveal their own raw data; however, it incurs heavy computation time and requires extensive communication between the involved parties. Objective: This study aimed to develop usable and efficient SMC applications that meet the needs of the potential end-users and to raise general awareness about SMC as a tool that supports data sharing. Methods: We have introduced distributed statistical computing (DSC) into the design of secure multiparty protocols, which allows us to conduct computations on each of the parties' sites independently and then combine these computations to form 1 estimator for the collective dataset, thus limiting communication to the final step and reducing complexity. The effectiveness of our privacy-preserving model is demonstrated through a linear regression application. Results: Our secure linear regression algorithm was tested for accuracy and performance using real and synthetic datasets. The results showed no loss of accuracy (over nonsecure regression) and very good performance (20 min for 100 million records). Conclusions: We used DSC to securely calculate a linear regression model over multiple datasets. Our experiments showed very good performance (in terms of the number of records it can handle). We plan to extend our method to other estimators such as logistic regression.",
keywords = "Data aggregation, Data analytics, Patient data privacy, Personal genetic information",
author = "Dankar, {Fida K.} and Nisha Madathil and Dankar, {Samar K.} and Sabri Boughorbel",
year = "2019",
month = "4",
day = "1",
doi = "10.2196/12702",
language = "English",
volume = "21",
journal = "Journal of Medical Internet Research",
issn = "1438-8871",
publisher = "Journal of medical Internet Research",
number = "4",

}

TY - JOUR

T1 - Privacy-preserving analysis of distributed biomedical data

T2 - Designing efficient and secure multiparty computations using distributed statistical learning theory

AU - Dankar, Fida K.

AU - Madathil, Nisha

AU - Dankar, Samar K.

AU - Boughorbel, Sabri

PY - 2019/4/1

Y1 - 2019/4/1

N2 - Background: Biomedical research often requires large cohorts and necessitates the sharing of biomedical data with researchers around the world, which raises many privacy, ethical, and legal concerns. In the face of these concerns, privacy experts are trying to explore approaches to analyzing the distributed data while protecting its privacy. Many of these approaches are based on secure multiparty computations (SMCs). SMC is an attractive approach allowing multiple parties to collectively carry out calculations on their datasets without having to reveal their own raw data; however, it incurs heavy computation time and requires extensive communication between the involved parties. Objective: This study aimed to develop usable and efficient SMC applications that meet the needs of the potential end-users and to raise general awareness about SMC as a tool that supports data sharing. Methods: We have introduced distributed statistical computing (DSC) into the design of secure multiparty protocols, which allows us to conduct computations on each of the parties' sites independently and then combine these computations to form 1 estimator for the collective dataset, thus limiting communication to the final step and reducing complexity. The effectiveness of our privacy-preserving model is demonstrated through a linear regression application. Results: Our secure linear regression algorithm was tested for accuracy and performance using real and synthetic datasets. The results showed no loss of accuracy (over nonsecure regression) and very good performance (20 min for 100 million records). Conclusions: We used DSC to securely calculate a linear regression model over multiple datasets. Our experiments showed very good performance (in terms of the number of records it can handle). We plan to extend our method to other estimators such as logistic regression.

AB - Background: Biomedical research often requires large cohorts and necessitates the sharing of biomedical data with researchers around the world, which raises many privacy, ethical, and legal concerns. In the face of these concerns, privacy experts are trying to explore approaches to analyzing the distributed data while protecting its privacy. Many of these approaches are based on secure multiparty computations (SMCs). SMC is an attractive approach allowing multiple parties to collectively carry out calculations on their datasets without having to reveal their own raw data; however, it incurs heavy computation time and requires extensive communication between the involved parties. Objective: This study aimed to develop usable and efficient SMC applications that meet the needs of the potential end-users and to raise general awareness about SMC as a tool that supports data sharing. Methods: We have introduced distributed statistical computing (DSC) into the design of secure multiparty protocols, which allows us to conduct computations on each of the parties' sites independently and then combine these computations to form 1 estimator for the collective dataset, thus limiting communication to the final step and reducing complexity. The effectiveness of our privacy-preserving model is demonstrated through a linear regression application. Results: Our secure linear regression algorithm was tested for accuracy and performance using real and synthetic datasets. The results showed no loss of accuracy (over nonsecure regression) and very good performance (20 min for 100 million records). Conclusions: We used DSC to securely calculate a linear regression model over multiple datasets. Our experiments showed very good performance (in terms of the number of records it can handle). We plan to extend our method to other estimators such as logistic regression.

KW - Data aggregation

KW - Data analytics

KW - Patient data privacy

KW - Personal genetic information

UR - http://www.scopus.com/inward/record.url?scp=85067288868&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85067288868&partnerID=8YFLogxK

U2 - 10.2196/12702

DO - 10.2196/12702

M3 - Article

VL - 21

JO - Journal of Medical Internet Research

JF - Journal of Medical Internet Research

SN - 1438-8871

IS - 4

M1 - e12702

ER -