Learning to identify relevant studies for systematic reviews using random forest and external information

Madian Khabsa, Ahmed Elmagarmid, Ihab Ilyas, Hossam Hammady, Mourad Ouzzani

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

We tackle the problem of automatically filtering studies while preparing Systematic Reviews (SRs) which normally entails manually inspecting thousands of studies to identify the few to be included. The problem is modeled as an imbalanced data classification task where the cost of misclassifying the minority class is higher than the cost of misclassifying the majority class. This work introduces a novel method for representing systematic reviews based not only on lexical features, but also utilizing word clustering and citation features. This novel representation is shown to outperform previously used features in representing systematic reviews, regardless of the classifier. Our work utilizes a random forest classifier with the novel features to accurately predict included studies with high recall. The parameters of the random forest are automatically configured using heuristics methods thus allowing us to provide a product that is usable in real scenarios. Experiments on a dataset containing 15 systematic reviews that were prepared by health care professionals show that our approach can achieve high recall while helping the SR author save time.

Original languageEnglish
JournalMachine Learning
DOIs
Publication statusAccepted/In press - 23 Oct 2015

Fingerprint

Classifiers
Heuristic methods
Health care
Costs
Experiments

Keywords

  • Classification
  • Inclusion prediction
  • Systematic review

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software

Cite this

@article{162ac1f807564ad5bbdfa342eeb05965,
title = "Learning to identify relevant studies for systematic reviews using random forest and external information",
abstract = "We tackle the problem of automatically filtering studies while preparing Systematic Reviews (SRs) which normally entails manually inspecting thousands of studies to identify the few to be included. The problem is modeled as an imbalanced data classification task where the cost of misclassifying the minority class is higher than the cost of misclassifying the majority class. This work introduces a novel method for representing systematic reviews based not only on lexical features, but also utilizing word clustering and citation features. This novel representation is shown to outperform previously used features in representing systematic reviews, regardless of the classifier. Our work utilizes a random forest classifier with the novel features to accurately predict included studies with high recall. The parameters of the random forest are automatically configured using heuristics methods thus allowing us to provide a product that is usable in real scenarios. Experiments on a dataset containing 15 systematic reviews that were prepared by health care professionals show that our approach can achieve high recall while helping the SR author save time.",
keywords = "Classification, Inclusion prediction, Systematic review",
author = "Madian Khabsa and Ahmed Elmagarmid and Ihab Ilyas and Hossam Hammady and Mourad Ouzzani",
year = "2015",
month = "10",
day = "23",
doi = "10.1007/s10994-015-5535-7",
language = "English",
journal = "Machine Learning",
issn = "0885-6125",
publisher = "Springer Netherlands",

}

TY - JOUR

T1 - Learning to identify relevant studies for systematic reviews using random forest and external information

AU - Khabsa, Madian

AU - Elmagarmid, Ahmed

AU - Ilyas, Ihab

AU - Hammady, Hossam

AU - Ouzzani, Mourad

PY - 2015/10/23

Y1 - 2015/10/23

N2 - We tackle the problem of automatically filtering studies while preparing Systematic Reviews (SRs) which normally entails manually inspecting thousands of studies to identify the few to be included. The problem is modeled as an imbalanced data classification task where the cost of misclassifying the minority class is higher than the cost of misclassifying the majority class. This work introduces a novel method for representing systematic reviews based not only on lexical features, but also utilizing word clustering and citation features. This novel representation is shown to outperform previously used features in representing systematic reviews, regardless of the classifier. Our work utilizes a random forest classifier with the novel features to accurately predict included studies with high recall. The parameters of the random forest are automatically configured using heuristics methods thus allowing us to provide a product that is usable in real scenarios. Experiments on a dataset containing 15 systematic reviews that were prepared by health care professionals show that our approach can achieve high recall while helping the SR author save time.

AB - We tackle the problem of automatically filtering studies while preparing Systematic Reviews (SRs) which normally entails manually inspecting thousands of studies to identify the few to be included. The problem is modeled as an imbalanced data classification task where the cost of misclassifying the minority class is higher than the cost of misclassifying the majority class. This work introduces a novel method for representing systematic reviews based not only on lexical features, but also utilizing word clustering and citation features. This novel representation is shown to outperform previously used features in representing systematic reviews, regardless of the classifier. Our work utilizes a random forest classifier with the novel features to accurately predict included studies with high recall. The parameters of the random forest are automatically configured using heuristics methods thus allowing us to provide a product that is usable in real scenarios. Experiments on a dataset containing 15 systematic reviews that were prepared by health care professionals show that our approach can achieve high recall while helping the SR author save time.

KW - Classification

KW - Inclusion prediction

KW - Systematic review

UR - http://www.scopus.com/inward/record.url?scp=84945162526&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84945162526&partnerID=8YFLogxK

U2 - 10.1007/s10994-015-5535-7

DO - 10.1007/s10994-015-5535-7

M3 - Article

AN - SCOPUS:84959537001

JO - Machine Learning

JF - Machine Learning

SN - 0885-6125

ER -