Demographic research with non-representative internet data

Emilio Zagheni, Ingmar Weber

Research output: Contribution to journalArticle

25 Citations (Scopus)

Abstract

Purpose - Internet data hold many promises for demographic research, but come with severe drawbacks due to several types of bias. The purpose of this paper is to review the literature that uses internet data for demographic studies and presents a general framework for addressing the problem of selection bias in non-representative samples. Design/methodology/approach - The authors propose two main approaches to reduce bias. When ground truth data are available, the authors suggest a method that relies on calibration of the online data against reliable official statistics. When no ground truth data are available, the authors propose a difference in differences approach to evaluate relative trends. Findings - The authors offer a generalization of existing techniques. Although there is not a definite answer to the question of whether statistical inference can be made from non-representative samples, the authors show that, when certain assumptions are met, the authors can extract signal from noisy and biased data. Research limitations/implications - The methods are sensitive to a number of assumptions. These include some regularities in the way the bias changes across different locations, different demographic groups and between time steps. The assumptions that we discuss might not always hold. In particular, the scenario where bias varies in an unpredictable manner and, at the same time, there is no “ground truth” available to continuously calibrate the model, remains challenging and beyond the scope of this paper. Originality/value - The paper combines a critical review of existing substantive and methodological literature with a generalization of prior techniques. It intends to provide a fresh perspective on the issue and to stimulate the methodological discussion among social scientists.

Original languageEnglish
Pages (from-to)13-25
Number of pages13
JournalInternational Journal of Manpower
Volume36
Issue number1
DOIs
Publication statusPublished - 7 Apr 2015

Fingerprint

Internet
Statistics
Calibration
World Wide Web
Demographics
Regularity
Statistical inference
Official statistics
Selection bias
Design methodology
Difference-in-differences
Scenarios
Internet use

Keywords

  • Demography
  • Digital breadcrumbs
  • Internet data
  • Non-representative samples
  • Selection bias Paper type Research paper

ASJC Scopus subject areas

  • Management of Technology and Innovation
  • Strategy and Management
  • Organizational Behavior and Human Resource Management

Cite this

Demographic research with non-representative internet data. / Zagheni, Emilio; Weber, Ingmar.

In: International Journal of Manpower, Vol. 36, No. 1, 07.04.2015, p. 13-25.

Research output: Contribution to journalArticle

@article{758a1d849f4646b4b200dca8fd8acd38,
title = "Demographic research with non-representative internet data",
abstract = "Purpose - Internet data hold many promises for demographic research, but come with severe drawbacks due to several types of bias. The purpose of this paper is to review the literature that uses internet data for demographic studies and presents a general framework for addressing the problem of selection bias in non-representative samples. Design/methodology/approach - The authors propose two main approaches to reduce bias. When ground truth data are available, the authors suggest a method that relies on calibration of the online data against reliable official statistics. When no ground truth data are available, the authors propose a difference in differences approach to evaluate relative trends. Findings - The authors offer a generalization of existing techniques. Although there is not a definite answer to the question of whether statistical inference can be made from non-representative samples, the authors show that, when certain assumptions are met, the authors can extract signal from noisy and biased data. Research limitations/implications - The methods are sensitive to a number of assumptions. These include some regularities in the way the bias changes across different locations, different demographic groups and between time steps. The assumptions that we discuss might not always hold. In particular, the scenario where bias varies in an unpredictable manner and, at the same time, there is no “ground truth” available to continuously calibrate the model, remains challenging and beyond the scope of this paper. Originality/value - The paper combines a critical review of existing substantive and methodological literature with a generalization of prior techniques. It intends to provide a fresh perspective on the issue and to stimulate the methodological discussion among social scientists.",
keywords = "Demography, Digital breadcrumbs, Internet data, Non-representative samples, Selection bias Paper type Research paper",
author = "Emilio Zagheni and Ingmar Weber",
year = "2015",
month = "4",
day = "7",
doi = "10.1108/IJM-12-2014-0261",
language = "English",
volume = "36",
pages = "13--25",
journal = "International Journal of Manpower",
issn = "0143-7720",
publisher = "Emerald Group Publishing Ltd.",
number = "1",

}

TY - JOUR

T1 - Demographic research with non-representative internet data

AU - Zagheni, Emilio

AU - Weber, Ingmar

PY - 2015/4/7

Y1 - 2015/4/7

N2 - Purpose - Internet data hold many promises for demographic research, but come with severe drawbacks due to several types of bias. The purpose of this paper is to review the literature that uses internet data for demographic studies and presents a general framework for addressing the problem of selection bias in non-representative samples. Design/methodology/approach - The authors propose two main approaches to reduce bias. When ground truth data are available, the authors suggest a method that relies on calibration of the online data against reliable official statistics. When no ground truth data are available, the authors propose a difference in differences approach to evaluate relative trends. Findings - The authors offer a generalization of existing techniques. Although there is not a definite answer to the question of whether statistical inference can be made from non-representative samples, the authors show that, when certain assumptions are met, the authors can extract signal from noisy and biased data. Research limitations/implications - The methods are sensitive to a number of assumptions. These include some regularities in the way the bias changes across different locations, different demographic groups and between time steps. The assumptions that we discuss might not always hold. In particular, the scenario where bias varies in an unpredictable manner and, at the same time, there is no “ground truth” available to continuously calibrate the model, remains challenging and beyond the scope of this paper. Originality/value - The paper combines a critical review of existing substantive and methodological literature with a generalization of prior techniques. It intends to provide a fresh perspective on the issue and to stimulate the methodological discussion among social scientists.

AB - Purpose - Internet data hold many promises for demographic research, but come with severe drawbacks due to several types of bias. The purpose of this paper is to review the literature that uses internet data for demographic studies and presents a general framework for addressing the problem of selection bias in non-representative samples. Design/methodology/approach - The authors propose two main approaches to reduce bias. When ground truth data are available, the authors suggest a method that relies on calibration of the online data against reliable official statistics. When no ground truth data are available, the authors propose a difference in differences approach to evaluate relative trends. Findings - The authors offer a generalization of existing techniques. Although there is not a definite answer to the question of whether statistical inference can be made from non-representative samples, the authors show that, when certain assumptions are met, the authors can extract signal from noisy and biased data. Research limitations/implications - The methods are sensitive to a number of assumptions. These include some regularities in the way the bias changes across different locations, different demographic groups and between time steps. The assumptions that we discuss might not always hold. In particular, the scenario where bias varies in an unpredictable manner and, at the same time, there is no “ground truth” available to continuously calibrate the model, remains challenging and beyond the scope of this paper. Originality/value - The paper combines a critical review of existing substantive and methodological literature with a generalization of prior techniques. It intends to provide a fresh perspective on the issue and to stimulate the methodological discussion among social scientists.

KW - Demography

KW - Digital breadcrumbs

KW - Internet data

KW - Non-representative samples

KW - Selection bias Paper type Research paper

UR - http://www.scopus.com/inward/record.url?scp=84925011891&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84925011891&partnerID=8YFLogxK

U2 - 10.1108/IJM-12-2014-0261

DO - 10.1108/IJM-12-2014-0261

M3 - Article

AN - SCOPUS:84925011891

VL - 36

SP - 13

EP - 25

JO - International Journal of Manpower

JF - International Journal of Manpower

SN - 0143-7720

IS - 1

ER -