Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

Kieu Trinh Do, Simone Wahl, Johannes Raffler, Sophie Molnos, Michael Laimighofer, Jerzy Adamski, Karsten Suhre, Konstantin Strauch, Annette Peters, Christian Gieger, Claudia Langenberg, Isobel D. Stewart, Fabian J. Theis, Harald Grallert, Gabi Kastenmüller, Jan Krumsiek

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Background: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. Methods: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. Results: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. Conclusion: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.

Original languageEnglish
Article number128
JournalMetabolomics
Volume14
Issue number10
DOIs
Publication statusPublished - 1 Oct 2018

Fingerprint

Metabolomics
Data handling
Mass spectrometry
Mass Spectrometry
Limit of Detection
Quantitative Trait Loci
Serum
Experiments

Keywords

  • Batch effects
  • K-nearest neighbor
  • Limit of detection
  • Mass spectrometry
  • MICE
  • Missing values imputation
  • Untargeted metabolomics

ASJC Scopus subject areas

  • Endocrinology, Diabetes and Metabolism
  • Biochemistry
  • Clinical Biochemistry

Cite this

Do, K. T., Wahl, S., Raffler, J., Molnos, S., Laimighofer, M., Adamski, J., ... Krumsiek, J. (2018). Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics, 14(10), [128]. https://doi.org/10.1007/s11306-018-1420-2

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. / Do, Kieu Trinh; Wahl, Simone; Raffler, Johannes; Molnos, Sophie; Laimighofer, Michael; Adamski, Jerzy; Suhre, Karsten; Strauch, Konstantin; Peters, Annette; Gieger, Christian; Langenberg, Claudia; Stewart, Isobel D.; Theis, Fabian J.; Grallert, Harald; Kastenmüller, Gabi; Krumsiek, Jan.

In: Metabolomics, Vol. 14, No. 10, 128, 01.10.2018.

Research output: Contribution to journalArticle

Do, KT, Wahl, S, Raffler, J, Molnos, S, Laimighofer, M, Adamski, J, Suhre, K, Strauch, K, Peters, A, Gieger, C, Langenberg, C, Stewart, ID, Theis, FJ, Grallert, H, Kastenmüller, G & Krumsiek, J 2018, 'Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies', Metabolomics, vol. 14, no. 10, 128. https://doi.org/10.1007/s11306-018-1420-2
Do, Kieu Trinh ; Wahl, Simone ; Raffler, Johannes ; Molnos, Sophie ; Laimighofer, Michael ; Adamski, Jerzy ; Suhre, Karsten ; Strauch, Konstantin ; Peters, Annette ; Gieger, Christian ; Langenberg, Claudia ; Stewart, Isobel D. ; Theis, Fabian J. ; Grallert, Harald ; Kastenmüller, Gabi ; Krumsiek, Jan. / Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. In: Metabolomics. 2018 ; Vol. 14, No. 10.
@article{b483820404064c789ae2a7bcdf1769d1,
title = "Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies",
abstract = "Background: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. Methods: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. Results: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. Conclusion: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.",
keywords = "Batch effects, K-nearest neighbor, Limit of detection, Mass spectrometry, MICE, Missing values imputation, Untargeted metabolomics",
author = "Do, {Kieu Trinh} and Simone Wahl and Johannes Raffler and Sophie Molnos and Michael Laimighofer and Jerzy Adamski and Karsten Suhre and Konstantin Strauch and Annette Peters and Christian Gieger and Claudia Langenberg and Stewart, {Isobel D.} and Theis, {Fabian J.} and Harald Grallert and Gabi Kastenm{\"u}ller and Jan Krumsiek",
year = "2018",
month = "10",
day = "1",
doi = "10.1007/s11306-018-1420-2",
language = "English",
volume = "14",
journal = "Metabolomics",
issn = "1573-3882",
publisher = "Springer New York",
number = "10",

}

TY - JOUR

T1 - Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

AU - Do, Kieu Trinh

AU - Wahl, Simone

AU - Raffler, Johannes

AU - Molnos, Sophie

AU - Laimighofer, Michael

AU - Adamski, Jerzy

AU - Suhre, Karsten

AU - Strauch, Konstantin

AU - Peters, Annette

AU - Gieger, Christian

AU - Langenberg, Claudia

AU - Stewart, Isobel D.

AU - Theis, Fabian J.

AU - Grallert, Harald

AU - Kastenmüller, Gabi

AU - Krumsiek, Jan

PY - 2018/10/1

Y1 - 2018/10/1

N2 - Background: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. Methods: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. Results: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. Conclusion: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.

AB - Background: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. Methods: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. Results: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. Conclusion: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.

KW - Batch effects

KW - K-nearest neighbor

KW - Limit of detection

KW - Mass spectrometry

KW - MICE

KW - Missing values imputation

KW - Untargeted metabolomics

UR - http://www.scopus.com/inward/record.url?scp=85053638868&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85053638868&partnerID=8YFLogxK

U2 - 10.1007/s11306-018-1420-2

DO - 10.1007/s11306-018-1420-2

M3 - Article

C2 - 30830398

AN - SCOPUS:85053638868

VL - 14

JO - Metabolomics

JF - Metabolomics

SN - 1573-3882

IS - 10

M1 - 128

ER -