Deepcrystal

A deep learning framework for sequence-based protein crystallization prediction

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Motivation: Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not. Results: Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew’s correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0% for F-score, 1.9, 3.9% for accuracy and 3.8, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets.

Original languageEnglish
Pages (from-to)2216-2225
Number of pages10
JournalBioinformatics
Volume35
Issue number13
DOIs
Publication statusPublished - 1 Jul 2019

Fingerprint

Crystallization
Learning
Proteins
Protein
Prediction
Protein Sequence
Test Set
Independent Set
Correlation coefficient
Diffraction
Predictors
Crystal
Attrition
Trial and error
Protein Structure
Feature Space
Crystals
Framework
Deep learning
Neural Networks

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

@article{dabed577157d479e834e99bd4d77c77a,
title = "Deepcrystal: A deep learning framework for sequence-based protein crystallization prediction",
abstract = "Motivation: Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not. Results: Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew’s correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1{\%} in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0{\%} for F-score, 1.9, 3.9{\%} for accuracy and 3.8, 7.0{\%} for MCC w.r.t. Crysalis II and Crysf on independent test sets.",
author = "Abdurrahman Elbasir and Balasubramanian Moovarkumudalvan and Khalid Kunji and Prasanna Kolatkar and RaghvenPhDa Mall and Halima Bensmail",
year = "2019",
month = "7",
day = "1",
doi = "10.1093/bioinformatics/bty953",
language = "English",
volume = "35",
pages = "2216--2225",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "13",

}

TY - JOUR

T1 - Deepcrystal

T2 - A deep learning framework for sequence-based protein crystallization prediction

AU - Elbasir, Abdurrahman

AU - Moovarkumudalvan, Balasubramanian

AU - Kunji, Khalid

AU - Kolatkar, Prasanna

AU - Mall, RaghvenPhDa

AU - Bensmail, Halima

PY - 2019/7/1

Y1 - 2019/7/1

N2 - Motivation: Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not. Results: Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew’s correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0% for F-score, 1.9, 3.9% for accuracy and 3.8, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets.

AB - Motivation: Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not. Results: Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew’s correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0% for F-score, 1.9, 3.9% for accuracy and 3.8, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets.

UR - http://www.scopus.com/inward/record.url?scp=85069784137&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069784137&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bty953

DO - 10.1093/bioinformatics/bty953

M3 - Article

VL - 35

SP - 2216

EP - 2225

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 13

ER -