Toward an efficient and scalable feature selection approach for internet traffic classification

Adil Fahad, Zahir Tari, Ibrahim Khalil, Ibrahim Habib, Hussein Alnuweiri

Research output: Contribution to journalArticle

48 Citations (Scopus)

Abstract

There is significant interest in the network management and industrial security community about the need to identify the "best" and most relevant features for network traffic in order to properly characterize user behaviour and predict future traffic. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classification accuracy as well as to reduce the computational complexity related to the construction of the classifier. In practice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrelevant features and as a knowledge discovery tool to reveal the "best" features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efficiently and accurately the "best" features by first combining the results of some well-known FS techniques to find consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic data sets demonstrates significant gain in accuracy and improved run-time performance of a classifier compared to individual results produced by some well-known FS techniques.

Original languageEnglish
Pages (from-to)2040-2057
Number of pages18
JournalComputer Networks
Volume57
Issue number9
DOIs
Publication statusPublished - 19 Jun 2013
Externally publishedYes

Fingerprint

Feature extraction
Internet
Classifiers
Soft computing
Network management
Data mining
Learning systems
Computational complexity

Keywords

  • Feature selection
  • Metrics
  • Traffic classification

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Toward an efficient and scalable feature selection approach for internet traffic classification. / Fahad, Adil; Tari, Zahir; Khalil, Ibrahim; Habib, Ibrahim; Alnuweiri, Hussein.

In: Computer Networks, Vol. 57, No. 9, 19.06.2013, p. 2040-2057.

Research output: Contribution to journalArticle

Fahad, Adil ; Tari, Zahir ; Khalil, Ibrahim ; Habib, Ibrahim ; Alnuweiri, Hussein. / Toward an efficient and scalable feature selection approach for internet traffic classification. In: Computer Networks. 2013 ; Vol. 57, No. 9. pp. 2040-2057.
@article{e60ad391ee5f4def91533530e2bc87b6,
title = "Toward an efficient and scalable feature selection approach for internet traffic classification",
abstract = "There is significant interest in the network management and industrial security community about the need to identify the {"}best{"} and most relevant features for network traffic in order to properly characterize user behaviour and predict future traffic. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classification accuracy as well as to reduce the computational complexity related to the construction of the classifier. In practice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrelevant features and as a knowledge discovery tool to reveal the {"}best{"} features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efficiently and accurately the {"}best{"} features by first combining the results of some well-known FS techniques to find consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic data sets demonstrates significant gain in accuracy and improved run-time performance of a classifier compared to individual results produced by some well-known FS techniques.",
keywords = "Feature selection, Metrics, Traffic classification",
author = "Adil Fahad and Zahir Tari and Ibrahim Khalil and Ibrahim Habib and Hussein Alnuweiri",
year = "2013",
month = "6",
day = "19",
doi = "10.1016/j.comnet.2013.04.005",
language = "English",
volume = "57",
pages = "2040--2057",
journal = "Computer Networks",
issn = "1389-1286",
publisher = "Elsevier",
number = "9",

}

TY - JOUR

T1 - Toward an efficient and scalable feature selection approach for internet traffic classification

AU - Fahad, Adil

AU - Tari, Zahir

AU - Khalil, Ibrahim

AU - Habib, Ibrahim

AU - Alnuweiri, Hussein

PY - 2013/6/19

Y1 - 2013/6/19

N2 - There is significant interest in the network management and industrial security community about the need to identify the "best" and most relevant features for network traffic in order to properly characterize user behaviour and predict future traffic. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classification accuracy as well as to reduce the computational complexity related to the construction of the classifier. In practice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrelevant features and as a knowledge discovery tool to reveal the "best" features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efficiently and accurately the "best" features by first combining the results of some well-known FS techniques to find consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic data sets demonstrates significant gain in accuracy and improved run-time performance of a classifier compared to individual results produced by some well-known FS techniques.

AB - There is significant interest in the network management and industrial security community about the need to identify the "best" and most relevant features for network traffic in order to properly characterize user behaviour and predict future traffic. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classification accuracy as well as to reduce the computational complexity related to the construction of the classifier. In practice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrelevant features and as a knowledge discovery tool to reveal the "best" features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efficiently and accurately the "best" features by first combining the results of some well-known FS techniques to find consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic data sets demonstrates significant gain in accuracy and improved run-time performance of a classifier compared to individual results produced by some well-known FS techniques.

KW - Feature selection

KW - Metrics

KW - Traffic classification

UR - http://www.scopus.com/inward/record.url?scp=84878323159&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84878323159&partnerID=8YFLogxK

U2 - 10.1016/j.comnet.2013.04.005

DO - 10.1016/j.comnet.2013.04.005

M3 - Article

VL - 57

SP - 2040

EP - 2057

JO - Computer Networks

JF - Computer Networks

SN - 1389-1286

IS - 9

ER -