Extensive experimental evidence is required to study the impact of text categorization approaches on real data and to assess the performance within operational scenarios. In this paper a wide set of profile-based classification models (a class of very efficient classifiers) sensitive to the syntactic information extracted from source texts is discussed. Several classifiers are tested, ranging from traditional approaches (e.g., variants of vector space, like SMART , or linear regression models) to original methods. All the experiments aim to evaluate some newly introduced feature weighting and inference models as well as to characterize the role of different linguistic information. The final purpose is thus to give an insight on the effective and efficient use of linguistic information for text categorization. The results suggest that an optimal exploitation of linguistic features can be obtained by a suitable selection among methods of feature weighting and inference. The empirical evidence collected in this paper over a wide range of corpora and languages is retained as a useful basis for the systematic design of operational statistical NLP-driven text classifiers.
ASJC Scopus subject areas
- Control and Systems Engineering
- Electrical and Electronic Engineering
- Artificial Intelligence