Automating Survey Coding by Multiclass Text Categorization Techniques

Daniela Giorgetti, Fabrizio Sebastiani

Research output: Contribution to journalReview article

13 Citations (Scopus)


Survey coding is the task of assigning a symbolic code from a predefined set of such codes to the answer given in response to an open-ended question in a questionnaire (aka survey). This task is usually carried out to group respondents according to a predefined scheme based on their answers. Survey coding has several applications, especially in the social sciences, ranging from the simple classification of respondents to the extraction of statistics on political opinions, health and lifestyle habits, customer satisfaction, brand fidelity, and patient satisfaction. Survey coding is a difficult task, because the code that should be attributed to a respondent based on the answer she has given is a matter of subjective judgment, and thus requires expertise. It is thus unsurprising that this task has traditionally been performed manually, by trained coders. Some attempts have been made at automating this task, most of them based on detecting the similarity between the answer and textual descriptions of the meanings of the candidate codes. We take a radically new stand, and formulate the problem of automated survey coding as a text categorization problem, that is, as the problem of learning, by means of supervised machine learning techniques, a model of the association between answers and codes from a training set of preceded answers, and applying the resulting model to the classification of new answers. In this article we experiment with two different learning techniques: one based on naive Bayesian classification, and the other one based on multiclass support vector machines, and test the resulting framework on a corpus of social surveys. The results we have obtained significantly outperform the results achieved by previous automated survey coding approaches.

Original languageEnglish
Pages (from-to)1269-1277
Number of pages9
JournalJournal of the American Society for Information Science and Technology
Issue number14
Publication statusPublished - 1 Dec 2003


ASJC Scopus subject areas

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this