(Mostly) Plain language research results

Semi-automatic coding of open-ended questions in surveys

Survey methods play a central role in the conduct of much social science research. Most methodological survey research has focused on closed-ended questions (numeric answers, multiple choice, choose all answers that apply). Open-ended questions are also important because they do not constrain respondents’ answers, and avoid respondents having to choose the least awkward answer. Responses to open-ended questions have historically been under-utilized, in part because it is labour intensive to collect and analyze text answers. The increasing use of web surveys now makes it particularly easy to collect such data because open-ended questions need not be transcribed.

Text data from open-ended survey questions are more difficult to analyze than categorical or numeric responses. Therefore, data from open-ended questions continue to be frequently ignored in the analysis of survey data. When categorization of open-ended questions becomes essential, multiple human coders usually manually code answers into categories, however, recent advances in text mining have enabled automation of such coding. These automated algorithms are not accurate enough to entirely replace humans. While not perfect, text mining algorithms can also distinguish between text answers that are almost certainly categorized correctly and those where there is considerable uncertainty about the appropriateness of the assigned category. We have proposed semi-automatic coding as a solution that is as accurate as human manual-coding, but requires less human involvement.

We use automatic coding for easy-to-categorize answers and involve human coders for hard-to-code answers in a way that the overall accuracy is not sacrificed. The basic steps are as follows:

Threshold Fraction Automatically categorized Accuracy
0.9 0.15 0.95
0.8 0.31 0.90
0.7 0.46 0.87
0.6 0.58 0.82
0 1.00 0.65
Disclosure Data: Threshold for automatic classification, fraction of data above threshold and achieved accuracy.
  1. Train the statistical machine learning algorithm on a set of answers that are categorized by humans (typically 500 answer texts)
  2. Apply the statistical machine learning algorithm to predict the categorization of answers not yet categorized by humans. The algorithm chooses the category with the highest probability of being correct for each answer.
  3. If the probability of the predicted classification is higher than a threshold probability, the automated classification is accepted; if it falls below the threshold, the answer is then given to the human coder.
In our paper we tested this approach with two open-ended questions: respondents’ advice to a patient in a hypothetical dilemma, and a follow-up probe related to respondents’ perception of disclosure/privacy risk. When targeting 80% accuracy, we found that 47%-58% of the data could be categorized automatically. We also found that semi-automatic prediction does not distort the distribution of outcome classes. Because of the overhead in setting up the statistical machine learning, semi-automatic makes most sense when there are 1,500-2,000 or more answers of open-ended questions to code.


Return to Home Page
Remove navigation bar on the left