Topic Modelling and Classification of Open-Ended Survey Responses
We're trying to not have to hand-code dozens (hundreds? please, no) of open-ended survey responses to train a classification model. Our recent project was, basically, how to automatically categorize open-ended survey responses.
What we envision is unsupervised topic modeling followed by a classification model. In particular, up front in the machine learning pipeline, a topic modeling technique to automate the identification of topics. The resulting categorization of topics labels a test set of responses to train a classification model. As new customer responses stream in, the trained classification model identifies the appropriate category for the response. Every so often, the topic modeling is re-run with recent responses to identify any new topics that may emerge. If so, retrain classification model. Voila! We have auto-categorization of open-ended customer responses beyond binary positive or negative sentiment.
Or do we?
In #MachineLearning and #NaturalLanguageProcessing, #TopicModelling is an active topic of research and there are great approaches (#LatentDirichletAllocation for the win!). The rub is that the automatically identified topics are often (usually?) not intuitive or explainable.
For example, in the Figure 1, Topic 2 is defined as a ranked list of words starting with “call”, “file”, “underwriter”, “like”, “point”, …. OK, but what is that category?
We imagine there's a lot of work going on to bridge unsupervised learning and supervised learning for topic classification. Nonetheless, we couldn’t find a workable solution that provided a one or two-word, intuitive, actionable label for automatic category identification (contact us if you do!).
So what to do?
Well, we’re not a research institution and the stream of open-ended customer survey responses isn’t going to pause for us. This is one of the times where we make an eyes-wide-open decision to incur the tech and process debt and just get a workable solution into production (80% rule, don’t let the perfect be the enemy of the good, etc.).
So we cobbled together a hybrid approach that we’re reasonably satisfied with:
Implement standard pre-processing (i.e., tokenization, filter stop words, lemmatize, etc.)
Run simple word and tuple frequency counts to identify high-frequency words
Run an LDA model to identify high saliency words
(Warning: beginning of the ugly part …) Manually assign highest frequency and saliency words to explainable categories (e.g., "call", "respond", "get hold" ... = "Communication")—that is, create a bag of words for each category.
Use the bags of words to categorize responses based on the presence of any of a category's words (a response may belong to more than one category).
Validate the categorization manually on a subset of responses; modify categories' bags-of-words, repeat Steps 3 and 4 until X% of responses (we use 90%) belong to at least one category.
This approach is inelegant, requires manual intervention and subjectivity, and accuracy probably depends on the data set. It is ugly. I almost can’t stand it. But it's working for us. And we're not locked in refactor paralysis—we're pressing ahead and making data-driven decisions on how to deliver an even more awesome customer experience.