Just KNIME It, Season 2 / Challenge 20 reference
Challenge question
Challenge 20: Topics in Hotel Reviews
Level: Hard
Description: You work for a travel agency and want to better understand how hotels are reviewed online. What topics are common in the reviews as a whole, and what terms are most relevant in each topic? How about when you separate the reviews per rating? A colleague has already crawled and preprocessed the reviews for you, so your job now is to identify relevant topics in the reviews, and explore their key terms. What do the reviews uncover? Hint: Topic Extraction can be very helpful in tackling this challenge. Hint 2: Coherence and perplexity are metrics that can help you pick a meaningful number of topics.
Author: Aline Bessa
Here are some of the key concepts involved in this challenge.
LDA
In the context of data science, LDA typically refers to Latent Dirichlet Allocation. It's a generative statistical model commonly used in natural language processing and machine learning.
LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, implemented via a probabilistic graphical model.
The basic idea of LDA is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. It assumes that the order of the words doesn't matter (bag of words assumption).
LDA can be used in various tasks like automated tagging systems, content recommendation, and text classification among others.