Topic Modelling is a method to extract the concealed topics from huge volumes of raw data. Latent Dirichlet Allocation(LDA) is the modelling algorithm for the purpose of topic modelling demonstrating with brilliant executions in Python's Gensim bundle. Our main purpose is to find topics that are clear, isolated and significant. This also relies vigorously upon the nature of content preprocessing and the strategies of finding the ideal number of Topics.
Introduction
One of the primary application of the NLP is to extract Topics from the large contextual volume of raw data. Some most common example of the large contextual volume of raw data can be social media, customers review of movies, products, hotels etc. user feedbacks, textual-news, customer’s complaint email.Comprehending what individuals are discussing and understanding their issues and sentiments is exceptionally significant to organizations, overseers, political battles. Furthermore, it's extremely difficult to physically peruse such enormous volumes and aggregate the subjects.
We will be using Gensim python library for the Latent Dirichlet Allocation algorithm for our topic extraction.
Prerequisites
We will be needing following libraries for this task with Python3.6
Install these libraries with pip.
Let's import the libraries.
# Run in python console import nltk; nltk.download('stopwords') # Run in terminal or command prompt python3 -m spacy download en import re import numpy as np import pandas as pd from pprint import pprint # Gensim import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from gensim.models import CoherenceModel # spacy for lemmatization import spacy # Plotting tools import pyLDAvis import pyLDAvis.gensim import matplotlib.pyplot as plt %matplotlib inline
Prepare Stopwords
from nltk.corpus import stopwords stop_words = stopwords.words('english')
Import DataSet
Note: We are using here Amazon Mobile Dataset which you can download from here. I have unzipped and renamed the file as 'mobile.csv' for better handling.
df = pd.read_json('mobile.csv') df.head()
Remove unwanted character's, numbers and symbols.
df['Reviews'] = df['Reviews'].str.replace("[^a-zA-Z#]", " ")
Remove Stopwords
def remove_stopwords(docs): docs_new = " ".join([i for i in docs if i not in stop_words]) return docs_new
reviews = [remove_stopwords(docs.split()) for docs in df['Reviews']]
Change reviews to lowercase.
reviews = [revs.lower() for revs in reviews]
Tokenize the reviews.
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
Lemmatize the tokenized reviews
def lemmatize(docs, tags=['NOUN', 'ADJ']): lemmatize_Reviews = [] for text in docs: joined_doc = nlp(" ".join(text)) lemmatize_Reviews.append([token.lemma_ for token in joined_doc if token.pos_ in tags]) return lemmatize_Reviews lemmatized_reviews = lemmatization(tokenized_reviews)
Building the LDA model.
dictionary = corpora.Dictionary(lemmatized_reviews) doc_term_matrix = [dictionary.doc2bow(lemma) for lemma in lemmatized_reviews] LDA = gensim.models.ldamodel.LdaModel # Build LDA model lda = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=5, random_state=42, chunksize=500, passes=20)
View the output of the Model.
print(lda.print_topics())
Model Output:
Here we can see the 2nd topic has keywords 'charge','battery','charger' so we can say that this topic is regarding the Mobile Charge.
Visualize the Topics from the Model.
pyLDAvis.enable_notebook() topicModel = pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary) topicModel
That's All for now. See you in next. Thanks for reading.