Posted By :Saharsh Gaurav |27th February 2020

Topic Modelling is a method to extract the concealed topics from huge volumes of raw data. Latent Dirichlet Allocation(LDA) is the modelling algorithm for the purpose of topic modelling demonstrating with brilliant executions in Python's Gensim bundle. Our main purpose is to find topics that are clear, isolated and significant. This also relies vigorously upon the nature of content preprocessing and the strategies of finding the ideal number of Topics.


One of the primary application of the NLP is to extract Topics from the large contextual volume of raw data. Some most common example of the large contextual volume of raw data can be social media, customers review of movies, products, hotels etc. user feedbacks, textual-news, customer’s complaint email.Comprehending what individuals are discussing and understanding their issues and sentiments is exceptionally significant to organizations, overseers, political battles. Furthermore, it's extremely difficult to physically peruse such enormous volumes and aggregate the subjects.

We will be using Gensim python library for the Latent Dirichlet Allocation algorithm for our topic extraction.


We will be needing following libraries for this task with Python3.6 

  1. Nltk
  2. Spacy
  3. Gensim
  4. Re
  5. pyLDAvis
  6. Matplotlib
  7. Numpy
  8. Pandas

Install these libraries with pip.

Let's import the libraries.

# Run in python console
import nltk; nltk.download('stopwords')

# Run in terminal or command prompt
python3 -m spacy download en

import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

Prepare Stopwords

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Import DataSet

Note: We are using here Amazon Mobile Dataset which you can download from here. I have unzipped and renamed the file as 'mobile.csv' for better handling.

df = pd.read_json('mobile.csv')

Remove unwanted character's, numbers and symbols.

df['Reviews'] = df['Reviews'].str.replace("[^a-zA-Z#]", " ")

Remove Stopwords

def remove_stopwords(docs):
    docs_new = " ".join([i for i in docs if i not in stop_words])
    return docs_new

reviews = [remove_stopwords(docs.split()) for docs in df['Reviews']]

Change reviews to lowercase.

reviews = [revs.lower() for revs in reviews]

Tokenize the reviews.

tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())

Lemmatize the tokenized reviews

def lemmatize(docs, tags=['NOUN', 'ADJ']): 
       lemmatize_Reviews = []
       for text in docs:
             joined_doc = nlp(" ".join(text)) 
             lemmatize_Reviews.append([token.lemma_ for token in joined_doc if token.pos_ in tags])
       return lemmatize_Reviews

lemmatized_reviews = lemmatization(tokenized_reviews)

Building the LDA model.

dictionary = corpora.Dictionary(lemmatized_reviews)

doc_term_matrix = [dictionary.doc2bow(lemma) for lemma in lemmatized_reviews]

LDA = gensim.models.ldamodel.LdaModel

# Build LDA model
lda = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=5, random_state=42,
                chunksize=500, passes=20)

View the output of the Model.


Model Output: 

Here we can see the 2nd topic has keywords 'charge','battery','charger' so we can say that this topic is regarding the Mobile Charge.

Visualize the Topics from the Model. 

topicModel = pyLDAvis.gensim.prepare(lda, doc_term_matrix, dictionary)

That's All for now. See you in next. Thanks for reading.

