Topic models

  1. Introduction

There is no doubt that AI is one of the hottest topics lately due to the recent technological breakthrough as well as the promising results in several industries.

In Retail and Ecommerce, AI can be used to adjust the price of goods dynamically based on supply and demand (dynamic pricing), implement a recommendation system, or even a chatbot to answer customers requests.

It can be used in Finance for defining the optimal portfolio for an investor to reduce the risk or increase the potential gains (portfolio management) and for fraud detection. In addition, one could think of using it in the healthcare industry to analyse pathologies based on medical imaging.


The following study made by Mckinsey shows the impact of AI adoption across different sectors.


  1. NLP: the next big thing?

Among all the possible applications of AIs, NLP is one of the most used techniques to extract knowledge from data and use it to achieve a certain goal. NLP stands for natural language processing and refer to all techniques which enable a computer to understand and process human language.

Every company is a legal entity made up of an association of people, be they natural, legal, or a mixture of both, for carrying on a commercial or industrial enterprise. The social nature of human beings can be seen within companies and leads to interactions that can often carry knowledge. This knowledge can be used by AI to optimize the current processes, detect flaws or automate a subset of repetitive tasks.

Hence, In an era of big data where storage resources do not cost a lot of money, all companies are trying to keep as much data as they can in order to leverage it using AI and have a competitive advantage in their industry.


Now enough conceptual/philosophical talk… Let’s see one real example of NLP and how it can be used to create business value in different fields!

  1. Topic models:

Have you ever been in a situation where you have huge amounts of textual data and want to extract the main topics or main concepts from it? Topic modeling allows you to achieve this goal.

In a nutshell, it is a discovery method that allows you to get a glimpse of the most important topics within a collection of documents.

A lot of techniques can be used for this purpose, the most popular ones are LSA, pLSA, LDA and LDA2VEC.

In this blogpost we will focus on LDA to understand the underlying concepts behind such methods.


Imagine that we have N documents and that the size of our vocabulary aka. number of unique terms within our collection of documents is m; then it is possible to represent all documents by a matrix of dimension N*m (figure 1).


Figure1: matrix representation of a collection of Documents


The idea is to decompose this Matrix using Algebra to 2 smaller matrices of dimensions m*T and T*N where T is the number of topics within the collection. By doing so, it is possible to associate to each document a set of predominant topics and each topic a set of predominant keywords. In reality, this is a modelisation of how probabilistic latent semantic analysis (pLSA) works, but LDA use the same methodology while assuming the topic distribution to have a sparse Dirichlet prior.


Figure2: Matrix Factorization – interpretation of LDA and pLSA


Let’s try to apply this method on two open source email collections: the 20newsgroup collection and the Enron dataset.


The 20newsgroup dataset consists of 18000 newsgroups posts on 20 topics and 6 subject matter (Table 1) split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). One of the main advantages of this dataset is the mapping with predefined classes or labels which is important to assess the predictive power of NLP models.


Table1: classes of the 20newsgroup dataset


The Enron dataset contains data from about 150 users, mostly senior management of Enron, organized into folders.

This dataset is related to the Enron Corporation –  an American energy, commodities, and services company based in Houston, Texas –  that went bankrupt in December 2001 due to fraudulent business practices.

The corpus contains a total of about 0.5M messages. The important number of emails of this dataset can be a good opportunity to assess the performance of algorithms on an industrial level. However, classes are not available for this dataset. This will not have an impact on exploratory approaches such as topic models but it can be hard to use it for predictive purposes. You can find in this link the processed version of the Enron dataset on a csv format (I couldn’t find one easily on the web).

In addition, An initiative made by SIMS, UC Berkeley researchers provide a subset of 1700 labeled email messages with topic and sentiment labels.  



  1. The first step is to define the fields that we will use. In our case it will be the Subject and the body of the email.

  1. The next step is to clean and process this data by:

    1. removing stop words: (words like “the” that does not encompass a meaningful concept).

    2. making n-grams: (bigrams, trigrams in our case which means that expressions like “San francisco” or “Golden State Warriors” will be considered as one entity.)

    3. applying lemmatization and creating our dictionary.

  1. After that, we can apply the lda topic model with the necessary parameters such as the number of topics.



For the 20news group, 2 models were created, one with 20 topics and another one with 6 topics to see to which extent this method can extract accurately the predominant ideas within a certain collection on an unsupervised fashion. The results will be compared with the labelled classes/topics that we already have.


The 6-topic model managed to extract 5 out of the 6 labelled topics: sport (rec), computer(comp), politics(talk.politics), science(sci) and religion (.religion). It failed to extract the topic and created a topic with short terms such as ‘ax’, ‘eof’, ‘ff’, ‘ck’, ‘eg’, ‘cx’, ‘mq’, etc. The figure below shows the most frequent terms in the topic 5 which corresponds to the topic ‘religion’. The figure below shows the most predominant words for the topic 5 aka. Religion. You can use the following link to explore the other topics that were extracted by this model. The left part shows the topics and their prevalence in the collection of documents. Topics are plotted as circles; the circles with the largest area correspond to the most prevalent topics.

The right part shows the most predominant words within the selected topic.

Also, It is possible to focus on the terms that are highly relevant to the selected topic using their “lift”; by sliding the bar towards λ= 0 we put more emphasis on terms which appear solely/mainly in this topic.


Topic 5: corresponding to the label religion:


The 20-topic model provide refined clusters with clear subject-matter as shown in the figure below


Topic 5: corresponding to the label talk.politics.guns:


Topic 14: corresponding to the


You can use the following link to explore the other topics generated by this model.


For the enron dataset, a 20-topic model was created and led to the extraction of some topics that were defined by “SIMS, UC Berkeley” researchers manually on a subsample of 1700 emails. For instance, Topic 3,5 could be associated with the class  “3.6 california energy crisis / california politics”. Topic 6 with “3.8 internal company operations” and in particular financial operations (invoice, credit). Topic 9 with “3.10 legal advice”, topic 15 and 19 with “1.4 Logistic Arrangements (meeting scheduling, technical support, etc)” , topic 16 with “3.13 trip reports” and topics 7-12 correspond to 2.2 Forwarded email(s) including replies.

Note that once the model is trained, it is possible to classify new unseen instances into one of the 20 created classes.


Topic 6: corresponding to credit/invoice


You can use the following link to explore the other topics generated by this model.


  1. Business value:

From a business perspective, such algorithms can be used for email/request classification in order to redirect them to the relevant agents. Ultimately, this process could lead to the reduction of the number of agents and hence a decrease in costs. Additionally, an auto-reply algorithm could be used for classes that require the same procedure to tackle their inherent requests.

This approach could also be used to extract the main topics of a collection of articles on linkedin to explore new trends or perform web scraping on a particular field of interest such as “block chain”.