Topic modeling - function, techniques and application

Wherever large amounts of relevant text data appear, the question arises as to what it contains. Of course, this can be solved not only by particularly diligent readers, but also automatically. One machine learning method often used for this is topic modeling. Based on the frequency of co-occurring or related words, the topics contained in the text can be estimated. This provides important preliminary work that humans or AI systems can use to make well-founded decisions.

What is topic modeling?

A topic model is an unsupervised mathematical model that processes documents as input and generates an output of topics that statistically represents the content of the text. Topic modeling is therefore the process that aims to achieve this result. Today, this usually requires knowledge of Python and machine learning as well as libraries such as Skikit-learn or special software. However, the technology actually originated back in the early 1990s in semantic methods such as Latent Semantic Indexing (LSI). At that time, the initial intention was to analyze historical newspapers and literature. However, due to the increase in digital data forms in conjunction with machine learning, there has been a steady development that continues to this day.

Although topic modeling is a rather special approach, a whole range of different approaches can now be used. Tasks of natural language processing solve:

  • Text classification - Depending on the modeled topics that the text contains, certain labels or categorizations can be created.
  • Summaries - The most frequently included topics also enable an aggregation of relevant content overviews.
  • Recommendations - Based on input, Topic Modeling allows suggestions of relevant documents that contain similar text.
  • Text clustering - Groupings of documents with related content can be created using the same principle.
  • Text search - Last but not least, the quality and relevance of search functions can also be optimized.

How does topic modeling work?

Topic modeling comprises a variety of statistical and graphical methods that extract and structure certain word combinations from text. A common basic assumption is that certain topics are more likely to be related to similar words. These correlations can be identified in very different ways, for example

e.g. through matrices, semantic analyses or vectorization, so-called word embeddings. The latter method plays a particularly important role in more recent methods such as Word2vec plays an important role. In addition to the available technical resources, the type of text is also decisive in the selection process. The following techniques are still very important today.

Latent Semantic Indexing (LSI)

In contrast to many newer methods, LSI primarily examines the semantics of words in order to identify corresponding relationships. After all, these are not only based on common usage, but also on their meaning. In order to decode this, the so-called Singular value decomposition the Term frequency-matrix is used. The semantic space, in which the terms can be represented in relational distances, is thus dimensionally reduced until only eigenvectors remain. This facilitates the calculations in the retrieval process, i.e. the measurement of the vector distances. This makes a latent semantic indexing model particularly suitable for very extensive text.

Latent Dirichlet Allocation (LDA)

As Bayesian network this method belongs to the generative probabilistic models and has been applied to documents since 2003. The nodes are to be understood as random variables, while the edges correspond to conditional dependencies. Text is thus understood as an unstructured collection of the words it contains, which are assigned to (latent) topics. Semantics is not taken into account, only the probability distribution, which is equivalent to a basic Bayesian statistical question. The number of topics to be output is determined by the user or data scientist themselves - it corresponds to the number of Multinomial distributionswhich are defined for each document from a Dirichlet distribution are drawn. An output of topics is created on the basis of these principles.

Topic model unigram for text data with LDA
LDA can be used to display unigram distributions - here for three words and four topics.
Source: Latent Dirichlet Allocation (2003)

Non-negative Matrix Factorization (NMF)

NMF is another well-established method for topic modeling that approaches documents through a linear combination of topics, which in turn are viewed as linear combinations of words. Both units are also represented here as vectors. What is special, however, is that they are each optimized taking a weighting into account. This method also aims at a dimensional reduction, whereby the matrix used contains only non-negative entries. It is broken down into two smaller matrices: one for the respective topic and one for the weighting. By interpreting these using various evaluation metrics, the aim is to create the most appropriate assignments for the individual documents.

The evolution of topic modeling

Interestingly enough, topic modeling is still a relevant method even after decades, which is a fact in the field of Natural Language Processings is rather atypical. In the paper "The Evolution of Topic Modeling" (2022), Rob Churchill and Lisa Singh analyzed the entire development in more detail. The oldest technology is LSI, and many further developments appear to be motivated by the advance of the Internet. The Hierarchical Dirichlet Process, a modification of LDA, eliminated the need to enter a fixed number of topics, making it easier to use. From 2010, Online LDA made it possible for the first time to deal appropriately with exponentially growing online data. Various specializations of Topic Modeling with regard to social media took place in 2011.

evolution of topic models and semantic
Source: The Evolution of Topic Modeling

A decisive turning point was the introduction of Word2Veca particularly powerful embedding method that was implemented, for example, for word suggestions in Google search. This was followed by several attempts to combine different techniques in order to solve more complex use cases. Embeddings in conjunction with topic models stand out here in particular. The The birth of the Transformer could not completely rationalize Topic Modeling, but led to shared use - for example with BERT.

part 2 of topic model evolution
Source: The Evolution of Topic Modeling

The reason for the continued use of classic techniques is that innovations were primarily geared towards new, unstructured formats and use cases. The first application scenarios such as literature analysis or the processing of simple documents still exist. Successful approaches such as LDA and LSI are still comparatively easy to use and at the same time combine modern techniques with the classic virtues of semantic and matrix analysis.

Areas of application

Uses Cases of Topic Modeling

In line with the evolution described above, a differentiation of possible use cases for topic modeling has taken place. Analyzing documents with regard to the topic they contain can open up impressive opportunities in almost any industry, but some of them are particularly striking:

Research

Scientific methods such as content analysis are still very much in demand in university research, for example in communication or other social sciences. Here, topic models can be used to evaluate the media discourse on a specific topic, for example, by identifying other related topics. This approach is also useful in the medical field: Yale researchers Porturas and Taylor (2021) analyzed over 47,000 articles from 40 years of emergency medicine using topic models. This enabled them to determine that the topic of risk factors, for example, has appeared significantly more frequently over time - basic research, on the other hand, has decreased.

Customer communication

Successful companies are committed to their customer relations management, take suggestions and feedback seriously and respond promptly. However, it can sometimes be challenging to sift through and organize the flood of incoming messages. The Classification is a typical case of topic modeling and enables the differentiation of error messages, data changes or general questions, for example, based on the topics contained. On this basis, the messages can be processed specifically by the relevant departments. Another use case is the reading of customer surveys.

Business intelligence

In many other business areas, text also has a special value as a data format - in transactional and narrative documents, including reports, presentations, contracts and offers. Every frequently recurring topic and related words can have a certain trend function that is relevant for forecasts and business decisions. In this sense, topic models form an important basis for data analysis by identifying and classifying relevant text corpora. For detailed investigations, however, the extended use of artificial intelligence is indispensable today given the high process complexity and the mixing of many unstructured formats.

Advanced AI approaches

In order to be able to process text fully and automatically, further steps are required in the process chain before and after the possible use of topic models. This primarily concerns the generation and further processing of data. Konfuzio is the name of the AI-based document software that provides a remedy here.

Optical character recognition (OCR)

The digitization of previously analogue processes means that text is often available in optical formats such as PDF. In this form, the words it contains cannot be captured by a topic model. Konfuzio uses high-precision OCR to convert the content of documents into machine-readable formats. A web-based interface then takes the data to the desired location for further processing - for example, a development environment for topic modeling.

Natural Language Processing (NLP)

If you want to process identified topics in detail, you can no longer avoid advanced NLP and modern models based on the Transformer architecture. This enables Konfuzio to index and understand even complex content. This enables sophisticated analyses with a high level of data control within a multicloud infrastructure. In principle, the use of topic models can also be completely replaced by Konfuzio's NLP approaches in many cases. For individual extraction pipelines based on Python, there is also a Software Development Kit ready.

Document chat

Users often also contribute their own ideas and search for suitable answers in their documents. An integrated chat interface enables user queries within familiar working environments (e.g. Office), which are answered by a language model based on all uploaded information. This even reveals implicit connections that are not based on explicit word combinations. In this way, topics are picked up that neither the person nor the topic model previously knew exactly how to name.

More information

Conclusion

Topic modeling is still a relevant method of machine learning today because automated topic output can solve a variety of language processing tasks. Since in most cases only semantics or the co-occurrence of words is taken into account, it is a comparatively simple and effective technique. Classic forms such as Latent Semantic Indexing (LSI) are still relevant today for suitable use cases. Due to advancing digitalization and the increase in online formats, various further developments have taken place. Combinations with modern language models are now also possible. However, topic models quickly reach their limits in complex business environments. Here, it is advisable to rely on more powerful AI software as a supplement or alternative.

Do you deal with the processing of extensive text data? Write us a message directly. Our experts will be happy to show you what possibilities artificial intelligence has in store for you.








    "
    "
    Tim Filzinger Avatar

    Latest articles