The Power of BERT: NLP Topic Modelling and Analyzing Podcast Transcripts

Richard Gao
3 min readSep 6, 2022

--

Recently I was scrolling through many podcast episodes and I had the idea to run topic modelling on podcast transcripts to help me determine if there were topics interesting enough for me to listen to the episode. This article is how I did that and what the results looked like.

BERT: Bidirectional Encoder Representations from Transformers
Bert bases its architecture on self attention mechanisms. If you want to learn more about BERT I recommend reading this book here:

BERT outputs word embeddings. These embeddings can be used for a variety of things such as text summarization and topic modelling. Unlike traditional topic modelling / NLP techniques, BERT doesn’t require techniques such as stemming/lemmatizing and removal of stop-words. In this case I am going to use BERT’s topic modelling abilities.

If you wish to use BERT yourself you can find it here: https://pypi.org/project/bertopic/

The Podcast Episode:

I picked the most recent podcast episode here. Honestly the title is good enough of a description already, but let’s see if BERT will be able to pick the topics mentioned in the title (and perhaps other side conversations). I used Selenium to scrape the data from this site, if you want to learn how to web scrape: https://www.scrapingbee.com/blog/selenium-python/

There are many articles on how to use BERT, here’s one that may help if you want to get started with topic modelling with BERT: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6

Evidently the BERT model is picking up relevant topics from the transcript— very similar to the title of the podcast (albeit slightly cryptically naming them) :

  1. Meaning_life_love
  2. religion_religious_theism_religions
  3. wisdom_rationality_puzzle_solving
  4. myths_patterns_stories_mythos
  5. sin_immoral_evil_immortality

Other interesting subjects:

  1. consciousness_unconcious_do_concious
  2. cognition_congitive_distributed_science
  3. flow_state_induction_need
  4. bullshit_deception_truth
  5. illusion_reality_we_math
  6. data_neural_networks_overfitting
  7. Death_mortality_problematic_die
  8. Video_games_world_game

It also happens to pick up times when the two speakers agree or disagree under the category no_yes_yeah_very. This is interesting as it indicates during these time frames, the two speakers share similar / dissimilar views (or just a misunderstanding and includes followup for clarification) on what ever subject they were talking about:

Another Interesting topics is also picked up: shampoo

If you want to know what library I used to create these visuals I used the Streamlit package. You can try it out here: https://docs.streamlit.io/library/get-started

Conclusion: Overall BERT is really good at identifying relevant topics. At the same time it does generate some garbage topics — which likely can be solved by tuning hyper parameters. The training time was less than 2 minutes on roughly 1000 sentences. And this project was set up and up in running in less than 3 hours. However, BERT has some issues when creating the topic names. They often come out awkward and reordered — why this happens? Maybe you can explain if you know!

--

--

Richard Gao
Richard Gao

Written by Richard Gao

Computer Science and Data Enthusiast | Linkedin: https://www.linkedin.com/in/richard-gao-csecon/ | Shovelling data into the AI engine

Responses (2)