The Power of BERT: NLP Topic Modelling and Analyzing Podcast Transcripts
Recently I was scrolling through many podcast episodes and I had the idea to run topic modelling on podcast transcripts to help me determine if there were topics interesting enough for me to listen to the episode. This article is how I did that and what the results looked like.
BERT: Bidirectional Encoder Representations from Transformers
Bert bases its architecture on self attention mechanisms. If you want to learn more about BERT I recommend reading this book here:
BERT outputs word embeddings. These embeddings can be used for a variety of things such as text summarization and topic modelling. Unlike traditional topic modelling / NLP techniques, BERT doesn’t require techniques such as stemming/lemmatizing and removal of stop-words. In this case I am going to use BERT’s topic modelling abilities.
If you wish to use BERT yourself you can find it here: https://pypi.org/project/bertopic/
The Podcast Episode:
I picked the most recent podcast episode here. Honestly the title is good enough of a description already, but let’s see if BERT will be able to pick the topics mentioned in the title (and perhaps other side conversations). I used Selenium to scrape the data from this site, if you want to learn how to web scrape: https://www.scrapingbee.com/blog/selenium-python/
There are many articles on how to use BERT, here’s one that may help if you want to get started with topic modelling with BERT: https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6
Evidently the BERT model is picking up relevant topics from the transcript— very similar to the title of the podcast (albeit slightly cryptically naming them) :
- Meaning_life_love
- religion_religious_theism_religions
- wisdom_rationality_puzzle_solving
- myths_patterns_stories_mythos
- sin_immoral_evil_immortality
Other interesting subjects:
- consciousness_unconcious_do_concious
- cognition_congitive_distributed_science
- flow_state_induction_need
- bullshit_deception_truth
- illusion_reality_we_math
- data_neural_networks_overfitting
- Death_mortality_problematic_die
- Video_games_world_game
It also happens to pick up times when the two speakers agree or disagree under the category no_yes_yeah_very. This is interesting as it indicates during these time frames, the two speakers share similar / dissimilar views (or just a misunderstanding and includes followup for clarification) on what ever subject they were talking about:
Another Interesting topics is also picked up: shampoo
If you want to know what library I used to create these visuals I used the Streamlit package. You can try it out here: https://docs.streamlit.io/library/get-started
Conclusion: Overall BERT is really good at identifying relevant topics. At the same time it does generate some garbage topics — which likely can be solved by tuning hyper parameters. The training time was less than 2 minutes on roughly 1000 sentences. And this project was set up and up in running in less than 3 hours. However, BERT has some issues when creating the topic names. They often come out awkward and reordered — why this happens? Maybe you can explain if you know!