In Part 1 of analysing the „Star Trek: The Next Generation“ transcripts, I performed some statistical analysis of the characters and episodes: who has the most lines and appears in which episodes, etc.
In my new Ipython notebook I concentrated on the text itself. This was actually my first motivation to work with the Star Trek: The Next Generation transcripts from chakoteya.net: I wanted to try out some machine learning algorithms. I came up with the idea to predict which STNG character is the speaker of a text-line or just any word.
Predicting Who Said What
I would say the results are pretty convincing if you look at some phrases:
- „My calculations are correct“ is ascribed to Data with 78% probability.
- Who would not have thought, that it is Troi uttered a sentence like „Captain, I’m sensing a powerful mind.“ with 73% probability
- And who would use the word „Mom“? Obviously Wesley with 88% probability.
- Where instead „Mother“ is a word used by Deanna Troi with 60% probability.
- But „Deanna!“ is used by Riker, not that exclusively (just 48% probability)
- And he is called „Number One“ by no other than Picard with almost 100% probability
Some more examples:
Also, the characters most used words are very descriptive for the characters, as we know them:
But have a look for yourself.
How to get there
To do all this, there where some steps included, which have been a real good practice in python, numpy, pandas and sklearn.
- I had to download and clean the data, which was a good practice in startrekng-episodes-analysis_01.ipynb.
- I did some statistical analysis of dataset with python, numpy and pandas in startrekng-episodes-analysis_02.ipynb.
- Finally we arrived in startrekng-episodes-analysis_03.ipynb, where I concentrate on predicting the speakers with the use of 2 algorithms: the „Term Frequency – Inverse Document Frequency“ and „Multinomial Naive Bayes“ („sklearn.feature_extraction.text.TfidfVectorizer“ and „sklearn.naive_bayes.MultinomialNB“)
If anyone is interested in the foundation of the algorithms, I recommend the coursera MOOC „Probabilistic Graphical Models“ by Stanford Professor Daphne Koller and the „Natural Language Processing“ course by