, , , , , , , , ,

In Part 1 of analysing the „Star Trek: The Next Generation“ transcripts, I performed some statistical analysis of the characters and episodes: who has the most lines and appears in which episodes, etc.

In my new Ipython notebook I concentrated on the text itself. This was actually my first motivation to work with the Star Trek: The Next Generation transcripts from chakoteya.net: I wanted to try out some machine learning algorithms. I came up with the idea to predict which STNG character is the speaker of a text-line or just any word.

Predicting Who Said What

I would say the results are pretty convincing if you look at some phrases:

  • „My calculations are correct“ is ascribed to Data with 78% probability.  
  • Who would not have thought, that it is Troi uttered a sentence like „Captain, I’m sensing a powerful mind.“ with 73% probability
  • And who would use the word „Mom“? Obviously Wesley with 88% probability.
  • Where instead „Mother“ is a word used by Deanna Troi with 60% probability.
  • But „Deanna!“ is used by Riker, not that exclusively (just 48% probability)
  • And he is called „Number One“ by no other than Picard with almost 100% probability

Some more examples:


Also, the characters most used words are very descriptive for the characters, as we know them:


But have a look for yourself.

How to get there

To do all this, there where some steps included, which have been a real good practice in python, numpy, pandas and sklearn.

  1. I had to download and clean the data, which was a good practice in startrekng-episodes-analysis_01.ipynb.
  2. I did some statistical analysis of dataset with python, numpy and pandas in startrekng-episodes-analysis_02.ipynb.
  3. Finally we arrived in startrekng-episodes-analysis_03.ipynb, where I concentrate on predicting the speakers with the use of 2 algorithms: the „Term Frequency – Inverse Document Frequency“  and „Multinomial Naive Bayes“ („sklearn.feature_extraction.text.TfidfVectorizer“ and „sklearn.naive_bayes.MultinomialNB“)

Practical Background

To do all this, I learned a lot from the „pandas“-book and from the scikit-learn-examples like the MLComp-text-classification. And obviously almost nothing could have been done without stackoverflow.

Theoretical Background

If anyone is interested in the foundation of the algorithms, I recommend the coursera MOOC „Probabilistic Graphical Models“ by Stanford Professor Daphne Koller and the „Natural Language Processing“ course by Dan Jurafsky and Christopher Manning. Also the udacity course about Machine Learning is pretty helpful.