**Tags**

AI, artificial intelligence, machinelearning, multinomial naive bayes, naive bayes, nlp, picard, star trek, stng, tfidf

In Part 1 of analysing the „Star Trek: The Next Generation“ transcripts, I performed some statistical analysis of the characters and episodes: who has the most lines and appears in which episodes, etc.

In my new Ipython notebook I concentrated on the text itself. This was actually my first motivation to work with the **Star Trek: The Next Generation** transcripts from chakoteya.net: I wanted to try out some machine learning algorithms. I came up with the idea to predict which STNG character is the speaker of a text-line or just any word.

## Predicting Who Said What

I would say the results are pretty convincing if you look at some phrases:

**„My calculations are correct“**is ascribed to**Data**with 78% probability.- Who would
*not*have thought, that it is**Troi**uttered a sentence like**„Captain, I’m sensing a powerful mind.“**with 73% probability - And who would use the word
**„Mom“**? Obviously**Wesley**with 88% probability. - Where instead „
**Mother**“ is a word used by Deanna**Troi**with 60% probability. - But
**„Deanna!“**is used by**Riker**, not that exclusively (just 48% probability) - And he is called „
**Number One**“ by no other than**Picard**with almost 100% probability

Some more examples:

Also, the characters most used words are very descriptive for the characters, as we know them:

But have a look for yourself.

## How to get there

To do all this, there where some steps included, which have been a real good practice in python, numpy, pandas and sklearn.

- I had to download and clean the data, which was a good practice in startrekng-episodes-analysis_01.ipynb.
- I did some statistical analysis of dataset with python, numpy and pandas in startrekng-episodes-analysis_02.ipynb.
- Finally we arrived in startrekng-episodes-analysis_03.ipynb, where I concentrate on predicting the speakers with the use of 2 algorithms: the „Term Frequency – Inverse Document Frequency“ and „Multinomial Naive Bayes“ („sklearn.feature_extraction.text.TfidfVectorizer“ and „sklearn.naive_bayes.MultinomialNB“)

## Practical Background

To do all this, I learned a lot from the „pandas“-book and from the scikit-learn-examples like the MLComp-text-classification. And obviously almost nothing could have been done without stackoverflow.

## Theoretical Background

If anyone is interested in the foundation of the algorithms, I recommend the coursera MOOC „Probabilistic Graphical Models“ by Stanford Professor Daphne Koller and the „Natural Language Processing“ course by