In the last couple of month I worked on getting my head around Numpy, Python and Pandas. Before I get into the technical challenges and talk about the steep learning curve in a following blog post – it is first frustrating but than en-lighting :-) – I need to show some results first!
I thought, before I am the 1 millionth person working on a Kaggle project, I try to get my own data set to play with…
So I came up with the idea of analyzing the transcripts of STNG. I did not have to google very long and I found some nice looking transcripts at chakoteya.net. I did some web scraping to download all the text files, and put them into a Pandas DataFrame.
Thanks to the author of the transcripts: I had to do a little data cleaning – some misspellings here, deleting some line breaks there… But there wasn’t much necessary. It’s pretty good quality!
Long story short: Have a look! Here are some examples:
The „line-pie“: the distribution of spoken lines for the 25 characters with the most spoken lines in STNG:
Picard hat obviously a lot to say…
The number of episodes a character had the most lines in:
Picard not suprisingly dominated 76 episodes. But who was K’EHLEYR again ??
PICARD 76 episodes DATA 20 RIKER 16 LAFORGE 10 CRUSHER 9 WORF 8 TROI 4 LWAXANA 3 BARCLAY 2 WESLEY 2 K'EHLEYR 2 (who was that again??) CLARA 1 CLEMENS 1 CONOR 1 JEV 1 ARMUS 1 DURKEN 1 FAJO 1 AMANDA 1 JAMESON 1 JELLICO 1 MADRED 1 MARR 1 OKONA 1 PICARD JR 1 Q 1 RAL 1 RASMUSSEN 1 RIKER 2 1 RO 1 SALIA 1 SCOTT 1 SITO 1 SPOCK 1 ALKAR 1
Picard lost his words
In the last 50 episodes Picard had more episodes with far less spoken lines than average.
The „Crusher-Pulaski-Gap“ – Episodes 26 to 47:
And she was never seen afterwards?
The one where three main characters at once talk highly over average:
What you always wanted to know about Wesley
And more, and more and more diagrams and insights
in the IPython notebook github.com/…/startrekng-episodes-analysis_02.ipynb
Have fun! Any feedback is more than welcome!
[…] In Part 1 of analysing the „Star Trek The Next Generation“ transcripts, I performed some statistical analysis of the characters and episodes: who has the most lines and appears in which episodes, etc. […]