It would seem that I have far too much time on my hands. After the post about a Star Trek “test”, I started wondering if there could be any data to back it up and… well here we go:

The Next Generation

Name Percentage of Lines
PICARD 20.16
RIKER 11.64
DATA 10.1
LAFORGE 6.93
WORF 6.14
TROI 5.4
CRUSHER 5.11
WESLEY 2.32

DS9

Name Percentage of Lines
SISKO 13.0
KIRA 8.23
BASHIR 7.79
O’BRIEN 7.31
ODO 7.26
QUARK 6.98
DAX 5.73
WORF 3.18
JAKE 2.31
GARAK 2.29
NOG 2.01
ROM 1.89
DUKAT 1.76
EZRI 1.53

Voyager

Name Percentage of Lines
JANEWAY 17.7
CHAKOTAY 8.76
EMH 8.34
PARIS 7.63
TUVOK 6.9
KIM 6.57
TORRES 6.45
SEVEN 6.1
NEELIX 4.99
KES 2.06

Enterprise

Name Percentage of Lines
ARCHER 24.52
T’POL 13.09
TUCKER 12.72
REED 7.34
PHLOX 5.71
HOSHI 4.63
TRAVIS 3.83
SHRAN 1.26

Discovery

Note: This is a limited dataset, as the source site only has transcripts for seasons 1, 2, and 4

Name Percentage of Lines
BURNHAM 22.92
SARU 8.2
BOOK 6.21
STAMETS 5.44
TILLY 5.17
LORCA 4.99
TARKA 3.32
TYLER 3.18
GEORGIOU 2.96
CULBER 2.83
RILLAK 2.17
DETMER 1.97
OWOSEKUN 1.79
ADIRA 1.63
COMPUTER 1.61
ZORA 1.6
VANCE 1.07
CORNWELL 1.07
SAREK 1.06
T’RINA 1.02

If anyone is interested, here’s the (rather hurried) Python used:

#!/usr/bin/env python

#
# This script assumes that you've already downloaded all the episode lines from
# the fantastic chakoteya.net:
#
# wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m
#
# Then you'll probably have to convert the following files to UTF-8 as they
# differ from the rest:
#
# * Voyager/709.htm
# * Voyager/515.htm
# * Voyager/416.htm
# * Enterprise/41.htm
#

import re
from collections import defaultdict
from pathlib import Path

EPISODE_REGEX = re.compile(r"^\d+\.html?$")
LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ")

EPISODES = Path("www.chakoteya.net")
DISCO = EPISODES / "STDisco17"
ENT = EPISODES / "Enterprise"
TNG = EPISODES / "NextGen"
DS9 = EPISODES / "DS9"
VOY = EPISODES / "Voyager"


class CharacterLines:
    def __init__(self, path: Path) -> None:
        self.path = path
        self.line_count = defaultdict(int)

    def collect(self) -> None:
        for episode in self.path.glob("*.htm*"):
            if EPISODE_REGEX.match(episode.name):
                for line in episode.read_text().split("\n"):
                    if m := LINE_REGEX.match(line):
                        self.line_count[m.group("name")] += 1

    @property
    def as_percentages(self) -> dict[str, float]:
        total = sum(self.line_count.values())
        r = {}
        for k, v in self.line_count.items():
            percentage = round(v * 100 / total, 2)
            if percentage > 1:
                r[k] = percentage
        return {k: v for k, v in reversed(sorted(r.items(), key=lambda _: _[1]))}

    def render(self) -> None:
        print(self.path.name)
        print("| Name             | Percentage of Lines |")
        print("| ---------------- | ------------------- |")
        for character, pct in self.as_percentages.items():
            print(f"| {character:16} | {pct} |")


if __name__ == "__main__":
    for series in (TNG, DS9, VOY, ENT, DISCO):
        counter = CharacterLines(series)
        counter.collect()
        counter.render()
  • milkisklim@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    3 days ago

    This is really cool stuff! Thanks for posting the code!

    This definitely goes to show why people felt Discovery was the Micheal Burnham show. Not that she had an unusual number of lines but that no one else spoke even half as much as her, with all of the other percentages of lines broken up by more characters than the other series.

    Also does GEORGIOU count for both prime and mirror versions of the character?

    • Daniel Quinn@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      3 days ago

      That was my takeaway as well. I just wish I had data for the other seasons. It’d be interesting to see how that might change the percentages as they are.

      As for GEOGIOU, I’m reasonably sure that this refers to both versions of her.