October 5, 2016

A Syntactic Analysis of the Presidential Debate of September 26, 2016


I analyzed the syntax of the Presidential Debate of September 26, 2016 between Hillary Clinton and Donald Trump. In this post, I outline what I did and a summary of my findings. TL;DR: Hillary Clinton talked less than Donald Trump, but she said more.

I have posted the results and process in a public Github repository, so anyone interested in the data and process of analysis may look there. I make use of the syntactic parsing packages that I have worked on in recent years -- in particular the mccawley-api repository which parses English sentences as webservice calls, and the mccawley-bulk repository which uses the mccawley-api for handling bulk amounts of written English language and which I adapted for analyzing the debate transcript.

I began by taking an unannotated copy of the debate transcript, cleaning it up (removing blank lines and ensuring that notes about speakers, when noted, always began a line of dialogue) into a raw but consistent text format. I then used a Python script to do two things: break up the text into sentences (one sentence per line), and denote a speaker for each sentence.

The results of that process became the input that was fed into a new sentences module in mccawley-bulk which did the parsing and analysis. It took 5 minutes, 18 seconds for all 1,363 sentences transcribed in the debate. I output that into a vanilla text file and copied that into an Excel spreadsheet.

At this point, let's pause and look more closely at the fields of the analysis and what they mean. The analysis delivered back eight columns:

  1. Speaker: Who said the sentence.

  2. Transcript: The transcribed sentence itself.

  3. # Words: The number of words in the sentence -- computed simply by dividing the sentence by spaces and counting the result.

  4. # Nodes: The number of syntactic nodes in the sentence, when parsed out. In the terminology of graphs, where you have lines and dots and lines connecting dots, a "node" is a synonym for a dot. For syntax trees, a node represents a word, or an abstract "dot" representing the combination of one or more words or one or more other "dots".

  5. # Tokens: The number of tokens in the sentence. A token in this context refers to an indivisible unit of language. For example, the sentence "She's here" has two words but three tokens: "she", "'s" (a synonym for "is") and "here".

  6. # Propositions: The number of propositions in the sentence. A proposition (not to be confused by _prep_osition with an "e") refers to a word or term that can bear some truth value. (Whether or not it actually is true is another story). Such propositions can correspond to types of speech -- typically a verb, adjective, adverb, or prepositional phrase -- so this amounts to a count of the number of verbs, adjectives, adverbs, and prepositional phrases in the sentence.

  7. Tree depth: The number of levels deep in the tree drawn by the sentence at hand, starting from one level beneath the ROOT node.

  8. S-Expression: A representation of the tree itself as a nested list of parentheses. Each set of parentheses wraps around a consituent or part of speech, and their nestings mirror the tree structure of the sentence. Constitutents and parts of speech follow the Penn Treebank Project.

I then used a great many command-line tools, some of which I archived to compute statistics for each speaker. The results of that are stored in a separate Excel file. I reproduce some of those stats here with commentary:

Number of sentences433752
Total number of distinct words13751291
Total of distinct non-stopwords11831101

Trump said more sentences than Clinton, to be sure. A stopword is a common word in a language, typically words that carry some universal function. In the case of English, these include articles (a, an, the), prepositions (in, of, for, at, etc.), pronouns (he, she, it, etc.), and so on. It's common in natural language processing to remove stopwords from analysis, given their preponderance and the fact that they don't necessarily carry a lot of insightful information. I provide a count of the total number of distinct words and distinct stopwords spoken by each candidate in the debate. I used this list of stop words and this Clojure script for determining those words and counts.

Total number of words63708679
Largest number of words6460
Average number of words per sentence14.711.5
Total number of nodes1434120121
Largest number of nodes147136
Average number of nodes per sentence33.126.7
Total number of tokens740810496
Largest number of tokens7368
Average number of tokens per sentence17.113.9
Total number of propositions30544200
Largest number of propositions3434
Average number of propositions per sentence75.5
Longest tree depth3134
Average tree depth per sentence8.66.7

In all of these metrics, we see a consistent pattern: Trump clearly wins by volume: more words, more tokens, more of nearly everything. But Clinton wins by average: her averages per sentence are greater than Trump's regardless where you look. That's what I mean when I said that Hillary Clinton talked less than Donald Trump, but she said more.

There is one more metric I'd like to discuss:

Number of ROOT-level FRAGments1027

When is a sentence not a sentence? When it's a fragment. That is, a sentence can be distilled down into components indicative of sentences -- for example, a noun phrase and a verb phrase in the case of a declarative sentence. When it can't be so distilled, then it's classified as a sentence fragment (or FRAG as it's noted in the engine).

Fragments are inevitable in spoken language, but in the case of a debate, fragments are like golf scores: the lower the number, the better. Here Trump said more "sentences" classified as FRAG (by a ratio of nearly three to one), but that's not a good thing. What's more, a great many of those fragments occurred in the course of his attempted interruptions of Clinton.

I would advise two things, in particular to Donald Trump, in the subsequent debates: Don't interrupt so much, and be more selective in your words. Sometimes, and especially here, less is more.