Predicting power dynamics with Machine Learning on emails (PART 1)

Target Audience

PART 1 targets general public while PART 2 is for a more technical audience who have at least heard of “Machine Learning classifiers”.


We either affect or are affected by people. You influence your family at home and/or your colleagues at work. At home, it’s perhaps the way you roll your eyes when your sibling says something you absolutely dislike; stare your kids down when they perform an action you disprove; or scream in excitement when your team snatches a touchdown in Super Bowl. At work, it’s probably the way you pack your bags at the end of a workday; attach sticky notes to your monitor as task reminders; or even slowly exit a boring conversation to return to your desk. In all of these, your actions either affect people or you are affected by their actions. Both could concurrently happen. There are countless fascinating examples that capture daily influence but let’s focus on one: written words.

Consider for a moment all your written conversations — this could be text messaging, Facebook chat, WhatsApp, Viber, Email, you-get-the-point. What kind of words appear the most? If you selected your top 3 words across all messages, what will they be? Perhaps, your word choice changes with different platforms and even people on the other communication end. Clearly, the way you write to your boss is different from how you write to your friend in a distant country. Let’s focus on your writing at work since for the average person, more time is spent at work than at home. One easy way to capture your work writing is through email. Does your email at work differ across employees?

By analyzing your email, you could discover the emotions expressed through your words and see how they change over time; you may discover consistent boot-licking every time you email your boss; or you find out that you always use “please” to forcibly sound polite. In this work, we investigate what kinds of words are used in a set of emails at a company, Enron. Then we predict if an email was sent by an employee of higher or lower organization hierarchy.

Big Question:

Given a work environment, what can we infer about power dynamics in written conversations based on usage of minor and major words?

In Natural Language Processing (NLP), minor words are stop words; these words have no meaning when considered alone and they are usually prepositions and conjunctions. On the other hand, major words are words with standalone meanings and they include nouns, adjectives, and verbs.

For instance:
Jane is awesome and loves to go shopping.
Minor words: “is, and, to”;
Major words: “Jane, awesome, loves, go, shopping”.

Small Questions:

  • How do major words frequency compare between upspeak and downspeak language corpus.
  • Using major words, how accurately can we predict power dynamics in an outgoing email i.e. is the email from higher employee to lower employee or vice-versa?

Upspeak email: from employee of lower ranking to one of higher ranking.
Downspeak email: from employee of higher ranking to one of lower ranking.

Power dynamics can be defined in multiple ways but our focus is on organizational hierarchy. So if a Vice President sends an email to a secretary then it is downspeak; when the secretary emails the Vice President then it is upspeak. Even though in reality, you have more than binary email labels, for simplicity, every email in the dataset have been labelled as either downspeak or upspeak.

Where is your dataset coming from?

  • Enron email dataset
 While there are multiple data streams to answer the small questions, this dataset was chosen simply for convenience and availability. The data has already been cleaned and properly labelled with the format as:

<email words>

where <label> is either upspeak or downspeak and <email content> is the actual email sent. **START** and **EOM** (end of message) represent the beginning and end of the email words. This pattern is repeated for every line throughout the document.

PART 1: major words comparison for upspeak and downspeak




From the words above, the size grows bigger as the frequency of the word increases. That means for upspeak, “Please” is used more than “new” while for downspeak, “meeting” is used more than “see”.

But why do majority of the emails sent as upspeak or downspeak have the word “please” or “Please”. Since wordle is case-sensitive, we can infer that “Please” starts most sentences while “please” is used in between words.

What do you think is happening?

It appears Enron employees try to be polite in their emails whether upspeak or downspeak. Interestingly, we see that upspeak have more “call”, “information” and “questions” while downspeak has “Enron”. We may conclude that when people send emails to others of higher status, they are always seeking to have a meeting or they have questions that should be answered so maybe a “call” would help. Perhaps, managers emailing their department always stress the goals of Enron so that the words appear in only downspeak. Or maybe when people of higher hierarchy send their emails, their signature has their title and the company name “Enron”. But who/what is “Shirley”? Maybe it’s just “noise” in the text?

Let’s jump into the actual emails.

What is actually happening

In upspeak, “please” shows up in “please let me know if you have any questions”, “if you have any questions give me a call” written in multiple variations. This is not surprising as most companies want employees to be as polite as possible especially when communicating with clients. But we’re not entirely off.

In downspeak, “please” appears in “please let …. know about this information”, “please be at this meeting… I have not received many responses, if you have any questions let me know”. Same word, slightly different usage.

Turns out Shirley is the point of contact –most probably HR Head– when a new employee has been interviewed and ready to join “Enron”. Same contact is used when compiling event expenses, inviting people to Enron, passing information across teams, etc. Since a lot of these emails happen, what you have is a bolded “Shirley” and the word “Enron”.

Now we know what major words were used in the Enron dataset. So if you saw a random email and tried to guess if it was upspeak or downspeak, a simple rule is check if “Shirley” or “Enron” appears. If there, then it’s downspeak, otherwise it’s upspeak. That simple.
But what if it is a downspeak email and “Shirley” or “Enron” isn’t present? If there are many emails like this then we’d have too many errors. Can we predict better? Yes, and this is where Machine Learning takes the dance floor. The next section has more technical details in order to aid replication or extension to other applications. If not interested in the technical explanation, skim through to get overall goals. See you in PART 2.

One thought on “Predicting power dynamics with Machine Learning on emails (PART 1)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s