Predicting power dynamics with Machine Learning on emails (PART 2)

PART 2: predicting power dynamics with Machine Learning


Cont’d from PART 1.

Now we know what major words were used in the Enron dataset. So if you saw a random email and tried to guess if it was upspeak or downspeak, a simple rule is check if “Shirley” or “Enron” appears. If there, then it’s downspeak, otherwise it’s upspeak. That simple.

But what about downspeak emails without “Shirley” or “Enron”?
For instance, one downspeak email talks about interviewing someone but “Shirley” or “Enron” didn’t appear. We definitely need something better. This is where Machine Learning algorithms come in.
Process Summary
  • Given a dataset of sent emails with labels “upspeak/downspeak”, let’s use the words in each email as features, “upspeak/downspeak” as binary labels, and trained with 3 different ML classifiers that appeal to me: naive bayes, SVM, and decision tree. There was no intuition involved in selecting the 3 classifiers used; they were chosen in order to get a feel of machine learning predictions.

Process Details & Python Programming

  • Enron email containing labels for upspeak/downspeak. Upspeak means the email was from someone of lower hierarchy in the company to someone of higher rank; downspeak means email was sent from a higher ranked person to a lower ranked person. There is no detail about the level of rankings between both parties, except that one is higher and the other is lower.
  • Reading the file involved stripping “\n” from every line. While reading every line, I aggregated all upspeak messages to one python list and all downspeak messages to another list.
  • Using bag of words approach from this online tutorial, I created features from the upspeak and downspeak python lists. Bag of words simply creates a dictionary presence of a word. For instance, the default case for “You are awesome” becomes {“You”: True, “are”: True, “Awesome”:True}. Beyond this, I created a variation of bag of words using same approach but eliminating stop words. So using the same example, result is: {“Awesome”: True}. Then a few more variations described below were made.
  • From the labelled data, training set (90%) and test set (10%) were made. Using “all words” as features, I got 82% accuracy with NaiveBayes classifier. Inspired by Danco’s iterative ML process in the last class, I decided to tweak for possible improved accuracy.

Let’s improve accuracy by removing stopwords

  • Digging more into NLTK classification tutorials, I imported this file to create better features. First I started with “bag_of_non_stopwords”, which eliminated all English stopwords i.e. high frequency words such as the, I, to, etc. The performance improved by about 4%. Neat!
  • Next, I ran the algorithm 5 times to see the overall performance but the result was disappointing: the performance seemed to fluctuate — sometimes better, other times worse than “bag_of_words”. Looking deeper into my code and it was evident that the performance would fluctuate since features are shuffled every time a classifier is trained.
  • One quick fix is to average out accuracy after n number of runs. I chose 10.

Let’s tweak the features and add other classifiers

  • I added more variations of features such as “bag_of_bigrams”, “bag_of_trigrams” while including more classifiers such as DecisionTree, and SVM. All of these were tested over 10 iterations.
  • Somehow, the performance for Naive Bayes appeared to be the same for different n-grams with NaiveBayes bigrams having the best output. I expected non stopwords to be the best because stopwords seemed like ‘text noise’ but I’m still not sure why Bigram outperformed others; perhaps, overfitting is at work in the other algorithms. 
  • Investigating other classifiers was rather disappointing. The first shocking result was that my program was stuck — the accuracy results wouldn’t print upon using SVM or Decision Tree.

Decision Tree/SVM: takes longer to compute and gives poor accuracy

  • Naive Bayes ran in 30 seconds but the other classifiers — DecisionTree and SVM — were still ongoing after 5 minutes even though the training set had been reduced from 90% to 10% of the featureset. Looking more at the theory of these classifiers, it makes sense that DecisionTree takes a long time since “best features” have to be selected after computing information gain. But each computation takes a long time as one goes down the tree. One solution is to reduce the training set size but this might lead to underfitting as the classifier wouldn’t have been trained enough.
  • SVM finds a hyperplane and calculates the maximum margin between support vectors (closest points to the hyperplane). However, I’m still missing the connection about why it takes so long.


Bag_of_words/Classifier Naive Bayes SVM* Decision Tree*
all_words 0.8021 0.5121 0.6477
non_stopwords 0.8181 0.5225 0.6405
bigram_words 0.8307 0.5092 0.6375**
trigram_words 0.8213 0.5026 0.6412**

*training set was reduced from 90% to 10%

**still running even after 15 minutes so had to be terminated

From the data above, bigram_words for Naive Bayes classifier performs best while trigram_words for SVM performs the least. The result also shows that for text classification Naive Bayes is fast and performs reasonably well.


  • This NLTK tutorial was helpful until I got to the part of making my own corpus with two labels: pos and neg. After some good time spent, I eventually figured how to make a corpus but not how to assign “categories”. Googling around didn’t seem too helpful either.
  • This tutorial conflicted with the initial tutorial in the creation of feature set that a classifier expects. Even this bitbucket code where I got the bag_of_words functions, made things worse when trying to use their labelled feature set function. The different sources seemed to contradict each other so I burned several hours trying to figure out why I had bugs.
Ethical concerns
  • The findings in this study will be retracted if the original email senders feel that their privacy has been bridged. Although this is highly unlikely as the Enron email dataset has undergone multiple privacy protection process.
  • Extending the methods applied here to personal emails should be done with care. Just because one can analyze their email does not mean your findings can be published to the public. For instance, analysis of incoming emails could show choice of words that a frequent sender emailing you would rather keep these words between you and them.
  • One simple rule of thumb is to fill an Institution Review Board (IRB) form. If not applicable to you then discuss with the parties involved what data you want to collect, the forms of analysis involved, and what they allow you to share, if any.
This is a combination of lessons learned from implementing code, utilizing libraries and reporting results:
  • When available, use qualitative methods to better understand results of quantitative analysis. This is useful for sanity checks and it is necessary regardless of your results. Qualitative methods could be asking your participants what they think about your research procedure or the results obtained, whether it agrees with your intuition and why.
  • Machine Learning is many times common sense even though the buzz can many times make it seem complicated. Okay, not entirely true, the mathematical theory still seems complicated for now.
  • Implementing algorithms helps to better understand the theory behind the work.
  • In making a classifer for NLTK, your training data should be a python list of tuples i.e. [ (feature, label), (feature, label), (feature, label)]. Then each tuple should be of the format (feature, label) — plus your feature as a python dictionary: feature_name: feature_value}.
  • Using python ML classifiers is easy once you get past formatting your dataset into features.
Limitations & Future Work
  • Extend the definition of “power dynamics” beyond organizational hierarchy to employee requests especially in decentralized work places. For instance, if in your email you ask a favor from a fellow employee at a startup then that email is upbeat. It doesn’t matter if hierarchically, you’re above the person, on the same level, or below.
  • Use minor words as features for training Machine Learning classifiers.
  • The distribution of the data clearly affects the Machine Learning performance. In this work, the dataset collected was sampled from a bigger email collection. However, there are no details about the sampling process. As such, the machine learning classifiers could be prone to overfitting and consequently perform poorly in prediction when new data is encountered.
  • Performing k-cross validation should be used for perhaps increased accuracy.
Okay, I’ve read/skimmed all these, so what? How does this apply to me
  • From a particular timeline (maybe 6 months), copy a set of your outgoing emails into a file and as you deem fit, label them upspeak or downspeak (same format as training.txt).
  • Save your file as “training.txt” in order to replace the current file.
  • python
  • If you are a little ambitious, tweak the code so that you have a different test file (instead of the current way of splitting training data into training and test).
  • If you did the previous step, then your prediction could give you simple insights about how much upspeak/downspeak you perform. You may start linking this to actual events in your life i.e. how you felt during the period of sending those labelled emails.
  • Hooray, you have built a hypothesis function that can classify any new outgoing email as upspeak or downspeak. Go forth and predict all over the world!

Tools used

  • R
  • Python
  • NLTK
  • Wordle



One thought on “Predicting power dynamics with Machine Learning on emails (PART 2)

  1. […] Now we know what major words were used in the Enron dataset. So if you saw a random email and tried to guess if it was upspeak or downspeak, a simple rule is check if “Shirley” or “Enron” appears. If there, then it’s downspeak, otherwise it’s upspeak. That simple. But what if it is a downspeak email and “Shirley” or “Enron” isn’t present? If there are many emails like this then we’d have too many errors. Can we predict better? Yes, and this is where Machine Learning takes the dance floor in PART 2. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s