Analysing WhatsApp Group Messages using Python

Benji Knights Johnson
Towards Data Science
6 min readOct 27, 2020

--

Who sends the most messages? What are the most common words used? What are the most common Emojis used? When do people message most often? If you’ve ever wondered these things about your WhatsApp groups with your mates, then this is the article for you. Find out with some (relatively) simple Python!

I am relatively new to Python, but this project gave me a really good intro to Python data wrangling as I had to use a lot of different techniques here as well as trying to explore the best one to use.

Photo by Caspar Camille Rubin on Unsplash

Getting the Data

More simple than I thought, I never realised you could get this on WhatsApp. On your phone, go to the WhatsApp group and click the dot dot dots in the top right. Then got to ‘More’ and to ‘Export chat’. You can then select where to save this. I chose Google Drive and put it where I would be creating my Jupyter Notebook.

Setup

Just a little comment on the setup I went for. I used a Jupyter Notebook as for this kind of analysis, that is done in little parts, I think it works really well to execute each cell. Here are a couple of tips with some special functions that help in Jupyter.

I have explained what these do in the comments, but to explain why: Jupyter limits the amount of outputs displayed for each cell. So if you want to output a graph of some data, as well as a supporting table nicely in one cell, then you’ll need to run the first command. The second two I use to control how much is outputted. I don’t like it how it defaults to such a low amount of rows/columns, especially when you can limit the amount displayed using functions like .head() instead anyway, so I’ve just gone unlimited. But do be careful when having a look at your data that you are limiting it when you need.

Wrangling the Data

Some notes on what I’ve done here. I like to have two sets of the data, leaving one as the raw data that you can compare to the wrangled dataset, hence the raw_data_csv and data.

Unfortunately, this data export does not come out perfectly from WhatsApp, and when a message contains multiple lines/line breaks, it is very hard to attribute that to the correct message details. I played around with some ways to do this, but couldn’t find anything too simple for it. For the purpose of this analysis I’ve just removed them, as we can still get good enough insights without them. That is what the dropna on datetime column is doing.

The second dropna on text_message is to get rid of some of the system messages like when people are added to the group.

I actually ran each of these sections in separate cells in Jupyter, and ran head, tail and info at each point to understand what the data is looking like at each point.

Analysing the Data

Some WhatsApp Group Context

So some context for these next few sections, the WhatsApp group I used is for my (field) hockey team, so mainly talking about training, matches and the nights out we had. But you’ll see this shortly!

Most Popular Message Times

I thought a heatmap would display this best using Hour of Day and Day of Week. That was what I was most interested in, as other things were obvious, like month of year due to when the hockey season runs. So how I’ve constructed this is by creating new columns for those two data points, and then creating a new data frame with the grouped counts of those dimensions.

Image by author

Can you guess which nights we have training on? Monday and Wednesday nights, yep. This will be messages saying people are in, out, or late for the session. Our matches are obviously on Saturday, which is where the bulk of our messages come in and will be mainly talking about traveling to the matches, and then post-game antics.

Messages by Sender

Next up is sender message counts, which I’ve chosen to display with a bar chart, like this:

Image by author

Displaying as a bar chart allows you to visually see the difference in numbers, as opposed to just a table of numbers. Alex S is the captain and so sends a lot of admin messages, and I am the social sec, trying to organise a bunch of guys to get out the house and do fun things, when they clearly don’t want to. Even I was surprised just how much more Alex and I message compared to the others, but we often don’t get any responses, so it does make sense.

Most Popular Words

Now on to popular words, which does take quite a bit of manipulation and time to run. Hopefully the comments explain most of what’s happening.

The hardest part and maybe biggest caveat of this is how to define the non_words, and ultimately I just went with words that I didn’t care about at all when they came out in the analysis, but I probably could have gone further here, hence the quite uninteresting top words we have as a total group:

Image by author

So I thought what might be more interesting is picking out words and seeing how high they rank, like this:

fitness tying with beers is an entertaining “coincidence”… Image by author

Another thing I thought would be interesting is word counts by sender. Unfortunately, we don’t have a huge volumes of messages/words, so segmenting down to anyone other than Alex S and I gives quite little data really, but here we go anyway:

Image by author

Most Popular Emojis

Final element I wanted to add is emoji use. Now we are not the craziest of emoji users, so again not too much data, and it is also very hard to define/segment exactly what an emoji is, so there are some odd ones in there that I believe correspond to gender and skin tones associated with the emoji, but here we go:

Image by author

Well that ties it up. I couldn’t think of much more to analyse for this dataset, but I would love to hear suggestions. Also, if anyone has any suggestions for the code, do let me know, as I am still quite new to Python so feel there are definitely better ways of doing some parts.

Enjoy.

--

--