15 minute read

I’ve been having some days! Between sick kids, myself getting sick, deadlines and development promises, I haven’t had a lot of time to make any kind of blog post. I have been following along with my #tidytuesday analysis and visualization though, so this week you get a double feature!

First up is last weeks Australian tax data. It was interesting to see other approaches to visualizations and what people presented, and a lot of people represented the data as straight up job salary and the inequity between men and women. In reality it’s a bit more murky than that, as that data was actually taxable income (which includes investments, independent work, etc.) so some the scale of differences in certain jobs are greatly magnified. This doesn’t mean that inequity doesn’t exist. It DEFINITELY does. What this does mean is that there needs to be further investigation, digging and cleaning to get to an analysis that is highly accurate. It also means that visualizations using coarse data are useful at highlighting the existence of a problem or pattern.

The first thing I did was explore the data to get a sense of how big the equity gap was. I tidied up the data, and classified jobs by dividing male taxable income by female taxable income and visualized that distribution as a waffle plot. This shows the jobs that men report more income in blue, and the jobs that women report more income in pink.

I decided to show the income difference in positions through a simple scatter plot, plotting each job as income earned by women vs income earned by men. I highlighted the jobs where women have greater income than men to illustrate the relative size of this equity problem. I also wanted people to able to explore the data, and provide them the underlying information in each job. Providing that level of detail, even simply labeling the job would put far too much clutter on the plot and obscure any patterns. This is usually a problem that can easily be handled by adding interactivity to the plot, in this case creating a custom tooltip that provides the underlying information on demand. There are packages to add this in R quite readily, mainly from the htmlwidgets family, and my favourite of those is ggiraph which provides interactive versions of ggplot geoms that can be used by all means of javascript. If you’re interested in this, go check out the package on github. There has been a pile written on the use of interactivity and interactive visualizations, but my favourite overview has to be in Andy Kirk’s (@vvisualisingdata) Data Visualization: A Handbook for Data Driven Design

The week 5 data is census data from the US government. This is another charged dataset that raised considerable discussion on twitter about perpetuating stereotypes, which sparked a discussion in the #tidytuesday channel in the R4DS slack. This has lead to the people behind #tidytuesday to take a good hard look about the datasets that are used, and try to pick ones that are relatively benign, yet interesting and meaningful enough. It also raises the learning opportunity that data carries bias, and it can either be something that comes from things like design (in case of the census), methodology, or the personal biases of who was working on the project. There’s a lot to learn about this from the social sciences, which I hope won’t neglected in data science education.

My take on week 5 was influenced by four things:

  1. My daily commute
  2. Popular music overplayed on the radio
  3. How I Met Your Mother
  4. The Martian by Andy Weir

My commute is about 25-30 minutes and is on a route that isn’t that bad. I listen to music (like nearly everyone else) during my commute. A few days ago, while skipping through stations, I must have heard the song “Despacito” 5 times that day. That song is popular, and very very very overplayed.

Enter How I Met Your Mother.

Marshall’s Fiero in the show is his beloved car, inherited from his brothers. Only the tape player worked in the car, and it had the single for the Proclaimers “I’m Gonna Be (500 Miles)” stuck in it. It played it on a loop. Over and over again. ad infinitum.

This raised the question in my mind. “How much would I like my commute if I had to listen to Despacito on repeat for the entire time”. Then when exploring the census data, I saw that it had a measure of “Average Commute Time”. I decided to normalize the commuting times of the citizens of the US, and plotting how many Despactios long commutes were across the country as a choropleth map. Yes, this is silly. Yes this is just scaling actual commute time to something arbitrary. This is where The Martian came in. There’s a point in the book when he’s trying to figure out something and the units get so complicated he just makes up his own, “A pirate ninja”.

Here’s my submission for week 5. It’s going to be legen……wait for it…


comments powered by Disqus