depression, school, and life : exploratory data analysis with personal data


I started this as a simple data-exploration exercise, and as a giddy excuse to explore a dataset relevant to my life — not about trees, climate, insects or fire: but a personal story. In doing so, I felt it necessary to provide an introduction, which got a bit out of hand:

I graduated with a BSc. in Physical Geography from Texas State University in 2012. It took me 5 ½ years to get that degree — partly a mix of generational malaise, partly depression, partly uncertainty about my future.

(I’m still uncertain, and probably always will be.)

A year into my degree, my mom had been diagnosed with ALS. An English major at the time, I was just dipping toes into the upper-level literature classes when it happened — and everything got really fucked up from there in ways too complex to cover briefly in a post about data visualization.

From 2008 to 2010, I coasted through classes, changed majors multiple times, took on the classic “Cs get degrees” mentality all while not really wanting a degree in anything in particular — or really even wanting to do anything in particular.

There’s a scene in the 2016 documentary Gleason, about the eponymous athlete who was diagnosed with ALS, that really stuck out to me, that had me, someone used to hiding my emotions from absolutely everyone at every moment, sobbing uncontrollably so hard we had to pause the movie: Steve and Michel are lying in bed, the only sound is the oxygen machine keeping Steve alive, and Steve starts asking with that robotic voice if Michel hates him.

Oct 08 011.jpg
My parents and siblings, taken Oct. 2008 not long after my mom’s diagnosis.

For two years, this conversation was every day for me and my family. Oh, that’s not that say we had this conversation every day (though we certainly had it many times), but we went through every day knowing we were behaving as Michel was behaving, and knowing the question would eventually, painfully, and slowly come around once again: We lived in dread and shame.

If it’s not already obvious, I’m not comfortable talking about it, and would rather move on now to the data exercises. I apologize in advance for the glaring tonal shift.

working with personal data

I work with a lot of data these days. As I’m pretty new to being comfortable with data analysis, I’m always hunting for small, personal projects to add to my experiences. Having recently ordered my college transcripts — also knowing and understanding my academic performance took a tremendous hit from 2008 to 2010 — I realized I had an easy, small dataset to explore with data visualization.

When translated from a raw transcript paperwork to a .csv, the head() and tail() of my academic career looks something like this when pulled into R:

year season studentYear department courseID grade semester program
2007 Fall Freshman ANTH 1312 B 1 Core Curriculum
2007 Fall Freshman ENG 1310 C 1 Core Curriculum
2007 Fall Freshman HIST 1310 A 1 Core Curriculum
2015 Spring Graduate II ESCI 597 A 19 Master’s Geography
2016 Winter Graduate II ENVS 690 S 20 Master’s Geography
2016 Spring Graduate II ENVS 690 S 21 Master’s Geography

There are a number of other variables here, but I’m interested in how my grades were ultimately affected by years-long stress and depression, namely:

  • Did my performance change over time?
  • Does the season tend to affect my grades?
  • Does the course complexity — i.e., the first number of the courseID variable — relate to my grades?
  • Does the program I was pursuing — Core Curriculum, English, Geography, Geology, and Master's Geography — affect anything?

Before I get too far, there are some behind-the-scenes tidying I’ve done to this dataset — like removing Ws, substituting Ss for As, &c.

Now, since we don’t have a lot of observations per group (e.g., schoolYear), or much range between observations using a 1-4 GPA scale, boxplots probably won’t tell us much:

ggplot(transcripts, aes(x = 1, y = gpa)) +
  geom_boxplot(fill = "grey70", alpha = 0.7) + 
  theme_bw() + 
  ylim(0, 4) +
  ylab("GPA") + xlab("") + 
  scale_x_continuous(labels = NULL) + 
  geom_jitter(aes(color = studentYear), alpha = 0.3) +
  scale_color_discrete(name = "School year")


Dividing into different factor levels makes this borderline unreadable, so it might be better to view a quick histogram. As the most significant pattern is readable from a count of As grouped by schoolYear, I’ve filtered out everything else. (All grades can be found here.)

ggplot(filter(transcripts, grade == "A"), aes(x = studentYear, fill = studentYear)) +
  geom_bar(stat = "count") +
  theme_bw() + xlab("School year") + ylab("Number of aced courses") + 
  ggtitle("A+ University Grades!!")
  scale_fill_discrete(guide = F) +
  scale_x_discrete(expand = c(0.01, 0)) +
  scale_y_continuous(expand = c(0.01, 0), breaks = c(2, 4, 6, 8, 10))


It’s still a little hard to read, but we can see a definite pattern nestled in those A​s re: Sophomore and Junior years. Faceting by studentYear is another option, but I rather like viewing the proportional relationship, which we get by simply adding position = "fill" to the geom_bar() piece of ggplot2 code.

Color combination brought to you by the wesanderson library. 🙂

The precipitous drop in grades while I helped take care of my mom are unmistakable here, as well as the slow return.

This isn’t a dataset that I can take beyond EDA without making in any meaningful way: It’s too small, missing too many possible explanatory variables. However, inching towards that with some basic scatterplots is another way to visualize underlying temporal patterns: A simple linear relationship shows a steady increase in grades over time, but, showing the linear relationships of each individual program I took part in shows a series of clashing patterns.

Since the non-linear temporal relationship should be sticking out like a sore thumb by now, we can use ggplot2‘s local regression options to illustrate this relationship:

ggplot(transcripts, aes(x = semester, y = gpa, color = program)) +
  geom_jitter() + 
  geom_smooth(se = F, method = "loess") +
  theme_bw() + ylab("GPA") + xlab("Semester") + ggtitle("TSU & WWU Grades")
  scale_color_discrete(name = "Program") +
  ylim(0, 4.5) + 
  geom_line(method = "loess", stat = "smooth", color = "grey50", alpha = 0.2, size = 1.2)

The additional Program details tell an underlying story. While there aren’t a lot of observations within the English degree, I still performed better in these classes, in general, during 2008 – 2010, because they were offering an escape. (Regardless, I became disillusioned with the degree, and a hike in the woods in 2009 convinced me to dig into the geology and biogeography classes offered by the Physical Geography program.)

I’ve about hit a wall in the type of data exploration I’m interested in. Offscreen, I’ve tried a few more combinations such as the influence of season, but the effects were not strong enough to warrant including here.

I haven’t explored personal data much before, but it’s been fun revisiting school memories — even if the whole purpose of this post is using a proxy measurement for the effect of prolonged stress and depression on a life.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s