Cohort analysis with R

Cohort analysis is a very powerful tool when it comes to analyzing different groups of users over time. I won’t talk much about theory in this post but rather show two real-world examples and its implementation with R.

Retention curves

The most prevalent usage of cohort analysis in the startup ecosystem is retention curves. Those guys show whether or not the product has traction (aka product-market fit).

Here is an example of one of SaaS products I used to work with

Even a year after signing up, a bunch of very first users (brownish ones) are still using it. That’s a good indicator of traction.

Retention curves might also be useful when analyzing longer timeframes. Here is an example of the above-mentioned SaaS product with a few extra months of data

There is a lot going on, but the trend is still clear.

Plotting retention curves with R

First, you need to transform the data into long-form — a table where each row represents one cohort in one month. I explain how to do that in the last section of this post.

When you first try to plot retention curves from your long-form data, more likely than not you’ll get a wired graph like this

To fix it, add aditional rows with zeros representing each cohort a month before it was born

for (cohort in unique(df$Cohort)) {
  df <- df %>%
    rbind(list(as_date(cohort), 0, as_date(cohort) - months(1), 0))

So that you have

It will smooth sharp angels. Now you’re ready to plot it with geom_area()

df %>%
  ggplot(aes(Month, MAU, group = Cohort, fill = Cohort)) +
  geom_area(position = position_stack(reverse = T)) +
  labs(title = 'Retention curves', y = 'Cumulative MAU') +
  scale_fill_brewer(type = 'div') +
  theme(text = element_text(family="Segoe UI"), axis.text.x = element_text(angle = -45))

Tile plots

Sometimes you want to focus less on the trend and more on the numbers itself. Especially, when it comes to percentages. In those cases, a tile plot would generally be a better choice.

Here is an example from my previous post showing Return on Equity of a loan company over time

Plotting tile plot with R

Again, make sure the data is in long-form. Than, use geom_tile() and a text layer geom_text() to make a tile plot

ggplot(filter(result, product == 'Product2'), aes(factor(month), factor(format(cohort, '%Y-%m')), fill = roe)) + 
  geom_tile() +
  geom_text(aes(label = ifelse(, '', sprintf('%.0f%%', roe * 100)))) +
  scale_fill_gradient2(low = 'red', high = 'white', midpoint = 0, labels = scales::percent) +
  labs(title = 'Return on equity by cohort, Product2', y = 'cohort', x = 'months after loan opening', fill = 'ROE') +
  theme(text=element_text(family="Segoe UI"))

How to convert data into long-form

Before making any cohort analysis, you need to transform your data into long-form — a table where each row represents one cohort in one month.

In simple case when data is already aggregated and joined properly, it can be done using gather() function. Otherwise, you should use loops.

In the previous example, the data wasn’t aggregated and stored in different dataframes, that look like this

So I used for loop to aggregate and transform it into long-form:

I iterated through unique cohorts, products and months calculating cumulative lending and payment volume, and then put it together in a dataframe:

cohorts <- double()
products <- character()
months <- integer()
lendings <- double()
payments_v <- double()

for (cohort in unique(floor_date(total_transactions_agg$open_date, 'month'))) {
  for (curr_product in c('Product1', 'Product2')) {
    i <- 0
    while(as_datetime(cohort) + months(i) <= max(floor_date(total_transactions_agg$open_date, 'month'))) {
      curr_lendings <- total_transactions_agg %>%
        filter(floor_date(open_date, 'month') == as_datetime(cohort) & product == curr_product)
      curr_payments <- total_transactions %>%
        filter(floor_date(open_date, 'month') == as_datetime(cohort) & product == curr_product & 
                 floor_date(payment_date, 'month') <= as_datetime(cohort) + months(i))
      cohorts <- c(cohorts, cohort)
      products <- c(products, curr_product)
      months <- c(months, i)
      lendings <- c(lendings, sum(curr_lendings$amount_lent))
      payments_v <- c(payments_v, sum(curr_payments$amount_paid, na.rm = T))
      i <- i + 1

result <- tibble(cohort = as_datetime(cohorts), product = products, month = months, total_lent = lendings, total_paid = payments_v) %>%
  mutate(roe = (total_paid - total_lent) / total_lent)

Once done, you’re ready to plot it.

 1892   2019   R