# Cohort analysis with R

*Cohort analysis is a very powerful tool when it comes to analyzing different groups of users over time. I won’t talk much about theory in this post but rather show two real-world examples and its implementation with R.*

## Retention curves

The most prevalent usage of cohort analysis in the startup ecosystem is retention curves. Those guys show whether or not the product has traction (aka product-market fit).

Here is an example of one of SaaS products I used to work with

Even a year after signing up, a bunch of very first users (brownish ones) are still using it. That’s a good indicator of traction.

Retention curves might also be useful when analyzing longer timeframes. Here is an example of the above-mentioned SaaS product with a few extra months of data

There is a lot going on, but the trend is still clear.

### Plotting retention curves with R

First, you need to transform the data into long-form — a table where each row represents one cohort in one month. I explain how to do that in the last section of this post.

When you first try to plot retention curves from your long-form data, more likely than not you’ll get a wired graph like this

To fix it, add aditional rows with zeros representing each cohort a month before it was born

```
for (cohort in unique(df$Cohort)) {
df <- df %>%
rbind(list(as_date(cohort), 0, as_date(cohort) - months(1), 0))
}
```

So that you have

It will smooth sharp angels. Now you’re ready to plot it with *geom_area()*

```
df %>%
ggplot(aes(Month, MAU, group = Cohort, fill = Cohort)) +
geom_area(position = position_stack(reverse = T)) +
labs(title = 'Retention curves', y = 'Cumulative MAU') +
scale_fill_brewer(type = 'div') +
theme(text = element_text(family="Segoe UI"), axis.text.x = element_text(angle = -45))
```

## Tile plots

Sometimes you want to focus less on the trend and more on the numbers itself. Especially, when it comes to percentages. In those cases, a tile plot would generally be a better choice.

Here is an example from my previous post showing Return on Equity of a loan company over time

### Plotting tile plot with R

Again, make sure the data is in long-form. Than, use *geom_tile()* and a text layer *geom_text()* to make a tile plot

```
ggplot(filter(result, product == 'Product2'), aes(factor(month), factor(format(cohort, '%Y-%m')), fill = roe)) +
geom_tile() +
geom_text(aes(label = ifelse(is.na(roe), '', sprintf('%.0f%%', roe * 100)))) +
scale_fill_gradient2(low = 'red', high = 'white', midpoint = 0, labels = scales::percent) +
labs(title = 'Return on equity by cohort, Product2', y = 'cohort', x = 'months after loan opening', fill = 'ROE') +
theme(text=element_text(family="Segoe UI"))
```

## How to convert data into long-form

Before making any cohort analysis, you need to transform your data into long-form — a table where each row represents one cohort in one month.

In simple case when data is already aggregated and joined properly, it can be done using *gather()* function. Otherwise, you should use loops.

In the previous example, the data wasn’t aggregated and stored in different dataframes, that look like this

So I used *for* loop to aggregate and transform it into long-form:

I iterated through unique cohorts, products and months calculating cumulative lending and payment volume, and then put it together in a dataframe:

```
cohorts <- double()
products <- character()
months <- integer()
lendings <- double()
payments_v <- double()
for (cohort in unique(floor_date(total_transactions_agg$open_date, 'month'))) {
for (curr_product in c('Product1', 'Product2')) {
i <- 0
while(as_datetime(cohort) + months(i) <= max(floor_date(total_transactions_agg$open_date, 'month'))) {
curr_lendings <- total_transactions_agg %>%
filter(floor_date(open_date, 'month') == as_datetime(cohort) & product == curr_product)
curr_payments <- total_transactions %>%
filter(floor_date(open_date, 'month') == as_datetime(cohort) & product == curr_product &
floor_date(payment_date, 'month') <= as_datetime(cohort) + months(i))
cohorts <- c(cohorts, cohort)
products <- c(products, curr_product)
months <- c(months, i)
lendings <- c(lendings, sum(curr_lendings$amount_lent))
payments_v <- c(payments_v, sum(curr_payments$amount_paid, na.rm = T))
i <- i + 1
}
}
}
result <- tibble(cohort = as_datetime(cohorts), product = products, month = months, total_lent = lendings, total_paid = payments_v) %>%
mutate(roe = (total_paid - total_lent) / total_lent)
```

Once done, you’re ready to plot it.