4 posts tagged

R

How to get daily Google Trends data for any period with R

Recently, I needed some seven years of Google Trends daily data. It turned out that by default it’s not possible to get it neither through the web interface nor via API. So I wrote a tiny script that pulls daily Google Trends data for any period using gtrendsR package

What’s the problem with Google Trends?

Google Trends returns data in daily granularity only if the timeframe is less than 9 months. If the timeframe is between 9 months and 5 years, you’ll get weekly data, and if it’s longer than 5 years – you’ll get monthly data.

A trivial solution like querying the data month by month and then tieing it together won’t work in this case, because Google Trends assess interest in relative values within the given time period. It means that for a given keyword and month, Google Trend will estimate interest identically – with a local minimum of 0 and a local maximum of 100 – event in one month it had twice as many searches than in the other.

Querying Google Trend daily data properly

To get proper daily estimates, I do the following:

  1. Query daily estimates for each month in the specified timeframe;
  2. Queries monthly data for the whole timeframe;
  3. Multiply daily estimates for each month from step 1 by its weight from step 2.

Here is the R code:

library(gtrendsR)
library(tidyverse)
library(lubridate)

get_daily_gtrend <- function(keyword = 'Taylor Swift', geo = 'UA', from = '2013-01-01', to = '2019-08-15') {
  if (ymd(to) >= floor_date(Sys.Date(), 'month')) {
    to <- floor_date(ymd(to), 'month') - days(1)
    
    if (to < from) {
      stop("Specifying \'to\' date in the current month is not allowed")
    }
  }

  mult_m <- gtrends(keyword = keyword, geo = geo, time = paste(from, to))$interest_over_time %>%
    group_by(month = floor_date(date, 'month')) %>%
    summarise(hits = sum(hits)) %>%
    mutate(ym = format(month, '%Y-%m'),
           mult = hits / max(hits)) %>%
    select(month, ym, mult) %>%
    as_tibble()
  
  pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'), 
               e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))
  
  raw_trends_m <- tibble()
  
  for (i in seq(1, nrow(pm), 1)) {
    curr <- gtrends(keyword, geo = geo, time = paste(pm$s[i], pm$e[i]))
    print(paste('for', pm$s[i], pm$e[i], 'retrieved', count(curr$interest_over_time), 'days of data'))
    raw_trends_m<- rbind(raw_trends_m,
                         curr$interest_over_time)
  }
  
  trend_m <- raw_trends_m %>%
    select(date, hits) %>%
    mutate(ym = format(date, '%Y-%m')) %>%
    as_tibble()
  
  trend_res <- trend_m %>%
    left_join(mult_m, by = 'ym') %>%
    mutate(est_hits = hits * mult) %>%
    select(date, est_hits) %>%
    as_tibble() %>%
    mutate(date = as.Date(date))
  
  return(trend_res)
}

get_daily_gtrend(keyword = 'Taylor Swift', geo = 'UA', from = '2013-01-01', to = '2019-08-15')

get_daily_gtrend function should return a tibble with daily trend. Now you can plot it nicely or use in some analysis

 No comments    196   2019   R

Recursive functions in R

Besides product management and growth, sometimes I also write short technical posts (particularly, about R), where I share solutions to non-trivial tasks I’ve encountered or just useful pieces of code.

In this post:

  1. How to write a recursive function in R, and
  2. How to apply this function group-wise to a data frame.

The context

Our orders table is designed in a way that a single order may be split into different entities with numerical suffixes and character prefixes. It’s done so because different products have different manufacturing and shipping time (1) and customers may add or change items shortly after placing an order (2).

It turned out though, that not all salespeople have been using it as expected: when a customer was coming back months later, instead of creating a new order, they just added an incremental suffix to the existing order.

From the analytical standpoint, it made no sense: a single order may be spread over months (or even years). So before doing any analysis, I first had to group orders by the order id and then, within those groups, gather orders that happened within 7 days together

While the first part of the task was trivial using regex, the second one required iterating through groups of orders to properly match them by order date. That’s where a recursive function becomes handy.

Writing a recursive function in R

Let’s say we have a group of orders with a similar order id as an input. Now we have to gather orders, that happened during 7 days from each other. Here is the algorithm:

  1. Find an order with the minimal order date;
  2. Find all orders with order date that falls within 7 days period since the minimal order date;
  3. Mark them as a single order;
  4. Repeat steps 1 – 3 with the rest of orders within the given group.

And here is the R function that does it using recursion:

# x is a DF containing orders with the similar order id
group_orders <- function(x) {
  curr_min_date <- min(x$date)
  curr_order_num <- filter(x, date == curr_min_date)$order_number[1]
  
  # DFs with orders before and after the current min_order_date + 7 days
  before_min_date <- mutate(filter(x, date <= curr_min_date + days(7)), g_order_number = curr_order_num)
  after_min_date <- filter(x, date > curr_min_date + days(7))
  
  # recursive call
  if(count(after_min_date) == 0)
    return(before_min_date)
  else
    return(rbind(before_min_date, group_orders(after_min_date)))
}

The function above takes a data frame of orders with the similar order id, groupes them by order date, and return the initial data frame + a g_order_number column which represents order number after grouping. Now we need to apply it to the initial data frame group-wise.

Applying a function by groups

Now the only thing left is to group orders by order id (I extract it to order_number_gen column using regexp) and apply the function above to each group

orders_grouped <- orders %>%
  filter(op_sum > 10) %>%
  mutate(order_number_gen = str_extract(order_number, '(?<=\\-)\\d+')) %>%
  group_by(order_number_gen) %>%
  group_map(~ group_orders(.x)) %>%
  ungroup() %>%
  group_by(customer_id, g_order_number) %>%
  summarise(order_date = min(date),
            order_amount = sum(op_sum))

I also filtered out orders that have amount less than 10 (those orders represent free gifts to our customers).

To apply a function group-wise, I used group_map, that was recently added to the dplyr library and makes the process above pretty straightforward.

 No comments    241   2019   R

What are the most popular couch fabrics? Visualizing product matrix with R

Pufetto — a Ukrainian furniture brand — has an online customization tool where you can build your dream couch changing its size and fabric. But what fabrics are the most attractive? I picked a bunch of popular couches and looked at which fabrics people clicked the most. Below I show how to visualize it with R.

Building a product heatmap

The idea was to build a heatmap showing most clickable fabrics by couches. To make the plot less messy, I included only fabrics that received a minimum required number of clicks. Here is what I got

Looks interesting, but not readable, and definitely not actionable. I decided to improve the following:

  1. Limit number of fabrics to top 5 per couch
  2. Use relative ranking instead of absolute number of clicks to make fabrics differentiate across couches
  3. Sort fabrics and couches by popularity
  4. Add meaningful axis labels

Here is the final iteration

Looks much better!

Here is the R code behind it:

# fabric_clicks  is a DF of couch names, fabrics, and dates
fabric_clicks %>%
  filter(date >= ymd('2019-01-01')) %>%
  count(couch_name, fabric) %>%
  group_by(couch_name) %>%
  mutate(rank = rank(desc(n)), total_couch_clicks = sum(n)) %>%
  filter(rank <= 5) %>% arrange(couch_name, rank) %>%
  group_by(fabric) %>%
  mutate(avg_fabric_rank = mean(rank)) %>%
  ggplot(aes(fct_reorder(couch_name, desc(total_couch_clicks)), fct_reorder(fabric, desc(avg_fabric_rank)), fill = rank)) +
  geom_tile() +
  geom_text(aes(label = ifelse(is.na(rank), '', rank)), color = 'white') +
  scale_fill_continuous(guide=guide_colourbar(reverse = T), low="#5EB7F8", high="#1A334B") + 
  labs(title = 'Top 5 clickable fabrics by couches', subtitle = '2019`s average', 
       fill = 'rank', x = 'couch\n ← more popular    less popular →', y = 'fabric by avg rank') +
  theme(axis.text.x = element_text(angle = -45))

Conclusion

It supposed to be a solely technical R-devoted post, so don’t expect result interpretation here (although it seems pretty straightforward — all bestclicking fabrics are both the cheapest and the most visible ones). Remember, that often (1) less is more and (2) order matters.

 No comments    117   2019   R

Cohort analysis with R

Cohort analysis is a very powerful tool when it comes to analyzing different groups of users over time. I won’t talk much about theory in this post but rather show two real-world examples and its implementation with R.

Retention curves

The most prevalent usage of cohort analysis in the startup ecosystem is retention curves. Those guys show whether or not the product has traction (aka product-market fit).

Here is an example of one of SaaS products I used to work with

Even a year after signing up, a bunch of very first users (brownish ones) are still using it. That’s a good indicator of traction.

Retention curves might also be useful when analyzing longer timeframes. Here is an example of the above-mentioned SaaS product with a few extra months of data

There is a lot going on, but the trend is still clear.

Plotting retention curves with R

First, you need to transform the data into long-form — a table where each row represents one cohort in one month. I explain how to do that in the last section of this post.

When you first try to plot retention curves from your long-form data, more likely than not you’ll get a wired graph like this

To fix it, add aditional rows with zeros representing each cohort a month before it was born

for (cohort in unique(df$Cohort)) {
  df <- df %>%
    rbind(list(as_date(cohort), 0, as_date(cohort) - months(1), 0))
}

So that you have

It will smooth sharp angels. Now you’re ready to plot it with geom_area()

df %>%
  ggplot(aes(Month, MAU, group = Cohort, fill = Cohort)) +
  geom_area(position = position_stack(reverse = T)) +
  labs(title = 'Retention curves', y = 'Cumulative MAU') +
  scale_fill_brewer(type = 'div') +
  theme(text = element_text(family="Segoe UI"), axis.text.x = element_text(angle = -45))

Tile plots

Sometimes you want to focus less on the trend and more on the numbers itself. Especially, when it comes to percentages. In those cases, a tile plot would generally be a better choice.

Here is an example from my previous post showing Return on Equity of a loan company over time

Plotting tile plot with R

Again, make sure the data is in long-form. Than, use geom_tile() and a text layer geom_text() to make a tile plot

ggplot(filter(result, product == 'Product2'), aes(factor(month), factor(format(cohort, '%Y-%m')), fill = roe)) + 
  geom_tile() +
  geom_text(aes(label = ifelse(is.na(roe), '', sprintf('%.0f%%', roe * 100)))) +
  scale_fill_gradient2(low = 'red', high = 'white', midpoint = 0, labels = scales::percent) +
  labs(title = 'Return on equity by cohort, Product2', y = 'cohort', x = 'months after loan opening', fill = 'ROE') +
  theme(text=element_text(family="Segoe UI"))

How to convert data into long-form

Before making any cohort analysis, you need to transform your data into long-form — a table where each row represents one cohort in one month.

In simple case when data is already aggregated and joined properly, it can be done using gather() function. Otherwise, you should use loops.

In the previous example, the data wasn’t aggregated and stored in different dataframes, that look like this

So I used for loop to aggregate and transform it into long-form:

I iterated through unique cohorts, products and months calculating cumulative lending and payment volume, and then put it together in a dataframe:

cohorts <- double()
products <- character()
months <- integer()
lendings <- double()
payments_v <- double()

for (cohort in unique(floor_date(total_transactions_agg$open_date, 'month'))) {
  for (curr_product in c('Product1', 'Product2')) {
    i <- 0
    while(as_datetime(cohort) + months(i) <= max(floor_date(total_transactions_agg$open_date, 'month'))) {
      
      curr_lendings <- total_transactions_agg %>%
        filter(floor_date(open_date, 'month') == as_datetime(cohort) & product == curr_product)
      
      curr_payments <- total_transactions %>%
        filter(floor_date(open_date, 'month') == as_datetime(cohort) & product == curr_product & 
                 floor_date(payment_date, 'month') <= as_datetime(cohort) + months(i))
      
      cohorts <- c(cohorts, cohort)
      products <- c(products, curr_product)
      months <- c(months, i)
      lendings <- c(lendings, sum(curr_lendings$amount_lent))
      payments_v <- c(payments_v, sum(curr_payments$amount_paid, na.rm = T))
      
      i <- i + 1
    }
  }
}

result <- tibble(cohort = as_datetime(cohorts), product = products, month = months, total_lent = lendings, total_paid = payments_v) %>%
  mutate(roe = (total_paid - total_lent) / total_lent)

Once done, you’re ready to plot it.

 No comments    519   2019   R