Alex Dyachenko

I help to create better digital products using data and analytics. Currently I’m a Product Analyst at LetyShops. Prior to that, I worked at SimilarWeb and Bitrix24.

How to get daily Google Trends data for any period with R

Recently, I needed some seven years of Google Trends daily data. It turned out that by default it’s not possible to get it neither through the web interface nor via API. So I wrote a tiny script that pulls daily Google Trends data for any period using gtrendsR package

What’s the problem with Google Trends?

Google Trends returns data in daily granularity only if the timeframe is less than 9 months. If the timeframe is between 9 months and 5 years, you’ll get weekly data, and if it’s longer than 5 years – you’ll get monthly data.

A trivial solution like querying the data month by month and then tieing it together won’t work in this case, because Google Trends assess interest in relative values within the given time period. It means that for a given keyword and month, Google Trend will estimate interest identically – with a local minimum of 0 and a local maximum of 100 – event in one month it had twice as many searches than in the other.

Querying Google Trend daily data properly

To get proper daily estimates, I do the following:

  1. Query daily estimates for each month in the specified timeframe;
  2. Queries monthly data for the whole timeframe;
  3. Multiply daily estimates for each month from step 1 by its weight from step 2.

Here is the R code:

library(gtrendsR)
library(tidyverse)
library(lubridate)

get_daily_gtrend <- function(keyword = 'Taylor Swift', geo = 'UA', from = '2013-01-01', to = '2019-08-15') {
  if (ymd(to) >= floor_date(Sys.Date(), 'month')) {
    to <- floor_date(ymd(to), 'month') - days(1)
    
    if (to < from) {
      stop("Specifying \'to\' date in the current month is not allowed")
    }
  }

  mult_m <- gtrends(keyword = keyword, geo = geo, time = paste(from, to))$interest_over_time %>%
    group_by(month = floor_date(date, 'month')) %>%
    summarise(hits = sum(hits)) %>%
    mutate(ym = format(month, '%Y-%m'),
           mult = hits / max(hits)) %>%
    select(month, ym, mult) %>%
    as_tibble()
  
  pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'), 
               e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))
  
  raw_trends_m <- tibble()
  
  for (i in seq(1, nrow(pm), 1)) {
    curr <- gtrends(keyword, geo = geo, time = paste(pm$s[i], pm$e[i]))
    print(paste('for', pm$s[i], pm$e[i], 'retrieved', count(curr$interest_over_time), 'days of data'))
    raw_trends_m<- rbind(raw_trends_m,
                         curr$interest_over_time)
  }
  
  trend_m <- raw_trends_m %>%
    select(date, hits) %>%
    mutate(ym = format(date, '%Y-%m')) %>%
    as_tibble()
  
  trend_res <- trend_m %>%
    left_join(mult_m, by = 'ym') %>%
    mutate(est_hits = hits * mult) %>%
    select(date, est_hits) %>%
    as_tibble() %>%
    mutate(date = as.Date(date))
  
  return(trend_res)
}

get_daily_gtrend(keyword = 'Taylor Swift', geo = 'UA', from = '2013-01-01', to = '2019-08-15')

get_daily_gtrend function should return a tibble with daily trend. Now you can plot it nicely or use in some analysis

 No comments    196   2019   R

Recursive functions in R

Besides product management and growth, sometimes I also write short technical posts (particularly, about R), where I share solutions to non-trivial tasks I’ve encountered or just useful pieces of code.

In this post:

  1. How to write a recursive function in R, and
  2. How to apply this function group-wise to a data frame.

The context

Our orders table is designed in a way that a single order may be split into different entities with numerical suffixes and character prefixes. It’s done so because different products have different manufacturing and shipping time (1) and customers may add or change items shortly after placing an order (2).

It turned out though, that not all salespeople have been using it as expected: when a customer was coming back months later, instead of creating a new order, they just added an incremental suffix to the existing order.

From the analytical standpoint, it made no sense: a single order may be spread over months (or even years). So before doing any analysis, I first had to group orders by the order id and then, within those groups, gather orders that happened within 7 days together

While the first part of the task was trivial using regex, the second one required iterating through groups of orders to properly match them by order date. That’s where a recursive function becomes handy.

Writing a recursive function in R

Let’s say we have a group of orders with a similar order id as an input. Now we have to gather orders, that happened during 7 days from each other. Here is the algorithm:

  1. Find an order with the minimal order date;
  2. Find all orders with order date that falls within 7 days period since the minimal order date;
  3. Mark them as a single order;
  4. Repeat steps 1 – 3 with the rest of orders within the given group.

And here is the R function that does it using recursion:

# x is a DF containing orders with the similar order id
group_orders <- function(x) {
  curr_min_date <- min(x$date)
  curr_order_num <- filter(x, date == curr_min_date)$order_number[1]
  
  # DFs with orders before and after the current min_order_date + 7 days
  before_min_date <- mutate(filter(x, date <= curr_min_date + days(7)), g_order_number = curr_order_num)
  after_min_date <- filter(x, date > curr_min_date + days(7))
  
  # recursive call
  if(count(after_min_date) == 0)
    return(before_min_date)
  else
    return(rbind(before_min_date, group_orders(after_min_date)))
}

The function above takes a data frame of orders with the similar order id, groupes them by order date, and return the initial data frame + a g_order_number column which represents order number after grouping. Now we need to apply it to the initial data frame group-wise.

Applying a function by groups

Now the only thing left is to group orders by order id (I extract it to order_number_gen column using regexp) and apply the function above to each group

orders_grouped <- orders %>%
  filter(op_sum > 10) %>%
  mutate(order_number_gen = str_extract(order_number, '(?<=\\-)\\d+')) %>%
  group_by(order_number_gen) %>%
  group_map(~ group_orders(.x)) %>%
  ungroup() %>%
  group_by(customer_id, g_order_number) %>%
  summarise(order_date = min(date),
            order_amount = sum(op_sum))

I also filtered out orders that have amount less than 10 (those orders represent free gifts to our customers).

To apply a function group-wise, I used group_map, that was recently added to the dplyr library and makes the process above pretty straightforward.

 No comments    241   2019   R

Why metrics like Facebook’s “7 friend in 10 days” can be tricky

Say, you’ve discovered that your customers, on average, spend on the website 2 minutes longer than noncustomers. What’s your first thought? “Let’s make everybody else spend 2 minutes more and we’ll increase the number of customers”, right? Well, not necessarily.

Correlation vs causation

Here is another example from Intercom’s handbook (which I highly recommend):

“Let’s imagine a ride sharing app, where all customers who complete two trips in the first 30 days after signup are 40% more likely to retain in month two, compared with the customers who take just one or no trips. [...] In this case you’d get laser focused on encouraging customers to take 2 rides in their first 30 days.”
The Growth Handbook, Intercom, page 24

Or a Facebook’s early growth metric:

“After all the testing, all the iterating, you know what the single biggest thing we realized? Get any individual to 7 friends in 10 days. That was it.”
Chamath Palihapitiya, Facebook’s first VP of Growth

But here is a question: why is that 2 extra minutes on the websites caused users to buy, and not the opposite — users, that were already interested in purchase just spent 2 extra minutes shopping?

Or that 2 completed rides caused users to become active — and not that active users completed 2 rides just because they like the app?

It’s common to confuse correlation with causation — think that A causes B whereas A and B just happen simultaneously, regardless of each other. Or maybe because some C causes both A and B, who knows.

There is even a website showcasing funny correlation, like divorce rate in Maine vs per capita consumption of margarine:

Back to our example: generally speaking, we don’t know what caused what, so we can’t say that making users spent 2 more minutes on the website will turn them into customers.

Recall vs precision

Following up on the ride sharing app example:

“...all customers who complete two trips in the first 30 days after signup are 40% more likely to retain in month two, compared with the customers who take just one or no trips...”

Let’s say I’m trying to find out what differentiate customers from noncustomers during week one on my website.

I assumed that two metrics could be the differentiators: number of 5-minute-long sessions (x-axis) and number of product page views (y-axis). Here is how customers and noncustomers are distributed

There are clearly much more customers (blue dots) than noncustomers with at least one 5-minute-long sessions during week one.

After doing the math, I concluded the following:

“Users that had at least one 5-minute-long session during week one are 2.5 times more likely to become customers, that those who didn’t”

Sounds like a useful insight except for the fact that 50% of the customers have had exactly zero 5-minute-long sessions — and yet they’ve eventually become customers.

How many users who haven’t completed 2 rides during the first 30 days returned next month? How many users who haven’t added 7 friends in 10 days eventually became active? Nobody ever heard about it.

Statements like that only tell how good the metric identifies customers within users who’ve completed it (it’s called precision), but they never mention how many customers were cut off (it’s called recall).

It’s easy to make a metric sound better than it actually is if you don’t look at both precision and recall.

What metric to look at

The framework above is helpful to think about and compare different metrics. In the real world, however, it’s rarely possible to find a metric that perfectly causes the result while having high recall and precision. The good news is: the real world rarely needs it.

Think about Facebook’s “7 friends in 10 days”, for example. Having an individual added 7 friends is a proxy for experiencing the core value of Facebook — the network of people.

Or Pinterest’s “Save” button. Casey Winters — ex. Lead of growth team at Pinterest — explains why they’ve chosen it as a key metric:

“People derive value from Pinterest in different ways, from browsing lots of images to saving images to clicking through to the source of content. Eventually, we settled on saving (pinning an image to your board), because, while people can get value from browsing or clicking through on something, we weren’t sure if it was satisfying. You only save things if you like them.”
Casey Winters, Why Onboarding is the Most Crucial Part of Your Growth Strategy

The best metrics always incorporate fundamental value of the product.

 250   2019  

Spot customers early: aha moments, predictive models, and beyond

Usually, marketers want to know how their acquisition channels are performing as soon as possible. But if it takes a couple of months for a user to convert, it might be a good idea to come up with a model predicting potential customers early on.

In this post, I share my experience building a predictive model, explain why it didn’t work as expected, and how we’re going to bypass it in the future.

Some numbers were changed or hidden due to privacy concerns.

Background

To begin with, Renetti is an online-to-offline (O2O) business, which means that a customer can buy either online or in-store.

If a customer buys in-store (as the majority do), we try to tie their offline purchase to the online behavior using a bunch of matching point.

Sometimes, when a customer uses different devices before and after the purchase, we may have a weird situation when a customer’s first visit date is greater than the order date (which is usually not true)

Obviously, I included only properly tracked customers.

Hypothesis

As shown in the histogram above, the purchase decision may take months. Since we can’t wait that long to see how our ads are doing, I decided to find a quick way to identify potential customers early on — an insight like “If during week 1 a user has viewed 3+ products, he’ll become a customer with 70% probability”.

Aha moment

The initial idea was to find a micro-conversion (aka “Aha moment”) — a single event, that would differentiate potential customers from noncustomers. Some kind of Facebook’s “7 friends in 10 days”.

Here is what I did:

  1. Picked a list of website events indicating level of engagement;
  2. Segmented customers by the delay between first visit and first order. Most likely, a person that is going to buy 3 months from now will behave differently than the one that is going to buy tomorrow;
  3. Checked week 1 behaviour across those segments.

It turned out, they behaved differently: the closer a person to purchase, the more active she is on the website. Here are some examples (swipe to see more)

However, when I tried to separate customers from noncustomers by a certain threshold using only one event, I got, well, not super accurate result.

The best filter was able to identify 60% of the customers with 1% precision — meaning, only 1 out of 100 converted users would eventually become a customer.

The difference in the behavior during week 1 wasn’t big enough given the disproportion between those segments: customers constituted only a tiny fraction of all the website visitors.

But I decided to go further and try some machine learning.

Predictive model

First, I picked the most relevant features to train the model on.

The quick way to look at how different website events correlate with each other is to use a pair plot. Here is a piece of mine (I added more features later on):

I won’t go deep on the ML things here (I’m not a data scientist, after all), just say that I tried a bunch of different algorithms and dataset’s makeups (as I said, the initial dataset was very imbalanced).

After a quick and dirty trial and error process, I ended up with the following result for different algorithms

For example, Random Forest (rf) was able to detect 80% of customers with 20% precision.

Unfortunately, this level of precision was possible only on the balanced dataset (customers = noncustomers). When I tried it on the real, highly disproportional data, the accuracy dropped to just 2% — meaning, 98 of 100 conversions would be False Positive.

Here is a good way to visualize it

Just to make a point, if the number of customers was equal to noncustomers, the model would be much more accurate

Conclusion

In our case, it wasn’t possible to accurately predict customers during the first week, mainly because of two reasons:

  1. The difference in behavior wasn’t big enough;
  2. Segments were too disproportional.

So what’s next? Well, if we can’t find a micro-conversion — we’ll create one. It will be an event, that falls in the middle of the interest-friction spectrum

But it would be a subject for another post.

 No comments    184   2019  

What you should know about bounce rate in Google Analytics

A lot of people think that a bounce in Google Analytics is a one-page-long session. It’s not quite true. A bounce is a one-hit-long session, which means that if a user triggers any event during his single-page session, it won’t be considered as a bounce. Below I show why it’s good to keep it in mind.

Bounce rate inflation

A couple of days ago we noticed a huge surge of bounce rate on the website — 15% to 85% overnight. All the bounced traffic was landing on the product pages, which haven’t been changed for a while. Neither were our marketing campaigns

After checking all the links, buttons, pictures, and other elements on the product page, we made a guess that it had to do with the page load time.

However, it wasn’t the case — there was no correlation with the bounce rate surge

Eventually, it turned out that just a couple of days ago the JS code on the product page stopped sending user interaction events to Google Analytics. You can see how red and yellow lines suddenly turned to zero on May 7th

Although users behavior hasn’t actually changed, from Google Analytics point of view, it has. I call it bounce rate inflation.

An alternative way to prove that the product page’s UX hasn’t changed is to look at the exit rate

As you can see, it’s still the same.

Conclusion

Bounce rate ≠ percentage of single-page sessions. It is determined by the events your site is sending to Google Analytics. Which means that you can make it ridiculously low by sending frequent events, like page scrolls. It is called bounce rate inflation. Be aware of that when comparing bounce rates of different websites.

 No comments    147   2019  

What are the most popular couch fabrics? Visualizing product matrix with R

Pufetto — a Ukrainian furniture brand — has an online customization tool where you can build your dream couch changing its size and fabric. But what fabrics are the most attractive? I picked a bunch of popular couches and looked at which fabrics people clicked the most. Below I show how to visualize it with R.

Building a product heatmap

The idea was to build a heatmap showing most clickable fabrics by couches. To make the plot less messy, I included only fabrics that received a minimum required number of clicks. Here is what I got

Looks interesting, but not readable, and definitely not actionable. I decided to improve the following:

  1. Limit number of fabrics to top 5 per couch
  2. Use relative ranking instead of absolute number of clicks to make fabrics differentiate across couches
  3. Sort fabrics and couches by popularity
  4. Add meaningful axis labels

Here is the final iteration

Looks much better!

Here is the R code behind it:

# fabric_clicks  is a DF of couch names, fabrics, and dates
fabric_clicks %>%
  filter(date >= ymd('2019-01-01')) %>%
  count(couch_name, fabric) %>%
  group_by(couch_name) %>%
  mutate(rank = rank(desc(n)), total_couch_clicks = sum(n)) %>%
  filter(rank <= 5) %>% arrange(couch_name, rank) %>%
  group_by(fabric) %>%
  mutate(avg_fabric_rank = mean(rank)) %>%
  ggplot(aes(fct_reorder(couch_name, desc(total_couch_clicks)), fct_reorder(fabric, desc(avg_fabric_rank)), fill = rank)) +
  geom_tile() +
  geom_text(aes(label = ifelse(is.na(rank), '', rank)), color = 'white') +
  scale_fill_continuous(guide=guide_colourbar(reverse = T), low="#5EB7F8", high="#1A334B") + 
  labs(title = 'Top 5 clickable fabrics by couches', subtitle = '2019`s average', 
       fill = 'rank', x = 'couch\n ← more popular    less popular →', y = 'fabric by avg rank') +
  theme(axis.text.x = element_text(angle = -45))

Conclusion

It supposed to be a solely technical R-devoted post, so don’t expect result interpretation here (although it seems pretty straightforward — all bestclicking fabrics are both the cheapest and the most visible ones). Remember, that often (1) less is more and (2) order matters.

 No comments    117   2019   R

Cohort analysis with R

Cohort analysis is a very powerful tool when it comes to analyzing different groups of users over time. I won’t talk much about theory in this post but rather show two real-world examples and its implementation with R.

Retention curves

The most prevalent usage of cohort analysis in the startup ecosystem is retention curves. Those guys show whether or not the product has traction (aka product-market fit).

Here is an example of one of SaaS products I used to work with

Even a year after signing up, a bunch of very first users (brownish ones) are still using it. That’s a good indicator of traction.

Retention curves might also be useful when analyzing longer timeframes. Here is an example of the above-mentioned SaaS product with a few extra months of data

There is a lot going on, but the trend is still clear.

Plotting retention curves with R

First, you need to transform the data into long-form — a table where each row represents one cohort in one month. I explain how to do that in the last section of this post.

When you first try to plot retention curves from your long-form data, more likely than not you’ll get a wired graph like this

To fix it, add aditional rows with zeros representing each cohort a month before it was born

for (cohort in unique(df$Cohort)) {
  df <- df %>%
    rbind(list(as_date(cohort), 0, as_date(cohort) - months(1), 0))
}

So that you have

It will smooth sharp angels. Now you’re ready to plot it with geom_area()

df %>%
  ggplot(aes(Month, MAU, group = Cohort, fill = Cohort)) +
  geom_area(position = position_stack(reverse = T)) +
  labs(title = 'Retention curves', y = 'Cumulative MAU') +
  scale_fill_brewer(type = 'div') +
  theme(text = element_text(family="Segoe UI"), axis.text.x = element_text(angle = -45))

Tile plots

Sometimes you want to focus less on the trend and more on the numbers itself. Especially, when it comes to percentages. In those cases, a tile plot would generally be a better choice.

Here is an example from my previous post showing Return on Equity of a loan company over time

Plotting tile plot with R

Again, make sure the data is in long-form. Than, use geom_tile() and a text layer geom_text() to make a tile plot

ggplot(filter(result, product == 'Product2'), aes(factor(month), factor(format(cohort, '%Y-%m')), fill = roe)) + 
  geom_tile() +
  geom_text(aes(label = ifelse(is.na(roe), '', sprintf('%.0f%%', roe * 100)))) +
  scale_fill_gradient2(low = 'red', high = 'white', midpoint = 0, labels = scales::percent) +
  labs(title = 'Return on equity by cohort, Product2', y = 'cohort', x = 'months after loan opening', fill = 'ROE') +
  theme(text=element_text(family="Segoe UI"))

How to convert data into long-form

Before making any cohort analysis, you need to transform your data into long-form — a table where each row represents one cohort in one month.

In simple case when data is already aggregated and joined properly, it can be done using gather() function. Otherwise, you should use loops.

In the previous example, the data wasn’t aggregated and stored in different dataframes, that look like this

So I used for loop to aggregate and transform it into long-form:

I iterated through unique cohorts, products and months calculating cumulative lending and payment volume, and then put it together in a dataframe:

cohorts <- double()
products <- character()
months <- integer()
lendings <- double()
payments_v <- double()

for (cohort in unique(floor_date(total_transactions_agg$open_date, 'month'))) {
  for (curr_product in c('Product1', 'Product2')) {
    i <- 0
    while(as_datetime(cohort) + months(i) <= max(floor_date(total_transactions_agg$open_date, 'month'))) {
      
      curr_lendings <- total_transactions_agg %>%
        filter(floor_date(open_date, 'month') == as_datetime(cohort) & product == curr_product)
      
      curr_payments <- total_transactions %>%
        filter(floor_date(open_date, 'month') == as_datetime(cohort) & product == curr_product & 
                 floor_date(payment_date, 'month') <= as_datetime(cohort) + months(i))
      
      cohorts <- c(cohorts, cohort)
      products <- c(products, curr_product)
      months <- c(months, i)
      lendings <- c(lendings, sum(curr_lendings$amount_lent))
      payments_v <- c(payments_v, sum(curr_payments$amount_paid, na.rm = T))
      
      i <- i + 1
    }
  }
}

result <- tibble(cohort = as_datetime(cohorts), product = products, month = months, total_lent = lendings, total_paid = payments_v) %>%
  mutate(roe = (total_paid - total_lent) / total_lent)

Once done, you’re ready to plot it.

 No comments    519   2019   R

Analyzing short-term loan company performance

This is a test assignment I’ve done a while ago. The company provides short-term consumer loan products through a mobile app. I was asked to analyze the data and show performance of the company in terms of volumes, delinquency, and profitability.

Loans dynamics and the most frequent lean term

The company made its first loan at the end of October 2017 and since then has been constantly increasing its numbers

Note, that there is only a half of June 2018 data (last data point is 2018-06-15), so seemingly the trend will keep climbing.

The most popular loan term is 30 days, following by 20, 15, and 10 days, respectively

Usually, it takes only one payment to repay the whole loan

Volumes and profitability

Now, let’s get to the nitty-gritty and see how well the company performs in terms of loan and payment volume

Two things to notice on this graph:

  1. Client selection is getting better: in November 2017, only around half of loan volume has been repaid, while in April 2018 payments volume has exceeded loan volume leaving a profit margin of 7% for Product 2.
  2. Product 1 is still unprofitable, even though the statement above.

Another way to look at profitability is to analyze Return On Equity (ROE) by cohort (swipe right to see Product2’s ROE)

It’s clear that the biggest jump happens during the first two months, and then ROE curve flattens.

“But what about May and June 2018?” — one may ask. As I mentioned before, the last data point in this dataset is 2018-06-15. Given that the most frequent loan term is 30 days, all those folks who borrowed money after 15th of May are yet to repay it.

Delinquency

As mentioned above, usually it takes only one payment to repay the whole loan. So it’s reasonable to report a loan as in delinquency if a borrower missed the first payment (long-term lenders, for instance, typically do not report a loan as in delinquency until the borrower has missed two consecutive payments).

To get a grasp of how many loans are yet to be repaid and how many are already delinquent (or default), take a look at the graph below

Here, unknown loans are those with the expected close date later than 2018-06-15, meaning we don’t know yet whether it will be repaid on time or delinquent.

Finally, let’s take a look at delinquency rate itself

As expected, premium Product 2 beats Product 1, although both have decent downwards trend.

 No comments    117   2019  

Analyzing Facebook campaigns performance with R

This is a test assignment I’ve done a while ago. The client is a media company producing healthcare and fitness apps. During summer 2018, they were running Facebook campaigns to increase their apps installs worldwide.

Given the 3-month historical data, and approximate financial metrics (ARPU = $4 and ARPPU = $50), I was asked to analyze campaigns’ performance and give some recommendation on which campaigns / ads / countries should be scaled and which should be stopped to increase ROI.

Helicopter view on the campaigns’ performance

Summer marketing activities were successful: for $3.4M we got 1.1M installs which led to $4.5M in revenue, which is a 30% return on investment.

To analyze campaigns’ performance, let’s look at them in two dimensions: efficiency and scalability:

The closer a campaign to the upper-right corner, the more scalable it is and the better return on investment it yields. Conversely, the closer a campaign to the bottom-right corner, the more money it loses.

We can see that two campaigns — vik-vid-23 and vik-vid-WW — are clearly standing out: with a total spend of over $1.1M they were able to yield over 80% return on investment.

If we take a look at the vik-vid-WW’ ads performance, we’ll see the same 80-20 pattern: there are 3 best-performing ads (one of them is paused for some reason) generating majority of revenue:

The second category of campaigns are those with total spend between 50K and 300K:

Here we can see a bunch of campaigns that are losing money. For example, rud – vid – tests old (the rightest one), has spent almost 300K with ROI of -6.5%.

Among rud – vid – tests old’s ads, there are two barely profitable ones, and four that constantly losing over 25% of its cost:

Those ads should be stopped.

Finally, there are smaller campaigns which spend less than $50K:

Campaigns with negative ROI should be further investigated.

CPA vs ARPPU analysis

Another way to look at campaigns’ efficiency is to compare its CPA to ARPPU:

Campaigns with CPA > ARPPU should be further investigated.

Campaigns’ performance by country

Finally, we can take a look at performance by countries:

We can see that the top 4 countries by marketing budget — US, GB, CA, and AU — have slightly negative ROI. 88% of the countries are profitable.

It doesn’t mean we should stop advertising to those top 4 countries though.

Different countries have different CPI and ARPU. In the calculation, we consider different CPIs, but imply the same ARPU for all countries (which is not true).

That’s why countries with low customer acquisition cost seem so profitable here, and conversely, countries with high acquisition cost seem to be losing money.

 No comments    114   2019