Blogboard Journal

A/B Testing for Practical Significance

Drazen Zaric — Wed, 07 Apr 2021 13:30:00 GMT

When doing statistical hypothesis testing, the math behind it gives us the toolset to determine the statistical significance of our observations. But if we're doing a two-sample test on a simple hypothesis, eg. $H_{0}:\mu_{1}=\mu_{2}$ vs. $H_{1}:\mu_{1}\ne\mu_{2}$ rejecting it won't tell us anything about the magnitude of the difference.

Usually, aside from making sure that the difference in measurements you observed are statistically significant, you want your observed differences to be practically significant as well. That means you don't want to test simply if your statistics (e.g. averages) differ, but you want them to differ by some margin, $\epsilon$. The size of the margin is dependent on the application - if you want to increase a click-through rate you probably have a business goal clearly specified, saying something like Increase CTR by 10 percentage points.

So how do we do test for a margin of difference? We'll need to run what's called a composite hypothesis test.

It's a fun practice in fundamentals of hypothesis testing to derive the formulas behind it, so let's first derive the test for the case of a single sample, comparing its mean to a constant. Then we'll modify it to make a two-sample test, enabling us to compare averages of two samples.

One-sample test

Let's assume our data points $X_{1},X_{2},...,X_{n}$ come as iid observations from some unknown distribution, with mean $\mu$ and variance $\sigma^2$. We want to test with the following null and alternative hypotheses

\[ \begin{eqnarray*} H_{0}:\left|\mu-\mu_{0}\right| & \le & \epsilon\\ H_{1}:\left|\mu-\mu_{0}\right| & > & \epsilon \end{eqnarray*} \]

where $\epsilon > 0$ is a constant quantifying our desired practical significance.

Unrolling this, we get

$H_{0}:(\mu\ge\mu_{0}-\epsilon)\ and\ (\mu\le\mu_{0}+\epsilon)$
$H_{1}:(\mu<\mu_{0}-\epsilon)\ and\ (\mu>\mu_{0}+\epsilon)$

which should make it obvious why it's called a composite hypothesis test.

As usual, we'll partition the parameter space $\Theta$ into two subspaces, corresponding to parameter spaces for $H_0$ and $H_1$:

$\Theta_{0}=\left\{ \mu\in\mathbb{\mathbb{R}}\mid(\mu\ge\mu_{0}-\epsilon)\ and\ (\mu\le\mu_{0}+\epsilon)\right\} $
$\Theta_{1}=\left\{ \mu\in\mathbb{\mathbb{R}}\mid(\mu<\mu_{0}-\epsilon)\ and\ (\mu>\mu_{0}+\epsilon)\right\} $

Parameter space partitioning into accept and reject regions

Since we want to test the mean, we'll use sample average as the estimate for $\mu$:

\[
\hat{\mu}=\bar{X}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}
\]

Central limit theorem tells us that as $n$ grows, $\bar{X}_n$ converges in distribution to a normal random variable:

\[
n\rightarrow\infty:\ \sqrt{n}(\bar{X}_{n}-\mu)\overset{i.d}{\rightarrow}\mathcal{N}\left(0,\sigma^{2}\right)
\]

and the Continuous mapping theorem allows us to go back and forth and get that:

\[n\rightarrow\infty:\ \bar{X}_{n}\overset{i.d}{\rightarrow}\mathcal{N}\left(\mu,\frac{\sigma^{2}}{n}\right)\]

So under $H_0$, $\bar{X}_n$ can be asymptotically distributed as any of the $\mathcal{N}\left(\mu,\frac{\sigma^{2}}{n}\right)$ for $\mu\in\Theta_{0}$

Possible asymptotic distributions of the sample statistic under $H_0$

We want to design a test with significance level $\alpha$, limiting the Type 1 error. Let's consider the following test for some $z \ge \epsilon$:

\[\psi=\begin{cases}1\ (H_{0}\ rejected) & if\ (\hat{\mu}\le\mu_{0}-z)\ or\ (\hat{\mu}\ge\mu_{0}+z)\\0\ (H_{0}\ not\ rejected) & otherwise \end{cases}\]

We denote as Type I error the error of falsely rejecting the null hypothesis. Formally, we define the Type I error rate $\alpha_\psi$ as the probability of rejecting the null hypothesis when it was in fact true.

$\alpha_\psi=P_\mu(\psi=1); \mu\in\Theta_0$

The level $\alpha$ of a test is the largest Type I error that we'll get for any $\mu\in\Theta_0$. Formally, a statistical test has level $alpha$ if:

$\alpha_\psi=P_\mu(\psi=1)\le\alpha, \forall\mu\in\Theta_0$

Thus, for our test $\psi$, we have the level

\[ \begin{eqnarray*} \alpha & = & \underset{\mu\in\Theta_{0}}{sup}P_{\mu}\left(\psi=1\right)\\ & = & \underset{\mu\in\Theta_{0}}{sup}P_{\mu}\left((\hat{\mu}<\mu_{0}-z)\ or\ (\hat{\mu}>\mu_{0}+z)\right)\\ & = & \underset{\mu\in\Theta_{0}}{sup}P_{\mu}\left((\hat{\mu}-\mu_{0}<-z)\ or\ (\hat{\mu}-\mu_{0}>z)\right)\\ & \overset{by\ CLT}{\sim} & \underset{\mu\in\Theta_{0}}{sup}\left\{ P\left(\mathcal{N}\left(\mu-\mu_{0},\frac{\sigma^{2}}{n}\right)<-z\right)\ +\ P\left(\mathcal{N}\left(\mu-\mu_{0},\frac{\sigma^{2}}{n}\right)>z\right)\right\} \\ & = & \underset{\mu\in\Theta_{0}}{sup}\left\{ P\left(\mathcal{N}\left(0,1\right)<\sqrt{n}\frac{-z-(\mu-\mu_{0})}{\sigma}\right)\ +\ P\left(\mathcal{N}\left(0,1\right)>\sqrt{n}\frac{z-(\mu-\mu_{0})}{\sigma}\right)\right\} \\ & = & \underset{\mu\in\Theta_{0}}{sup}\left\{ \Phi\left(\sqrt{n}\frac{-z-(\mu-\mu_{0})}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{z-(\mu-\mu_{0})}{\sigma}\right)\right\} \\ \end{eqnarray*} \]

Where $\Phi$ is the cumulative distribution function of a standard Gaussian.

Let's see how $\alpha_\psi$ behaves as we move $\mu$ over $\Theta_0$, that is from $\mu_0-\epsilon$ to $\mu_0+\epsilon$. We'll take the derivative of $\alpha_\psi$ with respect to $\mu$:

\[ \begin{eqnarray*} \frac{\partial}{\partial\mu}\alpha_{\psi}(\mu) & = & \frac{\partial}{\partial\mu}\left(P_{\mu}\left(\psi=1\right)\right)\\ & = & -\phi\left(\sqrt{n}\frac{-z-(\mu-\mu_{0})}{\sigma}\right)\cdot\frac{\sqrt{n}}{\sigma}+\phi\left(\sqrt{n}\frac{z-(\mu-\mu_{0})}{\sigma}\right)\cdot\frac{\sqrt{n}}{\sigma}\\ & = & \frac{\sqrt{n}}{\sigma}\left(\phi\left(\sqrt{n}\frac{z-(\mu-\mu_{0})}{\sigma}\right)-\phi\left(\sqrt{n}\frac{-z-(\mu-\mu_{0})}{\sigma}\right)\right) \end{eqnarray*} \]

Where $\phi(x)=\frac{\partial}{\partial{x}}\Phi(x)$ is the pdf of a standard Gaussian.

We can show by symmetry of $\phi$ that $\frac{\partial}{\partial\mu}\alpha_{\psi}(\mu)=0$ when $\mu=\mu_0$, and using properties of the Gaussian pdf prove that regardless of $n$ and $\sigma$:

$\frac{\partial}{\partial\mu}\alpha_{\psi}(\mu)<0\ for\ \mu<\mu_0$
$\frac{\partial}{\partial\mu}\alpha_{\psi}(\mu)>0\ for\ \mu>\mu_0$

Concretely, this means that our $\alpha_\psi$ is the smallest at $\mu=\mu_0$ and grows as we move away from $\mu_0$. It's easily shown that $\alpha_\psi(\mu_0-\epsilon)=\alpha_\psi(\mu_0+\epsilon)$, ie. it has the same (largest) value at the edges of $\Theta_0$, i.e. when $\mu=\mu_0\pm\epsilon$, or formally:

\[ \begin{eqnarray*} \underset{\mu\in\Theta_{0}}{argsup}P_{\mu}\left(\psi=1\right) & = & \mu_{0}\pm\epsilon\\ \underset{\mu\in\Theta_{0}}{sup}P_{\mu}\left(\psi=1\right) & = & \Phi\left(\sqrt{n}\frac{-z-\epsilon}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{z-\epsilon}{\sigma}\right) \end{eqnarray*} \\ \\ \]

So far, we've shown that with our test defined as:

\[\psi=\begin{cases}1\ (H_{0}\ rejected) & if\ (\hat{\mu}<\mu_{0}-z)\ or\ (\hat{\mu}>\mu_{0}+z)\\0\ (H_{0}\ not\ rejected) & otherwise \end{cases}\]

and for an arbitrary $z\ge\epsilon$, our test has level $\alpha=\Phi\left(\sqrt{n}\frac{-z-\epsilon}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{z-\epsilon}{\sigma}\right)$.

Let's now go the other way and pick $z$ for a desired level $\alpha$. Looking at the equation for $\alpha$ above, it doesn't look obvious how to find $z$. To avoid solving this equation, we'll note the following: as we move $z$ farther from $\epsilon$ our $\alpha$ gets smaller. That means we can find $z_\alpha$ numerically by bisection method.

However, since getting the p-value is enough for a test, we don't actually need to solve for $z_\alpha$. Note that as $z$ increases, $\alpha$ decreases, and from the definition of our test we have that the largest $z$ at which we'll ever reject is $z=\left|\hat{\mu}-\mu_{0}\right|$. This means that the smallest level at which we can reject is given by:

\[ \text{p-value}=\underset{z\in[\epsilon,\infty]}{min}\alpha=\Phi\left(\sqrt{n}\frac{-\left|\hat{\mu}-\mu_{0}\right|-\epsilon}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{\left|\hat{\mu}-\mu_{0}\right|-\epsilon}{\sigma}\right) \]

As a special case, consider the test with $\epsilon=0$. We'll have:

\[ \begin{eqnarray*} H_{0}:\left|\mu-\mu_{0}\right| & = & 0\Leftrightarrow\mu=\mu_{0}\\ H_{1}:\left|\mu-\mu_{0}\right| & > & 0\Leftrightarrow\mu\ne\mu_{0} \end{eqnarray*} \]

and:

\[ \begin{eqnarray*} \text{p-value} & = & \underset{z\in[\epsilon,\infty]}{min}\alpha=\Phi\left(\sqrt{n}\frac{-\left|\hat{\mu}-\mu_{0}\right|}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{\left|\hat{\mu}-\mu_{0}\right|}{\sigma}\right)\\ & = & 2\cdot\Phi\left(\sqrt{n}\frac{-\left|\hat{\mu}-\mu_{0}\right|}{\sigma}\right) \end{eqnarray*} \]

which is exactly the simple hypothesis two-sided test.

Two-sample test

So far we've just derived a one-sample test, so we need to modify it a bit to test the difference between means of two samples.

Given $n$ observations $X_1...X_n$ from a distribution with mean $\mu_X$ and variance $\sigma_X^2$, and $m$ observations $Y_1...Y_m$ from a distribution with mean $\mu_Y$ and variance $\sigma_Y^2$ , we formulate the following null and alternative hypotheses:

\[ \begin{eqnarray*} H_{0} & : & \left|\mu_{1}-\mu_{2}\right|\le\epsilon\\ H_{1} & : & \left|\mu_{1}-\mu_{2}\right|>\epsilon \end{eqnarray*} \]

Which is basically saying $\mu_1$ is different from $\mu_2$ by at least a margin of $\epsilon$. We use $\epsilon$ here to state our desired practical significance.

Let's define $d$ as $d=\mu_1-\mu_2$. Then we can rewrite our hypotheses as:

\[ \begin{eqnarray*} H_{0} & : & \left|d-0\right|\le\epsilon\\ H_{1} & : & \left|d-0\right|>\epsilon \end{eqnarray*} \]

We'll use $\hat{d}=\hat{\mu}_{1}-\hat{\mu}_{2}=\bar{X}_{n}-\bar{Y}_{n}$ as the estimator for $d$. From the Central limit theorem and the multivariate delta method, we get that:

\[ \hat{d}=\bar{X}_{n}-\bar{Y}_{n}\overset{i.d.}{\rightarrow}\mathcal{N}\left(\mu_{X}-\mu_{Y},\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}\right) \]

Substituting $\hat{\mu}\rightarrow\hat{d}$, $\mu_0\rightarrow0$, $\frac{\sigma^{2}}{n}\rightarrow\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}$ and plugging into our formula for p-value, we get:

\[ \text{p-value}=\Phi\left(\frac{-\left|\hat{d}\right|-\epsilon}{\sqrt{\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}}}\right)+1-\Phi\left(\frac{\left|\hat{d}\right|-\epsilon}{\sqrt{\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}}}\right) \]

Note once again that setting $\epsilon=0$ we get the familiar form for a simple two-sample two-sided test for difference of means.

Conclusion

We've shown a step-by-step exercise of deriving a two-sample two-sided Z-test with a margin of tolerance. It should be even simpler to derive a one-sided test, as it's just a modification of a simple hypothesis one-sided test with an extended range for $\Theta_0$.

Using the formulas above, in case we're rejecting $H_0$ we can also find the largest $\epsilon$ at which we can reject with a given p-value. This gives us an upper bound on the difference between means for a given statistical significance.

Kano Model Examples - Build Great Products With a Simple Mental Model

Drazen Zaric — Sun, 07 Feb 2021 23:18:23 GMT

When in 2007 Steve Jobs first showed the original iPhone it seemed out of this world. Watch the event video and pay attention to the crowd reaction when Jobs shows pinch-to-zoom or when the phone automatically switches to landscape mode when he flips it in his hand. It was unbelievable.

Few years later, there are thousands of smartphone models. Every single one has the pinch-to-zoom feature and switches to landscape mode automatically. You would seriously reconsider buying a phone lacking these features - it would feel weird.

This is a general property of any product or service - over time, amazing new features become basic, expected, boring attributes. Anyone offering a competitive product has to have these features for a satisfactory user experience.

It turns out there's a mental model that helps us thing about this dynamic. Dating back to 1980s, it originates from Japan and is known as the Kano Model after its original author, Noriaki Kano.

If you read on you'll find a brief description of Kano model, leaving out the details but more than enough to understand the intuition behind it. We'll then look at several Kano model examples, namely case studies showcasing how the model can be used in different settings.

In short, we'll see an example of a pyramid of customer needs, akin to Maslow's hierarchy, drawn from airline industry. Understanding products, services and feature ideas in terms of Kano hierarchy of customer needs can help us come up with product strategies and roadmap planning.

Then we'll look at how the Kano thinking can be used to understand Tesla's strategy to break into automotive industry and how we can use these ideas to find product-market fit.

Finally, we'll glance at how traditional hotels are fighting back against Airbnb and what they're doing wrong.

What is the Kano Model?

Kano model posits that you can put any feature of any product or service into one of the following buckets:

Reverse quality - these are product attributes that when present are reducing customer satisfaction, and you'd be better off removing them.
Indifferent - attributes that don't affect satisfaction regardless of whether they're present or not
Must-be - attributes that customers expect the product to have. Without these you don't have a satisfactory, competitive product.
Performance (or one-dimensional) - attributes that make customers happy when they're present, and dissatisfied when absent. Also the more-is-better attributes like memory in computers, horsepower or miles-per-gallon in cars.
Delighters (or attractive) - when present these attributes cause delight, when absent customers won't care that much. Usually the features or services that you provide by going out of your way to please customers.

An important idea coming with the model is the decay of delight - over time delighters will turn into performance attributes, and performance attributes will shift to must-be. With products, this inevitably happens as new technology becomes commonplace, with services it will happen once most competitors implement what used to be delightful offerings.

There's a commonly used visual representation of these. Imagine evaluating each feature along two axes:

feature sophistication (from not present to best of breed)
customer satisfaction (from dissatisfied to delighted).

You'll would the following chart:

Kano Model - Feature categories

Note that in practice you don't actually have a continuous measure for each of the axes. So the chart should be understood to depict qualitative categories and is an intuitive, rough way to capture the main idea of the Kano model.

There's theory and history behind the model, but the essence is easy to understand. As an example, let's say we're a smartphone company that wants to stop shipping chargers inside the box with new phones. We could ask a sample of our customers how happy they are that their new phone came with a charger inside the box. We would offer the following scale for answers:

I like it
I expect it
I'm neutral
I can tolerate it
I dislike it

Most people would likely respond with a positive answer (I expect it, I like it). Some may have a charger at home that they can use anyway, so they could be Neutral about it. It's hard to imagine a significant number of buyers to be dissatisfied with a free charger.

Remember, we want to stop shipping chargers inside the box by default. We'd like to know how the market will react? We could just ask our customers the obvious question: "If your new phone came without a charger in the box, how would you feel about it?". We would offer the same answer scale.

In general, take any feature f of your product and ask your customers the following two questions:

Functional question: “How would you feel if f was present/more extensive in the product?”
Dysfunctional question: “How would you feel if f was absent/less extensive in the product?”

If customers overwhelmingly say they expect the feature f, and would dislike it if f was absent, then you have a Must-be feature. Remove it and your product will be unsatisfactory.

If they say they don't mind f being absent, but like it when it's present, then you have a Delighter. It's an unexpected feature that gives you a competitive edge on the market. However, satisfaction will not deteriorate if you exclude the feature.

There are many possible combinations of answers and the Kano evaluation table tells us how to assign features to categories based on survey answers.

Kano decision matrix. Source and interactive version: https://kano.plus/about-kano#questionnaire)

That's the basics, and now that we know what Kano model is and where it comes from, let's look at how we can use it to think about products, innovation and market competition.

Kano Model for Product Strategy and Feature Prioritization

In a 2003 paper, Kano Model: A Dynamic Approach for Classifying and Prioritizing Travellers' Requirements, we learn about the service strategy of Scandinavian Airlines (SAS).

It's presented as a pyramid of services, with fundamentals at the bottom (transfer passengers and luggage safely, on time), support services in the middle and tailoring to individual needs at the top. It's not hard to map these onto Kano categories.

Image adapted from Kano Model: A Dynamic Approach for Classifying and Prioritizing Travellers' Requirements, Arash Shahin

When used to prioritize of feature implementation, the pyramid nicely fits the typical prioritization heuristic used with the Kano model: get the basic stuff right (must-be attributes), be competitive at the performance features, add in delighters on top. Miss any of the foundational elements and the entire pyramid crumbles down. Leave out the top and you don't have complete, shiny, pointy pyramid.

In How to build great products, Slava Akhmechet outlines an intuitive heuristic based on three buckets very similar Kano model categories:

The most important aspect of product management is categorizing features into three buckets: gamechangers, showstoppers, and distractions. When I first started building products, all features looked roughly the same. Over time, I formed the three bucket model and now my mind automatically slots every feature into one of these buckets.

Describing his heuristic using a case of building mobile phones, the author states his three feature buckets:

A gamechanger. People will want to buy your product because of this feature.

A showstopper. People won’t buy your product if you’re missing this feature, but adding it won’t generate demand.

A distraction. This feature will make no measurable impact on adoption.

Empirically, successful products have one to three gamechanging features, dozens of features that neutralize showstoppers, and very few features that are distractions. Your job is to build an intuition about your space to be able to tell these categories apart. That’s still pretty subtle (is a built-in phone projector a gamechanger or a distraction?), but at least this model gives you a plan of attack.

As Akhmechet suggests, building intuition around feature categories is an essential skill for anyone wanting to build successful products. On top of this, Kano model provides a systematic way of assessing and categorizing features. Besides the obvious, it can be a useful way of testing whether your intuition is aligned with your customers.

Removing features is another problem where Kano thinking can help. In software there's rarely reasons to remove functioning features. In service or manufacturing industries this is a massive cost-reduction mechanism. When thinking about which features to remove from your product or service, you'll likely end up going the opposite way from what we described above. You'll first consider removing delighters, reduce some performance attributes, and think very hard about must-be features.

The model is often studied as a tool for prioritization. However, mapping attributes of a product onto a pyramid like above can help prevent decay in quality of a product. Focusing too much on adding exciting new stuff and improving performance to stay competitive can cause attention to drift away from boring fundamentals.

Finding a Market for Your Product

Kano categories for a feature can vary depending on the target market and customer segment. A delighter in a budget family car will often be a basic (must-be) feature in an upmarket segment.

The set of features you're able to deliver will determine the market segment you can compete in. The market you choose will determine the entry barrier (must-be features), the areas where you can outperform competition (performance features) and the room you have to leave your customers in awe (delighters).

At any time, the current state of technology inevitably dictates what features are possible in a product. Introduction of the first Tesla Roadster is a great example of choosing the right market under such constraints. In 2006, Elon Musk shared The Secret Tesla Motors Master Plan. In it, Musk explains that with the new technology (electric motors, batteries) you can't play at the low-end market:

Almost any new technology initially has high unit cost before it can be optimized and this is no less true for electric cars. The strategy of Tesla is to enter at the high end of the market, where customers are prepared to pay a premium, and then drive down market as fast as possible to higher unit volume and lower prices with each successive model.

In part two, published in 2016, Musk adds:

Also, a low volume car means a much smaller, simpler factory, albeit with most things done by hand. Without economies of scale, anything we built would be expensive, whether it was an economy sedan or a sports car. While at least some people would be prepared to pay a high price for a sports car, no one was going to pay $100k for an electric Honda Civic, no matter how cool it looked.

Translating to Kano terminology, in economy-cars segment the price is a reverse-quality attribute, while luxury performance car buyers are mostly indifferent to high prices.

Tesla Roadster (first generation). Image credits: Wikipedia

Also, as a sports car buyer you'll likely care less about practical attributes such as service network, availability of charging stations, battery range. For a budget family car owner, those are very much worth considering.

With its superior acceleration, an electric vehicle is a perfect competitor in high-performance car market. It's novelty and exclusivity are delighters that make the first generation of the Tesla roadster a perfect luxury item. So by using the technology that's not yet ready to satisfy the basic needs of the mass market, Tesla were able to cleverly craft a product that would satisfy all the basics, win on the performance and leverage novelty to delight buyers of high-performance cars.

By finding a market where your product can perfectly deliver the must-be attributes, outplay the competition in performance, and deliver enough delighters you can achieve product-market fit by choosing the market instead of changing the product.

Staying Competitive in a Tough Market

Airbnb is a famous example of disruption in a centuries-old industry. Offering unique apartments reflecting the local spirit, and at cheaper prices, Airbnb stole a large chunk of the market from traditional hotels.

The difference is well captured in a PostFunnel article titled How Airbnb Changed the Hospitality Industry:

“People who choose Airbnb are seeking an intimate, non-manufactured experience. [..] They want to be embedded in the fabric of a community, where they can feel the uniqueness of place. This desire for authenticity is driving today’s hospitality design — each hotel must tell a story; it must be a place of context, reflective of its neighborhood and community.”

The consistent, manufactured, feel of traditional hotels simply doesn't cut it. It might be an efficiency thing for hotels, making them easier to manage, equip and maintain. But for a huge segment of customers, bland hotels are unattractive. So, in Kano model terminology, this consistency would be categorized as a reverse quality attribute.

An Airbnb room in San Francisco. Photo credit: Airbnb

A hotel room in San Francisco. Image credit: https://www.omnihotels.com

Since it's becoming an expectation (must-be) among travellers, large hotel chains are starting to transition to more cozy, residential style accommodation, connected to local lifestyle and attractions.

An article titled 4 Hotel Strategies for Remaining Competitive in the Age of Airbnb lists technology, personalization and amenities as the means for hotels to stay competitive. Referring to Kano model, this is would typically mean offering more delighters.

In Building a Winning UX Strategy Using the Kano Model Jared Spool opens his talk on Kano model with a story about Hyatt's random acts of generosity. It was a tactic intended on improving guest satisfaction and loyalty. Spool points out that we can predict such an effort to fail by observing it in the light of the Kano model. Simply put, it unlikely to help if you go out of your way to offer delighters on top of flawed must-be and subpar performance attributes.

That concludes our discussion of the Kano model and its uses.

If you're interested in more product-related reads, make sure to check out Product management on Blogboard search.

Marketing Data Science - Case Studies from Airbnb, Lyft, Doordash

Drazen Zaric — Thu, 07 Jan 2021 00:20:44 GMT

In the first quarter of 2019, Airbnb spent $367 million on sales and marketing. When you think about this from a technical standpoint, two obvious problems come to mind:

How do you scale your marketing processes to be able to spend $300+ million per quarter on ads?
Once you have systems in place to spend huge ad budgets, what's an optimal way to allocate the money?

In this article we'll look at several case studies of data science in marketing, applied to optimize efforts at companies like Lyft, Airbnb, Netflix, Doordash, Wolt, Rovio Entertainment.

Summarizing articles from official blogs of these companies, we'll get a high level overview of marketing automation and then zoom in on the parts where data science and machine learning play their role.

If you read on, you'll find these three sections:

Marketing automation systems - what are they, what subsystems they comprise, where in the process is data science usually applied
Performance estimation - why estimating the performance of your campaigns is the fundamental problem in marketing analytics and what is the data science tool set used for this
Optimizing bidding and budget allocation - once your marketing efforts are at the scale of hundreds or thousands of concurrent campaigns, it's impossible to allocate you budget manually in an optimal way. This is where Marketing Data Science shines. We look at two simple algorithms for budget allocation, shared by DoorDash and Lyft engineers.

Marketing Automation Systems

In large and analytically mature organizations, the optimization piece usually comes as a part of a larger marketing automation system, but as we'll see it's not always the case. Allocating budgets manually but aided by data science can be hugely profitable and might be a good first step towards a fully automated workflow.

Before diving into details, let's look at high level architecture of an automated process for online marketing.

Generally, all advertising platforms involve a common workflow. You set up the ad creative (text, visuals), choose the target audience, set bidding budget and strategy. As a result of streamlining this workflow, marketing automation systems are very similar in their high level architecture. Usually, these systems comprise the following:

Data tracking system
Track conversion events (customer signups, payment events, subscriptions, micro-transactions, etc).
Attribution system
Connect conversion events with the user acquisition source. That is, for each user we want to know exactly the marketing channel and the campaign that brought them in.
Performance estimation system
Let's say a campaign brought in 1000 users. We want to know if it paid off. We know how much we spent on it, but how do we know how much revenue the users will bring us over their lifetime. LTV and conversion modelling comes into play here.
Campaign management system
Online ads are a very fertile field for variation testing and content generation. But even without testing multiple you variations of the same ad, companies typically target different segments in different ways, easily resulting in dozens or hundreds ads running simultaneously. Companies like Airbnb and Netflix invest heavily in systems that support ad creation and management ( Airbnb article, Netflix article).
Automated bidding and budget optimization
The largest ad serving platforms provide you with near real-time feedback on your ad performance. Connect this with the spend and projected LTV and you can get your ROI predictions and adjust budgets accordingly. With dozens or hundreds of campaigns and variations, the benefits of automation and optimization at this steps can be huge.

As we're interested in the role that data science can play in overall ad lifecycle, we'll focus on the two parts that tend to benefit the most from mixing in data science: 1) performance estimation and 2) automated bidding and budgeting.

Before diving in, it's important to understand the channel/campaign nomenclature. By channel we consider an advertising platform, such as Google AdWords, Facebook, Youtube, etc. A campaign is a single piece of advertising aimed at specific audience, according to segments available on the channel, with a preset starting and end time.

When evaluating marketing performance, we might want to look at investment and ROI at the level of a channel, a single campaign or a group of similar campaigns. We'll see how these different levels of granularity influence the amount and quality of available data, and in consequence how that determines the approaches that can be taken.

Performance Estimation

Ideally, for the purpose of marketing data science optimization we're interested in LTV and CAC (Customer Acquisition Cost) as the factor in the ROI equation: $$ROI=\frac{LTV}{CAC}$$

LTV modelling is a fundamental problem in business analytics and it is far from trivial to get it completely right. The exact models depend heavily on the type of business and the intended application. LTV models are generally more valuable if we can give good estimates very early in the user lifetime. However, the earlier we do it the less data we have at our disposal.

In Pitfalls of Modeling LTV and How to Overcome Them, Dmitry Yudovsky outlines several challenges that make it impossible for a cookie-cutter approach for LTV estimation to exist:

Machine learning approaches are sometimes completely inadequate.
There might be lack of data necessary for long term LTV predictions. Also, even if we do have a large business with tons of historical data, there are cases when training models on year old data doesn't work well - maybe the product or the entire market is very different than a year or two ago.
Depending on whether we want to use LTV estimates for ad optimization, CRM efforts or corporate financial projections, we might have different requirements for model accuracy and cohort granularity at which we're making predictions (eg. single user, single campaign, group of campaigns, all users, etc.)

Of course the problem is not intractable, and there are several common approaches. We'll look at a few case studies found in tech blogs from DoorDash, Airbnb and Lyft Engineering teams.

In Optimizing DoorDash’s Marketing Spend with Machine Learning, Doordash data scientists present their approach, where instead of directly estimating LTV, they model conversion rates as a function of marketing spend. We'll see later how these cost curves help to neatly optimize budget allocation across channels and campaigns.

Experience (data) tells us that any marketing channel will reach saturation at some point, so we can model cost curves, ie. $Conversion=f(Spend)$ using a power function of the form $a\cdot Spend^{b}$.

Cost curve of the shape $a\cdot Spend^{b}$. Image credit: DoorDash Engineering

We can fit cost curves at any cohort level, and it's typically done at the granularity of a channel or campaign. Simply put, if for a given campaign we spent $x$ amount of money, and that brought us $y$ users, we have one data point, $(x, y)$.

However, when allocating budgets at a later stage, we might need to make decisions at the campaign level, which cause problems with insufficient amount of data. In the DoorDash Engineering article, Aman Dhesi explains this problem:

For some channels like search engine marketing, we have thousands of campaigns that spend a small amount of money every week. This makes the weekly attribution data noisy. Some weeks these campaigns don’t spend at all, which makes the data sparse. Using this data as-is will result in unreliable cost curves and in turn suboptimal (potentially wildly so) allocation.

At DoorDash they solve this problem by training separate models which use similar campaigns to fill in the gaps in the dataset with synthetic data. This approach brings with itself certain tradeoffs, described in the original article.

In a similar manner, as described in Building Lyft’s Marketing Automation Platform, data scientists at Lyft would fit an LTV curve of the shape $LTV=a\cdot Spend^{b}$. However, they incorporate an additional degree of randomness by modelling $a$ and $b$ as random variables and estimating their parameters $(\mu_a, \sigma_a)$ and $(\mu_b, \sigma_b)$ from historical data. This helps them implement an explore-exploit approach in the bidding step, by instantiating LTV curves after sampling $a$ and $b$ from their respective distributions. We'll revisit this approach briefly at the end of next section.

As described in Growing Our Host Community with Online Marketing, at Airbnb they face a problem stemming from the nature of their product and the market. When predicting LTV for an Airbnb home listing, two major problems are:

Ad conversions for hosts are a very rare event. This poses problems with building large enough data sets. It also influences data tracking and attribution, where these systems have to be as precise as possible in order not to lose or wrongly attribute any data points.
Time from ad impression (user seeing an ad) to conversion (home listed on Airbnb) can be very long, sometimes weeks. This is a problem if you want to optimize and re-budget your campaigns soon after rollout - you simply don't have enough data yet.

In the same post, Tao Cui describes the architecture of each part of Airbnb's marketing platform as well as the motivation for building the entire thing, along with choices of tech stack.

In another article dating from 2017, Using Machine Learning to Predict Value of Homes On Airbnb, Robert Chang describes how they use machine learning (ending up using XGBoost in production) to estimate LTV of each listing. Framing it as a typical regression problem, they use hundreds of features, such as location data, price with all the partial costs (eg. cleaning fee, discounts), availability, previous bookings, to predict revenue from a listing after some fixed amount of time (eg. 1-year revenue). If you're curious, the post also describes some of the pieces of infrastructure used by the system and gives a high-level code examples of training pipeline construction.

In Insights on the Pros and Cons of LTV-based Predictive Models an article from AppsFlyer, we can find a summary of pros and cons of the three common LTV modelling approaches for app-based businesses:

Retention/ARPDAU model
If we have a fairly old and stable product with some historical data, we can leverage the fact that we know the shape of the retention curve and can fit a power curve to several early-retention data points. We also know the Average Revenue Per Daily Active User (ARPDAU) which tends to be stable over time for most freemium and micro-transaction apps (such as free to play games). With some math we can arrive at an estimate of the expected LTV using these two measures. For example, to estimate LTV by day 90 of user's lifetime we would use the following equation: $$LTV_{90}=ARPDAU\cdot\sum_{d=0}^{90}retention[d]$$
LTV ratio model
As a simple example, in order to get $LTV_{90}$ we'll use historical data to estimate the ratio $\frac{LTV_{90}}{LTV_{7}}$ and use the observed 7-day LTV to predict the 90-day LTV
Behavior driven/user-level models
We'd use user-level features to train our favorite machine learning model for regression. This is the approach mentioned above in the Airbnb case.

The article further discusses pros and cons of each approach in depth, considering the type of business and the intended use cases for the LTV model.

Now, back to the big picture - we needed LTV estimation in order to predict performance of our marketing campaigns. Once we have satisfactory models in place we can use them to make decisions concerning ad budgets.

Optimizing bidding and budget allocation

Once we have the estimates of performance (ROI) for each campaign, we want to allocate our marketing budget across campaigns so that we maximize the total return on investment.

Depending on the degree of automation, we can use the data science-backed systems to either aid manual budgeting or to automate real-time bidding decisions in a fully automated system.

In the first case, we have a static problem where at some point in time we're looking at a set of channels/campaigns with their predicted ROIs. A set of sortable tables, visualizations and derived metrics can invaluably help campaign managers to optimize their efforts.

On the other hand, in a fully automated system, we can have algorithms bidding and deciding how to spend each dollar in an optimal way. Looking into articles from DoorDash and Lyft engineering teams, we learn about two variations of an approach that sequentially maximizes marginal value of each dollar spent.

In Optimizing DoorDash’s Marketing Spend with Machine Learning the proposed approach looks at cost curves for each channel/campaign, representing the function $Conversion=f(Spend)$. We note that the slope of the curve is monotonically decreasing as we increase spend, meaning that for each additional dollar spent our marginal value decreases - we get fewer conversions per $ spent.

Cost curves with monotonically decreasing slopes. Image credit: DoorDash Engineering

With such problem in place, in order to optimally allocate a fixed budget we can use a simple greedy algorithm:

For each channel/campaign $c$ set $spend\left[c\right]:=0$
For each \$ until budget is exhausted:

2.1. Find the channel/campaign $c_{best}$ with the largest marginal return (ie. the largest slope) at it's current spend. More formally: $c_{best}=\underset{c}{argmax}\left\{ \frac{\partial}{\partial spend}Conversion[c](spend[c])\right\} $

2.2. Assign the next \$ to campaign $c_{best}$, ie. $spend\left[c\right]:=spend\left[c\right]+1$

Of course, models and budget allocations can (and should) be periodically updated using performance data obtained from the advertising platform APIs. That brings us to the approach relying on continuously experimenting and updating the model in an explore-exploit fashion.

In Building Lyft’s Marketing Automation Platform, a Multi-armed bandit approach is described. Instead of modeling $Conversion$, they fit an LTV curve, that essentially has the same power-function properties that we described above (monotonically decreasing slope). As mentioned in the previous section, they incorporate an additional degree of randomness by modelling $a$ and $b$ as random variables and estimating their parameters $(\mu_a, \sigma_a)$ and $(\mu_b, \sigma_b)$.

Cost curves with model parameters modelled as random variables, enabling Thompson sampling approach to optimization. Image credit Lyft Engineering

Then they use Thompson sampling, a simple algorithm for Multi-armed bandit problem with a Bayesian model. An excelent introduction to Bayesian bandits and Thompson sampling can be found in Chris Stucchio's article from 2013 - Bayesian Bandits - optimizing click throughs with statistics.

In this article we've covered several case studies in using marketing data science to optimize online marketing with several different approaches. Sources vary in their depth and detail, but it's nevertheless inspiring to learn about all the different ways to solve common problems.

If you're curious about more case studies, make sure to checkout articles similar to Optimizing DoorDash’s Marketing Spend with Machine Learning

Discover the best Machine Learning and Data Science articles from leading tech companies

blogboard.io

👋 Liked the article? Let's get in touch - follow me on Twitter @drazenxyz

Designing Your Engineering Interview Process? Here's What You Need to Consider

Drazen Zaric — Wed, 23 Dec 2020 00:49:37 GMT

On the surface, software engineering interviews look pretty much the same in any company. You usually have a phone call with a recruiter, followed by either a short technical round or a set of take home exercises. If you pass that you're invited for several rounds of on-site interviews, focusing on your technical abilities and behavioral cues that could predict your job performance and how you'll get along with the team.

However, if you simply take this format, pick a set of technical questions for the specific role and go with it, you could soon find yourself thinking about how to improve the process. You might find it slow and tedious for the interviewers; you could start seeing evaluations for the same candidate that widely differ depending on the interviewer; you could get feedback from the candidates that the process requires too much of a time investment for them.

In this post we'll summarize lessons shared on official engineering blogs of several companies. Namely, the advice falls down into three sections:

Know what you are looking for
Be ready to iterate
Techniques to avoid bias

If you're curious and want to find more stories about engineering interviews and hiring process, you should try Blogboard search.

Know what you are looking for

You want to hire an engineer, and you want to hire a great one. But the exact definition of "great" can depend on many factors, so a one-size-fits-all list of traits is unlikely to exist.

Are you a big company or a startup, what's the current state of the project and the team, are you hiring for a junior or a senior role? These are some of the obvious things to consider when thinking about the best fit for the role. Once you're clear on this you can work out the details and the exact qualities you're looking for in a candidate. Finally, all this will determine what you screen for and what your interview scorecards need to look like.

In a Lyft Engineering blog post How Lyft Designs the Machine Learning Software Engineering Interview, Hao Yi Ong states the following three questions to ask in order to understand what you actually want from a role you're hiring for:

1. What are Lyft’s challenges (and can a specific role help)?
2. What should the role be with respect to the organization’s goals?
3. What are the desired skills, knowledge, and talents given the expectations for the role?

Companies are more or less transparent about detailed requirements for the roles. The above-mentioned article briefly describes them as follows:

Our desired talents are recurring patterns of thought, feeling, and behavior that can be productively applied in the context of Lyft’s ML SWE role. What we’re looking for here is a bit more complicated than simply work done in the past by a candidate. Faced with the same stimuli, people react and behave differently. When we look for role and values fit, we do mean just that. Beyond skills and knowledge, will a candidate’s unique way of responding to the problems thrown up in Lyft’s business context help that candidate succeed? So while conventional wisdom might suggest it, we’re not always looking for the Michael Jordans of machine learning (be it I. or J.). The narrow sort of talents associated with celebrated excellence can be important but in most cases the interviewers are listening for predictive clues of how a candidate will react when posed Lyft-specific problems on the job.

Speaking about high level qualities of successful engineers, here's what Ted Tomlinson of Databricks shares in his article Engineering Interviews — A Hiring Manager’s Guide to Standing Out:

At a startup like Databricks, the most important quality I’ve seen in successful engineers is ownership. We are growing quickly, which brings a lot of new challenges every week, but it’s not always clear how responsibilities divide across teams and priorities get determined. Great engineers handle this ambiguity by surfacing the most impactful problems to work on, not just those limited to their current team’s responsibilities. Sometimes this means directly helping to build the solution, but often it’s motivating others to prioritize the work.

The second quality we focus on, particularly for those earlier in their career, is the ability to learn and grow. The derivative of knowledge is often more important than a candidate’s current technical skills. Many of the engineering problems we are solving don’t have existing templates to follow. That means continually breaking through layers of abstraction to consider the larger system – from the lowest level of cpu instructions, up to how visualizations are rendered in the browser.

Going back to tailoring the requirements to the needs of your company, notice that Ted points out that ownership is the most important quality at a startup like Databricks, suggesting that wouldn't necessarily be equally important at a different place.

Once you've figured out the qualities you're looking for, you can go deeper and break them down into fine-grained categories that you can actually screen for in an interview. Medium Engineering shares a great example of how this can be done. In Engineering interviews: what we screen for, Jamie Talbot explains how at Medium they're looking for three things: (1) Can they build software? (2) Can they learn and teach? (3) Are they aligned with our values?

Each of these high level requirements is then broken down into six sub-categories. For example ability to build software covers the following categories:

Problem solving
Code fluency
Autonomy
Basic computer science knowledge
System design
Resoluteness

If you're curious, each capability is described in detail in the blog post. In addition to that, the team at Medium devised a detailed grading guideline for each category, helping interviewers decide on the scale of Strong No > No > Mixed > Yes > Strong Yes. They describe it in another article here.

With any problem, hiring being no exception, it helps to know why you're doing it and what your environment and constraints are. Only then you can come up with creative and more effective solutions.

Speaking of constraints, they're likely to change over time, rendering your existing interview process inefficient.

Be ready to iterate

There's several reasons for being flexible and ready to adapt your technical interview process.

In the previous part we discussed how you need to tailor your interviews to fit your company, project and team needs. Inevitably, these things will change over time, and with that your interview process will likely need to change as well.

On that note, when reading through company blogs a recurring theme is redesign of the interview process. It usually happens because circumstances change, but often (as with any process) you can find room for improvement even when the environment hasn't changed much.

At Medium they noticed more than a few things they thought need improvement. As elaborated in Engineering interviews: refining our process they wanted to address lack of clarity in capability requirements, inconsistencies in candidate evaluation, which traits are considered important and which less so, how they approach personality traits and non-technical qualities.

At The New York Times, they recognized the need to standardize the hiring process across the company, as it's recognized to be the key determining factor for the culture. One of the things where a consolidated interview process helped is increased trust for internal mobility. Due to lack of trust among teams, it could happen that an engineer would be required to pass a technical interview if they were to change teams. You can read about this on the NYT Open, the official behind-the-scenes blog of The New York Times (namely these two articles: How We Designed Our Front-End Engineer Hiring Process, How We Hire Front-End Engineers at The New York Times).

In Refactoring Backend Engineering Hiring at Slack, Slack engineers share the story of why and how they optimized the take-home exercise. Although it had many points in its favor, the exercise was a bottleneck in their hiring process. Candidates, wanting to show off the best of their skills, would take too much time to complete the exercise. Slack, on the other hand, was in state of rapid growth and the projected time for staffing all the necessary positions simply was too long:

The end result was that, by our estimates, it would have taken a year to fill our existing open headcount, future growth aside. This timeframe clearly would not allow us to grow at the speed we needed. However, we were also unwilling to sacrifice quality. We needed an approach that would give us good signal and help us hire great engineers, but at a reduced time cost to the candidate and to us.

To satisfy these needs, we decided to create two new take-home exercises: an API design exercise and a code review exercise. In creating these exercises, we sought to create a problem that was not an onerous time investment on the part of the candidate. We wanted something that would give us good signal on the attributes we cared about while taking at most two hours to complete.

Finally, the team at Slack came up with a new format for the challenge as well as internal apps and GitHub automations to streamline the process, resulting in significant measurable improvement:

In the end, we saw tangible improvements against our goals. We saw a decrease in our time-to-hire — the time from when a recruiter first reaches out, to the candidate’s first day in the office. The time-to-hire metric decreased from an average of 200 days to below 83 days — and it continues to drop. We’ve seen positive feedback from candidates and employees in all parts of the process.

At Soundcloud, the team attacked the same part of the interview funnel - the take-home exercise, since they noticed it often takes far too long to complete it. As they point out in this article, there's a very subtle reason to be careful about time investment on the candidate side:

Many great candidates have good jobs and busy personal lives. We want to talk to as many qualified candidates as possible, but to do that, we need to minimize the chances that our interview process itself gets in the way.

So not only does a tedious process slow down your hiring, it might cause the best candidates to simply give up because they're already too busy.

How to avoid bias and variance?

It's no secret that interviewers can easily fall victim to all sorts of biases. A good first impression can make you give better score to a candidate's technical abilities than you might do otherwise.

In 7 Practical Ways to Reduce Bias in Your Hiring Process, Rebecca Knight outlines the ways in which bias can hurt your hiring and offers several ways to overcome this inherently human problem:

Unconscious biases have a critical and “problematic” effect on our judgment, says Francesca Gino, professor at Harvard Business School. “They cause us to make decisions in favor of one person or group to the detriment of others.” In the workplace, this “can stymie diversity, recruiting, promotion, and retention efforts.”

In more concrete terms, Medium Engineering team share their means for fighting bias by standardizing the hiring process. As described in the first section, they've laid out in detail all the qualities they're seeking, as well as those they don't find predictive of work performance, such as school, GPA, previous employments, open source contributions. And not only they're not good predictors of the performance, they're the usual suspects of causing unconscious biases. Having identified these categories, interviewers then avoid penalizing anyone based on these criteria. You can read about this in Engineering interviews: what we don’t screen for.

Having clearly defined qualities and grading rubrics helps interviewers standardize evaluation and decisions across interviewers and candidates. This helps eliminate both bias and variance in the process. Simply put, eliminating bias will make sure that a single interviewer will evaluate equally two candidates with the same skillset. On the other hand, you want to eliminate the variance among interviewers t00, so that a candidate would be evaluated the same regardless of who interviews them.

Writing about their take-home test in How to prepare for engineering interview assignments, Intercom engineers Lorcan Coyle and Alex Mooney point out that at this first stage of the technical interview they don't care about anything but the solution itself:

Unconscious bias is a well-researched problem in our field, and it’s important for reviewers to eliminate as many potential sources of bias as possible. It’s crucial to be clear about what we’re looking for when reviewing an interview assignment, and it’s just as important to know what we’re not looking for. When assessing a technical submission, we don’t care about:

- The candidate’s experience level.
- The position they are applying for.
- Their CV or professional history.

None of these details are relevant at this stage. All we assess is the take-home test itself – we only care about your code!

At Slack, they automate parts of the take-home task review so that a script converts a GitHub pull request into an anonymized markdown file, ensuring that graders are unaware of the candidate's identity on GitHub.

Finally, at the New York Times, the engineering team makes sure that at each step of the process a candidate is evaluated by multiple interviewers in order to prevent blind spots. They've tried anonymized resume reviews, where you look at a resume with some fields removed, such as the identity of the candidate and names of their previous employers. They share their thoughts on this in How We Designed Our Front-End Engineer Hiring Process.

Dig deeper with Blogboard search:

Engineering interviews

Technical interviews

Hiring engineers

blogboard.io - Engineering blogs from top tech companies. Search, discover, follow.

Code Review Best Practices - Lessons from the Trenches

Drazen Zaric — Mon, 07 Dec 2020 16:22:29 GMT

There's a ton of resources scattered around the web dealing with code review fundamentals, best practices, tools, etc. In this article we'll summarize the lessons from a dozen of official company engineering blogs. You can find links to the original articles in this blogboard search.

What's in this article?

We'll cover several topics:

Why do code reviews?
Besides the obvious, quality assurance, there are other benefits to code reviews
Code reviews as quality assurance
We'll cover the general recommendations on what to look for in a code review, why having a review checklist is beneficial, and you'll get a fairly long checklist that you can use as a base for your own list
Code reviews as a team improvement tool
If you've done more than a few code reviews, you know they're useful for more than just preventing bugs. We'll summarize common views on how reviews are beneficial as a learning and team bonding tool
Preparing a pull request for review
Lessons for pull request authors. There are rules of thumb consistently pointed out that help to prepare a PR for a smooth review
Reviewing code - Be human!
Lessons for reviewers on how wording and tone of your comments can make a huge difference in effectiveness of the whole review effort.

The topics are covered fairly independently, so if you're curious about a particular topic feel free to skip ahead.

Why do code reviews?

It should be obvious that the primary purpose of code review is to assess quality of the changes being introduced. I mean, the dictionary definition of review says precisely that

review (noun) - a formal assessment of something with the intention of instituting change if necessary.

Of course, code being code, there's a lot of things that can be checked and tested automatically, so there's nuance to what actually needs to be checked in an actual code review. We cover that in the next section.

On the other hand, code review is a form of communication between the author of the change (these days usually a pull request) and one or several reviewers. So it has side effects that go beyond preventing bugs from slipping in or keeping the codebase consistent in terms of style and architecture.

When done well, code reviews help accelerate learning across the team, create psychological safety for all team members, help establish and communicate best practices, teach proper communication and improve team dynamics. When done poorly, they can help deteriorate all of the above.

Code reviews as quality assurance

There are a bunch of ways in which code reviews help maintain the quality bar for the codebase and the product. In the end it comes down to catching mistakes at the level which can hardly be automatically tested, such as architectural inconsistencies. Also, the code for automated tests should be reviewed, so there's a meta level at which reviews help with QA.

In Giving High Leverage Code Reviews, Casey Rollins advocates for having a checklist with all the usual things that need attention.

When I’m reviewing a pull request, I often do multiple “passes” where I focus on one attribute at a time. I start at the beginning and review the pull request with a single attribute in mind before moving on to the next. When I’ve worked through the checklist, I submit the review.

This checklist moves from general to specific checks because it’s important to focus on the high-level attributes first. It doesn’t make sense to offer a variable name suggestion if you’re also suggesting that an entire class or function be refactored.

You can have your own checklist or make it a shared list for the team or a project. There's a ton of material written on the usefulness of checklists. In Getting Things Done, David Allen puts forward a simple idea - our minds are great at processing information, but terrible at storing and recalling it. That's why checklists are a great way of externally storing and breaking down a planned or repetitive task.

Compiled from several articles (1, 2, 3) here's a high-level list of things to be concerned about when reviewing a code change:

Story alignment - does the change meet the requirements of the task at all; ie. does the code implement any and all of the specified functionalities?
Consistency across the codebase
Architectural considerations - how does the new piece of code fit the existing architecture. Can the new feature architecture be improved, is it too generic or not extensible enough?
Simplicity/over-engineering
Performance concerns - are there specific cases (eg. peak load times) when the code will break? Do the queries pull more data than necessary? Could new queries benefit from adding new indexes to the database?
Accidental errors such as typos or errors in math formulas - these can be either obvious or really tricky to notice, especially with math heavy code
Compliance with laws and regulations - depending on the business this might be the most important thing
Security concerns - are there any exploitable pieces of code being introduced? Are any secrets being shared or stored unsafely?
Readability and style - a seemingly perfect piece of code might not be immediately understandable and readable to a different pair of eyes. Is it possible to understand the changes without the author explaining them?
Best practices - programming languages usually have their best practices - are they met in the pull request? Also, with time any project, team and company will evolve their own set of best practices - code reviews are a way to enforce and spread knowledge about them
Localization - are all language dependent resources localized properly?
Dependencies - are there external libraries or APIs being introduced? Are there other simpler/faster/better ways to do this with different dependencies or without any?
Interactions and side effects - how does the new piece of code interact with the rest of the codebase; does the new function implementation break any existing functionality; are all relevant unit tests updated/added
Logging - it's practically impossible to debug server code properly without good logging. Is everything logged/traced correctly
Error handling - how are the errors handled on the backend; how are they communicated to the user; are fallbacks activated where possible?
Testability/Test coverage - is the new piece of code covered with automated tests? Have all the suspicious test cases been checked either automatically or manually? Is the code written in a way that's suitable for unit testing?
External documentation - in case it's necessary is the external documentation updated to reflect the change?

It's a pretty long list. In addition to it, a recurring piece of advice is not to use code reviews in place of static code analysis tools. If your review is mostly about code formatting, variable naming and alphabetical ordering, it might be a good time to include an automated code analysis tool into your development workflow.

Code reviews as a team improvement tool

In Effective Code Reviews: Bettering Products, Teams, and Engineers from PayPal engineering, Gabriel McAdams points out several important benefits of code reviews related to team dynamics:

Team cohesion - by making everyone's code subject to peer review, code review process promotes individual accountability, healthy conflict and the idea that everyone's working together to make the product better. As said in Code Review Best Practices: Code reviews are classless: being the most senior person on the team does not imply that your code does not need review.
In summary, McAdams puts it nicely: Trust + healthy conflict + individual accountability + working together to better the team = team cohesion.
Free career improvement training - simply by virtue of reviewing other people's code you become more skilled at reading and understanding new code. I've heard it said that one of the foremost traits of great engineers is the ability to dive into and dissect a completely unfamiliar piece of code. Over time you learn how to spot common practices, little tricks, pieces of syntactic sugar, architectural abstractions and how to appreciate different mental models used to solve the same problem.

In Code Review Best Practices from the Palantir Blog, Robert Fink lists several ways in which knowledge sharing and social side-effects happen via code reviews:

Authors are motivated by the peer review process to do all the necessary pre-checks, tighten the loose ends and generally tidy up the code before sending to review
A code review explicitly communicates changes made to product functionality to team members
The author maybe used a technique, abstraction or an algorithm that reviewers are unfamiliar with. The opposite can also be the case - reviewers might be aware of a more appropriate way to solve a given problem
Positive communication strengthens social bonds within the team (might especially be true for remote teams)

Preparing a pull request for review - help the reviewer

Code reviews should be seen as a team effort. Once you view them that way it becomes clear that both sides - the author and the reviewers - have their distinct sets of responsibilities.

In this short post on Medium Engineering blog, Xiao Ma describes how a different perspective changes the way code reviews are done, how feedback is taken and how people on each side benefit by adopting a positive mindset about code reviews.

When we talk about the responsibilities of the pull request author, there are several key things recurring in all code review guides.

Make pull requests as atomic as possible
At Shopify they advise to keep your pull requests small - it helps the reviewer dive into it and finish it as an atomic piece of work in their workday. In practice this can mean keeping your pull requests limited to a single concern. A single concern here means a single bug fix, a feature, an API change etc. Don't mix refactoring that doesn't alter behavior with bug fixes or new features. This is beneficial both for the ease of doing the code review but also helps keep the codebase maintainable (for example, atomic pull requests are easier to rollback).
You can find practically the same advice in posts from Kickstarter Engineering, Gusto Engineering and Palantir.
Provide a helpful pull request description
"Give your reviewers a map". It's true that you should pick the teammates that are the most familiar with the part of code you've changed. But even a few sentences describing why/what/where of the pull request can greatly help the reviewer to navigate your pull request.
Test before review
Make sure you've reviewed and tested the pull request before submitting for review. You want to make sure that all relevant files are included, that the PR passes the build and automated tests, that all suggestions from automated review tools are addressed.

Reviewing code - Be Human!

The most frequently recurring piece of advice, and perhaps the least obvious, is the importance of the tone of communication in code reviews.

In a Kickstarter Engineering article A Guide to Mindful Communication in Code Reviews, Amy Ciavolino lists many tips for improving communication on both sides of a code review. In Amy's words: "Technical skills are needed to review code thoroughly and quickly. But beyond that, reviewing code is also a form of communication, teaching, and learning. Either as the author or the reviewer of code, being mindful in your communications can make code reviews more valuable for everyone."

The article contains tips on how to be mindful of the author and the purpose of the process when doing the review:

Don't jump to conclusions, ask questions - assume the author knew what they were doing even when it seems completely wrong at first sight
No nitpicking - the fact that you are noticing tiny things like formatting inconsistencies is likely a sign you should consider using a linter on your project. In Giving High Leverage Code Reviews, Casey Rollins links nitpicking to the phenomenon of bikeshedding (or Parkinson's Law of triviality). Long story short - just because it's easy to spot tiny mistakes doesn't mean that you have to insist on them being fixed. Be mindful and pragmatic.
Be biased towards approving; make it clear if something can be fixed later - as a reviewer you're not necessarily a gatekeeper with the power to block any pull request. Maybe an architectural concern or a far-in-future problem can be addressed in the next sprint, while pushing the fix to production as soon as possible.
Include example code or documentation - especially if you've looked it up anyway. An important point is that by acknowledging that you needed to look something up can help junior members with impostor syndrome.

Wording makes a world of difference

A bug is a bug, a typo is a typo and there's no way around it. But even if it's an obvious mistake, there are often multiple ways to deliver the message. A code review ridden with comments like This is duplicate; Fix this...; Feels slow. Make it faster; Read the style guidelines can come as too harsh no matter who the author is.

This is nicely pointed out in Giving High Leverage Code Reviews:

At the core of a code review, you’re providing feedback to your peers, which might be hard. But receiving feedback is harder. Everyone on your team is trying to do their best work, so take care in delivering your message. For example, if you’re pointing out an error or asking a question, make it a team effort, not their fault. This might look like: “Can we remove some of the duplication in this file?” instead of “You missed an edge case”.

Alejandro Lujan Toro offers several practical examples of harsh comments that you can easily change to a more constructive tone:

Less of These	More of These
Move this to Markdown	How about moving this documentation into our Markdown README file? That way we can more easily share with other users.
Read the Google Python style guidelines	We should avoid single-character variables. How about board_size or size instead?
This feels too slow. Make it faster. Lightning fast.	This algorithm is very easy to read but I’m concerned about performance. Let’s test this with a large dataset to gauge its efficiency.
Bool or int?	Why did you choose a list of bool values instead of integers?

The trick is to approach code reviews as a team effort. Try to use more we instead of you when suggesting changes. Amy Ciavolino suggests that you shouldn't even start reviewing if you're not in the right mood to give considerate feedback:

When you’re checking in, also consider how you’re feeling in general. Giving kind and considered feedback takes time and energy. If you’re hungry, tired, in a hurry, have a lot of meetings, etc., don’t review code or send out a review. You need to fix those things first. If you don’t care for yourself, you can’t take care of others.

Don't forget to give praise!

Once you realize that code reviews are not simply about finding bugs, this should come naturally. Maybe you've learned something from the pull request, or the author has invested a great effort and showed impressive attention to detail. Let them know that.

Giving praise in code reviews is especially important with newcomers. In How to Make Good Code Reviews Better, Gergely Orosz suggests that code reviews need to be a positive experience for a newcomer:

Better code reviews pay additional attention to making the first few reviews for new joiners a great experience. Reviewers are empathetic to the fact that the recent joiner might not be aware of all the coding guidelines and might be unfamiliar with parts of the code. These reviews put additional effort into explaining alternative approaches and pointing to guides. They are also very positive in tone, celebrating the first few changes to the codebase that the author is suggesting.

blogboard.io - Engineering blogs from top tech companies. Search, discover, follow.

👋 Liked the article? Let's get in touch - follow me on Twitter @drazenxyz