본문 바로가기
Growth

Behavioral Data Analysis with R and Python

by Diligejy 2024. 1. 8.

 

 

p.xii

we’ll spend a lot of time learning to make sense of data. In my role as a data science interviewer, I have seen many candidates who can use sophisticated machine learning algorithms but haven’t developed a strong sense for data: they have little intuition for what’s going on in their data apart from what their algorithms tell them.

 

p.xiii

If you’re in academia or a field that requires you to follow academic norms (e.g., pharmaceutical trials), this book might still be of interest to you— but the recipes I’m describing might get you in trouble with your advisor/editor/manager. This book is not an overview of conventional behavioral data analysis methods, such as T-test or ANOVA. I have yet to encounter a situation where regression was less effective than these methods for providing an answer to a business question, which is why I’m deliberately restraining this book to linear and logistic regression.

 

p.xvi

One of the steps of going from beginner to intermediate level as a programmer is to stop writing scripts in which your code is just a long succession of instructions and to structure your code into functions instead.

 

p.4

Before getting to that, readers familiar with predictive analytics may wonder why I’m advocating for causal analytics instead. The answer is that even though predictive analytics have been (and will remain) very successful in business settings, they can fall short when your analyses pertain to human behaviors. In particular, adopting a causal approach can help us identify and resolve “confounding,” a very common problem with behavioral data.

 

p.4 

Descriptive analytics is the simplest form of analytics, but it is also underappreciated. Many organizations actually struggle to get a clear and unified view of their operations. To see the extent of that problem in an organization, just ask the same question of the finance department and the operations department and measure how different the answers are.

 

p.8

A regression appropriate for predictive analytics would often make a terrible regression for causal analytics purposes, and vice versa.

 

p.8

Therefore our variable mix has to be crafted not to create the most accurate prediction but to create the most accurate coefficients. 

 

p.16~17

The technical term for this phenomenon is the Berkson paradox, but Judea Pearl and Dana Mackenzie call it by a more intuitive name : the "explain-away effect." If one of your customers has a strong taste for vanilla, this completely explains why they are shopping at your stand, and they don't "need" to have a strong taste for chocolate. On the other hand, if one of your customers has a weak taste for vanilla, this can't explain why they are shopping at your stand, and they must have a stronger than average taste for chocolate.

 

The Berkson paradox is counterintuitive and hard to understand at first. It can cause biases in your data, depending on how it was collected, even before you start any analysis. A classic example of how this situation can create artificial correlations is that some diseases show a higher degree of correlation when looking at the population of hospital patients compared to the general population. In reality of course, what happens is that either disease is not enough for someone's health status gets bad enough to justify hospitalization only when they are both present.

 

p.21

 

p.21

For our purposes, we’ll define as personal characteristics all the information we have about a person that changes only rarely or very gradually over the relevant time frame for our analysis.

 

p.22

I would argue that we can resolve these issues by defining a cause as a “contributing factor” in a probabilistic sense. Reaching one’s 40s is neither a necessary nor a sufficient reason to have a midlife crisis, and having a midlife crisis is neither a necessary nor a sufficient reason to buy a red Corvette. In fact, both causal relationships are very heavily intertwined with other contributing factors: the contribution of age to existential qualms depends a lot on social patterns such as professional and family trajectories (e.g., the age at which someone enters the labor market or has their first child, assuming they do either one), and resolving such qualms through consumption is possible only if one has enough available income. It is also dependent on the degree to which one is influenced by advertising.

 

As a behavioral science mantra puts it, "behavior is a function of the person and the environment," and social factors often have arguably more weight than demographic variables. From a causal modeling and data analysis perspective, this interplay between social phenomena and personal characteristics can be captured through the use of moderation and meditation.

 

p.23

Cognition and emotions encompass all of that, as well as more nebulous business buzzwords and phrases such as customer satisfaction (CSAT) and customer experience (CX). CX has become a business mantra: many companies have a CX team, and there are conferences, consultants, and books devoted to the topic. But what exactly is it? Can you measure its causes and its effects? Yes, you can, but it requires intellectual humility and the willingness to spend time on sleuthing work.

 

p.23

This brings us to one of the biggest differences between UX or human-centered design and behavioral science: UX begins with the presumption that human beings know what they want, how they feel about something, and why, whereas behavioral science begins with the presumption that we are unaware of a lot of things going on in our own heads. To use a legal metaphor, a behavioral scientist will often treat what someone says as suspect until proven trustworthy, whereas a UX researcher will treat it as honest until proven misleading.

 

p.25

The rule of thumb I often give people is that an action or behavior is something you should be able to observe if you were in the room at that moment without having to ask the person. “Buying something on Amazon” is an action. So is “reading a review of a product on Amazon.” But “knowing something” or “deciding to buy something on Amazon” is not. You can’t know that someone has made a decision unless you either ask them or see them acting on that decision (which is a consequence but not the same thing).

 

p.27

On the other hand, from a data collection and analysis perspective, business behaviors can be an analyst’s worst nightmare: like water to fish, they can be invisible to an organization, and their effects on individual behaviors then become intractable noise. This happens for two reasons. 

 

First, many organizations, if they track business behaviors at all, simply don’t track them at the same level of detail as customer behaviors. 

 

Let’s say that C-Mart experimented with reducing its hours of business in the summer of 2018, leading to a temporary reduction in sales. Good luck figuring that out from the data alone! Many business rules, even when they are implemented in software, are simply not logged anywhere in a machine-readable format. If the corresponding data has indeed been recorded, it is often stored only in a departmental database (or worse, an Excel file) instead of the enterprise data lake.

 

Second, business behaviors can affect the interpretation of variables for customer behaviors. The clearest example of that would be sludges—intentional frictions and misleading communication introduced to confuse customers. Imagine a form on a website which, when you enter your email address, automatically checks the box “I want to receive marketing emails” that you had unchecked at the beginning of the form. Would that checked box really indicate that the customer wants to receive marketing emails? Beyond such obvious examples, business behaviors can be found lurking behind many customer behaviors, especially in the realm of sales. Many propensity-to-buy models should have as a caveat “among the people our sales team decided to call.” Paradoxically, while the compensation structure of sales representatives is often one of the levers that business leaders obsess most about, it is rarely included in models of customer purchasing behaviors.

 

Ultimately, getting reliable data about business behaviors, especially over time, can be a formidable challenge for behavioral data analysts—but that means that it’s also one way they can create value for their organization before running any analysis. A Basic Model of Human Behavior |

 

p.28~29

The same mindset applies to your data. Unless you happen to be among the very first employees of a startup, you'll be dealing with existing data and legacy processes.

 

Don't panic, and don't start going through your table list in alphabetical order. Start with a specific business problem and identify the variables that are most likely to be inaccurate, in decreasing order of their importance for the business problem:

 

1. Causes and effects of interest

2. Meditators and moderators, if relevant

3. Any potential confounder

4. Other nonconfounding independent variables (a.k.a covariates)

 

You'll have to make judgement calls along the way: for example, should you include a certain variable in your analysis, or is it so poorly defined that you're better off without it? Unfortunately, there is no clear-cut criterion to make these calls correctly;

 

You'll have to rely on your business sense and expertise. There is, however, a clear-cut way to make these calls incorrectly; pretend that thet don't exist.

 

A variable will or will not be included in your analysis, and there is no way around that fact. If your instinct leans toward inclusion, as is likely, then document why, describe potential sources of error, and indicate how the results would be different if the variable were omitted. As a UX researcher once put it in a friendly chat with me, being a researcher in business means constantly figuring out "what you can get away with"

 

p.29~30

Unfortunately, in many circumnstances, the way data is recorded by business and financial rules and is transaction-centric rather than customer-centric. This means that you should consider variables suspicious until proven innocent: in other words, do not automatically assume that the variable CustomerDidX means that the customer did X. It may mean something entirely different. For example:

 

- The customer checked a box without reading the fine print that mentioned that they were agreeing to X.

- The customer didn't say anything, so we defaulted them to X.

- The customer stated that they did X, but we can't verify.

- We bought data from a vendor indicating that the customer regularly did X at some point in their life.

 

Even if the customer actually did X, we can't assume their intent. They may have done this:

 

- Because we sent them a reminder email

- Four times in a row because the page was not refreshing

- Mistakenly, when they really wanted to do Y

- A week ago, but due to regulatory constraints we recorded it only today.

 

In other words, to paraphrase a popular line from The Princess Bride: "You keep using that variable. I do not think it means what you think it means."

 

p.30~31

In my experience, many business analytics projects fail or deliver underwhelming results because the analyst has not clarified what the project is about. Organizations always have an overarching target metric—profit for companies, client outcomes for nonprofits, etc. At a lower level, departments often have their own target metrics, such as Net Promoter Score for the customer experience team, downtime percent for IT, etc. If a business partner asks you to measure or improve a variable that seems unrelated to one of these target metrics, it generally means that they have in mind an implicit and possibly faulty behavioral theory connecting the two.

 

“Customer engagement,” another buzzword concept that behavioral scientists are often asked to improve, is a good example of that phenomenon. It’s not clear cut where it belongs, because it could really refer to two different things: 

 

• A behavior, namely the broad pattern of interactions with the business: customer A is deemed more engaged than customer B if customer A logs on to the website more often and spends more time navigating it. 

• A cognition or emotion, as when an audience is “engaged” with a movie or a course because they are engrossed in the flow and eager to know what comes next. 

 

Indeed, I strongly believe that the confusion between these two things explains a large part of the appeal of engagement metrics for startups and the broader digital world, even though they may be misleading. For example, in the first sense of the word, I’m more engaged with my washing machine when it stops working; that doesn’t translate into enjoyment and eagerness in the second sense of the word. Organizations that try to increase engagement as a behavior are often disappointed with the results. When engagement as behavior doesn’t translate into engagement as emotion, it doesn’t lead to desirable outcomes such as higher loyalty and retention.

 

As a personal example, a business partner once asked me for help to get employees to do a certain training. After some discussion, it became clear that what she really wanted was for employees to comply with a business rule; she believed that they didn’t comply because they were not sufficiently informed about the rule. We pivoted the project toward understanding why employees didn’t comply and how to encourage them to do so. In short: beware the self-diagnosing patients!

 

p.32

As I mentioned earlier, a variable being "about a behavior" is not the same thing as being a behavioral variable.

 

p.32

Aggregate metrics can offer useful snapshots for reporting purposes, but they can fall prey to biases and confounding factors such as changes in population composition (a.k.a. customer mix), especially when they are calculated based on time intervals.

 

For example, let’s imagine that a successful marketing campaign brings a lot of new users to AirCnC’s website. Let’s also assume that in that line of business, a significant share of new customers cancel their account in their first month. Thus AirCnC’s daily cancel rate may spike alarmingly during the month following the campaign, even though nothing went wrong. A good rule of thumb is that sound aggregate variables are based on sound individual variables. If a variable makes sense only in the aggregate and doesn’t have a meaningful interpretation at the individual level, that’s a red flag. In our example, a meaningful individual counterpart to the cancellation rate would be the cancellation probability. When controlling for individual characteristics and tenure with the company, this metric would remain stable despite the influx of new customers.

 

p.32

businesses often aggregate together a variety of different behaviors sharing a common intent. For example, there might be three different ways for a customer to change their billing address with AirCnC: by going to their account settings, by editing the information when finalizing a booking, and by contacting the call center. These would look different to someone watching the customer in the moment, but may be logged similarly in the database.

 

p.33

In many circumstances, identifying or creating a satisfying behavioral variable involves “getting your hands dirty.” Databases for analytical or research purposes often offer a “cleaned up” version of the truth as it would appear in the transaction databases, listing only the most up-to-date, vetted-out information. This makes perfect sense in most circumstances: if a customer made a booking and then canceled and was refunded, we wouldn’t want that amount to count toward the AmountSpent variable. After all, from a business perspective, AirCnC didn’t get to keep that money.

However, from a behavioral perspective, that customer is different from a customer who didn’t make any booking over the same time period, and there are analyses for which it would be relevant to take it into account. Don’t go and learn an ancient programming language like COBOL just to access the lowest level databases, but it’s worth digging around a bit beyond your usual pretty tables.

 

p.33

Timestamps are gold nuggets for behavioral analysts because they provide intuitive and often readily actionable insights.

 

p.34

Duration also offers a natural way to measure decaying effects. 

Things that you did or that happened a long time ago tend to have smaller effects than more recent occurrences. 

This often makes duration a good predictive variable.

If a customer hasn’t left AirCnC after a bad experience five years ago it probably doesn’t impact their decisions much anymore, and it would be better to weight the CSAT of past trips by how long ago they happened rather than just use an average.

 

p.34

A customer calling AirCnC’s call center to change their billing information after having tried to change it online exhibits a different behavior compared to a customer who calls directly. 

 

p.34

One of the best ways to aggregate behavioral data is to create variables for “doing Z after doing X.”

 

p.34

Modern life has its rhythms and schedules that are common knowledge. Because of their granularity, it is often better to start with an “hour of the week” variable instead of having separate “hour of the day” and “day of the week” variables (in local time, of course). Depending on your line of business, you may be able to aggregate things further into variables like “weekday evenings,” etc.

 

p.35

What people don’t do can often be as interesting as what they do.

 

p.39

A more accurate sound bite for introductory statistics would be that a simple correlation implies an unresolved causal structure.

 

p.40

A Causal Diagram is a visual representation of variables, shown as boxes, and their relationships to each other shown as arrows going from one box to another.

 

p.41

An analogy from physical sciences would be a magnet, a bar of iron, and the magnetic field around the magnet. You can't see the magnet field but it exists nonetheless, and it affects the iron bar. You may not have any data on the magnetic field and maybe you've never seen the equations describing it, but you can sense it as you move the bar, and you can develop intuitions as to what it does.

 

The same perspective applies when we want to understand what drives behaviors. We intuitively understand that human beings have habits, preferences, and emotions, and we treat these as causes even though we often don’t have any numeric data about them. When we say, “Joe bought peanuts because he was hungry,” we are relying on our knowledge, experience, and beliefs about humans in general and Joe in particular. We treat hunger as a real thing, even if we’re not measuring Joe’s blood sugar or brain activation.

 

p.46

If you have a quantitative background such as data science, you may be tempted to focus on the connection between CDs and data at the expense of the connection with behaviors. It is certainly a viable path, and it has given birth to an entire category of statistical models called probabilistic graphical models. For instance, algorithms have been and are still developed to identify causal relationships in data without relying on human expertise or judgment. However, this field is still in its infancy, and when applied to real life data, these algorithms are often unable to select between several possible CDs that lead to vastly different business implications. Business and common sense can frequently do a better job of selecting the most reasonable one. Therefore I strongly believe that you are better off using the mixed approach shown in this book’s framework and accepting the idea that you’ll need to use your judgment. The back and forth that CDs enable between your intuitions and your data is—literally, in many cases—where the money is.

 

p.46-48

 

 

p.49

In many cases, looking at the variable in the middle of a chain, namely the mediator, will allow you to make better decisions.

 

p.50

Because chains can be collapsed or expanded at will, in general we do not explicitly indicate when it has been done. It’s always assumed that any arrow could potentially be expanded to highlight an intermediary variable along the way. 

 

This also implies that the definition of “direct” and “indirect” relationships mentioned earlier relates to a specific representation of a CD: when you collapse a chain, two variables that had an indirect relationship now have a direct relationship. 

 

Forks 

-> When a variable causes two or more effects, the relationship creates a fork.

p.52

Very few things in the world have only one cause. When two or more variables cause the same outcome, the relationship creates a collider. Since C-Mart’s concession stand sells only two flavors of ice cream, chocolate and vanilla, a causal diagram representing taste and ice cream purchasing behavior would show that appetite for either flavor would cause past purchases of ice cream at the stand (Figure 3-17).

 

 

Figure 3-17. CD of a collider

 

Colliders are a common occurrence, and they can also be an issue in data analysis. A collider is in a sense the opposite of a fork, and the problems with them are also symmetric: a fork is problematic if we don’t control for the joint cause whereas a collider is a problem if we do control for the joint effect. We’ll explore these issues further in Chapter 5.

 

p.53

Chains, forks, and colliders take the variables in a CD as given. But in the same way that a chain can be collapsed or expanded, variables can themselves be sliced or aggregated to “zoom” in and out of specific behaviors and categories. We may also decide to modify the arrows—for example, when we’re faced with otherwise intractable cycles.

 

p.56

As we’ll see later, randomization can allow us to control for demographic factors so that we won’t have to include them in our analysis, but we might want to include them in our CD of the situation without randomization. If need be, we can always expand our diagram to accurately represent the demographic variables involved. Remember, however, that any variable can be split, but only variables that have the same direct and indirect relationships can be aggregated.

 

p.57

One thing to note is that the direction of the arrows shows the direction of causality (what is the cause and what is the effect), not the sign of the effect. In all of the CDs we looked at previously, the variables had a positive relationship where an increase in one caused an increase in the other. In this case, the relationships are negative, where an increase in one variable will cause a decrease in the other. The sign of the effect does not matter for causal diagrams, and a regression will be able to sort out the sign for the coefficient correctly as long as you correctly identify the relevant causal relationships.

 

p.63

At this point, you may be wondering where the [causal diagram] comes from. It’s an excellent question. It may be the question. A [CD] is supposed to be a theoretical representation of the state-of-the-art knowledge about the phenomena you’re studying. It’s what an expert would say is the thing itself, and that expertise comes from a variety of sources. Examples include economic theory, other scientific models, conversations with experts, your own observations and experiences, literature reviews, as well as your own intuition and hypotheses.

 

—Scott Cunningham, Causal Inference: The Mixtape (2021)

 

p.63

Once you have drawn that relationship, what comes next? How can you know what other variables you should include or not? Many authors say you should rely on expert knowledge, which is fine if you work in an established field like economics or epidemiology. But my perspective in this book is that you’re likely “behavioral scientist number one” in your organization and therefore you need to be able to start from a blank slate.

 

p.64

In addition, the recipe I’ll outline is not a mechanical algorithm that you could follow blindly to get to the right CD. On the contrary, business sense, common sense, and data insights will be crucial. We’ll go back and forth between our qualitative understanding of the causal situation at hand and the quantitative relationships present in the data, cross-checking one with the other until we feel that we have a satisfactory result. “Satisfactory” is an important word here: in applied settings, you usually can’t tell your manager that you’ll give them the right answer in three years. You need to give them the least bad answer possible in the short term, while planning the data collection work that will improve your answer over the years.

 

p.67

A nice aspect of using CDs for behavioral data analysis is that they are a great collaboration tool. Anyone in your organization with minimal knowledge of CDs can look at Figure 4-3 and say, “Well yeah, we require nonrefundable deposits for holiday bookings and these often get canceled because of weather,” or any other tidbit of behavioral knowledge that you couldn’t get otherwise.

 

At this point, the best next step would be a randomized experiment: assign refundable or nonrefundable deposits to a random sample of customers and you’ll be able to confirm or disprove your behavioral hypothesis. However, you may not be able to do so, or not yet. In the meantime, we’ll try to deconfound the relationship by identifying relevant variables to include.

 

p.72

As mentioned in Chapter 2, demographic variables are often valuable not so much for themselves but as proxies for other personal characteristics such as personality traits. The challenge at this step is therefore to resist the pull of whatever demographic variables are present in our data, and stick with our causal-behavioral mindset. A good way to do so is to think about traits first, before looking at demographic variables.

 

p.79

 

p.84

Measuring correlations between numeric and categorical variables is a more cumbersome process than measuring correlations within a homogenous category. 

 

Saying that there is a correlation between a numeric and a categorical variable is equivalent to saying that the values of the numeric variable are different on average across the categories of the categorical variable. We can check if this is the case by comparing the mean of the numeric variable across the categories of the categorical variable.

 

p.84

If you’re unsure whether the variations are truly substantial or if they only reflect random sampling errors, you can build confidence intervals for them using the Bootstrap, as explained later in Chapter 7.

 

p.85

Let’s imagine for instance that business customers are more likely to be repeated guests. They may then also appear to have a higher rate of previous cancellation than leisure customers even though among repeated guests, business and leisure customers have the exact same rate of previous cancellations. 

 

You can think of these causality assumptions as white lies: they’re not true, but it’s OK, because we’re not trying to build the true, complete CD, we’re trying to deconfound the relationship between NRD and cancellation rate. From that perspective, it is much more important to get the direction of arrows right than to have unconfounded relationships between variables outside of our variables of interest. If you’re still skeptical, one of the exercises in the next chapter explores this question further.

 

 

댓글