Adventures in Plotly: Cohort Charts - SoatDev IT Consulting

May 29, 2023
Rss Fetcher

Convert your Salesforce data to retention insights

The first four articles I published focused on the basic charts at your disposal that make up the vast majority of data viz use cases:

Pie
Histogram
Scatter
Bar

For this article, I’d like to introduce a more complex visualization, the cohort chart. A cohort chart shows the evolution of a defined segment and is an offshoot of the more general heatmap.

Typically, the segments are defined by time, and the evolution is shown over time. For example, you could show the cohort of users who signed up for your product in January 2023 and if they are still active month-over-month. Another example shows the cohort of users who signed up for your subscription service in February 2023 and if they are still active after 3, 4, 5, etc., orders.

Today, I will show you how to go from your Salesforce opportunity data to a cohort chart showing how many dollars you have retained month-over-month. However, you will also see that adjusting to fit your needs is just a matter of transforming and defining your x, y, and z data.

Why is this important to understand? It’s important because retention is the backbone of any subscription business, SaaS or otherwise. If your retention is low, then you won’t be able to grow your business because all of your efforts are focused on signing new customers to help you tread water. Low retention is a leaky bucket, and the cohort chart is a lens into where those leaks are.

The Data

A bit of a warning to you, the reader: This article will spend a lot of time explaining how to transform raw Salesforce opportunity data into a structure Plotly can use to generate a cohort chart. I typically forego this step, but it is too important (and, admittedly, too complex) not to include. Before we get into this, a couple of disclaimers below:

I want to mention that not all Salesforce implementations are the same. You may not use opportunities to track new business and renewals. You may use opportunities for new business and then use contracts for renewals or maybe invoices instead of contracts. My company uses opportunities only. Later in the article, I will call out when your data should look like mine so you can focus on getting to that point before following along.

Lastly, the data I’m using for this article is not actual data from my company’s Salesforce instance. That would be incredibly dumb of me to do, and I’m sure my boss wouldn’t be thrilled about it 😅. What is the same is the code I use and the field names. All of the actual data is synthetic.

The Salesforce API

To extract data from Salesforce, we can use python-salesforce-api, which is a wrapper around the Salesforce REST and SOAP API.

I won’t go deep into all the features but rather focus on the minimum functionality needed to get the data we need. First, some boilerplate code after installing the library:

import os
from salesforce_api import Salesforce

# Create the client object with your credentials
client = Salesforce(
    username=os.getenv("SFDC_USER"),
    password=os.getenv("SFDC_PW"),
    security_token=os.getenv("SFDC_TOKEN")
)

The first step is to create the client object with your credentials. The value for thesecurity_token parameter is generated in your Salesforce profile on the “Reset My Security Token” page. When you click the reset button, the token is emailed to you.

Step two is to query the data you want. The API we’re working with has a friendly SQL interface for querying objects that many analysts are comfortable with.

opportunities_raw = client.sobjects.query(
    f"""
    SELECT * 
    FROM Opportunity WHERE StageName IN ('Closed Won','Renewal')
    """
)

Whoops! That didn’t work. The SQL engine in the API does not allow the wildcard. We will have to specify the exact columns we want. Thankfully, we can find all the available fields very easily:

columns = [
    col.get("name") for col in 
    client.sobjects.Opportunity.describe().get("fields")
]

client.sobjects.Opportunity points to the opportunity object, and .describe() gives us access to a dictionary of metadata about the object. See the available keys below:

We want to look at the fields key, which gives us a list of dictionaries that contain metadata for each field, including the name. You can certainly look through the list of fields manually and pick out the ones you want, but in my experience, companies play fast and lose with their Salesforce account and create numerous custom fields that end up going unused. I don’t want to look through 500 fields that may or may not have descriptive names, so just add them all if you’re feeling a little lazy.

# Get list of all field names
columns = [
    col.get("name") for col in 
    client.sobjects.Opportunity.describe().get("fields")
]

We can then convert this list to a comma-separated string to insert into the original query.

opportunities_raw = client.sobjects.query(
    f"""
    SELECT {', '.join(columns)} 
    FROM Opportunity WHERE StageName IN ('Closed Won','Renewal')
    AND Contract_Start_Date__c >= 2020-01-01
    """
)

This takes could take some time, depending on how much data you’re pulling, specifying the fields you need could help with the speed if it is an issue. You can see that I limited the data I was pulling to only contracts that began in January 2020.

What you end up getting is a list of opportunities, with each element being a dictionary of fields that you specified in the query. From here, it’s straightforward to get it into a pandas (or polars) data frame.

Quick aside: I’ve found the Polars library pretty intuitive if you just let go of your Pandas bias. I haven’t yet committed to using Polars in the interest of getting things done quickly. Eventually, I’ll switch to learn enough to make an informed decision on which to use as my go-to.

keep_cols = [
    "Id", "AccountId", "CurrencyIsoCode", "Contract_Start_Date__c",
    "Contract_Expiration_Date__c", "Contract_Term_of_mo__c", 
    "Subscription_Amount__c", "StageName", "CloseDate", "CreatedDate"
]
opportunities = pd.DataFrame(opportunities_raw)[keep_cols]
opportunities.columns = [
    re.sub("__c", "", col.lower()) for col in opportunities.columns
]

You can see that I figured out which fields were important for this exercise, and I’m subsetting the data frame accordingly. If you do this periodically, it would be better to put these directly into the query.

You should try to marry your data to mine at this point. You need the following:

Opportunity ID
Account ID
Contract start date (or, more generally, the start of the period where the customer is paying you)
Contract end date, or the length of the contract so you can calculate the end date (you can see I have both)
The amount of money you are being paid

You don’t need the currency code if you only operate in one country, or the amount value is automatically converted to one currency. My company operates in several countries and doesn’t convert before it goes into Salesforce, so I wanted to include the idea of that here as it adds another small wrinkle in the data transformation that you may have to contend with.

Also, I included the stage name, but it’s not required.

Again, this data is synthetic and doesn’t reflect my company’s reality, but I took inspiration from what I do for my company to write this article.

Transformation

Let’s go through each transformation step to build a cohort chart with Plotly.

1. Make sure we have the correct data for the cohorts

This means we only want opportunities that generate actual revenue and have a non-zero contract length (this may seem obvious, but many salesforce instances are run by non-data folks, so weird things happen). We also want to take this opportunity (no pun intended) to convert the amounts to USD.

# These are relatively old exchange rates in case anyone was confused
exchange_rates = pd.DataFrame(
    (
        ["EUR", "GBP", "INR", "MXN", "MGA", "AUS", "CAN", "CAD", "USD", "ARS"],
        [1.1344, 1.3514, .0135, .0489, .0003, .7263, .7857, .7857, 1, .0091]
    )
).transpose().rename(columns={
    0: "currencyisocode",
    1: "exchange_rate"
})

opportunities = opportunities.loc[
    (opportunities["contract_term_of_mo"] > 0) &
    (opportunities["subscription_amount"] > 0), :
].merge(exchange_rates, on="currencyisocode", how="left").assign(
    converted_contract_value=lambda row: (
        row["subscription_amount"]*row["exchange_rate"]),
    subscription_mrr=lambda row: row["converted_contract_value"] /
        row["contract_term_of_mo"]
)

I probably could have used the map function instead of creating a separate data frame and joining it in, but whatever, quibble with performance on your own time.

2. Create date fields for comparison

We will use these to compare to the current period to assign a subscription status.

# Ensure datetime format for expiration date and add 1 day
opportunities["contract_expiration_date"] = pd.to_datetime(
    opportunities["contract_expiration_date"],
    format="%Y-%m-%d"
)+timedelta(days=1)

# Create field containing the first day of the starting month
opportunities["contract_start_month_bom"] = pd.to_datetime(
    opportunities["contract_start_date"],
    format="%Y-%m-%d"
).apply(lambda x: x.replace(day=1))

# Create field containing the last day of the starting month
opportunities["contract_start_month_eom"] = pd.to_datetime(
    opportunities["contract_start_date"],
    format="%Y-%m-%d"
) + MonthEnd(1)

# Create field containing the first day of the ending month
opportunities["contract_end_month_bom"] = pd.to_datetime(
    opportunities["contract_expiration_date"],
    format="%Y-%m-%d"
).apply(lambda x: x.replace(day=1))

# Create field containing the last day of the ending month
opportunities["contract_end_month_eom"] = pd.to_datetime(
    opportunities["contract_expiration_date"],
    format="%Y-%m-%d"
) + MonthEnd(1)

To recap so far, we did the following:

Removed bad data
Converted the currency to USD
Created a subscription_mrr field that will serve as the monthly revenue for our cohorts.

3. Create a data frame with the start and end dates for all possible months

months = pd.DataFrame(
    [
        list(pd.date_range(
            start=opportunities["contract_start_month_bom"].min(), 
            end=opportunities["contract_start_month_eom"].max(), 
            freq='MS'
        )),
        list(pd.date_range(
            start=opportunities["contract_start_month_bom"].min(), 
            end=opportunities["contract_start_month_eom"].max(), 
            freq='M'
        )),
    ]
).transpose()
months.columns = ["period_start_date", "period_end_date"]

Just bear with me. Everything will make sense in time.

4. Create a cartesian join with our opportunities data frame and months data frame we just created in step 3

Pandas doesn’t have a cross-join option, so we can manufacture one by adding a key column to both data frames to join on.

opportunity_months = opportunities.assign(key=1).merge(
    months.assign(key=1), 
    on="key"
).drop("key", axis=1)

You can see in the image that the data is all repeated except for the period start and end dates, which are incrementing by one month. This is the desired outcome, as we can now use date comparisons to understand active and churned customers.

5. Create a subscription status field

opportunity_months = opportunity_months.assign(
    months_after_contract_start=lambda row: np.round((row["period_start_date"] - row["contract_start_month_bom"])/np.timedelta64(1, 'M')),
    months_after_contract_end=lambda row: np.round((row["period_start_date"] - row["contract_end_month_bom"])/np.timedelta64(1, 'M')),
    subscription_status=lambda row: np.where(
        row["period_start_date"] < row["contract_start_month_bom"],
        "Pre-Active",
        np.where(
            (row["period_start_date"] >= row["contract_end_month_bom"]) & (row["months_after_contract_end"] == 0),
            "Churn",
            np.where(
                row["period_start_date"] > row["contract_end_month_bom"],
                "Post-Churn",
                "Active"
            )
        )
    )
)

Breaking down what is happening in the previous code block:

Create months_after_contract_start and months_after_contract_end fields
Create a subscription_status field to identify pre-active, active, churn, and post-churn months for each opportunity. The general logic is as follows:

IF period_start_date < contract_start_month_bom THEN
  "Pre-Active"
ELSE
  IF period_start_date >= contract_end_month_bom AND months_after_contract_end = 0 THEN
    "Churn"
  ELSE
    IF period_start_date > contract_end_month_bom THEN
      "Post-Churn"
    ELSE "Active"
    ENDIF
  ENDIF
ENDIF

6. Zero out revenue for non-active periods

If a contract isn’t active, you aren’t receiving revenue (duh).

opportunity_months = opportunity_months[
    [
        "id", "accountid", "period_end_date", "subscription_status", "subscription_mrr"
    ]
].assign(
    subscription_mrr=lambda row: np.where(
        row["subscription_status"].isin(["Pre-Active", "Churn", "Post-Churn"]),
        0,
        row["subscription_mrr"]
    )
)

Take a peek at what our data looks like before moving forward:

We have positive dollars when active and zero dollars when not. *success kid meme*

7. Group to the account-period level

The goal of a cohort chart is to show the accounts (or users) in a cohort and how they evolve, not individual contract periods. You may be asking yourself why we didn’t go directly to this level — great question! It is because sometimes accounts (or users) may churn and then come back. For example:

Customer X buys your product in January 2020
Customer X churns after the first 12 months in January 2021
Customer X realizes they can’t possibly go on without your product, so they returned in June 2021 and have been a customer ever since.

How do you treat Customer X? Are they only in the January 2020 cohort? Reassigned to the June 2021 cohort? Are they in both?

I think they are in both cohorts because they churned for a reason. You want to know that the January 2020 cohort possibly had a poor experience. If you reassign that account, you lose the data point. If you only keep them in that cohort, you lose a potential data point in the June 2021 cohort. What if your product improved (after 18 months, I hope it improved)? You could have fixed the issues that plagued the January 2020 cohort, and now with your superior functionality, they came back.

This step, together with step 8, will allow us to reassign cohorts.

cohort_df = opportunity_months.loc[
    pd.to_datetime(opportunity_months["period_end_date"]) <= dt.today()
].groupby(
    ["accountid", "period_end_date"]
).agg({
    "subscription_mrr": "sum"
}).reset_index().astype({"subscription_mrr": "float"})

This is straightforward. We are grouping the data on accountid and period_end_date, summing the subscription_mrr. We are also removing any partial or future periods.

8. Assign a “Group Index” to identify cohorts

This step is complex and involves looping over a data frame. I understand this is generally frowned upon, but I’d be very interested in a way to do this without iteration.

# Get the cumulative sum for each account
cohort_df["cume_sum"] = cohort_df.groupby(["accountid"])["subscription_mrr"].cumsum()

# Remove any account that doesn't have revenue
cohort_df = cohort_df.loc[cohort_df["cume_sum"] > 0, :]

account_list = cohort_df["accountid"].unique()

for account in account_list:
    # filter to one account
    values = cohort_df.loc[
        cohort_df["accountid"]== account, :
    ].values.tolist()

    group_list = list()
    # Group index to assign that denotes a cohort
    idx = 0
    # Loop over each account period
    for i, row in enumerate(values):
        # If it is the first row then append the index to group_list
        if i == 0:
            group_list.append(idx)
        else:
            # If the mrr (row[2]) is positive then append the index to group_list
            if row[2] > 0:
                group_list.append(idx)
            
            # If the mrr is zero then append the index and increment by 1
            else:
                group_list.append(idx)
                idx += 1
    
    # Assign the list of indices to the account
    cohort_df.loc[
        cohort_df["accountid"] == account, "group_index"
    ] = group_list

So, what are we actually doing here?

Removing any accounts that don’t have revenue. Maybe you’re giving someone a freebie? You don’t want them in a revenue retention chart. Once they start paying, it will give the impression that you had fantastic net retention when in reality someone went from a free trial to a paid subscription.
Loop over each account and create an index variable as our “group index.” The group index will allow us to assign cohorts. We also create an empty list to append the index to.
Loop over each period within an account.
The first period gets assigned the current index value. After that, we’re looking for positive revenue, where we’ll also assign the current index value. Once we get to the month of churn, we assign the current index and increment it by one.
Assign the list of indices to the account

Here is an example of an account that churned and then returned (churn-and-return?):

You can see the group_index is 0 starting January 2021 and then starts to increment a month after churning in January 2022. They come back in January 2023, and the group_index stops incrementing at 12. You’ll see what effect this has in the next step.

9. Assign a cohort and the starting revenue

We now join the data frame to an aggregated version of itself.

# Aggregate on account and group_index to find cohorts and starting revenue
agg_cohort_df = cohort_df.groupby(
    ["accountid", "group_index"]
).agg({
    "period_end_date": "first",
    "subscription_mrr": "first"
}).reset_index().rename(
    columns={
        "period_end_date": "cohort_period_end_date",
        "subscription_mrr": "starting_mrr"
    })

cohort_df = cohort_df.merge(
    agg_cohort_df,
    on=["accountid", "group_index"]
)

We aggregate on accountid and group_index to get the first period and revenue values. Those are our cohorts and starting revenue, respectively. Looking at the same account, we can see that it belongs to two cohorts over the life of the relationship.

We’re almost done transforming the data! I promise 😅.

10. Calculate the number of periods since joining the cohort

This will serve as our x-axis in the cohort chart.

cohort_df["periods_since_joining_cohort"] = cohort_df.groupby(
    ["accountid", "group_index"]
).cumcount()

11. Pivot the data

The final step is to pivot the data, which will resemble the cohort structure we want to visualize.

cohort_pivot_mrr = cohort_df.loc[
    cohort_df["subscription_mrr"] > 0, :
].pivot_table(
    index="cohort_period_end_date",
    columns="periods_since_joining_cohort",
    values="subscription_mrr",
    aggfunc="sum"
).reset_index().sort_values("cohort_period_end_date")

This is exactly what we want to see. The final cohort was last month (as of the writing of this article), and there’s no data after that. Each month prior has exactly one more month of data.

From here, you don’t have to visualize it if you don’t want to. This data is valuable as is. Paste it into an Excel file and go nuts.

Adapting to your needs

At the beginning of this article, I mentioned that there are multiple flavors of cohort charts, and to adapt to your needs, you need to adjust your x, y, and z values. Let’s reflect on what the data looks like at this point:

For the cohort chart we are making together, we need the following:

x — periods_since_joining_cohort
y — period_end_date
z — subscription_mrr

These can be adjusted to create a cohort chart for your specific use case. For example:

x — # of actions taken (clicks, logins, etc.) since joining
y — Week, month, quarter, year
z — Revenue, # of customers/users

This is just a matter of adjusting what you want to measure. Maybe you want to know your customer attrition after 3, 4, 5, etc., orders of your product? Knowing this by cohort (your y value) will help you pinpoint when issues in your service came about. Your three data points would be:

x — Orders since joining
y — Month
z — # of customers

Visualizing Your Cohorts

There are two ways to put together a cohort chart. The first is the easy way, and the second is the hard way. The hard way gives you more flexibility to treat each row independently (e.g., set different min/max ranges for the color scale).

First, I will set some variables that we’ll use for both.

periods = 25

x = np.array(list(range(date_periods)))
y = cohort_pivot_mrr.tail(date_periods)["cohort_period_end_date"].values

We’re going to look at the most recent 25 months. Our x, y, and z values will use that number to subset the data. Below is what each looks like:

The easy way

I will paste the entire code for this version except for one critical step to show the correct colors. I’ll go through every step when describing the hard way to do this.

z_mrr = cohort_pivot_mrr.tail(date_periods).loc[:, x].values
z_starting = cohort_pivot_mrr.tail(date_periods).loc[:, x].values
z_starting = [z[0] for z in z_starting]
z_ret = [z[0]/z[1] for z in zip(z_mrr, z_starting)]

fig = go.Figure()
fig.add_trace(
    go.Heatmap(
        x=x,
        y=y,
        z=z_ret,
        hoverongaps=False,
        texttemplate="%{z:.0%}",
        colorscale="greens",
        showscale=False,
        showlegend=False
    )
)
fig.update_traces(
    showscale=False
)
fig.update_coloraxes(
    showscale=False
)
fig.update_layout(
    plot_bgcolor="white",
    uniformtext_minsize=6,
    uniformtext_mode='hide',
    margin_pad=5,
    font={
        "size": 10
    },
    title={
        "text": "Last 25 Periods MRR",
        "font": {"size": 20},
        "x": 0.5
    }
)
fig.update_yaxes(
    type="category",
    tickfont={"size": 10}
)
fig.update_xaxes(
    type="category",
    tickfont={"size": 10}
)

Do you see what the issue is here?

…

The issue is that the color scale is based on all values being passed into the figure. If the range of values is large like above (31% — 334%), you end up with only very subtle color changes across most of the chart. A 10% increase or decrease in revenue retention could enormously impact your business, so you want to see it easily. When doing it this way, you absolutely need to set zmin, zmid, and zmax. These three variables guide the color scale to work within the set bounds.

fig.update_traces(
    zmin=0.5,
    zmid=1,
    zmax=1.5
)

Adjust the values to fit your needs.

The Hard Way

Why even discuss different ways? Well, sometimes you want more control over your life, I mean, your work. We achieve this by stacking a bunch of single-row heatmaps on top of each other. Doing it this way will allow us to isolate one or more cohorts if something is interesting about them.

To accomplish this, we’ll make use of the make_subplots method.

from plotly.subplots import make_subplots

You can look at the docs to get a complete understanding of this method. I will only focus on the parameters that I’m using.

fig = make_subplots(
    rows=len(y), cols=1,
    row_heights=[1]*len(y),
    vertical_spacing=0,
    shared_xaxes=True,
    shared_yaxes=True,
    column_titles=x.tolist(),
    x_title="Periods Since Joining Cohort"
)

Stepping through each parameter:

rows: # of rows to create — Set to the length of y
cols: # of columns to create — Only need one column
row_heights: Height of each row — Default is 1, so we create an array of 1s the length of y
vertical_spacing: Space between rows — Set to 0 so there is no space
shared_xaxes & shared_yaxes: Rows and columns share the same axis — Set to true so that each subplot doesn’t have their own axis labels
column_titles: List of column names — Set to x
x_title: x-axis title

What this has done is created the layout for the figure. Now we need to actually assign data to each subplot.

z_mrr = cohort_pivot_mrr.tail(date_periods).loc[:, x].values

for i, cohort in enumerate(y):
    fig.add_trace(
        go.Heatmap(
            x=x,
            y=[y[i]]*len(x),
            z=z_mrr[i]/z_mrr[i][0],
            customdata=np.dstack(
                (customdata[i], [customdata[i][0]]*len(customdata[i]))),
            hoverongaps=False,
            texttemplate="%{z:.0%}",
            hovertemplate="Cohort End Date: <b>%{y}</b> <br>"
            "Months Since Joining Cohort: <b>%{x}</b> <br>"
            "Retained MRR: <b>%{z:.2%}</b> <br>"
            "MRR: <b>$%{customdata[0][0]:,.2f}</b> <br>"
            "Starting MRR: <b>$%{customdata[1][1]:,.2f}</b> <br>"
            "<extra></extra>",
            colorscale="greens",
            zmin=0.5,
            zmid=1,
            zmax=1.5,
            showscale=False,
            showlegend=False
        ),
        row=i+1,
        col=1
    )

Let’s step through this:

First, we’re looping over all of the cohorts. This is the stacking I mentioned.
xand y are straightforward — xis the list of periods since joining the cohort. y is the cohort itself. You can see that we are forcing y to be the same dimensions as x. If we don’t do this, the only thing that gets returned is the first column of the cohort chart.
z — The value you want to show in each cell. In our case, it is the retention %.
hoverongaps — When there is no value, should you show a hover box? For us, the answer is no. If you set this to true, then the lower triangle of NaN values will produce a hover, which is an obvious issue if this is going to be embedded in a web app. If you’re putting this into a presentation, it doesn’t matter.
texttemplate and hovertemplate— I’ve touched on these in a previous article. We don’t need hovertemplate if this is for a presentation, but we do need texttemplate to format how the z value shows up in each cell. I’m applying a percentage format with no decimal places.
colorscale — Plotly has a number of them. For showing retention, you could either use sequential or diverging. I’m using the “greens” sequential color scale in this case, but if you wanted to go with a diverging scale to show that below 100% retention is bad, that also makes sense.
zmin, zmid, and zmax — See above.
showscale & showlegend— No need to clutter the chart with a visual color scale and legend. It is straightforward to understand as-is.
row — Specify which row to put the subplot in (1-index).
col — We only have one column, so this is one for all subplots.

At this point, we have the layout constructed and the data added to each subplot. However, if we looked at the chart, it would look like trash.

Nootttt good. We need to give this thing some TLC. Let’s first fix the axes. What is going on here is Plotly thinks the y-axis is a continuous date-time, but we want it to be a categorical string. Similarly, for the x-axis, Plotly thinks it’s a number (which it is), but we want it to be a categorical string as well.

fig.update_yaxes(
    type="category",
    tickfont={"size": 10}
)
fig.update_xaxes(
    type="category",
    tickfont={"size": 10}
)

Wayyy better. The next step is to clean up the cell fonts and hide the lower half of the triangle. We can add a generic title as well.

fig.update_layout(
    plot_bgcolor="white",
    uniformtext_minsize=6,
    uniformtext_mode='hide',
    margin_pad=5,
    font={
        "size": 10
    },
    title={
        "text": "Last 25 Periods MRR",
        "font": {"size": 20},
        "x": 0.5
    }
)

This is the exact same chart as the other version. But what if we wanted to highlight the best and worst cohorts? How might we do that? I’m thinking of two different approaches:

Make the row heights larger
Create a border

To adjust the row heights, we need to change the value in the initial make_subplots call. First, I will create a dictionary containing the two cohorts I want to highlight.

annotation_dict = {
    "2021-05-31": {"color": "purple", "adj_y_start": 0, "adj_y_end": 1.8},
    "2021-10-31": {"color": "red", "adj_y_start": 0.5, "adj_y_end": 2.3}
}

fig = make_subplots(
    rows=len(y), cols=1,
    row_heights=[
        2 if el in annotation_dict.keys() else 1 for el in y
    ],
    vertical_spacing=0,
    shared_xaxes=True,
    shared_yaxes=True,
    column_titles=x.tolist(),
    x_title="Periods Since Joining Cohort"
)

Instead of setting all row heights to 1, I’m adjusting it so that the height is 2 for those specific cohorts and 1 otherwise. This can only be done if you create the cohort chart by stacking heat maps. If you know how to do this using a single heatmap, please let me know (I’ll be slightly upset by how much effort it took to figure this out, c’est la vie).

Creating a border does not require the stacking method, but it can still be helpful for what we want to do. This requires a little math and some manual adjustments (you can see those adjustments in the dictionary above).

for k, v in annotation_dict.items():
    x_start = -0.09
    x_end = len(x) - np.where(y == k)[0][0]
    y_start = len(x) - np.where(y == k)[0][0] - 
        annotation_dict.get(k).get("adj_y_start")
    y_end = len(x) - np.where(y == k)[0][0] - 
        annotation_dict.get(k).get("adj_y_end")

    fig.add_shape(
        type="rect",
        xref="paper",
        yref="paper",
        x0=x_start,
        x1=x_end / len(x),
        y0=y_start / len(x),
        y1=y_end / len(x),
        line=dict(
            color=annotation_dict.get(k).get("color"),
            width=2,
        ),
    )

The idea here is to figure out where to start drawing the rectangle and where to finish. The start and end variables in the code block above tell us which row. The adjustments draw the rectangle exactly where we want it. Because we made these rows twice as high as the others, the adjustments could be clearer. Play around with it to find the right numbers.

Finally, let’s add an annotation to support what we’ve done.

fig.add_annotation(
    text="""
    The <b><span style='color: purple'>May 2021</span></b> cohort has continued to show<br>
    strong retention as those customers reach<br>
    the 2 year mark.<br>
    <br>
    Conversely, the <b><span style='color: red'>October 2021</span></b> cohort saw a large reduction in<br>
    revenue in month 1. Efforts have been made to upsell remaining<br>
    customers but so far only half of initial revenue has been recovered.
    """,
    xref="paper", yref="paper",
    x=1.05, y=0.1,
    font_size=12,
    align="left",
    showarrow=False
)

They’re a little wordy, but cohort charts can tell you a lot, so sometimes, what you have to say reflects that. And now, the finished product…

We’ve come a long way from the raw Salesforce data. Hopefully, you were able to follow along, but feel free to reach out if you have questions or run into issues!

For the full code, check out my git repo.

What’s Next

I’m honestly not sure yet. There are a few that I’m thinking of:

Gantt chart
Choropleth Maps
Waterfall Charts

Maybe I’ll do them in this order? Any opinions? Regardless, thank you for reading, and I hope you have a new tool in your toolbelt!

This is the fifth article in a series. If you want to read the others, you can find their links below:

Adventures in Plotly: Cohort Charts was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Convert your Salesforce data to retention insights

The Data

The Salesforce API

Transformation

Adapting to your needs

Visualizing Your Cohorts

The easy way

The Hard Way

What’s Next

Previous Post

Next Post

Solutions

Regions Covered