Chipotle Item Analysis

8 minute read

This is short exploratory walk through of an extract of transaction data (with no customer id but just order id) of orders at Chipotle. Main components are about how are different items priced at this Mexican Grill Cuisine and what are the common add-ons that orders include while a food lover customizes the order.

from collections import Counter  # for generating frequency count using dictionary

import matplotlib.pyplot as plt
import pandas as pd

from wordcloud import (  # for generating word cloud
    STOPWORDS,
    ImageColorGenerator,
    WordCloud,
)
data = pd.read_csv("chipotle.tsv", sep="\t")

Data wrangling

# replace $ and convert item_price to float
data["item_price"] = pd.Series(data["item_price"].str.replace("$", ""), dtype="float")
data.sample(5)
order_id quantity item_name choice_description item_price
3264 1306 1 Canned Soft Drink [Sprite] 1.25
1655 669 1 Chips NaN 2.15
2022 815 1 Canned Soft Drink [Diet Coke] 1.25
3306 1325 1 Chips NaN 2.15
82 36 1 Chicken Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Chees... 8.75

Summary stats

print(
    "There are {} observations and {} features in this dataset. They are {}. \n".format(
        data.shape[0], data.shape[1], ", ".join(list(data.columns))
    )
)

print(
    "There are {} different items in this dataset such as {}... \n".format(
        len(data.item_name.unique()), ", ".join(data.item_name.unique()[0:5])
    )
)

print(
    "There are {} unique orders in this dataset. \n".format(
        len(data["order_id"].unique())
    )
)

print(
    "The total revenue generated from the orders within is about ${} USD. \n".format(
        round(sum(data["item_price"]), 0)
    )
)
There are 4622 observations and 5 features in this dataset. They are order_id, quantity, item_name, choice_description, item_price. 

There are 50 different items in this dataset such as Chips and Fresh Tomato Salsa, Izze, Nantucket Nectar, Chips and Tomatillo-Green Chili Salsa, Chicken Bowl... 

There are 1834 unique orders in this dataset. 

The total revenue generated from the orders within is about $34500.0 USD. 

Avg. Price across all orders

by_order = pd.DataFrame(data.groupby(by="order_id").agg("sum").reset_index())
by_order = by_order.rename(columns={"item_price": "order_amount"})

# avg. order price
print(
    "The avg. order price is $",
    round(sum(by_order["order_amount"]) / len(by_order["order_amount"]), 2),
)
The avg. order price is $ 18.81

Avg. price paid by each order

by_order = pd.DataFrame(
    data.groupby(by="order_id")
    .agg({"item_price": "mean", "item_name": "count"})
    .reset_index()
)
by_order = by_order.rename(
    columns={"item_price": "avg_order_amount", "item_name": "num_items"}
)

# Distribution of avg. order amount across all orders
plt.hist(by_order["avg_order_amount"])
plt.xlabel("Avg. Order Amount in USD")
plt.ylabel("Frequency count")
plt.show()

png

It’s reasonable to see most of the orders paying under $15 for a fast-casual market player. Customers tend to buy shorter orders more often than ordering in bulk which is a rarity.

Distribution of items per order

plt.hist(by_order["num_items"], bins=25)
plt.xlabel("Items per order")
plt.ylabel("Frequency count")
plt.show()

png

by_item_numbers = (
    by_order.groupby(by="num_items").agg({"order_id": "count"}).reset_index()
)
by_item_numbers = by_item_numbers.rename(columns={"order_id": "num_orders"})
by_item_numbers
num_items num_orders
0 1 128
1 2 1012
2 3 484
3 4 134
4 5 44
5 6 14
6 7 4
7 8 5
8 9 2
9 10 1
10 11 3
11 12 1
12 14 1
13 23 1

Close to about 55% of the orders are 2-item orders analogous to people ordering a main item with either a side or a beverage. Appently, the next in number of orders is those with 3 items (main item + side + beverage).

Avg. prices per item ordered

As the each item ordered in any order is customized based on choice description, let’s see what are the avg. prices per item ordered.

# avg. price per item ordered
by_item = pd.DataFrame(
    data.groupby(by="item_name")
    .agg({"item_price": "mean", "order_id": "count"})
    .reset_index()
)
by_item = by_item.rename(
    columns={"item_price": "avg_price_paid", "order_id": "times_ordered"}
)

by_item["revenue"] = by_item["avg_price_paid"] * by_item["times_ordered"]
plt.figure(figsize=(10, 10))
by_item = by_item.sort_values(by="revenue", ascending=True)
plt.barh(by_item["item_name"], by_item["revenue"])
plt.xlabel("Revenue (USD)")
plt.ylabel("Food Item")
plt.title("Items ranked based on revenue generated in USD")
plt.show()

png

By most ordered

# Top 5items by 'times_ordered'
by_item.sort_values(by=["times_ordered"], ascending=False)[:10]
item_name avg_price_paid times_ordered revenue
17 Chicken Bowl 10.113953 726 7342.73
18 Chicken Burrito 10.082857 553 5575.82
25 Chips and Guacamole 4.595073 479 2201.04
39 Steak Burrito 10.465842 368 3851.43
10 Canned Soft Drink 1.457641 301 438.75
38 Steak Bowl 10.711801 211 2260.19
23 Chips 2.342844 211 494.34
6 Bottled Water 1.867654 162 302.56
22 Chicken Soft Tacos 9.635565 115 1108.09
21 Chicken Salad Bowl 11.170455 110 1228.75
  • Canned Soft Drink is the most ordered beverage.

By revenue

by_item.sort_values(by=["revenue"], ascending=False)[:10]
item_name avg_price_paid times_ordered revenue
17 Chicken Bowl 10.113953 726 7342.73
18 Chicken Burrito 10.082857 553 5575.82
39 Steak Burrito 10.465842 368 3851.43
38 Steak Bowl 10.711801 211 2260.19
25 Chips and Guacamole 4.595073 479 2201.04
21 Chicken Salad Bowl 11.170455 110 1228.75
22 Chicken Soft Tacos 9.635565 115 1108.09
45 Veggie Burrito 9.839684 95 934.77
2 Barbacoa Burrito 9.832418 91 894.75
44 Veggie Bowl 10.211647 85 867.99
  • Chicken Bowl, Chicken Burrito, Steak Burrito, Steak Bowl get us the most revenue.
  • Chips and Guacamole is the most ordered as well as revenue-generating sides

List of all the items where customization was possible

df_without_na = pd.DataFrame(data.dropna())

# Items for choice_description is possible
pd.Series(df_without_na["item_name"].unique())
0                      Izze
1          Nantucket Nectar
2              Chicken Bowl
3             Steak Burrito
4          Steak Soft Tacos
5      Chicken Crispy Tacos
6        Chicken Soft Tacos
7           Chicken Burrito
8               Canned Soda
9          Barbacoa Burrito
10         Carnitas Burrito
11            Carnitas Bowl
12            Barbacoa Bowl
13       Chicken Salad Bowl
14               Steak Bowl
15      Barbacoa Soft Tacos
16           Veggie Burrito
17              Veggie Bowl
18       Steak Crispy Tacos
19    Barbacoa Crispy Tacos
20        Veggie Salad Bowl
21      Carnitas Soft Tacos
22            Chicken Salad
23        Canned Soft Drink
24         Steak Salad Bowl
25        6 Pack Soft Drink
26                     Bowl
27                  Burrito
28             Crispy Tacos
29    Carnitas Crispy Tacos
30              Steak Salad
31        Veggie Soft Tacos
32      Carnitas Salad Bowl
33      Barbacoa Salad Bowl
34                    Salad
35      Veggie Crispy Tacos
36             Veggie Salad
37           Carnitas Salad
dtype: object
# Calling list_to_string function
%run list_to_string.py

This is a neat way of getting a paricular task done by calling a function added in a .py file. It makes code easy to read for future reference and our team members.

# data transformation to string without any [] in the choice_description column
df_without_na["choice_description"] = df_without_na["choice_description"].apply(
    list_to_string
)
text = ", ".join(cd for cd in df_without_na.choice_description)
print("There are {} choice additions in the choice descriptions of all the items ordered.".format(len(text)))
my_list = text.split(", ")
word_count_dict = Counter(my_list)
There are 210889 choice additions in the choice descriptions of all the items ordered.

What goes in the add-ons/choice description mostly?

wordcloud = WordCloud(
    max_words = 100, background_color="white", width=1600, height=1200, scale=1.5
).generate_from_frequencies(word_count_dict)

# Display the generated image
plt.figure(figsize=(10, 15))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

png

Just by looking at the wordcloud, a couple of insights are clearly visible:

  • In all the orders pertaining to this dataset, a bowl or a burrito has Rice, Lettuce, Cheese, Sour cream as add-ons. While this maybe trivial to some of you, especially those from the Mexican descent, for someone who don’t know anything about the Mexican cuisine like me, this is valuable information to form hypothesis. This applies to any data set one works on. In some cases, we as data analysts/scientists are pre-informed but in some cases, data itself tells us the new information we do not know till that point. It’s about asking the right questions.
  • Among Salsa’s, Fresh Tomato Salsa looks to be the top added add-on and then Roasted Chili Corn Salsa.
  • As beans are common in Mexican food, both Black and Pinto beans are visible in the top 100 add-ons. But as a matter of fact, Black beans has lower carbs than Pinto beans which could be a reason as all the food lovers like us in this data are preferring Black Beans.

What are the items which doesn’t have any choice_description (no customization)?

set(data["item_name"].unique()) - set(df_without_na["item_name"].unique())
{'Bottled Water',
 'Chips',
 'Chips and Fresh Tomato Salsa',
 'Chips and Guacamole',
 'Chips and Mild Fresh Tomato Salsa',
 'Chips and Roasted Chili Corn Salsa',
 'Chips and Roasted Chili-Corn Salsa',
 'Chips and Tomatillo Green Chili Salsa',
 'Chips and Tomatillo Red Chili Salsa',
 'Chips and Tomatillo-Green Chili Salsa',
 'Chips and Tomatillo-Red Chili Salsa',
 'Side of Chips'}

A good check that the sides are mostly not customizable - data sanity check. It’s good to have these sanity checks once in a while all through the data analysis and modelling to ensure reliable results. But that’s not the end as there can be variation depending on the region, season, etc. That’s where as a data analyst/scientist, one has to use communicate with a team member or someone in a senior role to clarify whether all the listed elements are not customizable. More generally, it’s important to understand the holistic view of what the business is trying to achieve. Ultimately, data tells a story partially, to make it whole, business knowledge (as simple as the above example) and team play is necessary.

This brings to the end of this simple walk-through (fully accessible via this repo) and my plan from here on is to extend this analysis for a market basket analysis or to recommend what can a customer add as an add-on for a future order.

Off to pondering whether this data is sufficient to do what struck me. If you have any ideas or loves to try different food cuisines, ping me - we can discuss about food or plan to implement the ideas for a great customer experience…

Trivia

What is the avg. number of orders in a week at Chipotle (any location)?

Let’s guesstimate…

Equation: No. of restaurants of Chipotle x No. of weeks in a year (wk) x No. of orders per week (orders/wk) x Revenue per order (USD/orders) = Total annual revenue (USD). Check if the units are balanced on both sides.

We get the Revenue per order from the above dataset. (remember this is an avg. figure) = $(34500/1834) USD. The other data points like revenue and no. of locations are openly available.

Assumptions

Inevitably, valid assumptions are key to calculating estimates:

  • A core assumption to this particular calculation is - the revenue generated in this data set is only for a single location/restaurant.
  • Chipotle’s annual revenue amounts to what it is just because of food orders
  • The total revenue amounts to both in-store and online sales.

No. of orders per week = 6B (USD) / (2500 * 52 * No. of orders per week * (34500/1834))

Turns out to be ~ 2453 orders per week.

This is only an approximation based on the assumptions made. Actual numbers might vary a lot depending on the location, season, things like COVID-19. There might also be other sources of revenue other than food ordering which hasn’t been accounted for in this calculation.

Leave a comment