Chipotle Item Analysis

8 minute read

This is short exploratory walk through of an extract of transaction data (with no customer id but just order id) of orders at Chipotle. Main components are about how are different items priced at this Mexican Grill Cuisine and what are the common add-ons that orders include while a food lover customizes the order.

from collections import Counter  # for generating frequency count using dictionary

import matplotlib.pyplot as plt
import pandas as pd

from wordcloud import (  # for generating word cloud
    STOPWORDS,
    ImageColorGenerator,
    WordCloud,
)

data = pd.read_csv("chipotle.tsv", sep="\t")

Data wrangling

# replace $ and convert item_price to float
data["item_price"] = pd.Series(data["item_price"].str.replace("$", ""), dtype="float")

data.sample(5)

	order_id	quantity	item_name	choice_description	item_price
3264	1306	1	Canned Soft Drink	[Sprite]	1.25
1655	669	1	Chips	NaN	2.15
2022	815	1	Canned Soft Drink	[Diet Coke]	1.25
3306	1325	1	Chips	NaN	2.15
82	36	1	Chicken Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Chees...	8.75

Summary stats

print(
    "There are {} observations and {} features in this dataset. They are {}. \n".format(
        data.shape[0], data.shape[1], ", ".join(list(data.columns))
    )
)

print(
    "There are {} different items in this dataset such as {}... \n".format(
        len(data.item_name.unique()), ", ".join(data.item_name.unique()[0:5])
    )
)

print(
    "There are {} unique orders in this dataset. \n".format(
        len(data["order_id"].unique())
    )
)

print(
    "The total revenue generated from the orders within is about ${} USD. \n".format(
        round(sum(data["item_price"]), 0)
    )
)

There are 4622 observations and 5 features in this dataset. They are order_id, quantity, item_name, choice_description, item_price. 

There are 50 different items in this dataset such as Chips and Fresh Tomato Salsa, Izze, Nantucket Nectar, Chips and Tomatillo-Green Chili Salsa, Chicken Bowl... 

There are 1834 unique orders in this dataset. 

The total revenue generated from the orders within is about $34500.0 USD.

Avg. Price across all orders

by_order = pd.DataFrame(data.groupby(by="order_id").agg("sum").reset_index())
by_order = by_order.rename(columns={"item_price": "order_amount"})

# avg. order price
print(
    "The avg. order price is $",
    round(sum(by_order["order_amount"]) / len(by_order["order_amount"]), 2),
)

The avg. order price is $ 18.81

Avg. price paid by each order

by_order = pd.DataFrame(
    data.groupby(by="order_id")
    .agg({"item_price": "mean", "item_name": "count"})
    .reset_index()
)
by_order = by_order.rename(
    columns={"item_price": "avg_order_amount", "item_name": "num_items"}
)

# Distribution of avg. order amount across all orders
plt.hist(by_order["avg_order_amount"])
plt.xlabel("Avg. Order Amount in USD")
plt.ylabel("Frequency count")
plt.show()

png

It’s reasonable to see most of the orders paying under $15 for a fast-casual market player. Customers tend to buy shorter orders more often than ordering in bulk which is a rarity.

Distribution of items per order

plt.hist(by_order["num_items"], bins=25)
plt.xlabel("Items per order")
plt.ylabel("Frequency count")
plt.show()

png

by_item_numbers = (
    by_order.groupby(by="num_items").agg({"order_id": "count"}).reset_index()
)
by_item_numbers = by_item_numbers.rename(columns={"order_id": "num_orders"})
by_item_numbers

	num_items	num_orders
0	1	128
1	2	1012
2	3	484
3	4	134
4	5	44
5	6	14
6	7	4
7	8	5
8	9	2
9	10	1
10	11	3
11	12	1
12	14	1
13	23	1

Close to about 55% of the orders are 2-item orders analogous to people ordering a main item with either a side or a beverage. Appently, the next in number of orders is those with 3 items (main item + side + beverage).

Avg. prices per item ordered

As the each item ordered in any order is customized based on choice description, let’s see what are the avg. prices per item ordered.

# avg. price per item ordered
by_item = pd.DataFrame(
    data.groupby(by="item_name")
    .agg({"item_price": "mean", "order_id": "count"})
    .reset_index()
)
by_item = by_item.rename(
    columns={"item_price": "avg_price_paid", "order_id": "times_ordered"}
)

by_item["revenue"] = by_item["avg_price_paid"] * by_item["times_ordered"]

plt.figure(figsize=(10, 10))
by_item = by_item.sort_values(by="revenue", ascending=True)
plt.barh(by_item["item_name"], by_item["revenue"])
plt.xlabel("Revenue (USD)")
plt.ylabel("Food Item")
plt.title("Items ranked based on revenue generated in USD")
plt.show()

png

By most ordered

# Top 5items by 'times_ordered'
by_item.sort_values(by=["times_ordered"], ascending=False)[:10]

	item_name	avg_price_paid	times_ordered	revenue
17	Chicken Bowl	10.113953	726	7342.73
18	Chicken Burrito	10.082857	553	5575.82
25	Chips and Guacamole	4.595073	479	2201.04
39	Steak Burrito	10.465842	368	3851.43
10	Canned Soft Drink	1.457641	301	438.75
38	Steak Bowl	10.711801	211	2260.19
23	Chips	2.342844	211	494.34
6	Bottled Water	1.867654	162	302.56
22	Chicken Soft Tacos	9.635565	115	1108.09
21	Chicken Salad Bowl	11.170455	110	1228.75

Canned Soft Drink is the most ordered beverage.

By revenue

by_item.sort_values(by=["revenue"], ascending=False)[:10]

	item_name	avg_price_paid	times_ordered	revenue
17	Chicken Bowl	10.113953	726	7342.73
18	Chicken Burrito	10.082857	553	5575.82
39	Steak Burrito	10.465842	368	3851.43
38	Steak Bowl	10.711801	211	2260.19
25	Chips and Guacamole	4.595073	479	2201.04
21	Chicken Salad Bowl	11.170455	110	1228.75
22	Chicken Soft Tacos	9.635565	115	1108.09
45	Veggie Burrito	9.839684	95	934.77
2	Barbacoa Burrito	9.832418	91	894.75
44	Veggie Bowl	10.211647	85	867.99

Chicken Bowl, Chicken Burrito, Steak Burrito, Steak Bowl get us the most revenue.
Chips and Guacamole is the most ordered as well as revenue-generating sides

List of all the items where customization was possible

df_without_na = pd.DataFrame(data.dropna())

# Items for choice_description is possible
pd.Series(df_without_na["item_name"].unique())

                    Izze
        Nantucket Nectar
            Chicken Bowl
           Steak Burrito
        Steak Soft Tacos
    Chicken Crispy Tacos
      Chicken Soft Tacos
         Chicken Burrito
             Canned Soda
        Barbacoa Burrito
       Carnitas Burrito
          Carnitas Bowl
          Barbacoa Bowl
     Chicken Salad Bowl
             Steak Bowl
    Barbacoa Soft Tacos
         Veggie Burrito
            Veggie Bowl
     Steak Crispy Tacos
  Barbacoa Crispy Tacos
      Veggie Salad Bowl
    Carnitas Soft Tacos
          Chicken Salad
      Canned Soft Drink
       Steak Salad Bowl
      6 Pack Soft Drink
                   Bowl
                Burrito
           Crispy Tacos
  Carnitas Crispy Tacos
            Steak Salad
      Veggie Soft Tacos
    Carnitas Salad Bowl
    Barbacoa Salad Bowl
                  Salad
    Veggie Crispy Tacos
           Veggie Salad
         Carnitas Salad
dtype: object

# Calling list_to_string function
%run list_to_string.py

This is a neat way of getting a paricular task done by calling a function added in a .py file. It makes code easy to read for future reference and our team members.

# data transformation to string without any [] in the choice_description column
df_without_na["choice_description"] = df_without_na["choice_description"].apply(
    list_to_string
)

text = ", ".join(cd for cd in df_without_na.choice_description)
print("There are {} choice additions in the choice descriptions of all the items ordered.".format(len(text)))
my_list = text.split(", ")
word_count_dict = Counter(my_list)

There are 210889 choice additions in the choice descriptions of all the items ordered.

What goes in the add-ons/choice description mostly?

wordcloud = WordCloud(
    max_words = 100, background_color="white", width=1600, height=1200, scale=1.5
).generate_from_frequencies(word_count_dict)

# Display the generated image
plt.figure(figsize=(10, 15))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

png

Just by looking at the wordcloud, a couple of insights are clearly visible:

In all the orders pertaining to this dataset, a bowl or a burrito has Rice, Lettuce, Cheese, Sour cream as add-ons. While this maybe trivial to some of you, especially those from the Mexican descent, for someone who don’t know anything about the Mexican cuisine like me, this is valuable information to form hypothesis. This applies to any data set one works on. In some cases, we as data analysts/scientists are pre-informed but in some cases, data itself tells us the new information we do not know till that point. It’s about asking the right questions.
Among Salsa’s, Fresh Tomato Salsa looks to be the top added add-on and then Roasted Chili Corn Salsa.
As beans are common in Mexican food, both Black and Pinto beans are visible in the top 100 add-ons. But as a matter of fact, Black beans has lower carbs than Pinto beans which could be a reason as all the food lovers like us in this data are preferring Black Beans.

What are the items which doesn’t have any choice_description (no customization)?

set(data["item_name"].unique()) - set(df_without_na["item_name"].unique())

{'Bottled Water',
 'Chips',
 'Chips and Fresh Tomato Salsa',
 'Chips and Guacamole',
 'Chips and Mild Fresh Tomato Salsa',
 'Chips and Roasted Chili Corn Salsa',
 'Chips and Roasted Chili-Corn Salsa',
 'Chips and Tomatillo Green Chili Salsa',
 'Chips and Tomatillo Red Chili Salsa',
 'Chips and Tomatillo-Green Chili Salsa',
 'Chips and Tomatillo-Red Chili Salsa',
 'Side of Chips'}

A good check that the sides are mostly not customizable - data sanity check. It’s good to have these sanity checks once in a while all through the data analysis and modelling to ensure reliable results. But that’s not the end as there can be variation depending on the region, season, etc. That’s where as a data analyst/scientist, one has to use communicate with a team member or someone in a senior role to clarify whether all the listed elements are not customizable. More generally, it’s important to understand the holistic view of what the business is trying to achieve. Ultimately, data tells a story partially, to make it whole, business knowledge (as simple as the above example) and team play is necessary.

This brings to the end of this simple walk-through (fully accessible via this repo) and my plan from here on is to extend this analysis for a market basket analysis or to recommend what can a customer add as an add-on for a future order.

Off to pondering whether this data is sufficient to do what struck me. If you have any ideas or loves to try different food cuisines, ping me - we can discuss about food or plan to implement the ideas for a great customer experience…

Trivia

What is the avg. number of orders in a week at Chipotle (any location)?

Let’s guesstimate…

Equation: No. of restaurants of Chipotle x No. of weeks in a year (wk) x No. of orders per week (orders/wk) x Revenue per order (USD/orders) = Total annual revenue (USD). Check if the units are balanced on both sides.

We get the Revenue per order from the above dataset. (remember this is an avg. figure) = $(34500/1834) USD. The other data points like revenue and no. of locations are openly available.

Assumptions

Inevitably, valid assumptions are key to calculating estimates:

A core assumption to this particular calculation is - the revenue generated in this data set is only for a single location/restaurant.
Chipotle’s annual revenue amounts to what it is just because of food orders
The total revenue amounts to both in-store and online sales.

No. of orders per week = 6B (USD) / (2500 * 52 * No. of orders per week * (34500/1834))

Turns out to be ~ 2453 orders per week.

This is only an approximation based on the assumptions made. Actual numbers might vary a lot depending on the location, season, things like COVID-19. There might also be other sources of revenue other than food ordering which hasn’t been accounted for in this calculation.

Share on

Twitter Facebook LinkedIn

Akshay Kotha