Brazilian E-Commerce Public Dataset by Olist
Introduction
This is my personal project, use dataset from Kaggle.com to practice Python skills, data visualization and business knowledge.
Link Kaggle: https://www.kaggle.com/code/trmnguynhi/brazilian-e-commerce-public-eda
Business case
This dataset was generously provided by Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners.
After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments.
The dataset has 8 CSV files, with the relationship as capture below

This analysis will contain four steps as below:
- Define business task
- Data Preparation
- Data Processing
- Data Analysis
In this analysis, I will focus on three below factors to have overall overview of the business and see which issue need to be solved:
- Delivery Performance
- Sales Performance
- Customer Experience Indicator
Data Preparation
Import the libraries & datasets
import pandas as pd import numpy as npimport matplotlib.pyplot as pltimport seaborn as sns from pandas.tseries.offsetsimport DateOffset from dateutil.relativedeltaimport relativedelta%matplotlib inlinecustomers = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_customers_dataset.csv") geolocation= pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_geolocation_dataset.csv") order_items = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_order_items_dataset.csv") order_payments = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_order_payments_dataset.csv") order_reviews = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_order_reviews_dataset.csv") date_cols = ['order_purchase_timestamp','order_approved_at','order_delivered_carrier_date','order_delivered_customer_date','order_estimated_delivery_date'] orders = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_orders_dataset.csv",parse_dates=date_cols) products = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_products_dataset.csv") sellers = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_sellers_dataset.csv")
Data Processing
Explore datasets
def overview_data(dataset_name):
# , null_amount, name_null_cols, num_null_cols):
nrows=dataset_name.shape[0]
ncols=dataset_name.shape[1]
null_values = dataset_name.isnull().sum()
name_nul_cols = []
for i in range(len(null_values)):
if null_values[i] >0:
name_nul_cols.append(null_values.index[i])
num_nul_cols = len(name_nul_cols)
return nrows, ncols , name_nul_cols,num_nul_cols
detailed_db = pd.DataFrame() list_dataset= [customers,geolocation,order_items,order_payments,order_reviews,orders,products,sellers] list_dataset_name= ['customers','geolocation','order_items','order_payments','order_reviews','orders','products','sellers'] detailed_db['dataset'] = [i for i in list_dataset_name] detailed_db['nrows'] = [overview_data(i)[0] for i in list_dataset] detailed_db['ncols'] = [overview_data(i)[1] for i in list_dataset] detailed_db['names_of_null_cols'] = [', '.join(overview_data(i)[2]) for i in list_dataset] detailed_db['num_null_cols'] = [overview_data(i)[3] for i in list_dataset]
orders.info() order_items.info()
Data Analysis
1. Delivery Performance
1.1 Ontime Delivery
orders_info = orders.merge(customers, on = 'customer_id', how = 'left' )
orders_info['order_delivered_customer_month'] = orders_info['order_delivered_customer_date'].dt.strftime('%Y-%m')
orders_info['order_purchase_month'] = orders_info['order_purchase_timestamp'].dt.strftime('%Y-%m')
orders_info['order_purchase_year'] = orders_info['order_purchase_timestamp'].dt.strftime('%Y')
orders_info['Delivery Time'] = ((orders_info['order_delivered_customer_date'] - orders_info['order_delivered_carrier_date']))/24
orders_info['is_deliver_ontime']=np.where( orders_info['order_delivered_customer_date'] > orders_info['order_estimated_delivery_date'],0,1)
delivery_ontime_pivot = orders_info[orders_info['order_delivered_customer_date'].isna() == False].groupby('order_delivered_customer_month')['is_deliver_ontime'].agg(TotalDelivered='count',TotalOntime='sum').reset_index()
delivery_ontime_pivot['Ontime Delivery Rate'] = (delivery_ontime_pivot['TotalOntime'] / delivery_ontime_pivot['TotalDelivered'])*100
fig, ax1 = plt.subplots(figsize=(12,6))
sns.lineplot(data = delivery_ontime_pivot[delivery_ontime_pivot['Ontime Delivery Rate' ]>0], x = 'order_delivered_customer_month', y='Ontime Delivery Rate', marker='o', ax=ax1)
ax2 = ax1.twinx()
sns.barplot(data = delivery_ontime_pivot[delivery_ontime_pivot['Ontime Delivery Rate' ]>0], x= 'order_delivered_customer_month', y='TotalDelivered', alpha=0.5, ax=ax2)
ax1.tick_params(axis='x', labelrotation = 45)
This is shown that the delivery ontime is decreased from 2016 to 2028 with the increasing of orders volumn ( not count for undelivered orders)
delivery_ontime_by_state = orders_info[orders_info['order_delivered_customer_date'].isna() == False].groupby(['customer_state'])['is_deliver_ontime'].agg(TotalDelivered = 'count',TotalOntime='sum').reset_index() delivery_ontime_by_state['Ontime Delivery Rate'] = (delivery_ontime_by_state['TotalOntime'] / delivery_ontime_by_state['TotalDelivered'])*100 delivery_ontime_by_state=delivery_ontime_by_state.sort_values(by='Ontime Delivery Rate') plt.figure(figsize=(12, 7)) sns.barplot(data =delivery_ontime_by_state, x = 'customer_state', y= 'Ontime Delivery Rate',color='skyblue')
Top 10 States have the lowest ontime rate need to be reviewed and enhanced
1.2 Average Delivery Time
avg_delivery_time = orders_info[orders_info['order_delivered_customer_date'].isna() == False].groupby(['order_delivered_customer_month'])['Delivery Time'].agg('mean').reset_index()
delivery_time_and_ontime_rate = avg_delivery_time.merge(delivery_ontime_pivot[delivery_ontime_pivot['Ontime Delivery Rate' ]>0],on='order_delivered_customer_month',how='right')
fig, ax1 = plt.subplots(figsize=(12,6))
ax1=sns.lineplot(data= delivery_time_and_ontime_rate,x='order_delivered_customer_month',y='Delivery Time', ax=ax1, color = 'Red',marker='o')
ax2=ax1.twinx()
ax2=sns.lineplot(data = delivery_time_and_ontime_rate, x = 'order_delivered_customer_month', y='Ontime Delivery Rate', marker='o',ax=ax2)
ax1.tick_params(axis='x', labelrotation = 45)
The relationship between delivery time and ontime delivery rate
2. Sales Performance
orders_info['purchase_month'] = orders_info['order_purchase_timestamp'].dt.strftime('%m')
sales_qty_by_month=orders_info.groupby('purchase_month')['order_id'].agg(TotalSales='count').reset_index()
sales_qty_by_year=orders_info.groupby('order_purchase_year')['order_id'].agg(TotalSales='count')
# sales_qty_by_month.head()
sns.set_style('whitegrid')
fig, axes = plt.subplots(2, 1, figsize=(15, 10))
axes1 = sns.lineplot(data=sales_qty_by_month, x= 'purchase_month',y='TotalSales',ax=axes[0])
axes2=sns.lineplot(data=sales_qty_by_year, x= 'order_purchase_year',y='TotalSales',ax=axes[1])
orders_payment_info=orders_info.merge(order_payments, how ='left', on ='order_id')
orders_payment_info_pivot=orders_payment_info.groupby('order_purchase_month')['payment_value'].agg(TotalOrder='count',TotalAmount='sum').reset_index()
orders_payment_info_pivot['Avg Value per Order'] = orders_payment_info_pivot['TotalAmount']/orders_payment_info_pivot['TotalOrder']
orders_payment_info_pivot
sns.lineplot(data=orders_payment_info_pivot,x='order_purchase_month',y='Avg Value per Order',marker = 'o')
plt.tick_params(axis='x', labelrotation =45)
Avg Value Per Order stay around 150 USD from 2016 to 2018
3. Customer Experience
3.1 Customer Reviews
order_reviews_info = order_reviews.groupby('review_score').agg('count').reset_index()
order_reviews_info.rename(columns ={'review_id':'total_review','review_comment_message':'no_review_have_cmt'},inplace=True)
order_reviews_info=order_reviews_info[['review_score','total_review','no_review_have_cmt']]
order_reviews_info['comment_review_rate']=(order_reviews_info['no_review_have_cmt']/order_reviews_info['total_review'])*100
order_reviews_info
plt.title("Distribution of Review Score")
plt.pie(order_reviews_info['total_review'],labels = order_reviews_info['review_score'], autopct='%.0f%%')
plt.show()
Review Score 5 has highest rate compared with other scores
print("Rate of no message review:",(order_reviews_info['no_review_have_cmt'].sum()/order_reviews_info['total_review'].sum())*100,"%")
order_reviews_info.head()

Customers have tend to give message for orders that they are not satisfied
There is nearly a half of review that customers didnot leave a message. Let’s find why they chosse score 1 or 5 (mad or happy)
orders_info_2=orders_info.merge(order_reviews, on ='order_id',how = 'left')
orders_info_2_no_message= orders_info_2[orders_info_2['review_comment_message'].isna()]
orders_info_2_no_message.head()
orders_info_2_no_message_pivot = orders_info_2_no_message.groupby('review_score').agg({'Delivery Time':'mean','is_deliver_ontime':'sum','order_id':'count'}).reset_index()
orders_info_2_no_message_pivot.rename(columns={'order_id':'total_orders'},inplace=True)
orders_info_2_no_message_pivot['ontime_delivery_rate']=orders_info_2_no_message_pivot['is_deliver_ontime']/orders_info_2_no_message_pivot['total_orders']
orders_info_2_no_message_pivot

As we can see, the lower the review score is, the longer the delivery time is so Delivery Time is one of the factor that affects satisfactory of customers
3.2. Customer Churn Rate
data_churn=orders_info.groupby(['customer_unique_id','order_purchase_year'])['order_id'].agg('count').reset_index()
data_churn['filter']=np.where(data_churn['order_id']==1,1,0)
data_churn_2=data_churn.groupby('order_purchase_year')['filter'].agg(NewCustomers='sum',TotalCustomers='count').reset_index()
data_churn_2['churn rate']=data_churn_2['NewCustomers']/data_churn_2['TotalCustomers']*100
data_churn_2['Existing Customer']=data_churn_2['TotalCustomers']-data_churn_2['NewCustomers']
data_churn_2=data_churn_2[data_churn_2['order_purchase_year'].isin(['2017','2018'])]
data_churn_2.index=data_churn_2['order_purchase_year']
data_churn_2[['NewCustomers','Existing Customer']].plot( stacked = True, kind = 'bar')
plt.tick_params(axis='x', labelrotation =360)

There are nearly 3% old customers come back to buy products in this website in 2017 & 2018 ( in my understanding, customer unique id point to an actual customer in reality). Not count for 2016 because 2016 is considered as the first year in data







