Brazilian E-Commerce Public Dataset by Olist

Introduction

This is my personal project, use dataset from Kaggle.com to practice Python skills, data visualization and business knowledge.

Link Kaggle: https://www.kaggle.com/code/trmnguynhi/brazilian-e-commerce-public-eda

Business case

This dataset was generously provided by Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners.

After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments.

The dataset has 8 CSV files, with the relationship as capture below

This analysis will contain four steps as below:

Define business task
Data Preparation
Data Processing
Data Analysis

In this analysis, I will focus on three below factors to have overall overview of the business and see which issue need to be solved:

Delivery Performance
Sales Performance
Customer Experience Indicator

Data Preparation

Import the libraries & datasets

import pandas as pd
import numpy as np import matplotlib.pyplot as plt import seaborn as sns
from pandas.tseries.offsets import DateOffset
from dateutil.relativedelta import relativedelta %matplotlib inline customers = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_customers_dataset.csv")
geolocation= pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_geolocation_dataset.csv")
order_items = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_order_items_dataset.csv")
order_payments = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_order_payments_dataset.csv")
order_reviews = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_order_reviews_dataset.csv")
date_cols = ['order_purchase_timestamp','order_approved_at','order_delivered_carrier_date','order_delivered_customer_date','order_estimated_delivery_date']
orders = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_orders_dataset.csv",parse_dates=date_cols)
products = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_products_dataset.csv")
sellers = pd.read_csv("/kaggle/input/brazilian-ecommerce/olist_sellers_dataset.csv")

Data Processing

Explore datasets

def overview_data(dataset_name):
                #   , null_amount, name_null_cols, num_null_cols):
    nrows=dataset_name.shape[0]
    ncols=dataset_name.shape[1]
    null_values = dataset_name.isnull().sum()    
    name_nul_cols = []
    for i in range(len(null_values)):
        if null_values[i] >0:
            name_nul_cols.append(null_values.index[i])    
    num_nul_cols = len(name_nul_cols)
    return nrows, ncols , name_nul_cols,num_nul_cols

detailed_db = pd.DataFrame()
list_dataset= [customers,geolocation,order_items,order_payments,order_reviews,orders,products,sellers]
list_dataset_name= ['customers','geolocation','order_items','order_payments','order_reviews','orders','products','sellers']
detailed_db['dataset'] = [i for i in list_dataset_name]
detailed_db['nrows'] = [overview_data(i)[0] for i in list_dataset]
detailed_db['ncols'] = [overview_data(i)[1] for i in list_dataset]
detailed_db['names_of_null_cols'] = [', '.join(overview_data(i)[2]) for i in list_dataset]
detailed_db['num_null_cols'] = [overview_data(i)[3] for i in list_dataset]

orders.info()
order_items.info()

Data Analysis

1. Delivery Performance

1.1 Ontime Delivery

orders_info = orders.merge(customers, on = 'customer_id', how = 'left' )
orders_info['order_delivered_customer_month'] = orders_info['order_delivered_customer_date'].dt.strftime('%Y-%m')
orders_info['order_purchase_month'] = orders_info['order_purchase_timestamp'].dt.strftime('%Y-%m')
orders_info['order_purchase_year'] = orders_info['order_purchase_timestamp'].dt.strftime('%Y')
orders_info['Delivery Time'] = ((orders_info['order_delivered_customer_date'] - orders_info['order_delivered_carrier_date']))/24

orders_info['is_deliver_ontime']=np.where( orders_info['order_delivered_customer_date'] > orders_info['order_estimated_delivery_date'],0,1)
delivery_ontime_pivot = orders_info[orders_info['order_delivered_customer_date'].isna() == False].groupby('order_delivered_customer_month')['is_deliver_ontime'].agg(TotalDelivered='count',TotalOntime='sum').reset_index()
delivery_ontime_pivot['Ontime Delivery Rate'] = (delivery_ontime_pivot['TotalOntime'] / delivery_ontime_pivot['TotalDelivered'])*100

fig, ax1 = plt.subplots(figsize=(12,6))
sns.lineplot(data = delivery_ontime_pivot[delivery_ontime_pivot['Ontime Delivery Rate' ]>0], x = 'order_delivered_customer_month', y='Ontime Delivery Rate', marker='o', ax=ax1)
ax2 = ax1.twinx()
sns.barplot(data = delivery_ontime_pivot[delivery_ontime_pivot['Ontime Delivery Rate' ]>0], x= 'order_delivered_customer_month', y='TotalDelivered', alpha=0.5, ax=ax2)
ax1.tick_params(axis='x', labelrotation = 45)

This is shown that the delivery ontime is decreased from 2016 to 2028 with the increasing of orders volumn ( not count for undelivered orders)

delivery_ontime_by_state = orders_info[orders_info['order_delivered_customer_date'].isna() == False].groupby(['customer_state'])['is_deliver_ontime'].agg(TotalDelivered = 'count',TotalOntime='sum').reset_index()
delivery_ontime_by_state['Ontime Delivery Rate'] =  (delivery_ontime_by_state['TotalOntime'] / delivery_ontime_by_state['TotalDelivered'])*100
delivery_ontime_by_state=delivery_ontime_by_state.sort_values(by='Ontime Delivery Rate')
plt.figure(figsize=(12, 7))
sns.barplot(data =delivery_ontime_by_state, x = 'customer_state', y= 'Ontime Delivery Rate',color='skyblue')

Top 10 States have the lowest ontime rate need to be reviewed and enhanced

1.2 Average Delivery Time

avg_delivery_time = orders_info[orders_info['order_delivered_customer_date'].isna() == False].groupby(['order_delivered_customer_month'])['Delivery Time'].agg('mean').reset_index()
delivery_time_and_ontime_rate = avg_delivery_time.merge(delivery_ontime_pivot[delivery_ontime_pivot['Ontime Delivery Rate' ]>0],on='order_delivered_customer_month',how='right')

fig, ax1 = plt.subplots(figsize=(12,6))
ax1=sns.lineplot(data= delivery_time_and_ontime_rate,x='order_delivered_customer_month',y='Delivery Time', ax=ax1, color = 'Red',marker='o')
ax2=ax1.twinx()
ax2=sns.lineplot(data = delivery_time_and_ontime_rate, x = 'order_delivered_customer_month', y='Ontime Delivery Rate', marker='o',ax=ax2)
ax1.tick_params(axis='x', labelrotation = 45)

The relationship between delivery time and ontime delivery rate

2. Sales Performance

orders_info['purchase_month'] = orders_info['order_purchase_timestamp'].dt.strftime('%m')
sales_qty_by_month=orders_info.groupby('purchase_month')['order_id'].agg(TotalSales='count').reset_index()
sales_qty_by_year=orders_info.groupby('order_purchase_year')['order_id'].agg(TotalSales='count')
# sales_qty_by_month.head()
sns.set_style('whitegrid')
fig, axes = plt.subplots(2, 1, figsize=(15, 10))
axes1 = sns.lineplot(data=sales_qty_by_month, x= 'purchase_month',y='TotalSales',ax=axes[0])
axes2=sns.lineplot(data=sales_qty_by_year, x= 'order_purchase_year',y='TotalSales',ax=axes[1])

orders_payment_info=orders_info.merge(order_payments, how ='left', on ='order_id')
orders_payment_info_pivot=orders_payment_info.groupby('order_purchase_month')['payment_value'].agg(TotalOrder='count',TotalAmount='sum').reset_index()
orders_payment_info_pivot['Avg Value per Order'] = orders_payment_info_pivot['TotalAmount']/orders_payment_info_pivot['TotalOrder']
orders_payment_info_pivot
sns.lineplot(data=orders_payment_info_pivot,x='order_purchase_month',y='Avg Value per Order',marker = 'o')
plt.tick_params(axis='x', labelrotation =45)

Avg Value Per Order stay around 150 USD from 2016 to 2018

3. Customer Experience

3.1 Customer Reviews

order_reviews_info = order_reviews.groupby('review_score').agg('count').reset_index()
order_reviews_info.rename(columns ={'review_id':'total_review','review_comment_message':'no_review_have_cmt'},inplace=True)
order_reviews_info=order_reviews_info[['review_score','total_review','no_review_have_cmt']]
order_reviews_info['comment_review_rate']=(order_reviews_info['no_review_have_cmt']/order_reviews_info['total_review'])*100
order_reviews_info
plt.title("Distribution of Review Score")
plt.pie(order_reviews_info['total_review'],labels = order_reviews_info['review_score'], autopct='%.0f%%')
plt.show()

Review Score 5 has highest rate compared with other scores

print("Rate of no message review:",(order_reviews_info['no_review_have_cmt'].sum()/order_reviews_info['total_review'].sum())*100,"%")
order_reviews_info.head()

Customers have tend to give message for orders that they are not satisfied

There is nearly a half of review that customers didnot leave a message. Let’s find why they chosse score 1 or 5 (mad or happy)

orders_info_2=orders_info.merge(order_reviews, on ='order_id',how = 'left')
orders_info_2_no_message= orders_info_2[orders_info_2['review_comment_message'].isna()]
orders_info_2_no_message.head()
orders_info_2_no_message_pivot = orders_info_2_no_message.groupby('review_score').agg({'Delivery Time':'mean','is_deliver_ontime':'sum','order_id':'count'}).reset_index()
orders_info_2_no_message_pivot.rename(columns={'order_id':'total_orders'},inplace=True)
orders_info_2_no_message_pivot['ontime_delivery_rate']=orders_info_2_no_message_pivot['is_deliver_ontime']/orders_info_2_no_message_pivot['total_orders']
orders_info_2_no_message_pivot

As we can see, the lower the review score is, the longer the delivery time is so Delivery Time is one of the factor that affects satisfactory of customers

3.2. Customer Churn Rate

data_churn=orders_info.groupby(['customer_unique_id','order_purchase_year'])['order_id'].agg('count').reset_index()
data_churn['filter']=np.where(data_churn['order_id']==1,1,0)
data_churn_2=data_churn.groupby('order_purchase_year')['filter'].agg(NewCustomers='sum',TotalCustomers='count').reset_index()
data_churn_2['churn rate']=data_churn_2['NewCustomers']/data_churn_2['TotalCustomers']*100
data_churn_2['Existing Customer']=data_churn_2['TotalCustomers']-data_churn_2['NewCustomers']
data_churn_2=data_churn_2[data_churn_2['order_purchase_year'].isin(['2017','2018'])]
data_churn_2.index=data_churn_2['order_purchase_year']
data_churn_2[['NewCustomers','Existing Customer']].plot( stacked = True, kind = 'bar')
plt.tick_params(axis='x', labelrotation =360)

There are nearly 3% old customers come back to buy products in this website in 2017 & 2018 ( in my understanding, customer unique id point to an actual customer in reality). Not count for 2016 because 2016 is considered as the first year in data

My Blog