Introduction

The “Vehicle Sales and Market Trends Dataset” provides a comprehensive collection of information pertaining to the sales transactions of various vehicles. This dataset encompasses details such as the year, make, model, trim, body type, transmission type, VIN (Vehicle Identification Number), state of registration, condition rating, odometer reading, exterior and interior colors, seller information, Manheim Market Report (MMR) values, selling prices, and sale dates.

My Kaggle project link: https://www.kaggle.com/code/trmnguynhi/car-prices-analysis

This analysis will contain four steps as below:

Define business task
Data Preparation
Data Processing
Data Analysis

Business Task

In this analysis, I will focus on two below factors to have overall overview of the business:

1. Sales Performance

Sales by Month-Year
Sales by Make
Sales by Color
Sales by Body Type
Sales by State ( Map Distribution)

2. Pricing Analyst

Average selling price per body
Pricing and condition relationship
Pricing and odometer relationship
Model regression of price, condition and odometer

Data Preparation

Import the libraries

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns
%matplotlib inline

Import datasets

car_prices = pd.read_csv("/kaggle/input/vehicle-sales-data/car_prices.csv")

Data Process

Explore datasets

car_prices.info()

car_prices.head(5)

Remove row with wrong format of saledate

car_prices['is_digit']= car_prices['saledate'].str.isdigit().fillna(False)
car_prices=car_prices[car_prices['is_digit']==False]
car_prices=car_prices.drop('is_digit', axis=1)

car_prices['body']=[str(i).capitalize() for i in car_prices['body']]

Data Analysis

1. Sales Performance

car_prices['saledate']=pd.to_datetime(car_prices['saledate'])
car_prices['salemonth_str']=car_prices['saledate'].astype('str').str[:10]
car_prices['salemonth']=pd.to_datetime(car_prices['salemonth_str']).dt.strftime('%Y-%m')
car_prices['saleyear']=pd.to_datetime(car_prices['salemonth_str']).dt.strftime('%Y')
car_prices=car_prices.drop('salemonth_str', axis=1)

sale_by_month=car_prices.groupby('salemonth')['sellingprice'].agg(SaleQty='sum').reset_index()
sale_by_month['SaleQty']=sale_by_month['SaleQty']/1000
sale_by_month.rename(columns={'SaleQty':'SaleQty (k$)'},inplace=True)

sns.set_style('whitegrid')
sns.lineplot(sale_by_month, x='salemonth', y='SaleQty (k$)', marker = 'o', color='red')
plt.xticks(rotation=45)
plt.title('Sales Revenue by Month')

sale_by_make=car_prices.groupby('make')['sellingprice'].agg(ToTalSalebyMake='count').reset_index()
top_10_sale_by_make=sale_by_make.sort_values(by='ToTalSalebyMake', ascending=False)[:10]
top_10_sale_by_make

sns.barplot(top_10_sale_by_make, x='ToTalSalebyMake',y='make' )
plt.title("Top 10 Highest Sales Qty by Make")

sale_by_color=car_prices.groupby('color')['sellingprice'].agg(TotalSalebyColor='count').reset_index()
sale_by_color.sort_values(by='TotalSalebyColor', ascending=False, inplace = True)
sale_by_color=sale_by_color.reset_index()
sale_by_color['color'][6:]='Others'
sale_by_color=sale_by_color.groupby('color')['TotalSalebyColor'].sum().reset_index(0)
sale_by_color.sort_values(by='TotalSalebyColor', ascending=False, inplace = True)
sale_by_color

plt.figure(facecolor='silver')
colors = ['black', 'white', 'silver', 'grey','green','blue','red']
plt.pie(sale_by_color['TotalSalebyColor'], labels = sale_by_color['color'], autopct='%.0f%%', colors=colors)

Customer prefer to buy Black, White, Silver than other colors

sale_by_body = car_prices.groupby('body')['sellingprice'].count().reset_index()
sale_by_body.sort_values(by='sellingprice', ascending=False,inplace=True)
sale_by_body.head(20)

SUV & Sedan is the most popular for customer to buy

sale_by_location = car_prices.groupby('state')['sellingprice'].agg(SaleQty='count',AvgSellingPrice='mean').reset_index()
sale_by_location['state']=sale_by_location['state'].str.upper()

import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode,iplot,plot
init_notebook_mode(connected=True)  data = dict(type='choropleth',
            colorscale = 'Viridis',
            reversescale = True,
            locations = sale_by_location['state'],
            z = sale_by_location['SaleQty'],
            locationmode = 'USA-states',
            text = car_prices['state'],
            marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),
            colorbar = {'title':"Sales Qty by State"}
            )

layout = dict(title = 'Sales Qty Distribution by State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)

2. Pricing Analyst

avg_selling_price = car_prices.groupby('body')['sellingprice'].agg(AvgPrice='mean').reset_index()
avg_selling_price.sort_values(by='AvgPrice', ascending=False, inplace = True)
avg_selling_price=avg_selling_price.reset_index()

fig, axes = plt.subplots(1, 2, figsize=(40, 15))
axes1=sns.barplot(avg_selling_price[:10],y='body',x='AvgPrice', ax=axes[0] )
axes1.set_title("Top 10 Highest Price by Body Type")
axes2=sns.barplot(avg_selling_price[-10:],y='body',x='AvgPrice', ax=axes[1] )
axes2.set_title("Top 10 Lowest Price by Body Type")

Difference between MMR & Selling Price

car_prices['differ']=-car_prices['mmr'] + car_prices['sellingprice']
car_prices['cluster']=['Selling Price Higher Than MMR' if i>0 else 'Selling Price Lower Than MMR' for i in car_prices['differ'] ]
car_prices['difference rate']= car_prices['differ'] / car_prices['mmr']*100
difference_mmr_and_sellingprice = car_prices.groupby(['salemonth','cluster'])['difference rate'].agg(differencerate='mean', count='count').reset_index()

fig ,ax1 = plt.subplots(figsize = (15,12))
ax1=sns.lineplot(difference_mmr_and_sellingprice, x='salemonth', y='count', hue = 'cluster', ax = ax1)

SellingPrice & Condition/Odometer Relationship

sns.jointplot(x='sellingprice',y='odometer',data=car_prices)

sns.lmplot(x='sellingprice',y='condition',data=car_prices)

Model Linear Regression

Train model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

train_data = car_prices[car_prices['sellingprice'].notnull() & car_prices['condition'].notnull()  & car_prices['odometer'].notnull() ]

sns.heatmap(train_data[['sellingprice','condition','odometer']].corr(),annot=True)

y = train_data['sellingprice']
X = train_data[['condition','odometer']]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)
lm = LinearRegression()
lm.fit(X_train,y_train)

Print out the coefficients of the model

print('Coefficients: \n', lm.coef_)

Predicting Test Data

Now that we have fit our model, let’s evaluate its performance by predicting off the test values!

Use lm.predict() to predict off the X_test set of the data.

predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

Evaluating the Model

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Plot a histogram of the residuals and make sure it looks normally distributed.

sns.distplot((y_test-predictions),bins=50);

coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

Nguyen Hai Tram

Car Prices Analysis