Car Prices Analysis
Introduction
The “Vehicle Sales and Market Trends Dataset” provides a comprehensive collection of information pertaining to the sales transactions of various vehicles. This dataset encompasses details such as the year, make, model, trim, body type, transmission type, VIN (Vehicle Identification Number), state of registration, condition rating, odometer reading, exterior and interior colors, seller information, Manheim Market Report (MMR) values, selling prices, and sale dates.
My Kaggle project link: https://www.kaggle.com/code/trmnguynhi/car-prices-analysis
This analysis will contain four steps as below:
- Define business task
- Data Preparation
- Data Processing
- Data Analysis
Business Task
In this analysis, I will focus on two below factors to have overall overview of the business:
1. Sales Performance
- Sales by Month-Year
- Sales by Make
- Sales by Color
- Sales by Body Type
- Sales by State ( Map Distribution)
2. Pricing Analyst
- Average selling price per body
- Pricing and condition relationship
- Pricing and odometer relationship
- Model regression of price, condition and odometer
Data Preparation
Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns
%matplotlib inline
Import datasets
car_prices = pd.read_csv("/kaggle/input/vehicle-sales-data/car_prices.csv")
Data Process
Explore datasets
car_prices.info()
car_prices.head(5)

Remove row with wrong format of saledate
car_prices['is_digit']= car_prices['saledate'].str.isdigit().fillna(False)
car_prices=car_prices[car_prices['is_digit']==False]
car_prices=car_prices.drop('is_digit', axis=1)
car_prices['body']=[str(i).capitalize() for i in car_prices['body']]
Data Analysis
1. Sales Performance
car_prices['saledate']=pd.to_datetime(car_prices['saledate'])
car_prices['salemonth_str']=car_prices['saledate'].astype('str').str[:10]
car_prices['salemonth']=pd.to_datetime(car_prices['salemonth_str']).dt.strftime('%Y-%m')
car_prices['saleyear']=pd.to_datetime(car_prices['salemonth_str']).dt.strftime('%Y')
car_prices=car_prices.drop('salemonth_str', axis=1)
sale_by_month=car_prices.groupby('salemonth')['sellingprice'].agg(SaleQty='sum').reset_index()
sale_by_month['SaleQty']=sale_by_month['SaleQty']/1000
sale_by_month.rename(columns={'SaleQty':'SaleQty (k$)'},inplace=True)

sns.set_style('whitegrid')
sns.lineplot(sale_by_month, x='salemonth', y='SaleQty (k$)', marker = 'o', color='red')
plt.xticks(rotation=45)
plt.title('Sales Revenue by Month')

sale_by_make=car_prices.groupby('make')['sellingprice'].agg(ToTalSalebyMake='count').reset_index()
top_10_sale_by_make=sale_by_make.sort_values(by='ToTalSalebyMake', ascending=False)[:10]
top_10_sale_by_make

sns.barplot(top_10_sale_by_make, x='ToTalSalebyMake',y='make' )
plt.title("Top 10 Highest Sales Qty by Make")

sale_by_color=car_prices.groupby('color')['sellingprice'].agg(TotalSalebyColor='count').reset_index()
sale_by_color.sort_values(by='TotalSalebyColor', ascending=False, inplace = True)
sale_by_color=sale_by_color.reset_index()
sale_by_color['color'][6:]='Others'
sale_by_color=sale_by_color.groupby('color')['TotalSalebyColor'].sum().reset_index(0)
sale_by_color.sort_values(by='TotalSalebyColor', ascending=False, inplace = True)
sale_by_color

plt.figure(facecolor='silver')
colors = ['black', 'white', 'silver', 'grey','green','blue','red']
plt.pie(sale_by_color['TotalSalebyColor'], labels = sale_by_color['color'], autopct='%.0f%%', colors=colors)

Customer prefer to buy Black, White, Silver than other colors
sale_by_body = car_prices.groupby('body')['sellingprice'].count().reset_index()
sale_by_body.sort_values(by='sellingprice', ascending=False,inplace=True)
sale_by_body.head(20)

SUV & Sedan is the most popular for customer to buy
sale_by_location = car_prices.groupby('state')['sellingprice'].agg(SaleQty='count',AvgSellingPrice='mean').reset_index()
sale_by_location['state']=sale_by_location['state'].str.upper()
import plotly.graph_objs as go from plotly.offline import init_notebook_mode,iplot,plot init_notebook_mode(connected=True)data = dict(type='choropleth', colorscale = 'Viridis', reversescale = True, locations = sale_by_location['state'], z = sale_by_location['SaleQty'], locationmode = 'USA-states', text = car_prices['state'], marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)), colorbar = {'title':"Sales Qty by State"} )
layout = dict(title = 'Sales Qty Distribution by State',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)

2. Pricing Analyst
avg_selling_price = car_prices.groupby('body')['sellingprice'].agg(AvgPrice='mean').reset_index()
avg_selling_price.sort_values(by='AvgPrice', ascending=False, inplace = True)
avg_selling_price=avg_selling_price.reset_index()
fig, axes = plt.subplots(1, 2, figsize=(40, 15))
axes1=sns.barplot(avg_selling_price[:10],y='body',x='AvgPrice', ax=axes[0] )
axes1.set_title("Top 10 Highest Price by Body Type")
axes2=sns.barplot(avg_selling_price[-10:],y='body',x='AvgPrice', ax=axes[1] )
axes2.set_title("Top 10 Lowest Price by Body Type")

Difference between MMR & Selling Price
car_prices['differ']=-car_prices['mmr'] + car_prices['sellingprice']
car_prices['cluster']=['Selling Price Higher Than MMR' if i>0 else 'Selling Price Lower Than MMR' for i in car_prices['differ'] ]
car_prices['difference rate']= car_prices['differ'] / car_prices['mmr']*100
difference_mmr_and_sellingprice = car_prices.groupby(['salemonth','cluster'])['difference rate'].agg(differencerate='mean', count='count').reset_index()
fig ,ax1 = plt.subplots(figsize = (15,12))
ax1=sns.lineplot(difference_mmr_and_sellingprice, x='salemonth', y='count', hue = 'cluster', ax = ax1)

SellingPrice & Condition/Odometer Relationship
sns.jointplot(x='sellingprice',y='odometer',data=car_prices)

sns.lmplot(x='sellingprice',y='condition',data=car_prices)

Model Linear Regression
Train model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
train_data = car_prices[car_prices['sellingprice'].notnull() & car_prices['condition'].notnull() & car_prices['odometer'].notnull() ]
sns.heatmap(train_data[['sellingprice','condition','odometer']].corr(),annot=True)

y = train_data['sellingprice']
X = train_data[['condition','odometer']]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)
lm = LinearRegression()
lm.fit(X_train,y_train)
Print out the coefficients of the model
print('Coefficients: \n', lm.coef_)

Predicting Test Data
Now that we have fit our model, let’s evaluate its performance by predicting off the test values!
Use lm.predict() to predict off the X_test set of the data.
predictions = lm.predict(X_test) plt.scatter(y_test,predictions) plt.xlabel('Y Test') plt.ylabel('Predicted Y')
Evaluating the Model
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions))) 
Plot a histogram of the residuals and make sure it looks normally distributed.
sns.distplot((y_test-predictions),bins=50); 
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients


