Used Cars Price Prediction

AIM

Predicting the prices of used cars based on their configuration and previous usage.

DATASET LINK

https://www.kaggle.com/datasets/avikasliwal/used-cars-price-prediction

MY NOTEBOOK LINK

https://www.kaggle.com/code/sid4ds/used-cars-price-prediction/

LIBRARIES NEEDED

LIBRARIES USED

pandas
numpy
scikit-learn (>=1.5.0 required for Target Encoding)
xgboost
catboost
matplotlib
seaborn

DESCRIPTION

Why is it necessary?

This project aims to predict the prices of used cars listed on an online marketplace based on their features and usage by previous owners. This model can be used by sellers to estimate an approximate price for their cars when they list them on the marketplace. Buyers can use the model to check if the listed price is fair when they decide to buy a used vehicle.

How did you start approaching this project? (Initial thoughts and planning)

Researching previous projects and articles related to the problem.
Data exploration to understand the features.
Identifying different preprocessing strategies for different feature types.
Choosing key metrics for the problem - Root Mean Squared Error (for error estimation), R2-Score (for model explainability)

Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.).

EXPLANATION

DETAILS OF THE DIFFERENT FEATURES

Feature Name	Description	Type	Values/Range
Name	Car model	Categorical	Names of car models
Location	City where the car is listed for sale	Categorical	Names of cities
Year	Year of original purchase of car	Numerical	Years (e.g., 2010, 2015, etc.)
Kilometers_Driven	Odometer reading of the car	Numerical	Measured in kilometers
Fuel_Type	Fuel type of the car	Categorical	[Petrol, Diesel, CNG, Electric, etc.]
Transmission	Transmission type of the car	Categorical	[Automatic, Manual]
Owner_Type	Number of previous owners of the car	Numerical	Whole numbers
Mileage	Current mileage provided by the car	Numerical	Measured in km/l or equivalent
Engine	Engine capacity of the car	Numerical	Measured in CC (Cubic Centimeters)
Power	Engine power output of the car	Numerical	Measured in BHP (Brake Horsepower)
Seats	Seating capacity of the car	Numerical	Whole numbers
New_Price	Original price of the car at the time of purchase	Numerical	Measured in currency

WHAT I HAVE DONE

Step 1Step 2Step 3Step 4Step 5

Exploratory Data Analysis

Summary statistics
Data visualization for numerical feature distributions
Target splits for categorical features

Data cleaning and Preprocessing

Removing rare categories of brands
Removing outliers for numerical features and target
Categorical feature encoding for low-cardinality features
Target encoding for high-cardinality categorical features (in model pipeline)

Feature engineering and selection

Extracting brand name from model name for a lower-cardinality feature.
Converting categorical Owner_Type to numerical Num_Previous_Owners.
Feature selection based on model-based feature importances and statistical tests.

Modeling

Holdout dataset created for model testing
Setting up a framework for easier testing of multiple models.
Models trained: LLinear Regression, K-Nearest Neighbors, Decision Tree, Random Forest, AdaBoost, Multi-Layer Perceptron, XGBoost and CatBoost.
Models were ensembled using Simple and Weighted averaging.

Result analysis

Predictions made on holdout test set
Models compared based on chosen metrics: RMSE and R2-Score.
Visualized predicted prices vs actual prices to analyze errors.

PROJECT TRADE-OFFS AND SOLUTIONS

Trade Off 1

Training time & Model complexity vs Reducing error

Solution: Limiting depth and number of estimators for tree-based models. Overfitting detection and early stopping mechanism for neural network training.

SCREENSHOTS

Project workflow

  graph LR
    A[Start] --> B{Error?};
    B -->|Yes| C[Hmm...];
    C --> D[Debug];
    D --> B;
    B ---->|No| E[Yay!];

Data Exploration

Feature Selection

Feature CorrelationTarget CorrelationMutual Information

featselect_corrfeatures

featselect_corrtarget

featselect_mutualinfo

MODELS USED AND THEIR PERFORMANCE

Model	RMSE	R2-Score
Linear Regression	3.5803	0.7915
K-Nearest Neighbors	2.8261	0.8701
Decision Tree	2.6790	0.8833
Random Forest	2.4619	0.9014
AdaBoost	2.3629	0.9092
Multi-layer Perceptron	2.6255	0.8879
XGBoost w/o preprocessing	2.1649	0.9238
XGBoost with preprocessing	2.0987	0.9284
CatBoost w/o preprocessing	2.1734	0.9232
Simple average ensemble	2.2804	0.9154
Weighted average ensemble	2.1296	0.9262

CONCLUSION

WHAT YOU HAVE LEARNED

Insights gained from the data

Features related to car configuration such as Power, Engine and Transmission are some of the most informative features. Usage-related features such as Year and current Mileage are also important.
Seating capacity and Number of previous owners had relatively less predictive power. However, none of the features were candidates for removal.

Improvements in understanding machine learning concepts

Implemented target-encoding for high-cardinality categorical features.
Designed pipelines to avoid data leakage.
Ensembling models using prediction averaging.

Challenges faced and how they were overcome

Handling mixed feature types in preprocessing pipelines.
Regularization and overfitting detection to reduce training time while maintaining performance.

USE CASES OF THIS MODEL

Application 1Application 2

Sellers can use the model to estimate an approximate price for their cars when they list them on the marketplace.

Buyers can use the model to check if the listed price is fair when they decide to buy a used vehicle.

FEATURES PLANNED BUT NOT IMPLEMENTED

Feature 1

Complex model-ensembling through stacking or hill-climbing was not implemented due to significantly longer training time.