Used Cars Price Prediction
AIM
Predicting the prices of used cars based on their configuration and previous usage.
DATASET LINK
https://www.kaggle.com/datasets/avikasliwal/used-cars-price-prediction
MY NOTEBOOK LINK
https://www.kaggle.com/code/sid4ds/used-cars-price-prediction/
LIBRARIES NEEDED
LIBRARIES USED
- pandas
- numpy
- scikit-learn (>=1.5.0 required for Target Encoding)
- xgboost
- catboost
- matplotlib
- seaborn
DESCRIPTION
Why is it necessary?
- This project aims to predict the prices of used cars listed on an online marketplace based on their features and usage by previous owners. This model can be used by sellers to estimate an approximate price for their cars when they list them on the marketplace. Buyers can use the model to check if the listed price is fair when they decide to buy a used vehicle.
How did you start approaching this project? (Initial thoughts and planning)
- Researching previous projects and articles related to the problem.
- Data exploration to understand the features.
- Identifying different preprocessing strategies for different feature types.
- Choosing key metrics for the problem - Root Mean Squared Error (for error estimation), R2-Score (for model explainability)
Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.).
EXPLANATION
DETAILS OF THE DIFFERENT FEATURES
Feature Name | Description | Type | Values/Range |
---|---|---|---|
Name | Car model | Categorical | Names of car models |
Location | City where the car is listed for sale | Categorical | Names of cities |
Year | Year of original purchase of car | Numerical | Years (e.g., 2010, 2015, etc.) |
Kilometers_Driven | Odometer reading of the car | Numerical | Measured in kilometers |
Fuel_Type | Fuel type of the car | Categorical | [Petrol, Diesel, CNG, Electric, etc.] |
Transmission | Transmission type of the car | Categorical | [Automatic, Manual] |
Owner_Type | Number of previous owners of the car | Numerical | Whole numbers |
Mileage | Current mileage provided by the car | Numerical | Measured in km/l or equivalent |
Engine | Engine capacity of the car | Numerical | Measured in CC (Cubic Centimeters) |
Power | Engine power output of the car | Numerical | Measured in BHP (Brake Horsepower) |
Seats | Seating capacity of the car | Numerical | Whole numbers |
New_Price | Original price of the car at the time of purchase | Numerical | Measured in currency |
WHAT I HAVE DONE
Exploratory Data Analysis
- Summary statistics
- Data visualization for numerical feature distributions
- Target splits for categorical features
Data cleaning and Preprocessing
- Removing rare categories of brands
- Removing outliers for numerical features and target
- Categorical feature encoding for low-cardinality features
- Target encoding for high-cardinality categorical features (in model pipeline)
Feature engineering and selection
- Extracting brand name from model name for a lower-cardinality feature.
- Converting categorical Owner_Type to numerical Num_Previous_Owners.
- Feature selection based on model-based feature importances and statistical tests.
Modeling
- Holdout dataset created for model testing
- Setting up a framework for easier testing of multiple models.
- Models trained: LLinear Regression, K-Nearest Neighbors, Decision Tree, Random Forest, AdaBoost, Multi-Layer Perceptron, XGBoost and CatBoost.
- Models were ensembled using Simple and Weighted averaging.
Result analysis
- Predictions made on holdout test set
- Models compared based on chosen metrics: RMSE and R2-Score.
- Visualized predicted prices vs actual prices to analyze errors.
PROJECT TRADE-OFFS AND SOLUTIONS
Training time & Model complexity vs Reducing error
- Solution: Limiting depth and number of estimators for tree-based models. Overfitting detection and early stopping mechanism for neural network training.
SCREENSHOTS
Project workflow
graph LR
A[Start] --> B{Error?};
B -->|Yes| C[Hmm...];
C --> D[Debug];
D --> B;
B ---->|No| E[Yay!];
Data Exploration
Feature Selection
MODELS USED AND THEIR PERFORMANCE
Model | RMSE | R2-Score |
---|---|---|
Linear Regression | 3.5803 | 0.7915 |
K-Nearest Neighbors | 2.8261 | 0.8701 |
Decision Tree | 2.6790 | 0.8833 |
Random Forest | 2.4619 | 0.9014 |
AdaBoost | 2.3629 | 0.9092 |
Multi-layer Perceptron | 2.6255 | 0.8879 |
XGBoost w/o preprocessing | 2.1649 | 0.9238 |
XGBoost with preprocessing | 2.0987 | 0.9284 |
CatBoost w/o preprocessing | 2.1734 | 0.9232 |
Simple average ensemble | 2.2804 | 0.9154 |
Weighted average ensemble | 2.1296 | 0.9262 |
CONCLUSION
WHAT YOU HAVE LEARNED
Insights gained from the data
- Features related to car configuration such as Power, Engine and Transmission are some of the most informative features. Usage-related features such as Year and current Mileage are also important.
- Seating capacity and Number of previous owners had relatively less predictive power. However, none of the features were candidates for removal.
Improvements in understanding machine learning concepts
- Implemented target-encoding for high-cardinality categorical features.
- Designed pipelines to avoid data leakage.
- Ensembling models using prediction averaging.
Challenges faced and how they were overcome
- Handling mixed feature types in preprocessing pipelines.
- Regularization and overfitting detection to reduce training time while maintaining performance.
USE CASES OF THIS MODEL
- Sellers can use the model to estimate an approximate price for their cars when they list them on the marketplace.
- Buyers can use the model to check if the listed price is fair when they decide to buy a used vehicle.
FEATURES PLANNED BUT NOT IMPLEMENTED
- Complex model-ensembling through stacking or hill-climbing was not implemented due to significantly longer training time.