🚗 Used Car Price Prediction in Virginia

Predicting the price of used cars with AI and data analysis

“Don’t guess the price — let the data tell you.”

📎 Full Analysis:
👉 View Jupyter Notebook on GitHub

Web App


📌 One-Line Summary

This project predicts the prices of used cars in Virginia using a dataset of over 46,000 listings.
By analyzing details like year, mileage, brand, and fuel type, the AI model can estimate a realistic market price.


1️⃣ How It Was Built

1. Data Collection

  • Collected real car sales data from the web
  • Stored in an AWS cloud MySQL database
  • Accessed using Python and shared via Flask API

2. Data Preparation

  • Filled missing values (e.g., unknown mileage)
  • Removed unrealistic values (e.g., mileage over 1 million km)
  • Converted text data (brand, fuel type) into numbers
  • Applied log transformation to balance skewed data

3. Data Analysis (EDA)

  • Visualized the relationship between price and year/mileage
  • Found year and mileage to be the most influential features

4. AI Model Training

Tested several machine learning models:

  • Linear Regression
  • Decision Tree
  • Random Forest
  • Support Vector Regression (SVR)
  • XGBoost (winner)

📊 Best Model: XGBoost

  • Accuracy (R²): 0.89
  • Average Error (RMSE): $5,474

5. Results

Model R² Score RMSE
Linear Regression 0.58 14,085
Decision Tree 0.73 9,022
Random Forest 0.84 7,021
SVR 0.11 14,328
XGBoost 0.89 5,474

6. Real-World Test

  • 2016 Honda Odyssey → Predicted price: $18,738
    Matched closely with actual market data.

2️⃣ Real-World Use

  • Final model saved as a Pickle file
  • Deployed via Flask API for real-time predictions
  • Created a simplified version (Year, Mileage, Brand, Model) for web app integration

🛠 Technologies Used

Step Technology
Data Storage AWS MySQL
Model Dev Python, scikit-learn, XGBoost
Deployment Flask API, Pickle
Environment AWS EC2 (Ubuntu)

💡 Key Learnings

  • Log transformation improves accuracy for skewed data
  • Tree-based models handle mixed data types effectively
  • Even with only 4 features, accurate real-time predictions are possible

🔗 GitHub Repository

📂 View Project on GitHub