House Prices is another Kaggle project. The goal is to create a model to predict the prices of residential homes in Ames, Iowa. The main challenge in this project is how to handle a dataset with a significant amount of features.
Colab links
About the data:
For this project, I didn't do a training dataset split because Kaggle already provides us with a test dataset. However, I need to submit the predictions to get the performance of the model.
EDA:
Even though there are 79 variables, the dataset is still small enough to be explored. I explore the data based on data categories:
- Numerical features
- Area features
- Non-area features
- Categorical features
- Nominal features
- Ordinal features
- Date features
What I learned through this exploration is:
- There are features correlated, (ex: Ground living area is correlated to first floor area, second floor area, lot area).
- An area of zero means the house doesn't have that feature. Example: pool area of zero means the house does not have pool
- In the dataset we can find unbalance categorical columns, which we can safely discard
- Some scatter plots may not tell us the big picture because there are so many variables that can affect the result
About the models
- Linear models perform slightly better than nonlinear models.
- After testing with many models and doing hyperparameter tuning, I found that Ridge was the best model for this task
Note: You can find a more detailed explanation inside the notebooks.