Projects

project name: House Prices

House Prices

House Prices is another Kaggle project. The goal is to create a model to predict the prices of residential homes in Ames, Iowa. The main challenge in this project is how to handle a dataset with a significant amount of features.

About the data:

For this project, I didn't do a training dataset split because Kaggle already provides us with a test dataset. However, I need to submit the predictions to get the performance of the model.

EDA:

Even though there are 79 variables, the dataset is still small enough to be explored. I explore the data based on data categories:

  • Numerical features
    • Area features
    • Non-area features
  • Categorical features
    • Nominal features
    • Ordinal features
  • Date features

What I learned through this exploration is:

  • There are features correlated, (ex: Ground living area is correlated to first floor area, second floor area, lot area).
  • An area of zero means the house doesn't have that feature. Example: pool area of zero means the house does not have pool
  • In the dataset we can find unbalance categorical columns, which we can safely discard
  • Some scatter plots may not tell us the big picture because there are so many variables that can affect the result

About the models

  • Linear models perform slightly better than nonlinear models.
  • After testing with many models and doing hyperparameter tuning, I found that Ridge was the best model for this task

Note: You can find a more detailed explanation inside the notebooks.

Linear Regression
Regularization
Dimensionality Reduction