Simple kaggle project to explore and build a binary classification model. The goal of the project is to diagnostically predict whether or not a patient has diabetes based on certain diagnostic measurements.
Colab links
Summary of what I did to solve the problem:
As the kaggle page only provides us with one dataset, I manually did a dataset division, one for testing and one for training. Let’s remember that we mustn't look at the testing dataset until the last step of the data science project to avoid any bias.
I perform an EDA as usual and discover a few things:
- Columns with a significant amount of nan values
- Glucose, BMI and Diabetes Pedigree Function are one of the most important features based on the data analysis and domain knowledge. Nevertheless, I'm going to use all the features to train the model to make sure I don't miss any important information.
To have a better understanding of the data I do a little bit of research about diabetes.The key findings about this research are:
- Glucose column is the result of a 2-hour OGTT test where having less than 120-140 mg/dL is the expected result for a normal person
- Overweight is a factor of diabetes, which means having higher BMI means higher probability of getting diabetes
- The Diabetes Pedigree Function (DPF) is a mathematical formula that is used to estimate the risk of a person developing diabetes based on their family history and other risk factors. In general, a DPF value of less than 0.1 suggests a low risk of developing diabetes, while a value of 0.1 to 0.3 indicates an intermediate risk, and a value greater than 0.3 indicates a high risk.
- Normal levels of diastolic blood pressure must be less than 80 mm Hg
For finding the best model I first tried with basic models (KNeighborsClassifier, SVC, RandomForestClassifier, LogisticRegression) then I move to ensemble models (VotingClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, ect).
Finally, I tried using oversampling.
The best model that I found was BaggingClassifier with LogisticRegression.
What I learned:
- Using oversampling may cause overfitting
- Ensemble models increase the performance a little bit.
- Domain knowledge about diabetes