Cardiovascular Disease Prediction
AIM
To predict the risk of cardiovascular disease based on lifestyle factors.
DATASET LINK
https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset
MY NOTEBOOK LINK
https://www.kaggle.com/code/sid4ds/cardiovascular-disease-risk-prediction
LIBRARIES NEEDED
LIBRARIES USED
- pandas
- numpy
- scikit-learn (>=1.5.0 for TunedThresholdClassifierCV)
- matplotlib
- seaborn
- joblib
DESCRIPTION
What is the requirement of the project?
- This project aims to predict the risk of cardivascular diseases (CVD) based on data provided by people about their lifestyle factors. Predicting the risk in advance can minimize cases which reach a terminal stage.
Why is it necessary?
- CVD is one of the leading causes of death globally. Using machine learning models to predict risk of CVD can be an important tool in helping the people affected by it.
How is it beneficial and used?
- Doctors can use it as a second opinion to support their diagnosis. It also acts as a fallback mechanism in rare cases where the diagnosis is not obvious.
- People (patients in particular) can track their risk of CVD based on their own lifestyle and schedule an appointment with a doctor in advance to mitigate the risk.
How did you start approaching this project? (Initial thoughts and planning)
- Going through previous research and articles related to the problem.
- Data exploration to understand the features. Using data visualization to check their distributions.
- Identifying key metrics for the problem based on ratio of target classes.
- Feature engineering and selection based on EDA.
- Setting up a framework for easier testing of multiple models.
- Analysing results of models using confusion matrix.
Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.).
EXPLANATION
DETAILS OF THE DIFFERENT FEATURES
Feature Name | Description | Type | Values/Range |
---|---|---|---|
General_Health | "Would you say that in general your health is—" | Categorical | [Poor, Fair, Good, Very Good, Excellent] |
Checkup | "About how long has it been since you last visited a doctor for a routine checkup?" | Categorical | [Never, 5 or more years ago, Within last 5 years, Within last 2 years, Within the last year] |
Exercise | "Did you participate in any physical activities like running, walking, or gardening?" | Categorical | [Yes, No] |
Skin_Cancer | Respondents that reported having skin cancer | Categorical | [Yes, No] |
Other_Cancer | Respondents that reported having any other types of cancer | Categorical | [Yes, No] |
Depression | Respondents that reported having a depressive disorder | Categorical | [Yes, No] |
Diabetes | Respondents that reported having diabetes. If yes, specify the type. | Categorical | [Yes, No, No pre-diabetes or borderline diabetes, Yes but female told only during pregnancy] |
Arthritis | Respondents that reported having arthritis | Categorical | [Yes, No] |
Sex | Respondent's gender | Categorical | [Yes, No] |
Age_Category | Respondent's age range | Categorical | ['18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75-80', '80+'] |
Height_(cm) | Respondent's height in cm | Numerical | Measured in cm |
Weight_(kg) | Respondent's weight in kg | Numerical | Measured in kg |
BMI | Respondent's Body Mass Index in kg/cm² | Numerical | Measured in kg/cm² |
Smoking_History | Respondent's smoking history | Categorical | [Yes, No] |
Alcohol_Consumption | Number of days of alcohol consumption in a month | Numerical | Integer values |
Fruit_Consumption | Number of servings of fruit consumed in a month | Numerical | Integer values |
Green_Vegetables_Consumption | Number of servings of green vegetables consumed in a month | Numerical | Integer values |
FriedPotato_Consumption | Number of servings of fried potato consumed in a month | Numerical | Integer values |
WHAT I HAVE DONE
Exploratory Data Analysis
- Summary statistics
- Data visualization for numerical feature distributions
- Target splits for categorical features
Data cleaning and Preprocessing
- Regrouping rare categories
- Categorical feature encoding
- Outlier clipping for numerical features
Feature engineering and selection
- Combining original features based on domain knowledge
- Discretizing numerical features
Modeling
- Holdout dataset created or model testing
- Models trained: Logistic Regression, Decision Tree, Random Forest, AdaBoost, HistGradient Boosting, Multi-Layer Perceptron
- Class imbalance handled through:
- Class weights, when supported by model architecture
- Threshold tuning using TunedThresholdClassifierCV
- Metric for model-tuning: F2-score (harmonic weighted mean of precision and recall, with twice the weightage for recall)
Result analysis
- Confusion matrix using predictions made on holdout test set
PROJECT TRADE-OFFS AND SOLUTIONS
Accuracy vs Recall: Data is extremely imbalanced, with only ~8% representing the positive class. This makes accuracy unsuitable as a metric for our problem. It is critical to correctly predict all the positive samples, due to which we must focus on recall. However, this lowers the overall accuracy since some negative samples may be predicted as positive.
- Solution: Prediction threshold for models is tuned using F2-score to create a balance between precision and recall, with more importance given to recall. This maintains overall accuracy at an acceptable level while boosting recall.
SCREENSHOTS
Project workflow
Numerical feature distributions
Correlations
MODELS USED AND THEIR ACCURACIES
Model + Feature set | Accuracy (%) | Recall (%) |
---|---|---|
Logistic Regression + Original | 76.29 | 74.21 |
Logistic Regression + Extended | 76.27 | 74.41 |
Logistic Regression + Selected | 72.66 | 78.09 |
Decision Tree + Original | 72.76 | 78.61 |
Decision Tree + Extended | 74.09 | 76.69 |
Decision Tree + Selected | 75.52 | 73.61 |
Random Forest + Original | 73.97 | 77.33 |
Random Forest + Extended | 74.10 | 76.61 |
Random Forest + Selected | 74.80 | 74.05 |
AdaBoost + Original | 76.03 | 74.49 |
AdaBoost + Extended | 74.99 | 76.25 |
AdaBoost + Selected | 74.76 | 75.33 |
Multi-Layer Perceptron + Original | 76.91 | 72.81 |
Multi-Layer Perceptron + Extended | 73.26 | 79.01 |
Multi-Layer Perceptron + Selected | 74.86 | 75.05 |
Hist-Gradient Boosting + Original | 75.98 | 73.49 |
Hist-Gradient Boosting + Extended | 75.63 | 74.73 |
Hist-Gradient Boosting + Selected | 74.40 | 75.85 |
MODELS COMPARISON GRAPHS
Logistic Regression
Decision Tree
Random Forest
Ada Boost
Multi-Layer Perceptron
Hist-Gradient Boosting
CONCLUSION
WHAT YOU HAVE LEARNED
Insights gained from the data
- General Health, Age and Co-morbities (such as Diabetes & Arthritis) are the most indicative features for CVD risk.
Improvements in understanding machine learning concepts
- Learned and implemented the concept of predicting probability and tuning the prediction threshold for more accurate results, compared to directly predicting with the default thresold for models.
Challenges faced and how they were overcome
- Deciding the correct metric for evaluation of models due to imbalanced nature of the dataset. Since positive class is more important, Recall was used as the final metric for ranking models.
- F2-score was used to tune the threshold for models to maintain a balance between precision and recall, thereby maintaining overall accuracy.
USE CASES OF THIS MODEL
- Doctors can use it as a second opinion when assessing a new patient. Model trained on cases from previous patients can be used to predict the risk.
- People (patients in particular) can use this tool to track the risk of CVD based on their own lifestyle factors and take preventive measures when the risk is high.
FEATURES PLANNED BUT NOT IMPLEMENTED
- Different implementations of gradient-boosting models such as XGBoost, CatBoost, LightGBM, etc. were not implemented since none of the tree ensemble models such as Random Forest, AdaBoost or Hist-Gradient Boosting were among the best performers. Hence, avoid additional dependencies based on such models.