Skip to content

Cardiovascular Disease Prediction

AIM

To predict the risk of cardiovascular disease based on lifestyle factors.

https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset

https://www.kaggle.com/code/sid4ds/cardiovascular-disease-risk-prediction

LIBRARIES NEEDED

LIBRARIES USED
  • pandas
  • numpy
  • scikit-learn (>=1.5.0 for TunedThresholdClassifierCV)
  • matplotlib
  • seaborn
  • joblib

DESCRIPTION

What is the requirement of the project?

  • This project aims to predict the risk of cardivascular diseases (CVD) based on data provided by people about their lifestyle factors. Predicting the risk in advance can minimize cases which reach a terminal stage.
Why is it necessary?
  • CVD is one of the leading causes of death globally. Using machine learning models to predict risk of CVD can be an important tool in helping the people affected by it.
How is it beneficial and used?
  • Doctors can use it as a second opinion to support their diagnosis. It also acts as a fallback mechanism in rare cases where the diagnosis is not obvious.
  • People (patients in particular) can track their risk of CVD based on their own lifestyle and schedule an appointment with a doctor in advance to mitigate the risk.
How did you start approaching this project? (Initial thoughts and planning)
  • Going through previous research and articles related to the problem.
  • Data exploration to understand the features. Using data visualization to check their distributions.
  • Identifying key metrics for the problem based on ratio of target classes.
  • Feature engineering and selection based on EDA.
  • Setting up a framework for easier testing of multiple models.
  • Analysing results of models using confusion matrix.
Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.).

EXPLANATION

DETAILS OF THE DIFFERENT FEATURES

Feature Name Description Type Values/Range
General_Health "Would you say that in general your health is—" Categorical [Poor, Fair, Good, Very Good, Excellent]
Checkup "About how long has it been since you last visited a doctor for a routine checkup?" Categorical [Never, 5 or more years ago, Within last 5 years, Within last 2 years, Within the last year]
Exercise "Did you participate in any physical activities like running, walking, or gardening?" Categorical [Yes, No]
Skin_Cancer Respondents that reported having skin cancer Categorical [Yes, No]
Other_Cancer Respondents that reported having any other types of cancer Categorical [Yes, No]
Depression Respondents that reported having a depressive disorder Categorical [Yes, No]
Diabetes Respondents that reported having diabetes. If yes, specify the type. Categorical [Yes, No, No pre-diabetes or borderline diabetes, Yes but female told only during pregnancy]
Arthritis Respondents that reported having arthritis Categorical [Yes, No]
Sex Respondent's gender Categorical [Yes, No]
Age_Category Respondent's age range Categorical ['18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75-80', '80+']
Height_(cm) Respondent's height in cm Numerical Measured in cm
Weight_(kg) Respondent's weight in kg Numerical Measured in kg
BMI Respondent's Body Mass Index in kg/cm² Numerical Measured in kg/cm²
Smoking_History Respondent's smoking history Categorical [Yes, No]
Alcohol_Consumption Number of days of alcohol consumption in a month Numerical Integer values
Fruit_Consumption Number of servings of fruit consumed in a month Numerical Integer values
Green_Vegetables_Consumption Number of servings of green vegetables consumed in a month Numerical Integer values
FriedPotato_Consumption Number of servings of fried potato consumed in a month Numerical Integer values

WHAT I HAVE DONE

Exploratory Data Analysis

  • Summary statistics
  • Data visualization for numerical feature distributions
  • Target splits for categorical features

Data cleaning and Preprocessing

  • Regrouping rare categories
  • Categorical feature encoding
  • Outlier clipping for numerical features

Feature engineering and selection

  • Combining original features based on domain knowledge
  • Discretizing numerical features

Modeling

  • Holdout dataset created or model testing
  • Models trained: Logistic Regression, Decision Tree, Random Forest, AdaBoost, HistGradient Boosting, Multi-Layer Perceptron
  • Class imbalance handled through:
  • Class weights, when supported by model architecture
  • Threshold tuning using TunedThresholdClassifierCV
  • Metric for model-tuning: F2-score (harmonic weighted mean of precision and recall, with twice the weightage for recall)

Result analysis

  • Confusion matrix using predictions made on holdout test set

PROJECT TRADE-OFFS AND SOLUTIONS

Accuracy vs Recall: Data is extremely imbalanced, with only ~8% representing the positive class. This makes accuracy unsuitable as a metric for our problem. It is critical to correctly predict all the positive samples, due to which we must focus on recall. However, this lowers the overall accuracy since some negative samples may be predicted as positive.

  • Solution: Prediction threshold for models is tuned using F2-score to create a balance between precision and recall, with more importance given to recall. This maintains overall accuracy at an acceptable level while boosting recall.

SCREENSHOTS

Project workflow

Numerical feature distributions
Correlations

MODELS USED AND THEIR ACCURACIES

Model + Feature set Accuracy (%) Recall (%)
Logistic Regression + Original 76.29 74.21
Logistic Regression + Extended 76.27 74.41
Logistic Regression + Selected 72.66 78.09
Decision Tree + Original 72.76 78.61
Decision Tree + Extended 74.09 76.69
Decision Tree + Selected 75.52 73.61
Random Forest + Original 73.97 77.33
Random Forest + Extended 74.10 76.61
Random Forest + Selected 74.80 74.05
AdaBoost + Original 76.03 74.49
AdaBoost + Extended 74.99 76.25
AdaBoost + Selected 74.76 75.33
Multi-Layer Perceptron + Original 76.91 72.81
Multi-Layer Perceptron + Extended 73.26 79.01
Multi-Layer Perceptron + Selected 74.86 75.05
Hist-Gradient Boosting + Original 75.98 73.49
Hist-Gradient Boosting + Extended 75.63 74.73
Hist-Gradient Boosting + Selected 74.40 75.85

MODELS COMPARISON GRAPHS

Logistic Regression

Decision Tree
Random Forest
Ada Boost
Multi-Layer Perceptron
Hist-Gradient Boosting

CONCLUSION

WHAT YOU HAVE LEARNED

Insights gained from the data

  • General Health, Age and Co-morbities (such as Diabetes & Arthritis) are the most indicative features for CVD risk.
Improvements in understanding machine learning concepts
  • Learned and implemented the concept of predicting probability and tuning the prediction threshold for more accurate results, compared to directly predicting with the default thresold for models.
Challenges faced and how they were overcome
  • Deciding the correct metric for evaluation of models due to imbalanced nature of the dataset. Since positive class is more important, Recall was used as the final metric for ranking models.
  • F2-score was used to tune the threshold for models to maintain a balance between precision and recall, thereby maintaining overall accuracy.

USE CASES OF THIS MODEL

  • Doctors can use it as a second opinion when assessing a new patient. Model trained on cases from previous patients can be used to predict the risk.
  • People (patients in particular) can use this tool to track the risk of CVD based on their own lifestyle factors and take preventive measures when the risk is high.

FEATURES PLANNED BUT NOT IMPLEMENTED

  • Different implementations of gradient-boosting models such as XGBoost, CatBoost, LightGBM, etc. were not implemented since none of the tree ensemble models such as Random Forest, AdaBoost or Hist-Gradient Boosting were among the best performers. Hence, avoid additional dependencies based on such models.