Cardiovascular Disease Prediction

AIM

To predict the risk of cardiovascular disease based on lifestyle factors.

DATASET LINK

https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset

MY NOTEBOOK LINK

https://www.kaggle.com/code/sid4ds/cardiovascular-disease-risk-prediction

LIBRARIES NEEDED

LIBRARIES USED

pandas
numpy
scikit-learn (>=1.5.0 for TunedThresholdClassifierCV)
matplotlib
seaborn
joblib

DESCRIPTION

What is the requirement of the project?

This project aims to predict the risk of cardivascular diseases (CVD) based on data provided by people about their lifestyle factors. Predicting the risk in advance can minimize cases which reach a terminal stage.

Why is it necessary?

CVD is one of the leading causes of death globally. Using machine learning models to predict risk of CVD can be an important tool in helping the people affected by it.

How is it beneficial and used?

Doctors can use it as a second opinion to support their diagnosis. It also acts as a fallback mechanism in rare cases where the diagnosis is not obvious.
People (patients in particular) can track their risk of CVD based on their own lifestyle and schedule an appointment with a doctor in advance to mitigate the risk.

How did you start approaching this project? (Initial thoughts and planning)

Going through previous research and articles related to the problem.
Data exploration to understand the features. Using data visualization to check their distributions.
Identifying key metrics for the problem based on ratio of target classes.
Feature engineering and selection based on EDA.
Setting up a framework for easier testing of multiple models.
Analysing results of models using confusion matrix.

Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.).

Research paper: Integrated Machine Learning Model for Comprehensive Heart Disease Risk Assessment Based on Multi-Dimensional Health Factors
Public notebook: Cardiovascular-Diseases-Risk-Prediction

EXPLANATION

DETAILS OF THE DIFFERENT FEATURES

Feature Name	Description	Type	Values/Range
General_Health	"Would you say that in general your health is—"	Categorical	[Poor, Fair, Good, Very Good, Excellent]
Checkup	"About how long has it been since you last visited a doctor for a routine checkup?"	Categorical	[Never, 5 or more years ago, Within last 5 years, Within last 2 years, Within the last year]
Exercise	"Did you participate in any physical activities like running, walking, or gardening?"	Categorical	[Yes, No]
Skin_Cancer	Respondents that reported having skin cancer	Categorical	[Yes, No]
Other_Cancer	Respondents that reported having any other types of cancer	Categorical	[Yes, No]
Depression	Respondents that reported having a depressive disorder	Categorical	[Yes, No]
Diabetes	Respondents that reported having diabetes. If yes, specify the type.	Categorical	[Yes, No, No pre-diabetes or borderline diabetes, Yes but female told only during pregnancy]
Arthritis	Respondents that reported having arthritis	Categorical	[Yes, No]
Sex	Respondent's gender	Categorical	[Yes, No]
Age_Category	Respondent's age range	Categorical	['18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75-80', '80+']
Height_(cm)	Respondent's height in cm	Numerical	Measured in cm
Weight_(kg)	Respondent's weight in kg	Numerical	Measured in kg
BMI	Respondent's Body Mass Index in kg/cm²	Numerical	Measured in kg/cm²
Smoking_History	Respondent's smoking history	Categorical	[Yes, No]
Alcohol_Consumption	Number of days of alcohol consumption in a month	Numerical	Integer values
Fruit_Consumption	Number of servings of fruit consumed in a month	Numerical	Integer values
Green_Vegetables_Consumption	Number of servings of green vegetables consumed in a month	Numerical	Integer values
FriedPotato_Consumption	Number of servings of fried potato consumed in a month	Numerical	Integer values

WHAT I HAVE DONE

Step 1Step 2Step 3Step 4Step 5

Exploratory Data Analysis

Summary statistics
Data visualization for numerical feature distributions
Target splits for categorical features

Data cleaning and Preprocessing

Regrouping rare categories
Categorical feature encoding
Outlier clipping for numerical features

Feature engineering and selection

Combining original features based on domain knowledge
Discretizing numerical features

Modeling

Holdout dataset created or model testing
Models trained: Logistic Regression, Decision Tree, Random Forest, AdaBoost, HistGradient Boosting, Multi-Layer Perceptron
Class imbalance handled through:
Class weights, when supported by model architecture
Threshold tuning using TunedThresholdClassifierCV
Metric for model-tuning: F2-score (harmonic weighted mean of precision and recall, with twice the weightage for recall)

Result analysis

Confusion matrix using predictions made on holdout test set

PROJECT TRADE-OFFS AND SOLUTIONS

Trade Off 1

Accuracy vs Recall: Data is extremely imbalanced, with only ~8% representing the positive class. This makes accuracy unsuitable as a metric for our problem. It is critical to correctly predict all the positive samples, due to which we must focus on recall. However, this lowers the overall accuracy since some negative samples may be predicted as positive.

Solution: Prediction threshold for models is tuned using F2-score to create a balance between precision and recall, with more importance given to recall. This maintains overall accuracy at an acceptable level while boosting recall.

SCREENSHOTS

Project workflow

Numerical feature distributions

Correlations

PearsonSpearman's RankKendall-Tau

pearson_correlation

spearman_correlation

kendall_correlation

MODELS USED AND THEIR ACCURACIES

Model + Feature set	Accuracy (%)	Recall (%)
Logistic Regression + Original	76.29	74.21
Logistic Regression + Extended	76.27	74.41
Logistic Regression + Selected	72.66	78.09
Decision Tree + Original	72.76	78.61
Decision Tree + Extended	74.09	76.69
Decision Tree + Selected	75.52	73.61
Random Forest + Original	73.97	77.33
Random Forest + Extended	74.10	76.61
Random Forest + Selected	74.80	74.05
AdaBoost + Original	76.03	74.49
AdaBoost + Extended	74.99	76.25
AdaBoost + Selected	74.76	75.33
Multi-Layer Perceptron + Original	76.91	72.81
Multi-Layer Perceptron + Extended	73.26	79.01
Multi-Layer Perceptron + Selected	74.86	75.05
Hist-Gradient Boosting + Original	75.98	73.49
Hist-Gradient Boosting + Extended	75.63	74.73
Hist-Gradient Boosting + Selected	74.40	75.85

MODELS COMPARISON GRAPHS

Logistic Regression

LR OriginalLR ExtendedLR Selected

cm_logistic_original

cm_logistic_extended

cm_logistic_selected

Decision Tree

DT OriginalDT ExtendedDT Selected

cm_decisiontree_original

cm_decisiontree_extended

cm_decisiontree_selected

Random Forest

RF OriginalRF ExtendedRF Selected

cm_randomforest_original

cm_randomforest_extended

cm_randomforest_selected

Ada Boost

AB OriginalAB ExtendedAB Selected

cm_adaboost_original

cm_adaboost_extended

cm_adaboost_selected

Multi-Layer Perceptron

MLP OriginalMLP ExtendedMLP Selected

cm_mlpnn_original

cm_mlpnn_extended

cm_mlpnn_selected

Hist-Gradient Boosting

HGB OriginalHGB ExtendedHGB Selected

cm_histgradient_original

cm_histgradient_extended

cm_histgradient_selected

CONCLUSION

WHAT YOU HAVE LEARNED

Insights gained from the data

General Health, Age and Co-morbities (such as Diabetes & Arthritis) are the most indicative features for CVD risk.

Improvements in understanding machine learning concepts

Learned and implemented the concept of predicting probability and tuning the prediction threshold for more accurate results, compared to directly predicting with the default thresold for models.

Challenges faced and how they were overcome

Deciding the correct metric for evaluation of models due to imbalanced nature of the dataset. Since positive class is more important, Recall was used as the final metric for ranking models.
F2-score was used to tune the threshold for models to maintain a balance between precision and recall, thereby maintaining overall accuracy.

USE CASES OF THIS MODEL

Application 1Application 2

Doctors can use it as a second opinion when assessing a new patient. Model trained on cases from previous patients can be used to predict the risk.

People (patients in particular) can use this tool to track the risk of CVD based on their own lifestyle factors and take preventive measures when the risk is high.

FEATURES PLANNED BUT NOT IMPLEMENTED

Feature 1

Different implementations of gradient-boosting models such as XGBoost, CatBoost, LightGBM, etc. were not implemented since none of the tree ensemble models such as Random Forest, AdaBoost or Hist-Gradient Boosting were among the best performers. Hence, avoid additional dependencies based on such models.