Using Machine Learning to Detect and Predict the likelihood of a Heart Attack

Introduction

Heart diseases or Cardiovascular diseases (CVDs) are the number one cause of death globally, claiming an estimated 17.9 million lives yearly. The World Health Organization estimates this to be 31% of deaths worldwide. In the United States alone, the Center for Disease Control estimates that a heart attack occurs every forty seconds translating to 805,000 Americans each year.

Some common symptoms include pain and tightness in the chest, shortness of breath, cold sweat, fatigue, sudden dizziness among others. Not everybody who has a heart attack experiences the same type or severity of symptoms. People experience moderate to intense pain while some have no such symptoms. It is expected that the higher the number of symptoms experienced the greater the likelihood of heart disease.

The severity and prevalence of this disease have necessitated a massive need for AI to develop predictive models that can help in disease management and risk control.

Our goal is to build ML models for Heart Disease Detection; export the best performing model to Android devices for everyday use. In addition, we will highlight the ML Operations for this project using Katib for Hyperparameter Tuning of the Model and creating a Pipeline using Kubeflow.

Data

The dataset used for this project was taken from Kaggle. It has observations of 303 patients with 14 features such as demographic features like age, gender, fasting blood sugar, cholesterol level, resting blood pressure, and so on. The chances of a heart attack were classified using binary classification. This classification was taken as the target in our modeling. The Data Dictionary for the dataset is as shown below.

Figure 1: Data Dictionary:

Analysis

A review of the features revealed an imbalance for some categories primarily gender and the output. We noted that the average age is 54.37 years and the outlier observations were in three features namely Cholesterol, Resting Blood Pressure and Maximum Heart Rate. Chest pain showed the highest correlation with the target.

Modeling

Data preprocessing tasks include removal of outliers, resampling the target variable for a more balanced distribution, and scaling. After preprocessing, our clean data was tested with six models, our best performers were the Logistic Regression and CatBoost models with a tied accuracy of 95%.

The Logistic Regression Model showed that the Maximum Heart Rate Achieved (thalachh) feature had the highest influence on the target. The visualization in figure 6 below highlights the Feature Importance Results.

Converting Machine Learning Model to TensorFlow Lite for Android Devices

The Keras Model was converted into a TensorFlow Lite format to be used on Android devices. The TensorFlow Lite Converter did not support Logistic Regression or CatBoost Classifier Models at the time the project was conducted.

The conversion steps are as follows:

  1. Import TensorFlow Lite

2. Convert the Keras model to TensorFlow lite format

3. Save the model

Model Explainability with Alibi Explain’s ALE (Accumulated Local Effects) Plots

In this session, we will use the ALE explainer (Accumulated Local Effects) plots to explain the behavior of our best performing model, the Logistic Regression model on our dataset.

The ALE plots show maximum heart rate achieved as the feature, most effective in predicting heart failure. According to heart.org, older age and high cholesterol levels are factors that increase the risk of a heart attack. Our dataset, however, does not depict that.

Implementing KubeFlow ML Operators for Model Learning

An Operator is a method of packaging, deploying and managing a stateful Kubernetes application which in this context is a machine learning Job.

Operators are software written to encapsulate all of those operational considerations for a specific Kubernetes application and ensure that all aspects of its lifecycle, from configuration and deployment to upgrades, monitoring, and failure-handling, are integrated right into the Kubernetes framework and invoked when needed

An ML Operator can be made for a range of actions from basic functionalities to specific logic for an ML Job.

TensorFlow Operator

This is one of the operators offered by Kubeflow to make it easy to run and monitor both distributed and non-distributed tensorflow jobs on Kubernetes. Training tensorflow models using tf-operator relies on centralized parameter servers for coordination between workers. It supports the tensorflow framework only.

TensorFlow Training Jobs (TFJob)

TensorFlow Training Job (TFJob) is a Kubernetes custom resource with a YAML representation that you can use to run TensorFlow training tasks on Kubernetes. The Kubeflow implementation of TFJob is in tf-operator.

TensorFlow Operator for Heart Attack Dataset

Here, we go through the process of creating a TensorFlow Operator with our Dataset:

  1. Check that the right image, TensorFlow is available:

2. To package the trainer in a container image, we shall need a file (on our cluster) that contains the code as well as a file with the resource definition of the job for the Kubernetes cluster:

3. Define a helper function to capture output from a cell with %%capture that looks like some-resource created:

4. Load and Inspect the Data:

5. Train the Model in the Notebook:

We trained the model in a distributed fashion and put all the code in a single cell. That way we could save the file and include it in a container image. That saves the file as defined by TRAINER_FILE but it does not run it.

6. Create a Docker Image:

The Docker file looks as follows:

7. Check if the code is correct by running it from within the notebook:

8. Create a Distributed TFJob:

For large training jobs, we wish to run our trainer in a distributed model. Once the notebook server cluster can access the Docker image from the registry, we can launch a distributed TF Job.

The specification for a distributed TFJob is defined using YAML:

9. Deploy the distributed training job:

10. See the job status:

11. See the created pods:

12. Stream logs from the worker-0 pod to check the training progress:

13. Delete the job:

14. Check to see if the pod is still up and running:

Hyperparameter Tuning with Katib for TensorFlow Model

Hyperparameter tuning is the process of optimizing a model’s hyperparameter values to maximize the predictive quality of the model. Katib automates the Hyperparameter Tuning process thereby eliminating errors that arise from manual intervention and also saves much-needed resources. Katib is agnostic to ML Frameworks and supports a variety of traditional Hyperparameter Tuning Algorithms. Its concepts are Experiments, Suggestions, Trials, and WorkerJob which are all Custom Resource Definitions integrated on the Kubernetes Engine.

In a nutshell, an Experiment runs several Trials until an objective is reached. Each Trial evaluates Suggestions which are HP values proposed by the tuning process. The WorkerJob evaluates a Trial and calculates its objective value.

This section shows how to create and configure an Experiment for the TensorFlow training job. In terms of Kubernetes, such an experiment is a Custom Resource Definition (CRD) run by the Katib operator.

How to Create Experiments:

  1. Set up a few basic definitions that can be reused:

2. TensorFlow: Katib TFJob Experiment:

The TFJob definition for this example is based on the TensorFlow operator notebook shown earlier. For our experiment, we focused on the learning rate, batch-size and optimizer. The following YAML file describes an Experiment object:

3. Run and Monitor Experiments:

You can either execute these commands on your local machine with kubectl or on the notebook server:

The cell magic grabs the output of the kubectl command and stores it in an object named kubectl_output. From there we can use the utility function we defined earlier:

4. See experiment status:

5. Get the list of created experiments:

6. Get the list of created trials:

7. After the experiment is completed, use describe to get the best trial results:

8. Delete Katib job to free up resources:

9. Check to see if the pod is still up and running:

Model Deployment Using KubeFlow

Deployment is a crucial factor in the ML process. For the models built to be effective to real-life users, there is a need to position our model on a platform that can successfully receive data from as many users as needed and output the predictions. For this use case, we will make use of the KubeFlow platform which provides helpful services and tools that ease the development, deployment, and management of portable, scalable machine learning projects.

In building Kubeflow Pipelines, the available options to build KubeFlow Pipelines are the Lightweight and Reusable Components. The former is easy to build and update; useful for Testing and Deployment while the latter are stable containerized functions useful for multiple projects. For our use case, the Reusable Components option was adopted. The process is broken down as follows.

  1. Creating self-contained ML code:

When creating reusable components, our first step involves creating functions of our ML code that can pass data between themselves, with all other packages needed to run contained within the function. Each step of the ML process should be packaged in this way.

2. Create Docker Images:

Using our packaged ML functions, we create Docker Images and push them to the repository where they can be called when needed by the pipeline. By sectioning each step of our code into components, any step can be repeated, scaled, or transformed individually without affecting the other components in the pipeline. Creating Docker Images requires the Docker package in a command line and an account with a repository like DockerHub.

This Dockerfile directs the installation of python 3.8 as the base of our image’s functioning, installs the needed packages, and creates a working directory for our python function.

With both the python function (logistic.py) and the Docker file in the same directory, we can run some Docker Commands to build and push the image to the repository.

With mavencodevv as the user id and lr_heart as our image tag, we have successfully built the image for our logistic regression component. Each step of our ML pipeline will be built this way before being compiled using Kubeflow’s pipeline functions.

3. Building and Compiling the Pipeline:

Using a Jupyter notebook environment, we will utilize the KubeFlow Python package to build our pipeline from our created images then compile it for deployment.

First, we install the packages then we create component functions built from our created images.

The inputs and outputs are explicitly stated to facilitate the passage of data between components, once all our pipeline components are packaged in this way we can call them in a final pipeline function that contains all the components created.

Our pipeline goes through loading the data, carrying out descriptive statistics, data validation before processing data for our six models before evaluation of the metrics, and exporting the best model to the cloud storage. Each step started as a self contained function with docket images created from them.

This pipeline function can then be compiled into a yaml file, zip and tar.gz formats are also acceptable and then uploaded to the KubeFlow platform.

Conclusion

AI’s goal is to make computers and other devices more effective in solving difficult healthcare problems, and by doing so, we can interpret data collected from the diagnosis of chronic diseases such as cardiovascular (heart) diseases. In the same vein, we have applied tools and techniques in machine learning to our heart disease use case to help predict the likelihood of a person having a heart attack or not. We went a step further to make our results available for Android devices for portability and scalability. To this end, early diagnosis of the likelihood of a person having a heart attack with our approach will be very helpful in minimizing complications of the disease.

References

https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1

https://www.cdc.gov/heartdisease/facts.htm

https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

http://rstudio-pubs-static.s3.amazonaws.com/24341_184a58191486470cab97acdbbfe78ed5.html

https://docs.seldon.io/projects/alibi/en/latest/methods/ALE.html

https://developer.android.com/ml?authuser=1

https://www.tensorflow.org/lite/guide?authuser=1

https://enterprisersproject.com/article/2019/2/kubernetes-operators-plain-english

https://docs.d2iq.com/dkp/kaptain/1.0.1-0.5.0/tutorials/metadata/

https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1

https://www.heart.org/en/health-topics/heart-attack/understand-your-risks-to-prevent-a-heart-attack

Website: www.mavencode.com
Twitter: @mavencode
Email: ai@mavencode.com

We build scalable data pipeline infrastructure & deploy machine learning and artificial intelligence models.