How to expose data biases from debugging your model with responsible AI (Part 6)

The traditional method of evaluating the trustworthiness of a model's performance is to look at calculated metrics such as accuracy, recall, precision, root mean squared error (RSME), mean absolute error (MAE), or R2, depending on the type of use-case you have (e.g., classification or regression). Data scientists and developers can also measure confidence levels for areas the model correctly predicted or the frequency of making correct predictions. You can also try to isolate your test data in separate cohorts to observe and compare how the model performs with some groups vs. others. However, all of these techniques ignore a major blind spot: the underlying data.

Data can be overrepresented in some cases and underrepresented in others. This may lead to data biases, causing the model to have fairness, inclusiveness, safety, and/or reliability issues.  In this tutorial, we will explore use the Data Analysis component of the Azure Responsible (RAI) dashboard to discover the root-cause of the model's poor performance. In this example, our problem is data distribution. This is a continuation of the Diabetes Hospital Readmission use case we use throughout this series. In the prior tutorial, we used the Model Overview component to explore dataset cohorts and feature cohorts to conduct a comparative analysis across cohorts to discover where the model was performing well or poorly.

Prerequisites

Data Analysis

The Azure Responsible dashboard includes a Data Analysis component that enables users to explore and understand the dataset distributions and statistics. It provides an interactive user interface (UI) to enable users to visualize datasets based on the predicted and actual outcomes, error groups, and specific features. This is useful for ML professionals to be able to quickly debug and identify issues of data over- and under-representation and to see how data is clustered in the dataset. As a result, they can understand the root cause of errors and any fairness issues introduced via data imbalances or lack of representation of a particular data group.

From the previous tutorial, we learned from using the Model Overview component of the RAI dashboard that the Prior_Inpatient feature played a role in the model performing poorly. In addition, using Error Analysis, we identified that Age was one of the top contributors to the overall model errors. We also found that the model incorrectly predicts that patients will not be readmitted, when in reality they will be readmitted back in a hospital within 30 days. This could be due to the model not having enough data to learn from for cases where patients return within 30 days.

Table view

The Table view pane under Data Analysis helps visualize all the features and rows of the data in a selected cohort. The advantage of using this view is that we get to see records of the raw data where the model made correct vs. incorrect predictions. In addition, for each row of data, the dashboard includes a field for TrueY and PredictedY columns to help users decipher common feature attributes from records where the model is incorrect.

Let's take a closer look at the actual data in our cohort that has the highest error rate: To filter the dashboard data to focus on data in this cohort, we'll click on the “switch cohort” link on top of the dashboard (Note: reference the Global controls

6-da-switch-cohort.png

In our Diabetes Hospital Readmission use case, the Table view confirms what the true vs. predicted outcomes are for our sample data. In addition, you can view details on the incorrect vs. correct predictions from the different data cohorts you've created.

6-da-table-view.png

Data Analysis: Chart view (aggregate)

The chart view of the dashboard is another useful tool to visualize the data representation. First, we'll use the chart to compare the data distribution of the number of patients not readmitted vs. readmitted in our test dataset using True Y and Predicted Y.  Then, we'll examine if there are disparities for sensitive features, patients with prior hospitalization, or socioeconomic groups.

Data imbalance issues with test dataset

We'll use the cohort with all the test data for our analysis by following these steps:

  1. Select the “All data” option from the “Select a dataset cohort to explore” drop-down menu.
  2. On the y-axis, we'll click on the current selected “race” value, which will launch a pop-up menu.
  3. Under “Select your axis value,” we'll choose “Count.”

7-select-data-chart.png

  1. On the x-axis, we'll click on the current selected “Index” value, then choose “True Y” under the “Select your axis value” menu.

7-da-trueY.png

We can see that, out of the 994 diabetes patients represented in our test data, 798 patients are not readmitted and 198 are readmitted back to a hospital within 30 days. These are the actual values or “TrueY.”

7-da-predictedY.png

For contrast, let's compare those values with what our model actually predicts. To do that, let's change the “True Y” value on the x-axis by selecting the “Predicted Y”.  Now, we see that the model's number of patients readmitted back to the hospital is 41, while the number of patients not readmitted is 953. So, this exposes an extreme data imbalance issue where the model does not perform well for cases where patients are readmitted.

Sensitive data representation

When we try to compare race distribution, we find that there's disparities in “Race” representation. Caucasians represent 77% of patients in the test data. African-Americans make up 19% of the patients. Hispanics represent 2% of the data. There's obviously data gaps between the different ethnicities, which can lead to fairness issues. This is an area where ML professionals can intervene and help mitigate data disparities to make sure the model does not codify any racial biases.

7-da-race-count.png

The gender representation among the patients are fairly balanced. So, this is not an area of concern.

7-da-gender-count.png

Age is not proportionately distributed across our data, as seen in our three age groups. Diabetes tends to be more common among older age groups, so this may be an acceptable and expected disparity. However, this is another area for ML professionals to validate with medical specialists to understand if this is a normal representation of individuals with diabetes across age groups.

7-da-age-count.png

Hospital Readmissions

The Prior_Inpatient is one of the features from the cohort with the highest model errors; so let's see what impact it has to the model's predictions. To do this, we'll take the following steps:

  1. Click on the y-axis label.
  2. In the pop-up window pane, select the “Dataset” radio button.
  3.  Then under “select feature”, select “prior_inpatient” on the drop-down menu.
  4. On the x-axis keep the “Predicted Y” selected.

7-da-prior-inpatient.png

The chart shows that the higher the number of hospitalizations (aka prior_inpatient) a diabetic patient has in the past, the more likely they will be readmitted back into the hospital within 30 days. Patients with fewer prior inpatients are more likely to be “not readmitted.”

7-da-inpatient-predictY.png

For race, the chart shows that, due to the data imbalance, the model will not be able to accurately predict if a patient will be readmitted back to the hospital for some ethnicities. As we saw above, the Caucasian patients are overrepresented in this data set. So, even when there was no prediction for the other ethnic groups, we see 31 “Readmitted” occurrences for Caucasian patients since there's an overrepresentation there.

7-da-race-predictY.png

The model prediction is affected by the patients' age groups as well. There's an overrepresentation of data for patients “over 60 years” and data underrepresentation for patients “30 years or younger.” Here, the effects of data imbalance were evident between the model's classification of “Not readmitted” vs. “Readmitted.”

7-da-age-predictY.png

For gender, we see the same impact of data imbalance on model outcome for both male and female patients. However, the sample size of data for each outcome category is almost the same.

7-da-gender-predictY.png

Socioeconomic gaps

Another important way to assess data disparities is to look at fairness when it comes to equal access to life opportunities. For our case, some of the patients depend on Medicare and Medicaid government assistance for low-income individuals. Only people aged 65 and older are eligible for Medicare or younger individuals with severe illnesses. Medicaid eligibility begins before age 65. So, we should examine our data to see if there are any socioeconomic disparities.

Let's evaluate if there are any significant differences between how race influenced the likelihood that diabetes patients paid a bill using government assistance. We'll use the Age cohorts we created in the Error Analysis tutorial. Since most patients in our test data are age > 60 years, we'll designate Medicare as their form of payment. From the cohort, select “Age == Over 60 years”. On the x-axis, select “medicare” and on the Y-axis, select race.

7-da-race-medicare.png

Caucasian patients that did not pay their hospital bill using medicare totaled 490, while 261 patients paid with medicare. In other words, 35% of Caucasian used Medicare to pay their hospital bill. In contrast, there were 44 African-American patients that paid their bill with medicare, while 158 patients did not pay with Medicare, meaning 22% of African-American patients paid using Medicare. Although, there's a 13% difference when we compare the medicare usage between Caucasian and African-American, it's not significant.  This shows that between Caucasian and African-American patients in our dataset, there is a fair balance of Medicare usage.

Data Analysis: Chart view (individual datapoints)

When viewing the data in a chart view, you have the option to look at the aggregated presentation of the data. The Responsible AI dashboard provides an individual data point view of the data as well. With this, you can add a third field (e.g. data feature, trueY, predicted, etc.) and see how the field is represented with individual data points if you want to isolate and examine each individual data point.

To achieve this, let's take the following steps:

  1. Under the “Select a dataset cohort to explore” drop-down menu, choose “All data”.
  2. On the y-axis, we'll select “Predicted Y”Note: enable the option for “Should dither” to display the unique values.
  3. On the x-axis, we'll select the “prior_inpatient”.  Make sure the “Should dither” radio button is selected.
  4. Under the “Chart type” on the right-hand side, select the “Individual datapoints” radio button.
  5. Select “age” from the Dataset, Under “Color value”.

7-da-age-predictY-inpatient.png

The chart should display 2 lines for “Predicted Y”:

  • Line 0:  for “Not Readmitted”
  • Line 1:  for “Readmitted”.

In this case, we want to see the data representation of age and the impact “prior_inpatient” has to a patient's hospital readmission.  At first glance, we can see that line 0 (Not readmitted patients), has a higher concentration of “Over 60 years” patients (seen in green) compared to “30 years or younger” or “30–60 years” (seen in orange). For analysis, we'll focus on where there is a higher concentration of patients in the different age groups when it comes to whether or not they will be readmitted in 30 days.

On line 0 (representing “Not Readmitted”) we see the following individual datapoints:

  • There is a higher concentration of datapoints with patients “Over 60 years” that are not readmitted back to the hospital when they have a prior history of hospitalization between 0 and 4. The concentration slowly reduces as the number of prior_inpatients increases.
  • The above is also true for patients age “30–60 years.”
  • Patients aged “30 years or younger” only shows a high concentration of datapoints when prior_inpatients = 0. Meaning, patients in this age group have no prior of hospitalization, which drives the model's outcome to be “Not Readmitted.”

On line 1 (representing “Readmitted”) we see the following individual datapoints:

  • The concentration of datapoints for patients “Over 60 years” slowly increases in step with the higher number of prior_inpatients between 1 and 6. This shows that a prior history of hospitalization has a significant impact on whether a senior citizen patient is readmitted back to a hospital within 30 days.
  • The above is also true for patients age “30–60 years.”
  • Patients aged “30 years or younger” have a datapoint at prior_inpatient = 8 and 9. These could be outliers since there are no other datapoints to form a conclusion.

7-da-age-predictY-inpatient.png

An alternative method of analysis is to change from the “All data” cohort to the cohort with the highest error rate: “When we change the cohort, we see that datapoints for patients “30 years or younger” no longer exist. However, the data pattern remains the same for patients “Over 60 years” and “30–60 years” with this erroneous cohort.  This means the likelihood of hospital readmission increases when the prior_inpatient => 4 for diabetic patients aged 30 or older.

7-da-age-predictY-inpatient-err.png

As you can see from all the data analysis we performed in this tutorial, data is a significant blind spot that is often missed when evaluating model performance. After tuning a model, you can increase the performance, but that does not mean you have a model that is fair and inclusive. Prime examples here were the patient's “Race” and “Age.” Although the race feature did not come up during our error analysis or model overview investigation, the Data Analysis component of the Responsible AI dashboard exposed this discrepancy. Because test data has an overrepresentation of elderly Caucasian diabetic patients, the model will produce inaccurate predictions. In this case, the model was trained using non-inclusive data that could result in an end user overlooking indicators that a patient is at risk of being readmitted back to a hospital within 30 days. This model may work well in locations where there's a high population of Caucasians, however, the model will have a high number of inaccurate predictions in locations where there's a large population of African-Americans, Hispanics or Asians. Hence, our model is unreliable.

This also showed us that, during the data analysis process, there are gray areas where data scientists will need to work very closely with business stakeholders or decision makers to understand if the data represents reality or not. For example, we saw the number of data disparities with diabetic patients in different age groups. These alternative angles that the Responsible AI dashboard provides will enable data scientists, AI developers and decision makers to debug, identify and mitigate issues to improve a model's behavior and reduce harm.

Fantastic work! Now we are going to learn explain and interpret a model.  Stay tune for Part 7 

DISCLAIMER:  Microsoft products and services (1) are not designed, intended or made available as a medical device, and (2) are not designed or intended to be a substitute for professional medical advice, diagnosis, treatment, or judgment and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment. Customers/partners are responsible for ensuring their solutions comply with applicable laws and regulations. Customers/partners also will need to thoroughly test and evaluate whether an AI tool is fit for purpose and identify and mitigate any risks or harms to end users associated with its use. 

 

This article was originally published by Microsoft's AI - Machine Learning Blog. You can find the original article here.