How to find model performance inconsistencies with Responsible AI (Part 5)

An effective approach to evaluating the performance of machine learning models is getting a holistic understanding of their behavior across different scenarios. One way to approach this includes calculating and assessing model performance metrics like accuracy, recall, precision, root mean squared error (RSME), mean absolute error (MAE), or R2 scores. However, just analyzing one metric or alternatively, the aggregated metrics for the overall model is insufficient to debug a model and identify the root cause of errors or inaccuracies. In conjunction with measuring performance metrics, data scientists and developers need to conduct comparative analysis to aid their holistic decision making.

Comparative analysis shines a light on how models are performing for one subgroup of the dataset versus another. One of the advantages is that the Model Overview component of the Responsible (RAI) dashboard in Azure Machine Learning is not just reliant on high-level numeric calculations on datasets, it dives down to the data features as well. This is especially important when one cohort has certain unique characteristics compared to another cohort. For example, discovering that the model is more erroneous with a cohort that has sensitive features (e.g., patient race, gender or age) can help expose potential unfairness.

In this tutorial, we will explore use the Model Overview component of the Responsible (RAI) dashboard to find model performance disparities across cohorts. This is a continuation of our Diabetes Hospital Readmission use case we've been using in this blog tutorial series. In the previous tutorial, we showed you how you can leverage the Error Analysis component within the RAI dashboard to discover cohorts where the model had highest error-rates, as well as cohort with the lowest errors. Now, we'll investigate why the model's behavior performs better in one cohort versus another cohort using the Model Overview component.


Model Overview: Dataset cohort analysis

The Model Overview component within the Responsible AI dashboard helps analyze model performance metric disparities across different data cohorts that the user creates. For our investigation, we'll be evaluating the cohort with the highest model error rate and the least error rate that we created from the prior tutorial: (See tutorial Part 4). To start with, we are going to analyze and compare cohort performance for two separate cohorts. Since the dashboard knows we are working with a classification model, it already pre-selected the parameters: Accuracy score, False Positive rate, False Negative rate and Selection rate.  Next, we'll select the “Dataset Cohorts” pane. This displays the different cohorts created in a table with the model metrics. Note: Refer to the UI overview to fully understand how to use all the controls on the RAI dashboard.


As we compare the cohort with the most errors “Err: Prior_Inpatient > 0; Num_Meds > 11 and ≤ 21.50” verse the least errors , we can see that the accuracy score for the erroneous cohort is 0.657 which is not optimal. The rate of False Positive is very low; meaning there's a low number of cases where the model is inaccurately predicting patients that are going to be readmitted back to the hospital in 30 days. Contrarily, the False Negative rate of 0.754 is high. This indicates that there's a high number of cases where the model is falsely predicting that many of the patients will not be readmitted, however the actual outcome is they will be readmitted in 30 days back to the hospital. The cohort with the least errors has an accuracy score of 0.94, which is far better than the overall accuracy score of the model with all the data. However, this cohort also has a lower false positive rate.

Probability distribution

Under the Probability distribution, we see the chart showing the model's probability prediction of patients not readmitted back to the hospital within 30 days. It reveals that, in general, the “All data” cohort with all the test dataset, the model predicted that a majority of the patients will not be readmitted back in the hospital with 30 days, with a median probability of patients not readmitted at 0.854 and upper quartile at 0.986, which is good and seems logical. We would not want a high frequency of patients being readmitted back to a hospital in a few days after being discharged. While our cohort with the highest error-rate with patients that were hospitalized in the past and were administered between 11 and 22 medications, the model predicts that more of them will likely not be readmitted in 30 days with a median of 0.754 and max probability of 0.955. The does not seem correct. This is the same erroneous cohort that has a high number of false negatives. Meaning, the model is falsely predicting patients in this cohort as Not readmitted. When their true outcome is readmitted.


On the other hand, the cohort with the least number errors, shows that patients who were not hospitalized in the past, have less than 7 diagnoses and less than number lab procedures, the model has a median of 0.90 and upper quartile of 0.986. This is a high probability that patients in this cohort are likely to not be readmitted.

The same analysis can be done when we can switch to the opposite probability by clicking on the “Label” button in the pop-up pane. Here we can select “Readmit” to see each cohort's model prediction probabilities.


Metric visualization

Now let's get a deeper understanding of the model's performance by switching to the Metric visualizations pane. Since we've already reviewed the accuracy score under the “Dataset Cohort” above, let's explore what other metrics can show about the 3 cohorts. To choose another metric, we'll click on the “Choose Label” on the x-axis to choose from the list of other available metrics. Since we have a classification model, the RAI dashboard will display only classification metrics.


In this case, we'll select “Precision score” to see if the model's prediction for the patients not being readmitted in 30 days was correct for both cohorts. When we review the chart will see that the model performance for all test data cohort and erroneous cohort is correct at ~70% of the time. The Precision score rate for the least erroneous cohort is 0.94 for patients with no prior hospitalization and the number of diagnoses is less than 7. This is consistent with the accuracy score.


Finally, we'll change the metric to “Recall” to see how well the model was able to correctly predict that the patients in the cohorts will be readmitted back in the hospital in 30 days. The recall shows that the model's prediction was correct less than 25% of the time for all the cohorts for patients being readmitted.


Confusion matrix

To check the distribution of where the model is making the most accurate prediction, let's click on the “Confusion matrix” tab. This reveals that the model is not learning well for cases where the patient is readmitted back in the hospital within 30 days. This explains why the Recall score was extremely low.


This means we need to look at the individual features in the cohort to see if there are errors in the data causing the model inaccuracies.

Model Overview: Feature cohort analysis

The RAI dashboard gives us the ability to examine model performance across different cohorts within a given feature as well. On the Feature cohort's pane, you can investigate a model by comparing model performance across user-specified sensitive and non-sensitive features (for example, performance across patient age, diagnoses or insulin results). Whether it is one feature, or a combination of two features, the RAI dashboard has built-in intelligence to divide feature values into various meaningful cohorts for users to do feature-based analysis and compare where the model is not doing well.

To look closer at the data, we'll switch to the Feature cohorts tab. Since the cohort with the highest error has patients with the number of Prior_Inpatient > 0 days and number of medications between 11 and 22 was where the model had a higher error rate, we'll take a closer look at the “Prior_Inpatient” and “Num_medications”.

Under the “Feature(s)” drop-down menu, scroll down the list and select the “Prior_Inpatient” checkbox. This will display 3 different feature cohorts and the model performance metrics.


Here we see a breakdown of the different “prior_inpatient” cohorts generated:

  • prior_inpatient ≥ 6

We see that the cohort has a sample size of 943. This means a majority of patients in the test data were hospitalized less than 3 times in the past. The model's accuracy rate for this cohort is 0.838, which is good. Only 39 patients from the test data fall in the cohort. The model's accuracy rate is 0.692, which is not very accurate. Lastly, just 12 patients from the test data have a prior hospitalization greater than or equal to 6 days. The model accuracy of 0.75 for this cohort is ok.

Probability Distribution

Similar to the Dataset cohort, we have the ability to view the “Probability Distribution.” From the “Probability Distribution,” we can see that the lesser the patient's number of prior inpatient hospitalizations, the more likely the diabetes patient was not going to be readmitted in 30 days. This makes sense, because if a patient is more prone to be hospitalized in the past, it's fair to assume that they get sick a lot which requires them to get hospitalized.


Metrics visualizations

The precision score for patients with is 0.40, which is very erroneous. This means that all the predictions that the model made, only 40% were correct for this cohort. The precision score for the other 2 cohorts are good.


The recall score for patients with is 0.013. Meaning, for a majority of patients in the test data, the model is having a difficulty correctly predicting whether the patient will be readmitted within 30 days.


The next feature we will evaluate is “num_medications” From the “feature(s)” cohort, we'll de-select “prior_inpatient” from the drop-down list and select the “num_medications” feature.


As we see from the different cohort generated, the sample size for the cohort is 822. This means that most of the patients in the test data take less than 23 medications. The model's accuracy score for this cohort is 0.827. The false positive rate of 0.01 is problematic, meaning the case where patients are readmitted within 30 days, the model is incorrectly predicting as not readmitted.

The cohort with , also has a good accuracy of 0.844 with a sample size of 160. Similarly, the false positive for this cohort false positive rate is 0.02. Showing once again that the model is incorrectly predicting patients who are readmitted back to the hospital within 30 days as not readmitted. Lastly, the num_medication ≥ 45 cohort only has 12 records. The model accuracy at 0.912 is great; however, the False Positive shows the model is incorrect predicting readmitted patients as not readmitted patients.

Overall, the Probability distribution chart shows that the number of medications administered to diabetes patients does not affect them not to be readmitted in 30 days. All number of medications administered show a high probability of them not being readmitted. Although the lesser the medication, the higher the probability of them not being readmitted.


The precision score for the cohort or the num_medications cohort is ~ 70%, which is ok. However, this means, the predictions the model is making in these cohorts are incorrect 30% of the time.


The recall score for the num_medications reveals the model is making incorrect predictions more than 90% of the time. This makes sense because as we saw from the False Positive rates, the model is not able to predict patients that will be readmitted within 30 days.


This behavior is consistent with the analysis we did with the large dataset cohort. The Model Overview section gave us the ability to isolate the problem by doing feature-base analyses on each of the features that were in the data cohort with the highest error rate. Both features had good accuracy, but poor precision and recall. In addition, the false positive rate consistently revealed that the model was not performing well in cases where the patients were readmitted in 30 days. As a result, further investigation is needed to find out why the model is not learning. There may not be enough data for cases where patients return within 30 days. If this is a data imbalance issue, the next tutorial will be able to expose it. We'll need to analyze if there are disparities in the amount of data.


As you can see, being able to perform comparative analysis between dataset cohort or feature cohort is vital in debugging a model to pinpoint where it's having issues. In the case of dataset cohort, the dashboard gives us the ability to compare the data cohorts that we create. This way, we are not just constrained to analyzing one test data, we can create other subgroups and investigate how the model performs in one group vs another. Though we only use 2 dataset cohorts in the tutorial, you can create more cohorts from a hypothesis you want to explore that could uncover areas where the model is having issues. In addition, the dashboard's built-in intelligence helps divide features into meaning cohorts for comparison, which is another great option of looking at cohorts within a feature to uncover root-causes or issues in the model or in data representation.

Finally, this tutorial shows how the traditional model performance metrics (e.g., accuracy, recall, confusion matrix etc.) are still critical. By combining RAI insights and traditional performance metrics, the dashboard gives us a holistic view to analyze and debug models on the aggregate and granular level.

Now we are ready to expose any over-representation or under-representation in our data that could affect model performance. Stay tuned for Part 6 of the tutorial to learn more…

DISCLAIMER:  Microsoft products and services (1) are not designed, intended or made available as a medical device, and (2) are not designed or intended to be a substitute for professional medical advice, diagnosis, treatment, or judgment and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment. Customers/partners are responsible for ensuring their solutions comply with applicable laws and regulations. Customers/partners also will need to thoroughly test and evaluate whether an AI tool is fit for purpose and identify and mitigate any risks or harms to end users associated with its use. 


This article was originally published by Microsoft's AI - Machine Learning Blog. You can find the original article here.