The Case of Azure AD Quota Exhaustion

Hello everyone, my name is Zoheb Shaikh and I'm a Solution Engineer working with Microsoft Mission Critical team (SfMC). Today I'll share with you about an interesting issue related to Quota limitation we came across recently.

I had a customer who exhausted there AAD quota which put them at a significant health risk.

Before I share more details on this let's try to understand what your organization AAD Quota could be and why does it even matter.

In simple words has a defined quota of number of Directory objects/Resources that can be created and stored in AAD.

A maximum of 50,000 resources can be created in a single tenant by users of the Free edition of Azure Active Directory by default. If you have at least one verified domain, the default Azure AD service quota for your organization is extended to 300,000 Azure AD resources. Azure AD service quota for organizations created by self-service sign-up remains 50,000 Azure AD resources even after you performed an internal admin takeover and the organization is converted to a managed tenant with at least one verified domain. This service limit is unrelated to the pricing tier limit of 500,000 resources on the Azure AD pricing page. To go beyond the default quota, you must contact Microsoft Support.

For information about AAD Quotas, see Service limits and restrictions – Azure Active Directory .

check your AAD Quota limit

Test this in Graph Explorer: https://developer.microsoft.com/en-us/graph/graph-explorer

Sign into Graph Explorer with your account that has access to the directory.

Run beta query (GET) https://graph.microsoft.com/beta/organization

ZohebShaikh_0-1624957696437.png

Now since you understand what AAD Quota is and view details let's get back to the customer scenario and try to understand how AAD Quota affected them and what is in this for you to learn.

The customer was approximately a 100k Users organization using multiple Microsoft cloud related services like Teams, EXO, Azure IaaS, PaaS etc.

As a part of cloud modernization journey, they were doing a massive Rollout of Intune across the Organization post doing all the testing and PoC.

Our proactive monitoring and CXP teams did inform the customer that there Azure AD objects are increasing at an unusual speed, but the customer never estimated this could go beyond their AAD Quota.

One fine morning I got up with a call from our SfMC Critsit manager that my customer's AAD Quota has exhausted and their AAD Connect is unable to Synchronize any new objects. As a part of Reactive arm of SfMC we got in a meeting with customer along with our Azure Rapid Response team to find what is the cause of the problem.

We decided on below approaches for the issue:

  1. Confirm AAD Quota exhaustion and what objects are consuming AAD Resources
  2. Remove stale objects from AAD.
  3. Reach out to Product Group asking for a Quota increase for this specific customer. 

check what objects are consuming AAD Quota limit

While I was engaged in this case, we did it the hard way (by exporting all registered objects in Excel and then using Pivot Tables to analyze) but now there is an easy way to do as described below:

  1. Login to Azure AD (https://aad.portal.azure.com)
  2. In Azure AD Click on preview features (Presently)
    ZohebShaikh_1-1624957696449.png
  3. This will give you a nice overview of your Object Status of AAD
    ZohebShaikh_2-1624957696461.png

We created the below table to help us find what exactly is going on in the environment, in this we measured how many Objects are in total and how many were created in last few days.

Object Count New Count in last 24 hours New Count in last 1 week
Users                         # # #
Groups # # #
Devices # # #
Contacts # # #
Applications # # #
Deleted Applications # # #
Service Principals # # #
Roles # # #
Extensionproperties # # #
TOTAL ## ## ##

Not sharing numbers here but highlighting that we saw Devices consuming about 50 % of the AAD Object quota and increased in the last 24 hours to few thousands.

This output made us understand that thousands of devices are getting registered every day which has resulted in AAD Quota exhaustion.

Based on our analysis to come out of this situation we recommended the customer to delete stale devices that have not been used for more than 1 year. This itself enabled us to delete approximately 50k objects.

This 50k object deletion ensured that they are out of the critical situation giving them some breathing space to think and avoid this problem from reoccurring at least for the next couple of weeks till we figure out what exactly is going wrong.

Being part of the Microsoft Solution for Mission Critical team, we always go above and beyond to support our customers. The first step is always to quickly resolve the reactive issue, subsequently identify the Root Cause, and finally through our Proactive Delivery Methodology making sure this does not happen again.

We followed below approach to identify the root cause and ensure it will not happen again, below the steps:

  1. Configuring Alerts for validation and Quota exhaustion
    1. Daily alerts for Azure AD Object count
    2. Alerts in case AAD Object Quota limit is exhausted.
  1. More detailed review on the Root Cause of the issue.
  2. Creating a baseline for AAD Objects needed in the organization.
    1. Baseline to be created based on number of Objects in Organization (Users, computers etc.)
    2. What is the expected count?
  3. Increasing the Object Quota based on the baseline created if needed.

In the next sections we will go through each of the above actions for more explanation

  1. Configuring Alerts for validation and Quota exhaustion

Option#1 using Azure Automation.

We wanted to add alerts to ensure the customer is notified if they are nearing the limit. We achieved this by using Azure Automation as below with the help of my colleague Eddy Ng from Malaysia.

Below is the step-by-step process on how you can help achieve alerts post creating an Azure Automation account:

  • Create the credential in Credential vault. Click on the + sign, add a credential and input the information name. The credential must have the sufficient rights to connect to Azure AD and not have MFA prompt. The name used is important. It will be referenced from the script.

ZohebShaikh_3-1624957696465.png

  • Next Install Microsoft Graph Intune under the Modules Resources. Click on Browse Gallery, search for Microsoft.Graph.Intune. Click on the result and Import.

ZohebShaikh_4-1624957696470.png

ZohebShaikh_5-1624957696473.png

ZohebShaikh_6-1624957696474.png

We are recommending MS Graph PowerShell SDK going forward.

  • Go to Runbooks.

ZohebShaikh_7-1624957696477.png

Create a new Runbook. Give it a name.

Runbook type : Powershell

Paste the below code:

#get from credential vault the admin ID. Change “admin” accordingly to the credential vault name

$credObject = Get-AutomationPSCredential -Name admin

#initiate connection to Microsoft Graph

$connection = Connect-MSGraph -PSCredential $credObject

#setting up Graph API URL

$graphApiVersion = “beta”

$Resource = ‘organization?$select=directorysizequota'

$uri = “https://graph.microsoft.com/$graphApiVersion/$($Resource)”

#initiate query via Graph API

$data = Invoke-MSGraphRequest -url $uri

#get data and validate

#change the number 50000 accordingly

$maxsize = 50000

if ([int]($data.value.directorysizequota.used) -gt $maxsize)

{

    write-output “Directory Size : $($data.value.directorysizequota.used) is greater than $maxsize limit”

    Write-Error “Directory Size : $($data.value.directorysizequota.used) is greater than $maxsize limit”

    Write-Error ” ” -ErrorAction Stop   

}

else

{

    Write-Output “Directory Size : $($data.value.directorysizequota.used)”

}

  • Click Save and Publish
  • Click Link to Schedule

ZohebShaikh_8-1624957696479.png

  • Populate the schedule accordingly. For example, run daily at 12pm UTC.
  • Results from each run job can be found under Jobs.

ZohebShaikh_9-1624957696481.png

  • If above quota, the status will be failed as a result of the script -erroraction Stop.

ZohebShaikh_10-1624957696486.png

  • Setup Alerts to take advantage of this by creating New Alert Rule

ZohebShaikh_11-1624957696489.png

  • Click select condition. Signal name “Total Job”. Follow the below. Amend “MyRunbook” accordingly. When finished, click done

ZohebShaikh_12-1624957696495.png

  • Select Action Group. Create Action Group.
  • Populate info accordingly for Basics.
  • Populate info similar to below for Notifications.

ZohebShaikh_13-1624957696503.png

  • Click Review + Create
  • Once done, scroll below under Alert Rule Details, such as Name, Description and Severity.

ZohebShaikh_14-1624957696514.png

  • Create Alert Rule

Results: When the Directory Quota Size breached the limit, you will get an alert via email to the admins.

You can then proceed to click on the Runbook and select Jobs. Click on All Logs to see the error output for each individual job run.

If you wish to monitor the previous results in bulk, go to Logs and run this Kusto Query below. Scroll to the right for ResultDescription. Assumption is that the schedule is set to run daily. A limit of 50 will then be for the past 50 days.

AzureDiagnostics

| where StreamType_s == “Output”

| limit 50

ZohebShaikh_15-1624957696552.png

Option#2 Alternative way Configuring Alerts for validation and Quota exhaustion

While I was writing this blog, my colleague Alin Stanciu from Romania advised with probably better way to configure Alerts for Quota Exhaustion.

Replace the script in the Azure Automation account as below!

#get from credential vault the admin ID. Change “admin” accordingly to the credential vault name

$credObject = Get-AutomationPSCredential -Name ‘azalerts'

#initiate connection to Microsoft Graph

$connection = Connect-MSGraph -PSCredential $credObject

#setting up Graph API URL

$graphApiVersion = “beta”

$Resource = ‘organization?$select=directorysizequota'

$uri = “https://graph.microsoft.com/$graphApiVersion/$($Resource)”

#initiate query via Graph API

$data = Invoke-MSGraphRequest -url $uri

#get data and validate

$usedpercentage=(($data.value.directorysizequota.used/$data.value.directorysizequota.total)*100)

#if ($usedpercentage -gt $maxsize)

#{write-output “Directory Size : $($data.value.directorysizequota.used) is greater than 90 percent”}

#else

#{
Write-Output “Directory Size : $($data.value.directorysizequota.used) and percentage used is $($usedpercentage)”
#}

And you could use Azure Log Analytics to help Alert on Monitor as below

AzureDiagnostics

| where Category == “JobStreams”

| where ResourceId == “” // replace with resourceID of the Automation Account

| where StreamType_s == “Output”

| project TimeGenerated, ResultDescription, JobId_g

| parse ResultDescription with “Directory Size : ” [“Actual Size”] ” ” * “percentage used is ” [“Percentage used”]

| extend [‘Percentage used'] = toreal([‘Percentage used'])

| top 1 by TimeGenerated desc

| where [‘Percentage used'] > (0.1)

This can help you get an overview of percentage used

ZohebShaikh_16-1624957696646.png

Alert configuration using Log Analytics can be done as shown in below screenshots:

You could define the threshold when to be alerted

ZohebShaikh_17-1624957696715.png

2. More detailed review on the Root Cause of the issue

In this step we need to identify why so many devices are being registered every day.

We exported the list of all registered devices in AAD in excel and tried to filter based on what type of registrations they have had.

Thanks to Claudiu Dinisoara & Turgay Sahtiyan for helping create a nice dashboard in POWERBI based on these logs which helped us understand the Root Cause much better.

This dashboard helped us understand the type of Device registrations and the overall count across the years, we found that there has been a Significant increase in AAD Device registrations due to Intune Rollout across the organization.

ZohebShaikh_18-1624957696731.png

We checked with customer's Intune support team, and they confirmed that this increase was expected.

3. Creating a baseline on number of AAD objects we can have:

The trickiest part on this issue was coming up with a baseline number for AAD Objects.

So, the customer had approximately 100k users and we came up with the below table for the baseline.

Please note that this number was unique to customer scenarios and discussions and it may differ for your organization.

Object Count Why this number
Users 105000 Total number of Production users are 100k and other 5000 users could be used for Administration, To be deleted users or Guest users.
Groups 60000 We felt 60k is a high number but they were using Groups extensively for Intune and other Policy management tasks, we recommended them to work on reducing this number in future.
Devices 200000 We assumed there will be 2 devices registered per user (Mobile & Laptop) and few stale devices.
Contacts 16000 These objects were already low, so we considered the present values as Baseline
Applications 1500 These objects were already low, so we considered the present values as Baseline
Service Principals 3000 These objects were already low, so we considered the present values as Baseline
Roles 100 These objects were already low, so we considered the present values as Baseline
TOTAL 385600 These objects were already low, so we considered the present values as Baseline

We compared the expected baseline with their Quota limit was 5,00,000 and came up with the strategy for cleanups and strategy to maintain the object counts as per baseline.

4. Increasing the Object Quota based on the baseline created if needed.

Their AAD Quota limit was 500,000 objects however our baseline indicated that they need to be around 400,000 objects.

Hope this helps,

Zoheb

Disclaimer
The sample are not supported under any Microsoft standard support program or service. The sample are provided AS IS without warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.

 

This article was originally published by Microsoft's Secure Blog. You can find the original article here.