Recently I was engaged with a customer who ran into the following scenario.
- The customer had forced a query that was inside a stored procedure using the Query Store. The plan was manually forced.
- The customer had a 2 node AlwaysOn Availability Group to provide high availability.
- The customer failed over from one node to another as part of a maintenance window.
When the customer failed over, the forced plan no longer showed that it was being executed. In the sys.query_store_query DMV, for the exact same query_text_id, a new entry had been created with a different query_id. This new query also had a new query_plan associated with it and that plan was not forced because the query store saw it as a new query with a new plan.
For a new entry to be inserted into the sys.query_store_query DMV, at least one of five things must be different about the query:
After the failover, the one thing that was different between the original entry from the old primary and the new entry from the new primary was the batch_sql_handle column. As documented at https://docs.microsoft.com/en-us/sql/relational-databases/system-catalog-views/sys-query-store-query-transact-sql?view=sql-server-2017 the batch_sql_handle is only populated if a temporary table or table variable is used in the query. In this case, there was a temporary table used by the query that had its plan forced, so that did play a role.
Why would the simple act of failing over an AlwaysOn Availability Group create a different value for the batch_sql_handle, which caused a new query to get created therefore essentially undoing the forced plan that was put in place on the original primary?
The answer to that question is that the batch_sql_handle encodes the database id into the handle value. If the database id is different between the primary and secondary replica, and you have a forced plan that involves a temporary table or table variable, then you essentially create a new query that has a new plan. The original plan remains forced, but it will never be used until you fail back over to the original primary.
The following screenshots demonstrate this situation.
You can see we have a query in the query store that is query_id 2. This query contains the values for the 5 critical components that compose a unique query in the sys.query_store_query DMV. At this point we failed over and reran the stored procedure.
Notice there is a slight difference in the batch_sql_handle value. 0x0A vs 0x0C. That is the hex encoding for database ID 10 vs Database ID 12. If these databases had the same database ID on both replicas, then the batch_sql_handle values would have been identical between the replicas and there would not have been a new query entry for this query. This is very important when you force plans that involve temporary tables or table variables as it will not use the original query when you fail over which means that the new query generates a new plan and the forced plan will not be used until you fail back to the original primary.
When you are restoring databases to a secondary replica, we would advise you to restore those databases in the order that they were created on the primary replica to keep the database IDs in sync between the replicas. This has two benefits:
- Forced plans that use temporary tables or table variables will have the same batch_sql_handle value and therefore will continue to be treated as the same query between replicas keeping those forced plans in use.
- A secondary benefit is that the same databases that have parallel redo enabled for them on a primary replica will be consistent with the secondary replica. This is documented at https://blogs.msdn.microsoft.com/sql_server_team/sql-server-20162017-availability-group-secondary-replica-redo-model-and-/
If you are already in a scenario where the database IDs mismatch between the primary and secondary replica and you are encountering the behavior of the forced plan as described in this blog, if the forced plan has been recreated with the new query, you may have to consider forcing both of the equivalent plans for each query. This option is only valid if the equivalent already forced plan from the first replica has been reproduced on the second replica.
Thank you reading the article. I hope this will help explain why forced plan behavior can vary in an AlwaysOn Availability Group scenario and by understanding database ID orders importance you can proactively prevent this scenario from occurring.