Optimizing ETL Workflows: A Guide to Azure Integration and Authentication with Batch and Storage

Introduction

When it comes to building a robust foundation for ETL (Extract, Transform, Load) pipelines, the trio of or Azure Synapse Analytics, Azure Batch, and Azure is indispensable. These tools enable efficient data movement, transformation, and processing across diverse data sources, thereby helping us achieve our strategic goals.

This document provides a comprehensive guide on how to authenticate Azure Batch with SAMI and Azure with Synapse SAMI. This enables user-driven connectivity to , facilitating data extraction. Furthermore, it allows the use of custom activities, such as High-Performance Computing (HPC), to process the extracted data.

The key enabler of these functionalities is the Synapse Pipeline. Serving as the primary orchestrator, the Synapse Pipeline is adept at integrating various Azure resources in a secure manner. Its capabilities can be extended to (ADF), providing a broader scope of data management and transformation.

Through this guide, you will gain insights into leveraging these powerful Azure services to optimize your data processing workflows.

Services Overview

During this procedure we will use different services, below you have more details about each of them.

Azure Synapse Analytics / Data Factory

  • Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.
  • Documentation:

Azure Batch

Azure Storage

Managed Identities

  • Azure Managed Identities are a feature of Azure Active Directory that automatically manages credentials for applications to use when connecting to resources that support Azure AD authentication. They eliminate the need for developers to manage secrets, credentials, certificates, and keys.
  • There are two types of managed identities:
    • System-assigned: Tied to your application.
    • User-assigned: A standalone Azure resource that can be assigned to your app
  • Documentation: Managed identities for Azure resources – Managed identities for Azure resources | Microsoft Learn
 

Scenario

Run an ADF / Synapse Pipeline that pulls a script located in a Storage Account and execute it into the Batch nodes using User Assigned Managed Identities (UAMI) for Authentication to Storage and System Assigned Managed Identity (SAMI) to authenticate with Batch.

Prerequisites

  • ADF / Synapse Workspace
  • UA Managed Identity
  • Storage Account
 

Procedure Overview

During this procedure we will walk through step by step to complete the following actions:

  • Create UAMI Credentials
  • Create Linked Services for Storage and Batch Accounts
  • Add UAMI and SAMI to Storage and Batch Accounts
  • Create, Configure and Execute an ADF / Synapse Pipeline
    • We will refer to ADF (Portal, Workspace, Pipelines, Jobs, Linked Services) as Synapse during all the exercise and examples to avoid redundancy.
  • Debugging

Procedure

Create UAMI Credentials

1. In your Synapse Portal, go to Manage -> Credentials -> New and fill in the details and click Create.

Josedobla_0-1715603461051.png

Create Linked Services Connections for Storage and Batch

2. In your Synapse Portal, go to Manage – Linked Services -> New -> Azure -> Continue and complete the form

a. Authentication Type: UAMI

b. Azure Subscription: Choose your one

c. Storage Account name: Choose your one where the script to be used is allocated

d. Credentials: choose the created into the Step #1

e. Click on Create

Josedobla_1-1715603461052.png

3. In Azure Portal go to your Batch Account -> Keys and Copy the Batch Account name & Account Endpoint to be used in next step, also copy the Pool Name to be used for this example.

Josedobla_2-1715603461052.png

4. In your Synapse Portal, go to Manage -> Linked Services -> New -> Azure Batch -> Continue and fill in the information

a. Authentication Method: SAMI (Copy the Managed Identity Name to be used later)

b. Account Name, Batch URL and Pool Name: Paste on here the values copied from Step#3

c. Storage linked service Name: Choose the one created from Step#2

Josedobla_3-1715603461053.png

5. Publish all your changes

Josedobla_4-1715603461054.png

Adding UAMI RBAC Roles to Storage Account

6. In the Azure Portal, go to your Storage Account -> Access Control (IAM)

a. Click on Add Option and then on Add role assignment and search for “Storage Blob Data Contributor”, then click on Next.

Josedobla_5-1715603461055.png

Josedobla_6-1715603461055.png

b. Choose Managed Identity and select your UAMI click on Select and then click Next, Next and Review + assign.

Josedobla_7-1715603461055.png

Josedobla_8-1715603461056.png

Adding SAMI RBAC Roles to Batch Account

7. In the Azure Portal, go to your Batch Account -> Access Control (IAM)

a. Click on Add Option and then on Add role assignment

Josedobla_9-1715603461057.png

b. Click on “Privileged administrator roles” tab and then choose the Contributor role and click Next.

Josedobla_10-1715603461057.png

c. Choose Managed Identity and under Managed Identity lookup for “Synapse workspace” and then choose the SAMI same as it is added into the step 4a., then click on Select and Next, Next and Review and Assign.

 

Josedobla_11-1715603461057.png

Josedobla_12-1715603461058.png

Adding UAMI to Batch Pool

If you need to create a new Batch Pool, you can follow the following procedure:

8. If you already have a Batch Pool created follow the next steps:

a. Into the Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Go to Identity

Josedobla_13-1715603461058.png

b. Click on Add then choose the necessary UAMI (on this example it was selected the one used by the Synapse Linked Services for Storage and another one used for other integrations) and click on Add.
 

Important: In case your Batch Pool use multiples UAMI's (as example to connect with Key Vault or other services), you have first to remove the existing one and then add all of them together.

Josedobla_14-1715603461059.png

Josedobla_15-1715603461060.png

c. Then, it is required to Scale in and Scale out the Pool to apply the changes.

Josedobla_16-1715603461060.png

Setting up the Pipeline

9. In your Synapse Portal, go to Integrate -> Add New Resource -> Pipeline

Josedobla_17-1715603461061.png

10. Into the right panel Activities -> Batch Services -> Drag and drop the Custom activities

Josedobla_18-1715603461061.png

11. In the Azure Batch tab details for the Custom Activities, click on the Azure Batch linked service and click the one created in Step 4 and test the connection (if you receive a connection error, please go to the scenario 1)

Josedobla_19-1715603461062.png

Josedobla_20-1715603461062.png

12. Then go to Settings tab and add your script. Ffor this example, we will use a Powershell script previously uploaded to a Storage Blob Container and send the output to txt file.

a. Command: your script details

b. Resource linked Service: The Storage Service Linked connection configured previously on Step#2

c. Browse Storage: lookup for the Container where your script was uploaded

Josedobla_21-1715603461062.png

       d. Publish your Changes and perform a Debug

Josedobla_22-1715603461063.png

Josedobla_23-1715603461063.png

Josedobla_24-1715603461064.png

Debugging

12. Check the Synapse Jobs Logs and outputs

       a. Copy the Activity Run ID
 

Josedobla_25-1715603461065.png

       b. Then, in the Azure Portal Go to your Storage Account –> Containers -> adfjobs -> select the folder with the activityID -> output.

       c. On here you will find two files, “stderr.txt” and “stdout.txt” both of them contains information about the errors or the outputs of the commands executed during the task execution

Josedobla_26-1715603461065.png

13. Check the Batch Logs and outputs. To get the Batch logs you have different ways:
 

       a. Over Nodes: In Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Nodes -> then into the Folders details go to the folder for this Synapse execution -> job-x -> lookup for the activityID
 

Josedobla_27-1715603461066.png

       b. Over Jobs: In Azure Portal go to your Batch Account -> Jobs -> Select a pool with a name of adfv2-yourPoolName -> click on the Task with the ID same as it was the ActivityID of the Synapse Pipeline from step 12a.

Josedobla_28-1715603461067.png

What we have learned

During this walkthrough procedure we have learned and implemented about

  • Authentication: Utilizing User Assigned Managed Identities (UAMI) and System Assigned Managed Identity (SAMI) for secure connections.
  • Linked Services: Creation and configuration of linked services for Azure Storage and Azure Batch accounts.
  • Pipeline Execution: Steps to create, configure, and execute an ADF/Synapse Pipeline, emphasizing the use of Synapse as a unified term to avoid redundancy.
  • Debugging: Detailed instructions for creating credentials, adding RBAC roles, and setting up pipelines, along with tips.
  • Logs Analysis: How to access and analyze Synapse Jobs logs and Azure Batch logs for .
  • Error Handling: Understanding the significance of ‘stderr.txt' and ‘stdout.txt' files in identifying and resolving errors during task execution.

If you have any questions or feedback, please leave a comment below!

 

This article was originally published by Microsoft's Azure Data Factory Blog. You can find the original article here.