Avail the power of Microsoft Fabric from within Azure Machine Learning

Unveiling the Public Preview of Azure OneLake datastore.

Microsoft Fabric, now generally available, is the all-in-one analytics solution for enterprises, offering a comprehensive suite of services, including , data engineering, and data integration, all in one place. OneLake is where customers can grow their data gravity by unifying their data across clouds, domains, and accounts.

To help customers build custom ML models and LLMs, grounded on their data in OneLake, we're building a native integration with Azure (AzureML) to support and MLOps at enterprise scale. This is not limited to reading data from OneLake during training jobs but also writing back, so as to keep the modified or featurized datasets together with the input data.

Microsoft Fabric customers have diverse business needs, data, and AI use cases. Previously, customers that needed to connect to data in OneLake from AzureML had to s to inject data into their AI solutions. Now, AzureML provides an out-of-the-box data connector in the form of “OneLake datastore” to help customers directly access data from OneLake

When to use Data Import or OneLake shortcuts

We see two kinds of customer goals where data versioning and lineage tracking using external data sources with AzureML is key:

  • Goal: Reproducibility and auditability is a primary goal for training, using the same original dataset is critical to reproduce results and debug models. To achieve this, “Data Import” can be used to connect to data sources located outside of Azure, available to AzureML services.

Learn more about data import here.

Goal: Versioning data from external data sources but don't want the additional tax of multiple copies of data. Maintaining “one source of truth” and configuring training jobs directly to their external sources is the primary objective for these customers. To achieve this, customers can leverage shortcuts in OneLake to virtualize data from sources like Amazon S3, Azure Gen2, and Dataverse with more sources coming soon.  With this capability, customers create a shortcut in OneLake pointing to their external data source (eg. Amazon S3 bucket) and then 2) in AzureML, create a OneLake datastore pointing to their Lakehouse in Fabric. Once the datastore is created, the shortcuts can be accessed as though they are files on their Lakehouse, just by providing relative paths from the root Lakehouse artifact.

What is a datastore in AzureML?

A datastore in AzureML is an entity which references an underlying or data source and would contain the and credentials to access the referenced store as well. This brings in an abstraction layer over different types of data sources and makes it available as a common easy-to-use interface. Learn more about datastores here.

What is a OneLake datastore?

OneLake datastore is a datastore entity referencing OneLake artifacts. Right now, only Lakehouse-type artifacts are supported in this type of datastore in AzureML.

Think of this as a “pointer” created to the Lakehouse artifact in one's Microsoft Fabric instance. OneLake datastore supports both credential-less or service principal-based as of this public preview release. The feature is available via CLI and SDK today, with UI support coming soon.

Since OneLake datastore is a pointer to the root folder (or, artifact) for the Lakehouse, you can access any file or folder by referencing it using the relative path. You just need the following information from the Fabric instance in order to create a OneLake datastore:

  • Fabric workspace name or GUID

Below, we explore get the above information and create a datastore using either a CLI or SDK

OneLake workspace name

In your Microsoft Fabric instance, you can find the workspace information as shown in this screenshot. You can use either a GUID value, or a “friendly name” to create an Azure Machine Learning OneLake datastore.

amar_badal_0-1699902736189.png

amar_badal_1-1699902736193.png

OneLake endpoint

In your Microsoft Fabric instance, you can find the endpoint information as shown in this screenshot:

amar_badal_2-1699902736197.png

amar_badal_3-1699902736200.png

OneLake artifact name

In your Microsoft Fabric instance, you can find the artifact information as shown in this screenshot. You can use either a GUID value, or a “friendly name” to create an Azure Machine Learning OneLake datastore, as shown in this screenshot:

amar_badal_4-1699902736201.png

amar_badal_5-1699902736204.png

Once you have the above information, you could log into your Azure ML workspace in CLI and use this YAML substituted for the values obtained as above from your Microsoft Fabric instance:

amar_badal_6-1699902736208.png

The following section shows creation of a data asset pointing to a OneLake datastore.

Below is what your OneLake – Lakehouse instance looks like. You can see that there is a file “iris.csv” and a folder named “for1Lake,” which is actually an Amazon S3 shortcut.

amar_badal_7-1699902736209.png

Now that we have the datastore created as above with the name “onelake_example_id” the above file and folder can be referenced as follows:

iris_path = “azureml://datastores/onelake_example_id/paths/Files/iris.csv”

s3_shortcut_path= “azureml://datastores/onelake_example_id/paths/Files/iris.csv”

“`Onelake_fileds.yaml

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

type: uri_file

name: “my_onelake_fds”

version: “1”

description: “My Onelake file dataset”

path:“ azureml://subscriptions/my-subscription/resourcegroups/my-resourcegroup/workspace/my-workspace/datastores/onelake_example_id/paths/Files/iris.csv

“`

“`cli

az ml data create -f Onelake_fileds.yml

“`

“`Onelake_shortcut.yaml

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

type: uri_folder

name: “my_onelake_shrtcutds”

version: “1”

description: “My Onelake folder shortcut dataset”

path:“ azureml://subscriptions/my-subscription/resourcegroups/my-resourcegroup/workspace/my-workspace/datastores/onelake_example_id/paths/Files/for1Lake/”

“`

“`cli

az ml data create -f Onelake_shortcut.yml

“`

To summarize, you can now create a data asset pointing to the shortcut just as you would create a file or folder in AzureML pointing to their Fabric instance. Once this dataset is created, it is readily available to be used in any of the AzureML training jobs. Learn more on this here.

Additional Scenarios:

Accessing via Azure Blob File System (ABFSS) URI

amar_badal_8-1699902736211.png

Code snippet:

Substitute the abfss uri or path below –

amar_badal_0-1699902911035.png

Apache Delta format support

AzureML supports delta format as well and you can load the “delta-table” root folder URL using either of the patterns –

  1. “azureml://” pattern

Eg: delta_table_url = “azureml://subscriptions//resourcegroups//workspaces/my-ws/datastores/my-onelake-ds/paths/Files/delta-table”

  1. “abfss:// pattern –

Eg: delta_table_url=”abfss://88aa174e-6310-4634-bfcb-5761e1a1190a@msit-onelake.dfs.fabric.microsoft.com/012b70e5-7f37-4174-a807-15a99e0d0392/Files/delta-table”

You can use the delta table URL (any of the above) and substitute in the code snippet below:

Quick recap

Fabric customers can now seamlessly connect and inject their data to their AI solutions via AzureML without having to write custom connectors.

Customers that have data in external sources can now utilize the power of Fabric Shortcuts for training directly from our platform without data movement.

Get started

To get started with Azure Machine Learning OneLake datastore, please visit the resources below where you can find detailed instructions to set up connections for external sources in your AzureML workspace and train or deploy models with a variety of AzureML examples.

If you want to learn more about Microsoft Fabric, consider:

Learn more about Azure Innovate, our new offering to help you adopt AI

 

This article was originally published by Microsoft's AI - Machine Learning Blog. You can find the original article here.