Summary
Often times when creating reproducible Machine Learning pipelines (See Blog Article: AML Pipelines), the need to transfer data between various data stores arises. This articles shows the architecture for performing data transfer and links to a GitHub repository of a code sample walkthrough of this architecture.
Pipeline Architecture for Transferring Data
AML Pipeline View
Code Walkthrough
Code for the above architecture: AML Data Transfer GitHub
Step 1 01. Transfer Data Configuration.ipynb Configure necessary components to perform a Data Transfer in the next notebook
Step 2 02. Transfer Data.ipynb Transfer Data from Blob Storage to Azure SQL Database using an existing Azure Data Factory
DataTransferStep
A Data Transfer Step is used within an Azure ML Pipelines to transfer data between locations using an Azure Data Factory. Currently supported source and destinations include,
Data Store | Source | Destination |
---|---|---|
Azure Blob Storage | Yes | Yes |
Azure Data Lake Storage Gen 1 | Yes | Yes |
Azure Data Lake Storage Gen 2 | Yes | Yes |
Azure SQL Database | Yes | Yes |
Azure Database for PostgreSQL | Yes | Yes |
Azure Database for MySQL | Yes | Yes |
from azureml.pipeline.steps import DataTransferStep datatransferstep_name = 'transfer_blob_to_sql' data_transfer_step = DataTransferStep( name = datatransferstep_name, source_data_reference=blob_data_ref, destination_data_reference=sql_query_data_ref, compute_target=adf_compute, source_reference_type='file', #destination_reference_type=None, allow_reuse=False) print("Data transfer step created")
Running a DataTransferStep
A Data Transfer Step comprises of Source DataStore and a Destination DataStore. For a walkthrough on DataStores, see Dealing with Data in AML. A Data Transfer Step is then added to a Pipeline and executed through an Experiment.
Conclusion
Azure Machine Learning services provides Pipelines as a mechanism to automated Machine Learning processes. Within those Pipelines, as this articles demonstrated, a DataTransferStep can be used to transfer data between two data stores using an Azure Data Factory.
Comments are closed