Important Azure Data Factory Components – An Overview

In my previous blog, we created an Azure Data Factory instance and discussed the benefits of using it. We discussed how Azure Data Factory allows users to transform their raw data into actionable insights. In this blog, I shall introduce you to major Azure Data Factory components for this process. So, let us start exploring the Data Factory components.

The following are Azure Data Factory components that I shall be covering. These components will transform your source data and make it consumable for the end-product.

  • Pipelines
  • Activities
  • Dataflows
  • Datasets
  • Linked Services
  • Integration Runtimes
  • Triggers

Please Note: It is essential for you to have created a data factory instance to follow this blog. If you do not know how to create a data factory instance, please refer to the following blog:

Introduction to Azure Data Factory

To understand the Azure Data Factory components, we shall be looking at an example comprising of a pipeline, two datasets, and one dataflow as shown below.

Azure Factory Resources

Pipelines

A pipeline is defined as a logical group of activities to perform a piece of work. A Data Factory can comprise of multiple pipelines with each pipeline having multiple activities. The activities inside the pipelines can be structured to run sequentially or in parallel, depending on your needs. The pipeline allows users to easily manage and schedule multiple activities together.

Azure Factory Resources Pipelines

Activities

Activities are processing steps that perform each individual task. This component supports data movement, data transformation and control activities. The activities can be executed in both a sequential and parallel manner.

Data Factory Activities

Data flows

These are special types of activities that allow data engineers to develop a data transformation logic visually without writing code. They are executed inside the Azure Data Factory pipeline on the Azure Databricks cluster for scaled out processing using Spark. Azure Data Factory controls all the data flow execution and code translation. It handles a large amount of data with ease.

Data Factory data flows

Datasets

For the movement or transformation of data, you need to specify the data configuration settings. A dataset can comprise of a database table or file name, folder structure, etc. Moreover, every data set refers to a linked service.

sql dataset

Linked Services

As a pre-requisite step, before connecting to a data source (e.g. SQL Server) we create a connection string. The Linked Services act like connection strings in a SQL server. They contain information about data sources & services. The requesting identity connects to a data source using this linked service/connection string.

azure linked services

Integration Runtimes

Integration Runtimes provide a compute infrastructure on which activities can be executed. You can choose to create them from three types:

  • Azure IR

This service provides a fully managed, serverless compute in Azure. All the movement and transformation of data is done in the cloud data stores.

  • Self-hosted IR

This service is needed to manage activities between cloud data stores and a data store residing in a private network.

  • Azure-SSIS IR

This is primarily required to execute native SSIS packages.

azure Integration Runtimes

Triggers

They are needed to execute your pipeline without any manual intervention at a pre-defined schedule. We can set the configuration settings like start/end date, execution frequency, time to execute, etc.

factory resources triggers

Summary

In this blog, we went over the different components of Azure Data Factory. The diagram below represents the relationship between the different components of the Azure Data Factory at a high level as explained on the official Microsoft website.

components of the Azure Data Factory

Conclusively, pipelines are created to execute multiple activities. The input and output format of the dataset is defined for the activities involved in data transformation or data movement. Next, data sources or services connect with the assistance of linked services. To facilitate with the infrastructure and execution location of the activities, we use integration runtimes. When the creation of the pipeline and the activities inside it is complete, we then add triggers to automatically execute it at a specific time or event.

Now that we have understood the Azure Data Factory components in theory, we can make things HAPPEN!

Stay tuned for the next blog in which we shall be using these components. Also, feel free to type in your questions and queries in the comment box below!

Leave a Reply

Your email address will not be published. Required fields are marked *