Workflow orchestration by AWS Step Function

3 min readSep 5, 2023

As part of ETL migration to AWS, the entire data extraction and transformation can be migrated to the AWS environment. This was achieved by the dint of some in-demand AWS services, viz. Lambda, S3 and Step Functions.

The Rationale endpoint URLs acted as the source for the data extraction. The XML tags from these URLs were used to get the data were populated in a structured table format in the database (AWS Redshift) from the XML tags.

AWS LAMBDA

Salient features:

· Part of compute domain umbrella of AWS

· Server less application.

· Used for back-end service.

· Manages servers no need to manage.

· Scales with usage

· Charged as per the number of requests.

· Language supported Py, c#, java, NodeJS

· All the activities of the function are logged in cloud watch.

AWS lambda functions were built with python as the Lambda provides runtimes for Python that run your code to process events. Batch processing was achieved with external package multiprocessing since the architecture was process based parallelism. The primary motive for choosing lambda functions as a compute service was it lets you run code without provisioning or managing servers. Lambda also can add environment variables making the code environment agnostic and deployment friendly. The python code captured the entire data in the form of a data frame and this data frame was written into CSV files, with a preliminary level of processing.

AWS S3

The CSV files are stored on AWS S3 as it offers high performance and easy-to-use management features which in turn optimizes costs. High-scalability, data security and low-latency data storage were a few other contributing factors in opting for S3 as the storage platform. We customized the nomenclature of the S3 files which proved to be a cornerstone in the implementation of incremental strategy while loading the data from the CSV into redshift tables.

AWS STEP FUNCNTION

The entire architecture was event-based which is when step functions come into the picture. An important use case of step functions is Dynamic parallelism. Step functions provide a graphical console that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. The step function is a powerful tool providing high and customizable parallelization capabilities. It provides a drag and drop feature to add user created AWS services that can be called sequentially as well as parallelly with the help of map functions. In step functions, a workflow is called a state machine, which is a series of event-driven steps. Each step in a workflow is called a state. A task state represents a unit of work that another AWS service, such as AWS Lambda, performs. An entire pipeline was created with customizable flow and reiterated based on response of the previous step or micro-services.

The interface also has assorted levels of logging and tracing for each of the executions. The pipeline executes a complex flow of diverse services being invoked hinging on the preceding event and their corresponding responses.

This entire migration engendered an exponential reduction of execution time by 170%.

Workflow orchestration by AWS Step Function

Written by Sriyash Kadu

No responses yet