Serverless ETL Datalake Using Amazon Web Services

This was our first time working with a remote team, but Velotio’s team didn't miss any deadlines despite having a tight schedule and won our trust early in the project. They excelled at reporting and addressing issues quickly. The communication with our on-site team was also extremely smooth. We're extremely happy with the progress we have made with them.

Director of Engineering
Data Engineering
San Francisco
$5 million
4 months
Tech Stack Used
AWS Lambda
AWS Glue
AWS Athena
AWS S3
AWS Step Functions

The customer is a B2B Customer Data Platform providing a unified view of the customer across all platforms, with leading brands like Staples, Walmart, and Cisco as their customers.


Data Engineering
Tech Stack Used:
AWS Lambda
AWS Glue
AWS Athena
AWS S3
AWS Step Functions
Results

- Reduced data processing and storage cost by 10x.

- 50-60% reduction in ongoing operational costs.

- Scale and process petabytes of data with AWS S3 and Athena.

Talk to us

Business Context:

The customer would like to setup a multi-tenant  serverless data lake with real-time and batch data ingestion and processing. The data ingestion system should support multiple file formats (CSV, TSV, XLS) and different sources - AWS S3 Buckets, FTP, Dropbox among others.

Challenges:

  • The current CDP platform was built using traditional technologies like Hadoop, Hive, HDFS, YARN which was difficult to manage, scale and upgrade. The new solution should have Minimal infrastructure maintenance and remove the undifferentiated heavy lifting of managing infrastructure as demand changes and technologies evolve.
  • As the customer were signing on more larger enterprises, the expected data storage was expected to increase 10x from Terabytes to Petabytes.
  • The current platform did not support the way to store unprocessed raw data in a cost effective way.
  • The data warehouse gets data from a range of services. In the current data warehouse, any updates to those services required manual updates to ETL jobs and tables. The response times for these data sources are critical. This requires us to take a data-driven approach to selecting a high-performance architecture.

Solutions:

Without much knowledge of serverless technologies, the customer approached Velotio - who has deep expertise in setting up serverless data lake that scale to store petabyte-scale data.

Velotio worked with the customer to understand the existing platform, data characteristics and end goals.

Based on these requirements, Velotio decided to change the data warehouse both operationally and architecturally. From an operational standpoint, we designed a new shared responsibility model for data ingestion. Architecturally, we chose a serverless model over a traditional relational database. These two decisions ended up driving every design and implementation decision that we made in our migration.

Serverless ETL Datalake Using Amazon Web Services
  • Velotio built the solution on AWS using serverless technologies like AWS Step Functions, AWS Lambda, AWS Glue, AWS Athena and AWS S3. Velotio built a proof-of-concept in one month to demonstrate the solution addressing all the challenges.The  complete solution was built in  4 months.


  • Velotio developed the solution as follows:

    a. Designed the pipeline for batch processing AWS Step functions, AWS Lambda for basic data sanitisation and AWS Glue for complex batch operations. AWS Glue handles the ETL job scheduling and AWS Glue crawlers manage the metadata in the AWS Glue Data Catalog.

    b. Setup AWS Kinesis and Kinesis Firehose to fetch real-time data for data processing.

    c. Leveraged AWS S3 and AWS Athena to store raw and processed data. The platform provides the ability to re-process raw data in case there are changes to the ETL rules and parsing data.

Result:

  • The new serverless data analytics reduced the cost for data processing and storage by 10x.
  • AWS S3 with Athena can easily scale to store and process 10s of petabytes of data.
  • Leveraging AWS services and serverless model reduced the ongoing operational costs by 50-60%.
  • The current platform enables the ability to run Tensorflow-based Machine Learning models and analytics to understand customer behavior.

Choosing Velotio was a straight-forward decision. They came with proven expertise in Data Engineering and had experience building CDP systems before. We were impressed by their expertise, ability to be flexible and speed of delivery.

Director of Engineering

Our journey together so far

Exclusive office space

Right from renting out an exclusive office space to setting up robust technology architecture, handling payroll and other local administrative task

Dedicated recruitment team

Fast-track your hiring by selecting from our pool of carefully-screened talent pipeline or get dedicated recruiters to build your dream team of highly-skilled engineers that match your precise requirements.

High confidentiality

Ensure foolproof NDAs. We honor it not only at a company level, but also at an individual level as each member who joins your team signs it as well.

About Velotio

Velotio Technologies is an offshore product development partner for mission-driven technology startups across the globe. We combine business expertise and cutting-edge technology to drive success for our customers and help them win in their chosen markets.

Talk to us