Thanks! We'll be in touch in the next 12 hours
Oops! Something went wrong while submitting the form.

Modern Data Stack: The What, Why and How?

Shivam Anand

Data Engineering

This post will provide you with a comprehensive overview of the modern data stack (MDS), including its benefits, how it’s components differ from its predecessors’, and what its future holds.

“Modern” has the connotation of being up-to-date, of being better. This is true for MDS, but how exactly is MDS better than what was before?

What was the data stack like?...

A few decades back, the map-reduce technological breakthrough made it possible to efficiently process large amounts of data in parallel on multiple machines.

It provided the backbone of a standard pipeline that looked like:

It was common to see HDFS used for storage, spark for computing, and hive to perform SQL queries on top.

To run this, we had people handling the deployment and maintenance of Hadoop on their own.

This core attribute of the setup eventually became a pain point and made it complex and inefficient in the long run.

Being on-prem while facing growing heavier loads meant scalability became a huge concern.

Hence, unlike today, the process was much more manual. Adding more RAM, increasing storage, and rolling out updates manually reduced productivity

Moreover,

  • The pipeline wasn’t modular; components were tightly coupled, causing failures when deciding to shift to something new.
  • Teams committed to specific vendors and found themselves locked in, by design, for years.
  • Setup was complex, and the infrastructure was not resilient. Random surges in data crashed the systems. (This randomness in demand has only increased since the early decade of internet, due to social media-triggered virality.)
  • Self-service was non-existent. If you wanted to do anything with your data, you needed data engineers.
  • Observability was a myth. Your pipeline is failing, but you’re unaware, and then you don’t know why, where, how…Your customers become your testers, knowing more about your system’s issues.
  • Data protection laws weren’t as formalized, especially the lack of policies within the organization. These issues made the traditional setup inefficient in solving modern problems, and as we all know…

For an upgraded modern setup, we needed something that is scalable, has a smaller learning curve, and something that is feasible for both a seed-stage startup or a fortune 500.

Standing on the shoulders of tech innovations from the 2000s, data engineers started building a blueprint for MDS tooling with three core attributes: 

Cloud Native (or the ocean)

Arguably the definitive change of the MDS era, the cloud reduces the hassle of on-prem and welcomes auto-scaling horizontally or vertically in the era of virality and spikes as technical requirements.

Modularity

The M in MDS could stand for modular.

You can integrate any MDS tool into your existing stack, like LEGO blocks.

You can test out multiple tools, whether they’re open source or managed, choose the best fit, and iteratively build out your data infrastructure.

This mindset helps instill a habit of avoiding vendor lock-in by continuously upgrading your architecture with relative ease.

By moving away from the ancient, one-size-fits-all model, MDS recognizes the uniqueness of each company's budget, domain, data types, and maturity—and provides the correct solution for a given use case.

Ease of Use

MDS tools are easier to set up. You can start playing with these tools within a day.

Importantly, the ease of use is not limited to technical engineers.

Owing to the rise of self-serve and no-code tools like tableau—data is finally democratized for usage for all kinds of consumers. SQL remains crucial, but for basic metric calculations PMs, Sales, Marketing, etc., can use a simple drag and drop in the UI (sometimes even simpler than Excel pivot tables).

MDS also enables one to experiment with different architectural frameworks for their use case. For example, ELT vs. ETL (explained under Data Transformation).

But, one might think such improvements mean MDS is the v1.1 of Data Stack, a tech upgrade that ultimately uses data to solve similar problems.

Fortunately, that’s far from the case.

MDS enables data to solve more human problems across the org—problems that employees have long been facing but could never systematically solve for, helping generate much more value from the data.

Beyond these, employees want transparency and visibility into how any metric was calculated and which data source in Snowflake was used to build what specific tableau dashboard.

Critically, with compliance finally being focused on, orgs need solutions for giving the right people the right access at the right time.

Lastly, as opposed to previous eras, these days, even startups have varied infrastructure components with data; if you’re a PM tasked with bringing insights, how do you know where to start? What data assets the organization has?

Besides these problem statements being tackled, MDS builds a culture of upskilling employees in various data concepts.

Data security, governance, and data lineage are important irrespective of department or persona in the organization.

From designers to support executives, the need for a data-driven culture is a given.

You’re probably bored of hearing how good the MDS is and want to deconstruct it into its components.

Let’s dive in.

SOURCES

In our modern era, every product is inevitably becoming a tech product

From a smart bulb to an orbiting satellite, each generates data in its own unique flavor of frequency of generation, data format, data size, etc.

Social media, microservices, IoT devices, smart devices, DBs, CRMs, ERPs, flat files, and a lot more…

INGESTION

Post creation of data, how does one “ingest” or take in that data for actual usage? (the whole point of investing).

Roughly, there are three categories to help describe the ingestion solutions:

Generic tools allow us to connect various data sources with data storages.

E.g.: we can connect Google Ads or Salesforce to dump data into BigQuery or S3.

These generic tools highlight the modularity and low or no code barrier aspect in MDS.

Things are as easy as drag and drop, and one doesn't need to be fluent in scripting.

Then we have programmable tools as well, where we get more control over how we ingest data through code

For example, we can write Apache Airflow DAGs in Python to load data from S3 and dump it to Redshift.

Intermediary - these tools cater to a specific use case or are coupled with the source itself.

E.g. - Snowpipe, a part of the data source snowflake itself, allows us to load data from files as soon as it's available at the source.

DATA STORAGE

Where do you ingest data into?

Here, we’ve expanded from HDFS & SQL DBs to a wider variety of formats (noSQL, document DB).

Depending on the use case and the way you interact with data, you can choose from a DW, DB, DL, ObjectStores, etc.

You might need a standard relational DB for transactions in finance, or you might be collecting logs. You might be experimenting with your product at an early stage and be fine with noSQL without worrying about prescribing schemas.

One key feature to note is that—most are cloud-based. So, no more worrying about scalability and we pay only for what we use.

PS: Do stick around till the end for new concepts of Lake House and reverse ETL (already prevalent in the industry).

DATA TRANSFORMATION

The stored raw data must be cleaned and restructured into the shape we deem best for actual usage. This slicing and dicing is different for every kind of data.

For example, we have tools for the E-T-L way, which can be categorized into SaaS and Frameworks, e.g., Fivetran and Spark respectively.

Interestingly, the cloud era has given storage computational capability such that we don’t even need an external system for transformation, sometimes.

With this rise of E-LT, we leverage the processing capabilities of cloud data warehouses or lake houses. Using tools like DBT, we write templated SQL queries to transform our data in the warehouses or lake house itself.

This is enabling analysts to perform heavy lifting of traditional DE problems

We also see stream processing where we work with applications where “micro” data is processed in real time (analyzed as soon as it’s produced, as opposed to large batches).

DATA VISUALIZATION

The ability to visually learn from data has only improved in the MDS era with advanced design, methodology, and integration.

With Embedded analytics, one can integrate analytical capabilities and data visualizations into the software application itself.

External analytics, on the other hand, are used to build using your processed data. You choose your source, create a chart, and let it run.

DATA SCIENCE, MACHINE LEARNING, MLOps

Source: https://medium.com/vertexventures/thinking-data-the-modern-data-stack-d7d59e81e8c6

In the last decade, we have moved beyond ad-hoc insight generation in Jupyter notebooks to

production-ready, real-time ML workflows, like recommendation systems and price predictions. Any startup can and does integrate ML into its products.

Most cloud service providers offer machine learning models and automated model building as a service.

MDS concepts like data observation are used to build tools for ML practitioners, whether its feature stores (a feature store is a central repository that provides entity values as of a certain time), or model monitoring (checking data drift, tracking model performance, and improving model accuracy).

This is extremely important as statisticians can focus on the business problem not infrastructure.

This is an ever-expanding field where concepts for ex MLOps (DevOps for the ML pipelines—optimizing workflows, efficient transformations) and Synthetic media (using AI to generate content itself) arrive and quickly become mainstream.

ChatGPT is the current buzz, but by the time you’re reading this, I'm sure there’s going to be an updated one—such is the pace of development.

DATA ORCHESTRATION

With a higher number of modularized tools and source systems comes complicated complexity.

More steps, processes, connections, settings, and synchronization are required.

Data orchestration in MDS needs to be Cron on steroids.

Using a wide variety of products, MDS tools help bring the right data for the right purposes based on complex logic.


DATA OBSERVABILITY

Data observability is the ability to monitor and understand the state and behavior of data as it flows through an organization's systems.

In a traditional data stack, organizations often rely on reactive approaches to data management, only addressing issues as they arise. In contrast, data observability in an MDS involves adopting a proactive mindset, where organizations actively monitor and understand the state of their data pipelines to identify potential issues before they become critical.

Monitoring - a dashboard that provides an operational view of your pipeline or system

Alerting - both for expected events and anomalies 

Tracking - ability to set and track specific events

Analysis - automated issue detection that adapts to your pipeline and data health

Logging - a record of an event in a standardized format for faster resolution

SLA Tracking - Measure data quality against predefined standards (cost, performance, reliability)

Data Lineage - graph representation of data assets showing upstream/downstream steps.

DATA GOVERNANCE & SECURITY

Data security is a critical consideration for organizations of all sizes and industries and needs to be prioritized to protect sensitive information, ensure compliance, and preserve business continuity. 

The introduction of stricter data protection regulations, such as the General Data Protection Regulation (GDPR) and CCPA, introduced a huge need in the market for MDS tools, which efficiently and painlessly help organizations govern and secure their data.

DATA CATALOG

Now that we have all the components of MDS, from ingestion to BI, we have so many sources, as well as things like dashboards, reports, views, other metadata, etc., that we need a google like engine just to navigate our components.

This is where a data catalog helps; it allows people to stitch the metadata (data about your data: the #rows in your table, the column names, types, etc.) across sources.

This is necessary to help efficiently discover, understand, trust, and collaborate on data assets.

We don’t want PMs & GTM to look at different dashboards for adoption data.

Previously, the sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. Chukwa collected events and wrote them to S3 in Hadoop sequence file format. In those days, end-to-end latency was up to 10 minutes. That was sufficient for batch jobs, which usually scan data at daily or hourly frequency.

With the emergence of Kafka and Elasticsearch over the last decade, there has been a growing demand for real-time analytics on Netflix. By real-time, we mean sub-minute latency. Instead of starting from scratch, Netflix was able to iteratively grow its MDS as per changes in market requirements.

Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/


This is a snapshot of the MDS stack a data-mature company like Netflix had some years back where instead of a few all in one tools, each data category was solved by a specialized tool.

FUTURE COMPONENTS OF MDS?

DATA MESH

Source: https://martinfowler.com/articles/data-monolith-to-mesh.html

The top picture shows how teams currently operate, where no matter the feature or product on the Y axis, the data pipeline’s journey remains the same moving along the X. But in an ideal world of data mesh, those who know the data should own its journey.

As decentralization is the name of the game, data mesh is MDS’s response to this demand for an architecture shift where domain owners use self-service infrastructure to shape how their data is consumed.

DATA LAKEHOUSE

Source: https://www.altexsoft.com/blog/data-lakehouse/

We have talked about data warehouses and data lakes being used for data storage.

Initially, when we only needed structured data, data warehouses were used. Later, with big data, we started getting all kinds of data, structured and unstructured.

So, we started using Data Lakes, where we just dumped everything.

The lakehouse tries to combine the best of both worlds by adding an intelligent metadata layer on top of the data lake. This layer basically classifies and categorizes data such that it can be interpreted in a structured manner.

Also, all the data in the lake house is open, meaning that it can be utilized by all kinds of tools. They are generally built on top of open data formats like parquet so that they can be easily accessed by all the tools.

End users can simply run their SQLs as if they’re querying a DWH. 

REVERSE ETL

Suppose you’re a salesperson using Salesforce and want to know if a lead you just got is warm or cold (warm indicating a higher chance of conversion).

The attributes about your lead, like salary and age are fetched from your OLTP into a DWH, analyzed, and then the flag “warm” is sent back to Salesforce UI, ready to be used in live operations.

 METRICS LAYER

Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/

The Metric layer will be all about consistency, accessibility, and trust in the calculations of metrics.

Earlier, for metrics, you had v1 v1.1 Excels with logic scattered around.

Currently, in the modern data stack world, each team's calculation is isolated in the tool they are used to. For example, BI would store metrics in tableau dashboards while DEs would use code.

A metric layer would exist to ensure global access of the metrics to every other tool in the data stack.

For example, DBT metrics layer helps define these in the warehouse—something accessible to both BI and engineers. Similarly, looker, mode, and others have their unique approach to it.

In summary, this blog post discussed the modern data stack and its advantages over older approaches. We examined the components of the modern data stack, including data sources, ingestion, transformation, and more, and how they work together to create an efficient and effective system for data management and analysis. We also highlighted the benefits of the modern data stack, including increased efficiency, scalability, and flexibility. 

As technology continues to advance, the modern data stack will evolve and incorporate new components and capabilities.

Get the latest engineering blogs delivered straight to your inbox.
No spam. Only expert insights.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Modern Data Stack: The What, Why and How?

This post will provide you with a comprehensive overview of the modern data stack (MDS), including its benefits, how it’s components differ from its predecessors’, and what its future holds.

“Modern” has the connotation of being up-to-date, of being better. This is true for MDS, but how exactly is MDS better than what was before?

What was the data stack like?...

A few decades back, the map-reduce technological breakthrough made it possible to efficiently process large amounts of data in parallel on multiple machines.

It provided the backbone of a standard pipeline that looked like:

It was common to see HDFS used for storage, spark for computing, and hive to perform SQL queries on top.

To run this, we had people handling the deployment and maintenance of Hadoop on their own.

This core attribute of the setup eventually became a pain point and made it complex and inefficient in the long run.

Being on-prem while facing growing heavier loads meant scalability became a huge concern.

Hence, unlike today, the process was much more manual. Adding more RAM, increasing storage, and rolling out updates manually reduced productivity

Moreover,

  • The pipeline wasn’t modular; components were tightly coupled, causing failures when deciding to shift to something new.
  • Teams committed to specific vendors and found themselves locked in, by design, for years.
  • Setup was complex, and the infrastructure was not resilient. Random surges in data crashed the systems. (This randomness in demand has only increased since the early decade of internet, due to social media-triggered virality.)
  • Self-service was non-existent. If you wanted to do anything with your data, you needed data engineers.
  • Observability was a myth. Your pipeline is failing, but you’re unaware, and then you don’t know why, where, how…Your customers become your testers, knowing more about your system’s issues.
  • Data protection laws weren’t as formalized, especially the lack of policies within the organization. These issues made the traditional setup inefficient in solving modern problems, and as we all know…

For an upgraded modern setup, we needed something that is scalable, has a smaller learning curve, and something that is feasible for both a seed-stage startup or a fortune 500.

Standing on the shoulders of tech innovations from the 2000s, data engineers started building a blueprint for MDS tooling with three core attributes: 

Cloud Native (or the ocean)

Arguably the definitive change of the MDS era, the cloud reduces the hassle of on-prem and welcomes auto-scaling horizontally or vertically in the era of virality and spikes as technical requirements.

Modularity

The M in MDS could stand for modular.

You can integrate any MDS tool into your existing stack, like LEGO blocks.

You can test out multiple tools, whether they’re open source or managed, choose the best fit, and iteratively build out your data infrastructure.

This mindset helps instill a habit of avoiding vendor lock-in by continuously upgrading your architecture with relative ease.

By moving away from the ancient, one-size-fits-all model, MDS recognizes the uniqueness of each company's budget, domain, data types, and maturity—and provides the correct solution for a given use case.

Ease of Use

MDS tools are easier to set up. You can start playing with these tools within a day.

Importantly, the ease of use is not limited to technical engineers.

Owing to the rise of self-serve and no-code tools like tableau—data is finally democratized for usage for all kinds of consumers. SQL remains crucial, but for basic metric calculations PMs, Sales, Marketing, etc., can use a simple drag and drop in the UI (sometimes even simpler than Excel pivot tables).

MDS also enables one to experiment with different architectural frameworks for their use case. For example, ELT vs. ETL (explained under Data Transformation).

But, one might think such improvements mean MDS is the v1.1 of Data Stack, a tech upgrade that ultimately uses data to solve similar problems.

Fortunately, that’s far from the case.

MDS enables data to solve more human problems across the org—problems that employees have long been facing but could never systematically solve for, helping generate much more value from the data.

Beyond these, employees want transparency and visibility into how any metric was calculated and which data source in Snowflake was used to build what specific tableau dashboard.

Critically, with compliance finally being focused on, orgs need solutions for giving the right people the right access at the right time.

Lastly, as opposed to previous eras, these days, even startups have varied infrastructure components with data; if you’re a PM tasked with bringing insights, how do you know where to start? What data assets the organization has?

Besides these problem statements being tackled, MDS builds a culture of upskilling employees in various data concepts.

Data security, governance, and data lineage are important irrespective of department or persona in the organization.

From designers to support executives, the need for a data-driven culture is a given.

You’re probably bored of hearing how good the MDS is and want to deconstruct it into its components.

Let’s dive in.

SOURCES

In our modern era, every product is inevitably becoming a tech product

From a smart bulb to an orbiting satellite, each generates data in its own unique flavor of frequency of generation, data format, data size, etc.

Social media, microservices, IoT devices, smart devices, DBs, CRMs, ERPs, flat files, and a lot more…

INGESTION

Post creation of data, how does one “ingest” or take in that data for actual usage? (the whole point of investing).

Roughly, there are three categories to help describe the ingestion solutions:

Generic tools allow us to connect various data sources with data storages.

E.g.: we can connect Google Ads or Salesforce to dump data into BigQuery or S3.

These generic tools highlight the modularity and low or no code barrier aspect in MDS.

Things are as easy as drag and drop, and one doesn't need to be fluent in scripting.

Then we have programmable tools as well, where we get more control over how we ingest data through code

For example, we can write Apache Airflow DAGs in Python to load data from S3 and dump it to Redshift.

Intermediary - these tools cater to a specific use case or are coupled with the source itself.

E.g. - Snowpipe, a part of the data source snowflake itself, allows us to load data from files as soon as it's available at the source.

DATA STORAGE

Where do you ingest data into?

Here, we’ve expanded from HDFS & SQL DBs to a wider variety of formats (noSQL, document DB).

Depending on the use case and the way you interact with data, you can choose from a DW, DB, DL, ObjectStores, etc.

You might need a standard relational DB for transactions in finance, or you might be collecting logs. You might be experimenting with your product at an early stage and be fine with noSQL without worrying about prescribing schemas.

One key feature to note is that—most are cloud-based. So, no more worrying about scalability and we pay only for what we use.

PS: Do stick around till the end for new concepts of Lake House and reverse ETL (already prevalent in the industry).

DATA TRANSFORMATION

The stored raw data must be cleaned and restructured into the shape we deem best for actual usage. This slicing and dicing is different for every kind of data.

For example, we have tools for the E-T-L way, which can be categorized into SaaS and Frameworks, e.g., Fivetran and Spark respectively.

Interestingly, the cloud era has given storage computational capability such that we don’t even need an external system for transformation, sometimes.

With this rise of E-LT, we leverage the processing capabilities of cloud data warehouses or lake houses. Using tools like DBT, we write templated SQL queries to transform our data in the warehouses or lake house itself.

This is enabling analysts to perform heavy lifting of traditional DE problems

We also see stream processing where we work with applications where “micro” data is processed in real time (analyzed as soon as it’s produced, as opposed to large batches).

DATA VISUALIZATION

The ability to visually learn from data has only improved in the MDS era with advanced design, methodology, and integration.

With Embedded analytics, one can integrate analytical capabilities and data visualizations into the software application itself.

External analytics, on the other hand, are used to build using your processed data. You choose your source, create a chart, and let it run.

DATA SCIENCE, MACHINE LEARNING, MLOps

Source: https://medium.com/vertexventures/thinking-data-the-modern-data-stack-d7d59e81e8c6

In the last decade, we have moved beyond ad-hoc insight generation in Jupyter notebooks to

production-ready, real-time ML workflows, like recommendation systems and price predictions. Any startup can and does integrate ML into its products.

Most cloud service providers offer machine learning models and automated model building as a service.

MDS concepts like data observation are used to build tools for ML practitioners, whether its feature stores (a feature store is a central repository that provides entity values as of a certain time), or model monitoring (checking data drift, tracking model performance, and improving model accuracy).

This is extremely important as statisticians can focus on the business problem not infrastructure.

This is an ever-expanding field where concepts for ex MLOps (DevOps for the ML pipelines—optimizing workflows, efficient transformations) and Synthetic media (using AI to generate content itself) arrive and quickly become mainstream.

ChatGPT is the current buzz, but by the time you’re reading this, I'm sure there’s going to be an updated one—such is the pace of development.

DATA ORCHESTRATION

With a higher number of modularized tools and source systems comes complicated complexity.

More steps, processes, connections, settings, and synchronization are required.

Data orchestration in MDS needs to be Cron on steroids.

Using a wide variety of products, MDS tools help bring the right data for the right purposes based on complex logic.


DATA OBSERVABILITY

Data observability is the ability to monitor and understand the state and behavior of data as it flows through an organization's systems.

In a traditional data stack, organizations often rely on reactive approaches to data management, only addressing issues as they arise. In contrast, data observability in an MDS involves adopting a proactive mindset, where organizations actively monitor and understand the state of their data pipelines to identify potential issues before they become critical.

Monitoring - a dashboard that provides an operational view of your pipeline or system

Alerting - both for expected events and anomalies 

Tracking - ability to set and track specific events

Analysis - automated issue detection that adapts to your pipeline and data health

Logging - a record of an event in a standardized format for faster resolution

SLA Tracking - Measure data quality against predefined standards (cost, performance, reliability)

Data Lineage - graph representation of data assets showing upstream/downstream steps.

DATA GOVERNANCE & SECURITY

Data security is a critical consideration for organizations of all sizes and industries and needs to be prioritized to protect sensitive information, ensure compliance, and preserve business continuity. 

The introduction of stricter data protection regulations, such as the General Data Protection Regulation (GDPR) and CCPA, introduced a huge need in the market for MDS tools, which efficiently and painlessly help organizations govern and secure their data.

DATA CATALOG

Now that we have all the components of MDS, from ingestion to BI, we have so many sources, as well as things like dashboards, reports, views, other metadata, etc., that we need a google like engine just to navigate our components.

This is where a data catalog helps; it allows people to stitch the metadata (data about your data: the #rows in your table, the column names, types, etc.) across sources.

This is necessary to help efficiently discover, understand, trust, and collaborate on data assets.

We don’t want PMs & GTM to look at different dashboards for adoption data.

Previously, the sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. Chukwa collected events and wrote them to S3 in Hadoop sequence file format. In those days, end-to-end latency was up to 10 minutes. That was sufficient for batch jobs, which usually scan data at daily or hourly frequency.

With the emergence of Kafka and Elasticsearch over the last decade, there has been a growing demand for real-time analytics on Netflix. By real-time, we mean sub-minute latency. Instead of starting from scratch, Netflix was able to iteratively grow its MDS as per changes in market requirements.

Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/


This is a snapshot of the MDS stack a data-mature company like Netflix had some years back where instead of a few all in one tools, each data category was solved by a specialized tool.

FUTURE COMPONENTS OF MDS?

DATA MESH

Source: https://martinfowler.com/articles/data-monolith-to-mesh.html

The top picture shows how teams currently operate, where no matter the feature or product on the Y axis, the data pipeline’s journey remains the same moving along the X. But in an ideal world of data mesh, those who know the data should own its journey.

As decentralization is the name of the game, data mesh is MDS’s response to this demand for an architecture shift where domain owners use self-service infrastructure to shape how their data is consumed.

DATA LAKEHOUSE

Source: https://www.altexsoft.com/blog/data-lakehouse/

We have talked about data warehouses and data lakes being used for data storage.

Initially, when we only needed structured data, data warehouses were used. Later, with big data, we started getting all kinds of data, structured and unstructured.

So, we started using Data Lakes, where we just dumped everything.

The lakehouse tries to combine the best of both worlds by adding an intelligent metadata layer on top of the data lake. This layer basically classifies and categorizes data such that it can be interpreted in a structured manner.

Also, all the data in the lake house is open, meaning that it can be utilized by all kinds of tools. They are generally built on top of open data formats like parquet so that they can be easily accessed by all the tools.

End users can simply run their SQLs as if they’re querying a DWH. 

REVERSE ETL

Suppose you’re a salesperson using Salesforce and want to know if a lead you just got is warm or cold (warm indicating a higher chance of conversion).

The attributes about your lead, like salary and age are fetched from your OLTP into a DWH, analyzed, and then the flag “warm” is sent back to Salesforce UI, ready to be used in live operations.

 METRICS LAYER

Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/

The Metric layer will be all about consistency, accessibility, and trust in the calculations of metrics.

Earlier, for metrics, you had v1 v1.1 Excels with logic scattered around.

Currently, in the modern data stack world, each team's calculation is isolated in the tool they are used to. For example, BI would store metrics in tableau dashboards while DEs would use code.

A metric layer would exist to ensure global access of the metrics to every other tool in the data stack.

For example, DBT metrics layer helps define these in the warehouse—something accessible to both BI and engineers. Similarly, looker, mode, and others have their unique approach to it.

In summary, this blog post discussed the modern data stack and its advantages over older approaches. We examined the components of the modern data stack, including data sources, ingestion, transformation, and more, and how they work together to create an efficient and effective system for data management and analysis. We also highlighted the benefits of the modern data stack, including increased efficiency, scalability, and flexibility. 

As technology continues to advance, the modern data stack will evolve and incorporate new components and capabilities.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings