ClickHouse - The Newest Data Store in Your Big Data Arsenal

Milind Kulkarni

Data Engineering

Tags:

Data Engineering

ClickHouse

Database

Data WareHouse

SQL

ClickHouse

ClickHouse is an open-source column-oriented data warehouse for online analytical processing of queries (OLAP). It is fast, scalable, flexible, cost-efficient, and easy to run. It supports the best in the industry query performance while significantly reducing storage requirements through innovative use of columnar storage and compression.

ClickHouse’s performance exceeds comparable column-oriented database management systems that are available on the market. ClickHouse is a database management system, not a single database. ClickHouse allows creating tables and databases at runtime, loading data, and running queries without reconfiguring and restarting the server.

ClickHouse processes from hundreds of millions to over a billion rows of data across hundreds of node clusters. It utilizes all available hardware for processing queries to their fastest. The peak processing performance for a single query stands at more than two terabytes per second.

What makes ClickHouse unique?

Data Storage & Compression: ClickHouse is designed to work on regular hard drives but uses SSD and additional RAM if available. Data compression in ClickHouse plays a crucial role in achieving excellent performance. It provides general-purpose compression codecs and some specialized codecs for specific kinds of data. These codecs have different CPU consumption and disk space and help ClickHouse outperform other databases.

High Performance: By using vector computation, engine data is processed by vectors which are parts of columns, and achieve high CPU efficiency. It supports parallel processing across multiple cores, turning large queries into parallelized naturally. ClickHouse also supports distributed query processing; data resides across shards which are used for parallel execution of the query.

Primary & Secondary Index: Data is sorted physically by the primary key allowing low latency extraction of specific values or ranges. The secondary index in ClickHouse enable the database to know that the query filtering conditions would skip some of the parts entirely. Therefore, these are also called data skipping indexes.

Support for Approximated Calculations: ClickHouse trades accuracy for performance by approximated calculations. It provides aggregate functions for an approximated estimate of several distinct values, medians, and quantiles. It retrieves proportionally fewer data from the disk to run queries based on the part of data to get approximated results.

Data Replication and Data Integrity Support: All the remaining duplicates retrieve their copies in the background after being written to any available replica. The system keeps identical data on several clones. Most failures are recovered automatically or semi-automatically in complex scenarios.

But it can’t be all good, can it? there are some disadvantages to ClickHouse as well:

No full-fledged transactions.
Inability to efficiently and precisely change or remove previously input data. For example, to comply with GDPR, data could well be cleaned up or modified using batch deletes and updates.
ClickHouse is less efficient for point queries that retrieve individual rows by their keys due to the sparse index.

ClickHouse against its contemporaries

So with all these distinctive features, how does ClickHouse compare with other industry-leading data storage tools. Now, ClickHouse being general-purpose, has a variety of use cases, and it has its pros and cons, so here’s a high-level comparison against the best tools in their domain. Depending on the use case, each tool has its unique traits, and comparison around them would not be fair, but what we care about the most is performance, scalability, cost, and other key attributes that can be compared irrespective of the domain. So here we go:

ClickHouse vs Snowflake:

With its decoupled storage & compute approach, Snowflake is able to segregate workloads and enhance performance. The search optimization service in Snowflake further enhances the performance for point lookups but has additional costs attached with it. ClickHouse, on the other hand, with local runtime and inherent support for multiple forms of indexing, drastically improves query performance.
Regarding scalability, ClickHouse being on-prem makes it slightly challenging to scale compared to Snowflake, which is cloud-based. Managing hardware manually by provisioning clusters and migrating is doable but tedious. But one possible solution to tackle is to deploy CH on the cloud, a very good option that is cheaper and, frankly, the most viable.

ClickHouse vs Redshift:

Redshift is a managed, scalable cloud data warehouse. It offers both provisioned and serverless options. Its RA3 nodes compute scalably and cache the necessary data. Still, even with that, its performance does not separate different workloads that are on the same data putting it on the lower end of the decoupled compute & storage cloud architectures. ClickHouse’s local runtime is one of the fastest.
Both Redshift and ClickHouse are columnar, sort data, allowing read-only specific data. But deploying CH is cheaper, and although RS is tailored to be a ready-to-use tool, CH is better if you’re not entirely dependent on Redshift’s features like configuration, backup & monitoring.

ClickHouse vs InfluxDB:

InfluxDB, written in Go, this open-source no-SQL is one of the most popular choices when it comes to dealing with time-series data and analysis. Despite being a general-purpose analytical DB, ClickHouse provides competitive write performance.
ClickHouse’s data structures like AggregatingMergeTree allow real-time data to be stored in a pre-aggregated format which puts it on par in performance regarding TSDBs. It is significantly faster in heavy queries and comparable in the case of light queries.

ClickHouse vs PostgreSQL:

Postgres is another DB that is very versatile and thus is widely used by the world for various use cases, just like ClickHouse. Postgres, however, is an OLTP DB, so unlike ClickHouse, analytics is not its primary aim, but it's still used for analytics purposes to a certain extent.
In terms of transactional data, ClickHouse's columnar nature puts it below Postgres, but when it comes to analytical capabilities, even after tuning Postgres to its max potential, for, e.g., by using materialized views, indexing, cache size, buffers, etc. ClickHouse is ahead.

ClickHouse vs Apache Druid:

Apache Druid is an open-source data store that is primarily used for OLAP. Both Druid & ClickHouse are very similar in terms of their approaches and use cases but differ in terms of their architecture. Druid is mainly used for real-time analytics with heavy ingestions and high uptime.
Unlike Druid, ClickHouse has a much simpler deployment. CH can be deployed on only one server, while Druid setup needs multiple types of nodes (master, broker, ingestion, etc.). ClickHouse, with its support for SQL-like nature, provides better flexibility. It is more performant when the deployment is small.

To summarize the differences between ClickHouse and other data warehouses:

ClickHouse Engines

Depending on the type of your table (internal or external) ClickHouse provides an array of engines that help us connect to different data storages and also determine the way data is stored, accessed, and other interactions on it.

These engines are mainly categorized into two types:

Database Engines:

These allow us to work with different databases & tables.
ClickHouse uses the Atomic database engine to provide configurable table engines and dialects. The popular ones are PostgreSQL, MySQL, and so on.

Table Engines:

These determine

how and where data is stored
where to read/write it from/to
which queries it supports
use of indexes
concurrent data access and so on.

These engines are further classified into families based on the above parameters:

MergeTree Engines:

This is the most universal and functional table for high-load tasks. The engines of this family support quick data insertion with subsequent background data processing. These engines also support data replication, partitioning, secondary data-skipping indexes and some other features. Following are some of the popular engines in this family:

MergeTree
SummingMergeTree
AggregatingMergeTree

MergeTree engines with indexing and partitioning support allow data to be processed at a tremendous speed. These can also be leveraged to form materialized views that store aggregated data further improving the performance.

Log Engines:

These are lightweight engines with minimum functionality. These work the best when the requirement is to quickly write into many small tables and read them later as a whole. This family consists of:

Log
StripeLog
TinyLog

These engines append data to the disk in a sequential fashion and support concurrent reading. They do not support indexing, updating, or deleting and hence are only useful when the data is small, sequential, and immutable.

Integration Engines:

These are used for communicating with other data storage and processing systems. This support:

JDBC
MongoDB
HDFS
S3
Kafka and so on.

Using these engines we can import and export data from external sources. With engines like Kafka we can ingest data directly from a topic to a table in ClickHouse and with the S3 engine, we work directly with S3 objects.

Special Engines:

ClickHouse offers some special engines that are specific to the use case. For example:

MaterializedView
Distributed
Merge
File and so on.

These special engines have their own quirks for eg. with File we can export data to a file, update data in the table by updating the file, etc.

Summary

We learned that ClickHouse is a very powerful and versatile tool. One that has stellar performance is feature-packed, very cost-efficient, and open-source. We saw a high-level comparison of ClickHouse with some of the best choices in an array of use cases. Although it ultimately comes down to how specific and intense your use case is, ClickHouse and its generic nature measure up pretty well on multiple occasions.

ClickHouse’s applicability in web analytics, network management, log analysis, time series analysis, asset valuation in financial markets, and security threat identification makes it tremendously versatile. With consistently solving business problems in a low latency response for petabytes of data, ClickHouse is indeed one of the faster data warehouses out there.

ClickHouse - The Newest Data Store in Your Big Data Arsenal

ClickHouse

What makes ClickHouse unique?

Data Storage & Compression: ClickHouse is designed to work on regular hard drives but uses SSD and additional RAM if available. Data compression in ClickHouse plays a crucial role in achieving excellent performance. It provides general-purpose compression codecs and some specialized codecs for specific kinds of data. These codecs have different CPU consumption and disk space and help ClickHouse outperform other databases.

High Performance: By using vector computation, engine data is processed by vectors which are parts of columns, and achieve high CPU efficiency. It supports parallel processing across multiple cores, turning large queries into parallelized naturally. ClickHouse also supports distributed query processing; data resides across shards which are used for parallel execution of the query.

Primary & Secondary Index: Data is sorted physically by the primary key allowing low latency extraction of specific values or ranges. The secondary index in ClickHouse enable the database to know that the query filtering conditions would skip some of the parts entirely. Therefore, these are also called data skipping indexes.

Support for Approximated Calculations: ClickHouse trades accuracy for performance by approximated calculations. It provides aggregate functions for an approximated estimate of several distinct values, medians, and quantiles. It retrieves proportionally fewer data from the disk to run queries based on the part of data to get approximated results.

Data Replication and Data Integrity Support: All the remaining duplicates retrieve their copies in the background after being written to any available replica. The system keeps identical data on several clones. Most failures are recovered automatically or semi-automatically in complex scenarios.

But it can’t be all good, can it? there are some disadvantages to ClickHouse as well:

No full-fledged transactions.
Inability to efficiently and precisely change or remove previously input data. For example, to comply with GDPR, data could well be cleaned up or modified using batch deletes and updates.
ClickHouse is less efficient for point queries that retrieve individual rows by their keys due to the sparse index.

ClickHouse against its contemporaries

ClickHouse vs Snowflake:

With its decoupled storage & compute approach, Snowflake is able to segregate workloads and enhance performance. The search optimization service in Snowflake further enhances the performance for point lookups but has additional costs attached with it. ClickHouse, on the other hand, with local runtime and inherent support for multiple forms of indexing, drastically improves query performance.
Regarding scalability, ClickHouse being on-prem makes it slightly challenging to scale compared to Snowflake, which is cloud-based. Managing hardware manually by provisioning clusters and migrating is doable but tedious. But one possible solution to tackle is to deploy CH on the cloud, a very good option that is cheaper and, frankly, the most viable.

ClickHouse vs Redshift:

Redshift is a managed, scalable cloud data warehouse. It offers both provisioned and serverless options. Its RA3 nodes compute scalably and cache the necessary data. Still, even with that, its performance does not separate different workloads that are on the same data putting it on the lower end of the decoupled compute & storage cloud architectures. ClickHouse’s local runtime is one of the fastest.
Both Redshift and ClickHouse are columnar, sort data, allowing read-only specific data. But deploying CH is cheaper, and although RS is tailored to be a ready-to-use tool, CH is better if you’re not entirely dependent on Redshift’s features like configuration, backup & monitoring.

ClickHouse vs InfluxDB:

InfluxDB, written in Go, this open-source no-SQL is one of the most popular choices when it comes to dealing with time-series data and analysis. Despite being a general-purpose analytical DB, ClickHouse provides competitive write performance.
ClickHouse’s data structures like AggregatingMergeTree allow real-time data to be stored in a pre-aggregated format which puts it on par in performance regarding TSDBs. It is significantly faster in heavy queries and comparable in the case of light queries.

ClickHouse vs PostgreSQL:

Postgres is another DB that is very versatile and thus is widely used by the world for various use cases, just like ClickHouse. Postgres, however, is an OLTP DB, so unlike ClickHouse, analytics is not its primary aim, but it's still used for analytics purposes to a certain extent.
In terms of transactional data, ClickHouse's columnar nature puts it below Postgres, but when it comes to analytical capabilities, even after tuning Postgres to its max potential, for, e.g., by using materialized views, indexing, cache size, buffers, etc. ClickHouse is ahead.

ClickHouse vs Apache Druid:

Apache Druid is an open-source data store that is primarily used for OLAP. Both Druid & ClickHouse are very similar in terms of their approaches and use cases but differ in terms of their architecture. Druid is mainly used for real-time analytics with heavy ingestions and high uptime.
Unlike Druid, ClickHouse has a much simpler deployment. CH can be deployed on only one server, while Druid setup needs multiple types of nodes (master, broker, ingestion, etc.). ClickHouse, with its support for SQL-like nature, provides better flexibility. It is more performant when the deployment is small.

To summarize the differences between ClickHouse and other data warehouses:

ClickHouse Engines

These engines are mainly categorized into two types:

Database Engines:

Table Engines:

These determine

how and where data is stored
where to read/write it from/to
which queries it supports
use of indexes
concurrent data access and so on.

These engines are further classified into families based on the above parameters:

MergeTree Engines:

MergeTree
SummingMergeTree
AggregatingMergeTree

Log Engines:

These are lightweight engines with minimum functionality. These work the best when the requirement is to quickly write into many small tables and read them later as a whole. This family consists of:

Log
StripeLog
TinyLog

Integration Engines:

These are used for communicating with other data storage and processing systems. This support:

JDBC
MongoDB
HDFS
S3
Kafka and so on.

Special Engines:

ClickHouse offers some special engines that are specific to the use case. For example:

MaterializedView
Distributed
Merge
File and so on.

These special engines have their own quirks for eg. with File we can export data to a file, update data in the table by updating the file, etc.

Subscribe to get the latest technology updates

ClickHouse - The Newest Data Store in Your Big Data Arsenal

Milind Kulkarni

ClickHouse

What makes ClickHouse unique?

ClickHouse against its contemporaries

ClickHouse vs Snowflake:

ClickHouse vs Redshift:

ClickHouse vs InfluxDB:

ClickHouse vs PostgreSQL:

ClickHouse vs Apache Druid:

ClickHouse Engines

Summary

Further Readings

MORE POSTS BY THIS AUTHOR

Milind Kulkarni

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

ClickHouse - The Newest Data Store in Your Big Data Arsenal

ClickHouse

What makes ClickHouse unique?

ClickHouse against its contemporaries

ClickHouse vs Snowflake:

ClickHouse vs Redshift:

ClickHouse vs InfluxDB:

ClickHouse vs PostgreSQL:

ClickHouse vs Apache Druid:

ClickHouse Engines

Summary

Further Readings

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Data Engineering: Beyond Big Data

Iceberg: Features and Hands-on (Part 2)

Data QA: The Need of the Hour

Iceberg - Introduction and Setup (Part - 1)

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

Elasticsearch - Basic and Advanced Concepts

BigQuery 101: All the Basics You Need to Know

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting