Skip to main content
Skip to main content

Data Lakehouse

ClickHouse integrates with open lakehouse table formats, including Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon. This allows users to connect ClickHouse to data already stored in these formats across object storage, combining the analytical power of ClickHouse with their existing data lake infrastructure.

Why use ClickHouse with open table formats?

Query existing data in place

ClickHouse can query open table formats directly in object storage without duplicating data. Organizations standardized on Iceberg, Delta Lake, Hudi, or Paimon can point ClickHouse at existing tables and immediately use its SQL dialect, analytical functions, and efficient native Parquet reader. At the same time, tools like clickhouse-local and chDB enable exploratory, ad hoc analysis across more than 70 file formats in remote storage, allowing users to interactively explore lakehouse datasets with no infrastructure setup.

Users can achieve this with either direct reading, using table functions and table engines, or by connecting to a data catalogue.

Real-time analytical workloads with ClickHouse

For workloads that demand high concurrency and low-latency responses, users can load data from open table formats into ClickHouse's MergeTree engine. This provides a real-time analytics layer on top of data that originates in a data lake, supporting dashboards, operational reporting, and other latency-sensitive workloads that benefit from MergeTree columnar storage and indexing capabilities.

See the getting started guide for accelerating analytics with MergeTree.

Capabilities

Read data directly

ClickHouse provides table functions and engines for reading open table formats directly on object storage. Functions such as iceberg(), deltaLake(), hudi(), and paimon() allow users to query lake format tables from within a SQL statement without any prior configuration. Versions of these functions exist for most common object stores, such as S3, Azure Blob Storage, and GCS. These functions also have equivalent table engines which can be used to create tables within ClickHouse which reference underlying lake formats object storage - thus making querying more convenient.

See our getting started guide for querying directly, or by connecting to a data catalogue.

Expose catalogs as databases

Using the DataLakeCatalog database engine, users can connect ClickHouse to an external catalog and expose it as a database. Tables registered in the catalog appear as tables within ClickHouse, enabling the full range of ClickHouse SQL syntax and analytical functions to be used transparently. This means users can query, join, and aggregate across catalog-managed tables as if they were native ClickHouse tables, benefiting from ClickHouse's query optimization, parallel execution, and reading capabilities.

Supported catalogs include:

CatalogGuide
AWS GlueGlue Catalog guide
Databricks Unity CatalogUnity Catalog guide
Iceberg REST CatalogREST Catalog guide
LakekeeperLakekeeper Catalog guide
Project NessieNessie Catalog guide
Microsoft OneLakeOneLake Catalog guide

See the getting started guide for connecting to catalogs.

Write back to open table formats

ClickHouse supports writing data back to open table formats, which is relevant in scenarios such as:

  • Real-time to long-term storage - Data transits through ClickHouse as a real-time analytics layer, and users need to offload results to Iceberg or other formats for durable, cost-effective long-term storage.
  • Reverse ETL - Users perform transformations inside ClickHouse using materialized views or scheduled queries and wish to persist the results into open table formats for consumption by other tools in the data ecosystem.

See the getting started guide for writing to data lakes.

Next steps

Ready to try it out? The Getting Started guide walks through querying open table formats directly, connecting to a catalog, loading data into MergeTree for fast analytics, and writing results back - all in a single end-to-end workflow.