DuckDB
forze[duckdb] implements the analytics query contract with an embedded
DuckDB engine. Where ClickHouse and BigQuery are
standing warehouses, DuckDB is compute without a server: point it at Parquet,
CSV, Iceberg or Delta files on S3/GCS (or local disk) and it runs the scan
in-process. It's the cheap path for "hard analytics" that don't need
ultra-low-latency serving — and, being Docker-free, an excellent analytics
double in tests.
It is query-only (no ingest, no table management): named, parameterized queries returning typed rows, exactly like the other analytics adapters.
Install¶
uv add 'forze[duckdb]'
No server to run — DuckDB is a library. Reading object storage pulls the
httpfs extension (and iceberg / delta for those formats) on first use.
The client¶
from forze_duckdb import DuckDbClient
duck = DuckDbClient()
The engine is synchronous and in-process; the client runs each query on a dedicated bounded executor (DuckDB releases the GIL, so the event loop stays responsive) with its own cursor, and honors per-query timeouts via interrupt.
The lake source¶
Sources are declared as typed descriptors that compile to scan expressions and auto-load the extensions they need — no hand-written SQL strings:
from forze_duckdb import ParquetSource, IcebergSource, DeltaSource
events = ParquetSource("s3://bucket/events/*.parquet") # also CsvSource, JsonSource
Object-storage credentials render to DuckDB secrets, resolved at startup from the
wired secrets backend via a secret_ref (or supplied inline for
local runs):
from forze.application.contracts.secrets import SecretRef
from forze_duckdb import S3Credentials
creds = S3Credentials(name="lake", secret_ref=SecretRef(path="lake/s3"))
Wire it¶
Each analytics route maps query_keys to DuckDB SQL (referencing a registered
source view or an inline read_parquet(...)), keyed by AnalyticsSpec.name. The
lifecycle step opens the connection, loads extensions, registers credentials and
source views:
from forze.application.execution import DepsRegistry, LifecyclePlan
from forze_duckdb import (
DuckDbAnalyticsConfig,
DuckDbClient,
DuckDbDepsModule,
DuckDbQueryConfig,
ParquetSource,
S3Credentials,
duckdb_lifecycle_step,
)
from forze.application.contracts.secrets import SecretRef
duck = DuckDbClient()
events = DuckDbAnalyticsConfig(
queries={"by_day": DuckDbQueryConfig(
sql="SELECT day, sum(total) AS total FROM events WHERE day >= $since GROUP BY day",
)},
)
deps = DepsRegistry.from_modules(DuckDbDepsModule(client=duck, analytics={"events": events}))
lifecycle = LifecyclePlan.from_steps(
duckdb_lifecycle_step(
object_stores=(S3Credentials(name="lake", secret_ref=SecretRef(path="lake/s3")),),
sources={"events": ParquetSource("s3://bucket/events/*.parquet")},
),
)
For a runnable, Docker-free walkthrough, see the analytics recipe.
What it provides¶
| Contract | Keyed by |
|---|---|
Analytics query (run / run_page / run_cursor / run_chunked) |
AnalyticsSpec.name |
Notes¶
- Query-only. Writing/maintaining tables (Iceberg/Delta compaction, ingest) is an ETL/ops concern, not a domain port — keep it out of the adapter.
- Source styles compose. Register named views (
sources=) and reference them in SQL, or inlineread_parquet('s3://…')in a query — both work. - Iceberg/Delta: point
IcebergSourceat the table's current*.metadata.json(the table-directory form needs aversion-hint.textmany writers omit); for in-place tables read withallow_moved_paths=False. - Concurrency: each query gets its own cursor, so parallel reads don't
serialize. Cap parallelism with
DuckDbConfig(max_concurrent_queries=...). - Cursor pagination is offset-based; a full terminal page reports
has_moreuntil the next (short) page confirms the end.