Statistics¶

Fabric Data Warehouse supports manual statistics that give the query optimizer accurate cardinality estimates. While Fabric creates automatic statistics at query time (using sampling), manual statistics use FULLSCAN by default — reading all rows for exact histograms — and persist across sessions without needing a query to trigger creation.

This adapter lets you manage manual statistics declaratively using the statistics config option.

When to use manual statistics¶

Fabric creates automatic statistics for columns used in GROUP BY, JOIN, WHERE, and ORDER BY clauses. Manual statistics add value when:

After large data loads — automatic stats may be stale until the next query triggers a refresh
For critical query paths — avoid the first query after a data change paying the cost of synchronous statistics creation
When sampling isn't accurate enough — automatic stats use sampling; FULLSCAN reads all rows

Choosing which columns

Focus on columns in WHERE, JOIN, GROUP BY, and ORDER BY clauses — these are where the optimizer needs cardinality estimates. Columns only in the SELECT list don't benefit from statistics.

Usage¶

Add statistics to your model's config block. It accepts true (all columns), a single column name, or a list of column names:

Specific columnsAll columnsIn dbt_project.yml

models/orders.sql

{{ config(
    materialized='table',
    statistics=['customer_id', 'order_date', 'product_id']
) }}

select
    order_id,
    order_date,
    customer_id,
    product_id,
    total_amount
from {{ source('raw', 'orders') }}

models/orders.sql

{{ config(
    materialized='table',
    statistics=true
) }}

select
    order_id,
    order_date,
    customer_id,
    total_amount
from {{ source('raw', 'orders') }}

dbt_project.yml

models:
  my_project:
    marts:
      +statistics: ['customer_id', 'order_date']
    staging:
      +statistics: true

The generated SQL for specific columns:

CREATE STATISTICS [dbt_stats__fcfc231ef98cf4ae86d587235eda4cd7]
ON [schema].[orders] ([customer_id]) WITH FULLSCAN;

CREATE STATISTICS [dbt_stats__62b326a5adbe56da5634fa1bcf7b579e]
ON [schema].[orders] ([order_date]) WITH FULLSCAN;

On subsequent runs, existing statistics are updated instead of recreated:

UPDATE STATISTICS [schema].[orders] [dbt_stats__fcfc231ef98cf4ae86d587235eda4cd7] WITH FULLSCAN;

Sampling¶

By default, manual statistics use FULLSCAN (exact counts). For large tables where a full scan is too expensive, use statistics_sample_percent to switch to sampling:

models/orders.sql

{{ config(
    materialized='table',
    statistics=['customer_id', 'order_date'],
    statistics_sample_percent=50
) }}

select ...

dbt_project.yml

models:
  my_project:
    staging:
      +statistics: true
      +statistics_sample_percent: 25

This generates:

CREATE STATISTICS [dbt_stats__fcfc231ef98cf4ae86d587235eda4cd7]
ON [schema].[orders] ([customer_id]) WITH SAMPLE 50 PERCENT;

Works with incremental models¶

Statistics are created or updated on every run — both full refresh and incremental merge:

models/orders_incremental.sql

{{ config(
    materialized='incremental',
    unique_key='order_id',
    statistics=['customer_id', 'order_date']
) }}

select
    order_id,
    order_date,
    customer_id,
    total_amount
from {{ source('raw', 'orders') }}
{% if is_incremental() %}
where order_date > (select max(order_date) from {{ this }})
{% endif %}

Works with snapshots¶

Statistics can also be configured on snapshots:

snapshots/schema.yml

snapshots:
  - name: orders_snapshot
    config:
      statistics: ['customer_id', 'dbt_valid_from', 'dbt_valid_to']

`statistics: true` vs explicit column list¶

statistics: true creates statistics on every column in the table. This is convenient but has trade-offs:

	`statistics: true`	Explicit column list
Convenience	No maintenance needed	Must update when columns change
Performance	Full scan per column — slow on wide, large tables	Only scans the columns you specify
Value	Covers columns that don't affect query plans	Targets columns the optimizer actually uses

Rule of thumb: use true for small-to-medium tables or when simplicity matters. Use an explicit list for large tables where you want to minimize post-create overhead.

Naming convention¶

Statistics are named dbt_stats__<md5_hash> where the hash is computed from <table>__<column> (e.g., dbt_stats__fcfc231ef98cf4ae86d587235eda4cd7 for column customer_id on table orders). The dbt_stats__ prefix avoids collisions with Fabric's auto-generated _WA_Sys_* statistics, and hashing guarantees uniqueness regardless of identifier length.

Limitations¶

Fabric Data Warehouse only — FabricSpark (Lakehouse) uses a different statistics mechanism (ANALYZE TABLE) and is not supported by this config option
Single-column statistics only — Fabric Data Warehouse does not support multi-column statistics
One scan method per model — all columns in a model use the same FULLSCAN or SAMPLE setting; for per-column control, use post-hooks

Statistics¶

When to use manual statistics¶

Usage¶

Sampling¶

Works with incremental models¶

Works with snapshots¶

statistics: true vs explicit column list¶

Naming convention¶

Limitations¶

`statistics: true` vs explicit column list¶