Lakehouse (Spark SQL)¶
The dbt-fabric-samdebruyn adapter supports Fabric Lakehouse via the fabricspark adapter type. This uses Spark SQL as the query language and connects to Fabric through the Livy API -- an HTTP REST interface for submitting Spark statements.
Getting started¶
Installation¶
Install the adapter with the [spark] extra:
This installs dbt-spark as a dependency.
Configuration¶
Add a FabricSpark profile to your profiles.yml:
default:
target: dev
outputs:
dev:
type: fabricspark
workspace: your workspace name
database: name_of_your_lakehouse
schema: dbt
The workspace (or workspace_id) is always required for FabricSpark -- the adapter uses it to resolve the Livy API endpoint. The database field is the name of your Lakehouse.
For all configuration options, see the configuration reference.
Authentication¶
The authentication methods documented in the authentication guide work with both adapter types. When following examples there, substitute type: fabricspark where the examples show type: fabric. Note that ActiveDirectoryIntegrated and ActiveDirectoryPassword are Data Warehouse-only methods and do not work with FabricSpark.
The FabricSpark adapter does not use the host option -- it always resolves the Livy endpoint from the workspace and lakehouse information.
How it works¶
The FabricSpark adapter executes all SQL through Fabric's high-concurrency Livy API. Each dbt thread gets its own REPL inside a shared underlying Livy session. Here is the execution flow:
sequenceDiagram
participant dbt
participant Adapter
participant HC Livy API
participant Spark Session
dbt->>Adapter: Compiled Spark SQL
Adapter->>HC Livy API: POST /highConcurrencySessions (acquire REPL)
alt Underlying session exists (warm)
HC Livy API-->>Adapter: HC session ID + REPL ID
else No underlying session
Note over HC Livy API,Spark Session: Spark startup
HC Livy API-->>Adapter: HC session ID + REPL ID
end
Adapter->>HC Livy API: POST /highConcurrencySessions/{id}/repls/{replId}/statements
HC Livy API->>Spark Session: Execute Spark SQL (in REPL)
loop Poll every 3 seconds
Adapter->>HC Livy API: GET /highConcurrencySessions/{id}/repls/{replId}/statements/{stmtId}
HC Livy API-->>Adapter: Status + results (when done)
end
Adapter-->>dbt: Parsed results
Key technical details:
- One REPL per thread -- Each dbt thread acquires its own REPL inside a shared underlying Livy session. Statements from different REPLs execute in parallel.
- Deterministic session tag -- The adapter computes a session tag from
(workspace_id, lakehouse_id). Fabric packs all REPLs with the same tag onto one underlying Livy session, enabling warm session reuse across dbt invocations. - Polling interval -- The adapter polls for statement completion every 3 seconds.
- Rate limiting -- The Fabric Livy API enforces rate limits. The adapter handles HTTP 429 responses automatically using the
Retry-Afterheader. - DB-API 2.0 cursor -- Results are returned as JSON and parsed into a PEP 249 compatible cursor, so dbt interacts with the Lakehouse the same way it interacts with any other database.
Materializations¶
Default materialization: materialized_view¶
Unlike most dbt adapters where the default materialization is view or table, the FabricSpark adapter defaults to materialized_view. This creates Fabric lake views, a Fabric-specific concept.
Lake views support:
CREATE OR REPLACEsemanticsPARTITIONED BYclausesTBLPROPERTIES(Spark table properties)CHECKconstraints withON MISMATCHbehavior
Supported materializations¶
| Materialization | Supported | Notes |
|---|---|---|
materialized_view |
Yes | Default. Creates a Fabric lake view. |
table |
Yes | Creates a managed Delta table. |
view |
No | Fabric Lakehouse with schemas does not support Spark SQL views. |
incremental |
Yes | Supports append and insert_overwrite strategies. merge and delete+insert are not supported. |
ephemeral |
Yes | Standard CTE-based ephemeral models. |
No Spark SQL views
Fabric Lakehouse with schemas enabled does not support Spark SQL views. If a model or package uses materialized='view', you will see the error 'view' is not a valid FabricSparkRelationType. Change the materialization to materialized_view or table.
Identifier quoting¶
FabricSpark uses backticks (`) for identifier quoting, following Spark SQL conventions. This is different from the Data Warehouse adapter, which uses T-SQL brackets ([]).
-- FabricSpark (Spark SQL)
SELECT `my column` FROM `my_schema`.`my_table`
-- Fabric Data Warehouse (T-SQL)
SELECT [my column] FROM [my_schema].[my_table]
High-concurrency Livy¶
The adapter uses Fabric's high-concurrency Livy API. Each dbt thread acquires its own HC session -- and therefore its own REPL -- inside a single underlying Livy session shared via a deterministic sessionTag derived from (workspace_id, lakehouse_id). Statements from different REPLs execute in parallel inside the same Spark application, so increasing threads in your profile directly increases throughput.
Session reuse across runs¶
The session tag is deterministic: every dbt invocation targeting the same workspace + lakehouse produces the same tag. Fabric snap-attaches new REPLs onto the still-warm underlying Livy session, skipping the Spark cold-start entirely on subsequent runs.
threads > 5¶
Fabric packs up to 5 REPLs onto one underlying Livy session (see the HC Livy key concepts). With threads > 5, dbt still works correctly -- Fabric spins up a second underlying Livy session to host the 6th REPL onwards.
| Property | Shared across underlying sessions? |
|---|---|
| OneLake Delta tables (dbt model outputs) | Yes -- same lakehouse storage |
Catalog / metastore (SELECT FROM <other_model>) |
Yes -- same Fabric catalog |
Temp views (CREATE TEMPORARY VIEW ...) |
No -- REPL/session-local |
Session-level Spark configs (SET spark.sql.X = ...) |
No |
| Cached datasets / UDFs / broadcast vars | No |
Because dbt-fabricspark materializations always write permanent Delta / lake view objects, model-to-model refs resolve correctly regardless of which underlying session produced or consumes the table.
Cost tradeoff
Each additional underlying Livy session is a separate Spark cluster billed for the duration of the run plus the idle timeout. Keep threads ≤ 5 for the cheapest profile; raise it only when the extra parallelism beats the extra compute spend.
Performance considerations¶
The Livy API architecture has inherent performance characteristics that are important to understand.
Session startup¶
Creating a new Spark session takes some time. The adapter reuses sessions within a run, so this overhead is paid once per dbt run. Subsequent runs may reuse an existing session if it is still alive. The high-concurrency Livy session tag is deterministic, so subsequent runs can skip startup entirely by reattaching to a warm session.
Statement execution¶
Each SQL statement involves multiple HTTP API calls (submit + poll). This is inherently slower than a direct database connection like the TDS protocol used by the Data Warehouse adapter. Statements from different threads execute in parallel via high-concurrency Livy, significantly improving wall-clock time for multi-model runs.
Polling overhead¶
The adapter polls for statement completion every 3 seconds. Even very fast queries take at least 3 seconds to return.
API rate limiting¶
Fabric applies rate limits to the Livy API. The adapter handles HTTP 429 responses automatically by respecting the Retry-After header. During heavy workloads, you may see pauses of 5-30 seconds between statements.
Practical impact¶
A dbt run with many models will be significantly slower on FabricSpark than on Fabric Data Warehouse. This is inherent to the Livy API architecture, not a limitation of the adapter. High-concurrency Livy reduces this gap by running statements in parallel.
Recommendations¶
- Use higher thread counts to parallelize model execution and amortize the per-statement overhead. However, higher parallelism also increases API call volume, which can trigger rate limiting sooner.
- Keep models as consolidated as possible to reduce the total number of statements.
- Monitor the Spark session in the Fabric monitoring hub to understand execution patterns.
Differences from Fabric Data Warehouse¶
| Concept | Data Warehouse (fabric) |
Lakehouse (fabricspark) |
|---|---|---|
| SQL dialect | T-SQL | Spark SQL |
| Connection | mssql-python (TDS protocol) | Livy sessions (HTTP REST) |
| Identifier quoting | [brackets] |
`backticks` |
| Default materialization | table |
materialized_view (lake view) |
| Views | Supported | Not supported |
| String type | varchar(MAX) |
string |
| Timestamp type | datetime2(6) |
timestamp |
| Pagination | SELECT TOP N |
LIMIT N |
| Catalog queries | sys.tables, sys.columns |
SHOW TABLES, DESCRIBE |
| Python models | Via Livy + separate lakehouse config | Native (same Livy session) |
| MERGE incremental | Supported | Not supported |
| CLUSTER BY | Supported | Not supported |
| Warehouse snapshots | Supported | Not supported |
| Catalog statistics | Supported | Not supported |
Python models¶
Python models work differently depending on the adapter type:
type: fabric(Data Warehouse): Python models use Livy to execute PySpark code that reads from and writes to the Data Warehouse via the synapsesql connector. You need both alakehouse(for the Livy session) and adatabase(the DW target).type: fabricspark(Lakehouse): Python models run on the same Livy session that handles all SQL models. Thedatabasefield IS the lakehouse. No separatelakehouseconfig is needed.
See the Python models guide for writing and debugging Python models.
Limitations¶
- No Spark SQL views -- only tables and materialized lake views (Fabric lake views) are supported.
- No incremental merge strategy -- the Spark SQL
MERGEsyntax in Fabric Lakehouse is not supported by the adapter. Useappendorinsert_overwriteinstead. - API rate limiting -- can slow down large runs with many models.
- Session startup time -- creating a new Spark session adds latency to the first statement in a run.
- Data Warehouse-only features -- CLUSTER BY, warehouse snapshots, and catalog statistics are not available for Lakehouse.
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
Livy session did not become idle in time |
Session startup took too long | Increase spark_session_timeout, retry, or check Fabric capacity |
HTTP 429 in logs |
API rate limiting | Automatic -- the adapter retries. Reduce threads if excessive. |
'view' is not a valid FabricSparkRelationType |
Model uses view materialization |
Change to materialized_view or table |
| Slow execution | Livy polling overhead | Expected behavior. Use higher thread count. |
| Statement timeout | Long-running Spark query | Increase query_timeout |
Either workspace_id or workspace_name must be provided |
Missing workspace configuration | Add workspace or workspace_id to your profile |