From SQL to Lineage Graph

Overview

You write SQL files. clgraph reads them once and builds a complete lineage graph showing how your data flows - both at the table level and column level.

No configuration. No annotations. Just point to your SQL.

from clgraph import Pipeline

# Point to your SQL files
pipeline = Pipeline.from_sql_files("examples/sql_files/", dialect="bigquery")

# That's it. Complete graph built automatically.
# - Table dependencies: pipeline.table_graph
# - Column lineage: pipeline.columns, pipeline.edges

The Three-Step Process

1. Your SQL Files (Input)

You have SQL files that create tables, transform data, join datasets:

-- queries/01_raw_orders.sql
CREATE TABLE raw.orders AS
SELECT order_id, customer_id, amount, status
FROM external.orders;

-- queries/02_staging_orders.sql
CREATE TABLE staging.orders AS
SELECT
    customer_id,
    SUM(amount) as total_amount,
    COUNT(*) as order_count
FROM raw.orders
WHERE status = 'completed'
GROUP BY customer_id;

-- queries/03_customer_metrics.sql
CREATE TABLE analytics.customer_metrics AS
SELECT
    customer_id,
    total_amount,
    order_count,
    total_amount / order_count as avg_order_value
FROM staging.orders;

View Pipeline Structure:

from clgraph import Pipeline

pipeline = Pipeline.from_sql_files("examples/sql_files/", dialect="bigquery")

# View the pipeline structure with query units
print(pipeline)
# Pipeline(
#   01_raw_orders: CREATE TABLE raw.orders AS SELECT order_id, customer_...
#     main
#   02_staging_orders: CREATE TABLE staging.orders AS SELECT     customer_id,...
#     main
#   03_customer_metrics: CREATE TABLE analytics.customer_metrics AS SELECT   ...
#     main
# )

Each query shows its query units - the internal structure of the query: - main - the main SELECT statement - cte:name - Common Table Expressions (CTEs) - subq:N - Subqueries

Example with CTEs:

from clgraph import Pipeline

sql_with_ctes = """
CREATE TABLE customer_summary AS
WITH monthly_sales AS (
    SELECT customer_id, DATE_TRUNC(order_date, MONTH) as month, SUM(amount) as revenue
    FROM orders
    GROUP BY 1, 2
),
top_customers AS (
    SELECT customer_id, month, revenue
    FROM monthly_sales
    WHERE revenue > 1000
)
SELECT c.name, t.month, t.revenue
FROM top_customers t
JOIN customers c ON t.customer_id = c.id
"""

pipeline = Pipeline.from_sql_string(sql_with_ctes, dialect="bigquery")
print(pipeline)
# Pipeline(
#   select: CREATE TABLE customer_summary AS WITH monthly_sales AS (...
#     cte:monthly_sales
#     cte:top_customers
#     main
# )

2. SQL → AST (Parsing)

When you call Pipeline.from_sql_files(), clgraph uses sqlglot to parse each SQL statement into an AST (Abstract Syntax Tree).

What's an AST? A structured representation of your SQL that a program can understand.

# Your SQL (what you write)
"SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id"

# AST (what the parser sees)
SELECT
  ├─ customer_id (column reference)
  ├─ SUM(amount) (aggregate function)
  │   └─ amount (column reference)
  FROM
  └─ orders (table reference)
  GROUP BY
  └─ customer_id (grouping key)

The AST contains everything: - Table names (raw.orders, staging.orders) - Column names (customer_id, amount, total_amount) - Transformations (SUM, COUNT, division) - Relationships (WHERE, JOIN, GROUP BY) - Expressions (total_amount / order_count)

You don't need to understand ASTs. clgraph handles all of this automatically.

3. AST → Lineage Graph (Graph Building)

clgraph walks through each AST and extracts:

Table-Level Graph

graph TD
    A[raw.orders] --> B[staging.orders]
    B --> C[analytics.customer_metrics]

What we extract: - Source tables: external.orders - Created tables: raw.orders, staging.orders, analytics.customer_metrics - Dependencies: staging.orders depends on raw.orders

Column-Level Graph

graph LR
    A1[raw.orders.order_id] -.dropped.-> N1[ ]
    A2[raw.orders.customer_id] --> B2[staging.orders.customer_id] --> C2[analytics.customer_metrics.customer_id]
    A3[raw.orders.amount] -->|SUM| B3[staging.orders.total_amount] --> C3[analytics.customer_metrics.total_amount]
    B3 -->|DIVIDE| C4[analytics.customer_metrics.avg_order_value]
    A3 -->|COUNT| B4[staging.orders.order_count] --> C5[analytics.customer_metrics.order_count]
    B4 -->|DIVIDE| C4
    A5[raw.orders.status] -.filtered.-> N2[ ]

What we extract: - Column flows: amount → SUM(amount) → total_amount - Transformations: SUM(), COUNT(), DIVIDE() - Dropped columns: order_id (not in final output) - Filtered columns: status (used in WHERE but not stored) - Calculated columns: avg_order_value = total_amount / order_count

What's In The Graph

Table Graph

# Access table-level dependencies (pipeline provided by preamble)

# Get all tables
print(table_graph.tables)
# Output: ['raw.orders', 'staging.orders', 'analytics.customer_metrics']

# Get dependencies for a specific table
deps = table_graph.get_dependencies('analytics.customer_metrics')
print(f"Dependencies: {deps}")

# Get execution order (topological sort)
execution_order = table_graph.get_execution_order()
print(f"Execution order: {execution_order}")

Column Lineage Graph

# Access column-level lineage directly from Pipeline (pipeline provided by preamble)

# Get all columns in a table
table_columns = [col for col in pipeline.columns.values()
                 if col.table_name == 'analytics.customer_metrics']
print(f"Columns in analytics.customer_metrics: {[c.column_name for c in table_columns]}")

# Trace column backward to sources
sources = pipeline.trace_column_backward(
    'analytics.customer_metrics',
    'avg_order_value'
)
print(f"Sources for avg_order_value: {[s.full_name for s in sources]}")

# Trace column forward to downstream
affected = pipeline.trace_column_forward(
    'raw.orders',
    'amount'
)
print(f"Affected by raw.orders.amount: {[a.full_name for a in affected]}")

Why The AST Matters

The AST contains everything your SQL already describes:

Example: SUM Aggregation

SELECT customer_id, SUM(amount) as total_amount
FROM raw.orders
GROUP BY customer_id

What the AST tells us: - total_amount is a new column (not in source) - It's created by SUM(amount) (aggregation function) - Source column: raw.orders.amount - Transformation type: SUM - Grouping key: customer_id

Without parsing: You'd have to read the SQL manually and guess what happens.

With AST parsing: We extract exact relationships automatically.

Example: Calculated Column

SELECT
    total_amount,
    order_count,
    total_amount / order_count as avg_order_value
FROM staging.orders

What the AST tells us: - avg_order_value depends on two columns: total_amount and order_count - Transformation: DIVIDE operation - Both source columns flow into this calculation

This is why clgraph can trace backward and say:

avg_order_value comes from:
  - total_amount (which comes from SUM(raw.orders.amount))
  - order_count (which comes from COUNT(*) on raw.orders)

Supported SQL Dialects

clgraph uses sqlglot, which supports many SQL dialects:

BigQuery
Snowflake
PostgreSQL
DuckDB
Redshift
Spark SQL
And many more

Just specify the dialect when creating the pipeline:

# BigQuery
pipeline = Pipeline.from_sql_files("examples/sql_files/", dialect="bigquery")

# Snowflake
pipeline = Pipeline.from_sql_files("examples/sql_files/", dialect="snowflake")

# DuckDB
pipeline = Pipeline.from_sql_files("examples/sql_files/", dialect="duckdb")

What Gets Extracted

From your SQL, clgraph automatically extracts:

Tables

Source tables (FROM clause)
Created tables (CREATE TABLE statements)
Joined tables (JOIN clauses)
CTEs (WITH clauses)

Columns

Selected columns (SELECT clause)
Renamed columns (AS aliases)
Calculated columns (expressions)
Aggregated columns (SUM, AVG, COUNT, etc.)
Joined columns (JOIN conditions)
Filtered columns (WHERE clause)

Transformations

Aggregate functions: SUM, COUNT, AVG, MAX, MIN
Window functions: ROW_NUMBER, RANK, LAG, LEAD
Conditional logic: CASE WHEN
Mathematical operations: +, -, *, /
String operations: CONCAT, SUBSTRING, UPPER, LOWER
Type conversions: CAST, SAFE_CAST

Relationships

Table dependencies (which table depends on which)
Column flows (which column comes from which source)
Join relationships (how tables connect)
Filter conditions (WHERE clauses)
Group by keys (aggregation dimensions)

Example: Complete Flow

Let's trace one column through the entire pipeline:

SQL

-- Step 1
CREATE TABLE raw.orders AS
SELECT order_id, customer_id, amount, status FROM external.orders;

-- Step 2
CREATE TABLE staging.orders AS
SELECT
    customer_id,
    SUM(amount) as total_amount
FROM raw.orders
WHERE status = 'completed'
GROUP BY customer_id;

-- Step 3
CREATE TABLE analytics.customer_metrics AS
SELECT
    customer_id,
    total_amount,
    total_amount * 1.1 as total_with_tax
FROM staging.orders;

Lineage Trace: `analytics.customer_metrics.total_with_tax`

from clgraph import Pipeline

# Create pipeline from the SQL above
sql_queries = [
    ("01_raw", """
        CREATE TABLE raw.orders AS
        SELECT order_id, customer_id, amount, status FROM external.orders
    """),
    ("02_staging", """
        CREATE TABLE staging.orders AS
        SELECT
            customer_id,
            SUM(amount) as total_amount
        FROM raw.orders
        WHERE status = 'completed'
        GROUP BY customer_id
    """),
    ("03_analytics", """
        CREATE TABLE analytics.customer_metrics AS
        SELECT
            customer_id,
            total_amount,
            total_amount * 1.1 as total_with_tax
        FROM staging.orders
    """)
]

example_pipeline = Pipeline.from_tuples(sql_queries, dialect="bigquery")

# Trace the column backward to its sources
sources = example_pipeline.trace_column_backward(
    'analytics.customer_metrics',
    'total_with_tax'
)

print(f"total_with_tax has {len(sources)} source columns:")
for source in sources:
    print(f"  - {source.full_name}")

Complete provenance in 0.003 seconds.

Key Takeaways

You write SQL - Just normal SQL files, no special syntax
We parse to AST - Using sqlglot to understand structure
We build the graph - Extracting tables, columns, transformations
You get lineage - Complete table and column-level tracking

No configuration. No annotations. No manual mapping.

Your SQL already describes the lineage. We just extract it.

Next Steps

Table Lineage & Orchestration - Learn what you can do with the graph
Quick Start - Try it yourself
API Documentation - Full API reference