Skip to content

Why Column Lineage Should Be Built-In, Not Bolted-On

In my first few data engineering jobs, I spent hours, sometimes days, debugging data quality issues. My workflow was always the same: dig through SQL queries, check information schema, scan query logs, and try to piece together what went wrong. For impact analysis, I'd do the same thing in reverse: trace forward through the code to see what might break.

I found myself drawing diagrams on paper to visualize the relationships between columns. Sometimes within a single query with complex CTEs, sometimes across multiple queries in a pipeline. These hand-drawn graphs were the only way I could keep track of how data flowed.


The Moment Everything Changed

One day, I learned how Python's AST (Abstract Syntax Tree) works—how it converts code into a tree, which is really just a special kind of graph. When I visualized a Python AST, it looked remarkably similar to the diagrams I'd been drawing on paper for my SQL queries.

I started wondering: is there an AST for SQL?

That question led me to sqlglot. The structure I'd been manually sketching was already there in the SQL itself. I just needed a parser to extract it.

I started feeding my queries into sqlglot to see how the AST looked. It became my secret weapon for SQL-related problems. Instead of manually tracing through CTEs and joins, I could see the structure directly.

As we all know now, static analysis has its own limitation: it won't go to the database to look up information if you don't specify source of columns precisely. Sometimes, I needed to tweak my queries to make the AST produce a diagram that matched reality. But that was a small price to pay because making the queries work with static analysis also makes them easier to understand in the first place.


From Secret Weapon to Solved Problem

Fast forward a few years. I had a job that required converting thousands of Stata scripts (yes, Stata) into a BigQuery pipeline with hundreds of tables. Let's not debate whether this was a good idea—it was a job requirement.

I started writing SQL in a way that the AST could construct proper lineage graphs. And something clicked: I didn't need to switch between the database, code, and query logs anymore. I could construct the complete column lineage graph from the SQL alone.

I could trace relationships between columns before execution. I could see how changes to a column in one table would affect downstream tables—before deploying anything. Suddenly, column lineage wasn't a mystery to reverse-engineer. It was a solved problem.


Two Approaches to Column Lineage

Approach 1: Runtime Lineage (What Most Teams Do)

Most column lineage is extracted in the following flow:

Write SQL → Deploy to database → Track lineage at runtime → Query lineage

Usually, there are three options.

Option 1: Database Engine Lineage

Major data warehousing vendors provide lineage APIs:

  • BigQuery: INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
  • Snowflake: ACCOUNT_USAGE.ACCESS_HISTORY
  • Databricks: Unity Catalog lineage APIs

Option 2: Third-Party Tools

Deploy SQL and let external tools scan query logs (Monte Carlo, Atlan, Collibra, Alation, etc.).

Option 3: Manually Trace

Manually tracing table schema and query logs (if you don't have the above two options, or these tools don't fully solve your problem).

Problems with runtime lineage:

There are a few issues with these three runtime lineage approaches:

Lineage only exists after execution. We need to run the query (sometimes in production?) to get the lineage. Because we can only get lineage after execution, we cannot validate query dependencies before deployment. On top of that, query log scanning is resource-intensive: we might need to justify the cost, the query scanning may not be 24/7, or the retention period is short and we end up with broken lineage.

The fundamental issue is I need to deploy the code first, sometimes in production, then discover problems. Runtime lineage always lags behind code changes and you cannot have complete information in development phase.

Approach 2: Static Lineage (What I Do Now)

Parse SQL and compute lineage before deployment:

Write SQL → Lineage computed instantly → Validate → Deploy

I parse my SQL files, build a lineage graph, and query it—all before deploying anything. The lineage is available immediately because it's derived from the SQL structure itself.

I was wondering why no one promotes this approach, so I decided to build clgraph with the help of Claude to share my discovery.

How It Works: SQL to AST to Lineage

Step 1: Parse SQL into an AST

SELECT u.name, COUNT(o.id) as order_count
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.name

Gets parsed into a tree structure:

  • SELECT clause: [u.name, COUNT(o.id)]
  • FROM clause: users (aliased as u)
  • JOIN clause: orders (aliased as o)
  • GROUP BY clause: [u.name]

Step 2: Trace Columns Through the AST

  • u.name → Find alias u → Resolves to users.name
  • COUNT(o.id) → Trace argument o.id → Resolves to orders.id
  • CTEs become named subqueries, analyzed recursively
  • Subqueries are parsed as child AST nodes

Step 3: Build the Lineage Graph

Output Columns:
  name → users.name
  order_count → COUNT(orders.id)

Since SQL structure defines lineage relationships, AST analysis traces column dependencies through code analysis alone. No database connection needed.

Why I prefer this approach:

  1. Lineage during development—I see impact before deploying
  2. Works offline
  3. Validates in CI/CD before deployment
  4. Single source of truth in code
  5. Fast feedback loop (milliseconds)
  6. Database-agnostic, supported by sqlglot

The Key Insight

Column lineage is a property of the SQL query itself, not the data it processes.

SELECT
    u.name as customer_name,
    COUNT(o.id) as order_count
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name

Without executing this query, I know order_count depends on orders.id and customer_name depends on users.name, unless the users table doesn't have name or id columns or the orders table doesn't have a user_id column. This is static analysis—like how an IDE tells me if a variable is undefined without running my code.


The SELECT * Problem

Unfortunately, the most common query SELECT * breaks most lineage tools:

WITH base AS (
    SELECT * FROM users
),
enriched AS (
    SELECT
        base.*,
        COUNT(*) as order_count
    FROM base
    JOIN orders ON base.id = orders.user_id
    GROUP BY base.id
)
SELECT * FROM enriched

Most tools report: "Unable to determine - requires schema"

SELECT * expands to different columns based on table schema. Without querying the database, traditional tools cannot determine output columns.

My Solution: Star Notation as First-Class Lineage

I treat * as a valid lineage column:

  • enriched.* → depends on base.* → depends on users.*
  • Plus enriched.order_count → depends on orders.id

This is sufficient for impact analysis:

  • "If users schema changes, what breaks?" → Everything downstream from users.*
  • "What does order_count depend on?" → orders.id

For SELECT * EXCEPT(password, ssn), I track it as: "All columns from users except password and ssn"


Example: Tracing Through Nested CTEs

Here's a query I needed to debug:

WITH monthly_sales AS (
    SELECT
        product_id,
        DATE_TRUNC('month', sale_date) as month,
        SUM(amount) as total_sales
    FROM sales
    GROUP BY product_id, DATE_TRUNC('month', sale_date)
),
product_stats AS (
    SELECT
        product_id,
        AVG(total_sales) as avg_monthly_sales,
        MAX(total_sales) as peak_sales
    FROM monthly_sales
    GROUP BY product_id
)
SELECT
    p.product_name,
    ps.avg_monthly_sales,
    ps.peak_sales,
    ps.avg_monthly_sales / ps.peak_sales as sales_consistency
FROM products p
JOIN product_stats ps ON p.id = ps.product_id

Manually tracing this would take 15+ minutes—reading through each CTE, tracking transformations, drawing diagrams on paper. With static lineage, I just ask "where does sales_consistency come from?" and get the complete lineage tree in milliseconds:

sales_consistency (output)
  └─ avg_monthly_sales / peak_sales
      ├─ avg_monthly_sales (product_stats)
      │   └─ AVG(total_sales)
      │       └─ total_sales (monthly_sales)
      │           └─ SUM(amount)
      │               └─ amount (sales)
      └─ peak_sales (product_stats)
          └─ MAX(total_sales)
              └─ total_sales (monthly_sales)
                  └─ SUM(amount)
                      └─ amount (sales)

Once I can programmatically construct this lineage graph, communication becomes much easier. I can visualize the graph and share it with my team—no more explaining dependencies verbally or hoping everyone interprets my hand-drawn diagrams or mental model the same way. My peers can explore the relationships on their own.


What Changed For Me

No More Context Switching

I used to jump between the database UI, the code editor, and query logs constantly. Now I stay in my editor. The lineage is right there in the code.

Confidence in Refactoring

Before, I was always nervous about changing column names or modifying transformations. Now I trace forward to see exactly what depends on a column before I touch it.

Faster Onboarding

When I join a new codebase with hundreds of SQL files, I don't have to read them all. I load them into a pipeline and explore the lineage graph interactively.

Foundation for More

Once lineage is built-in, other things become possible: auto-generated documentation, LLM-powered SQL generation with full context, automatic PII tracking. These are all just applications of the same underlying graph.


Final Thought

Runtime lineage (from query logs) has its place, but it's fundamentally reactive—I discover problems after deployment. Static lineage from SQL parsing is proactive—I see the graph before anything runs.

The diagrams I used to draw on paper? They're now generated automatically from the SQL itself. Column lineage isn't something I bolt on after the fact. It's built into how I write SQL.