Why Column Lineage Should Be Built-In, Not Bolted-On
In my first few data engineering jobs, I spent hours, sometimes days, debugging data quality issues. My workflow was always the same: dig through SQL queries, check information schema, scan query logs, and try to piece together what went wrong. For impact analysis, I'd do the same thing in reverse: trace forward through the code to see what might break.
I found myself drawing diagrams on paper to visualize the relationships between columns. Sometimes within a single query with complex CTEs, sometimes across multiple queries in a pipeline. These hand-drawn graphs were the only way I could keep track of how data flowed.
The Moment Everything Changed
One day, I learned how Python's AST (Abstract Syntax Tree) works—how it converts code into a tree, which is really just a special kind of graph. When I visualized a Python AST, it looked remarkably similar to the diagrams I'd been drawing on paper for my SQL queries.
I started wondering: is there an AST for SQL?
That question led me to sqlglot. The structure I'd been manually sketching was already there in the SQL itself. I just needed a parser to extract it.
I started feeding my queries into sqlglot to see how the AST looked. It became my secret weapon for SQL-related problems. Instead of manually tracing through CTEs and joins, I could see the structure directly.
As we all know now, static analysis has its own limitation: it won't go to the database to look up information if you don't specify source of columns precisely. Sometimes, I needed to tweak my queries to make the AST produce a diagram that matched reality. But that was a small price to pay because making the queries work with static analysis also makes them easier to understand in the first place.
From Secret Weapon to Solved Problem
Fast forward a few years. I had a job that required converting thousands of Stata scripts (yes, Stata) into a BigQuery pipeline with hundreds of tables. Let's not debate whether this was a good idea—it was a job requirement.
I started writing SQL in a way that the AST could construct proper lineage graphs. And something clicked: I didn't need to switch between the database, code, and query logs anymore. I could construct the complete column lineage graph from the SQL alone.
I could trace relationships between columns before execution. I could see how changes to a column in one table would affect downstream tables—before deploying anything. Suddenly, column lineage wasn't a mystery to reverse-engineer. It was a solved problem.
Two Approaches to Column Lineage
Approach 1: Runtime Lineage (What Most Teams Do)
Most column lineage is extracted in the following flow:
Usually, there are three options.
Option 1: Database Engine Lineage
Major data warehousing vendors provide lineage APIs:
- BigQuery:
INFORMATION_SCHEMA.COLUMN_FIELD_PATHS - Snowflake:
ACCOUNT_USAGE.ACCESS_HISTORY - Databricks: Unity Catalog lineage APIs
Option 2: Third-Party Tools
Deploy SQL and let external tools scan query logs (Monte Carlo, Atlan, Collibra, Alation, etc.).
Option 3: Manually Trace
Manually tracing table schema and query logs (if you don't have the above two options, or these tools don't fully solve your problem).
Problems with runtime lineage:
There are a few issues with these three runtime lineage approaches:
Lineage only exists after execution. We need to run the query (sometimes in production?) to get the lineage. Because we can only get lineage after execution, we cannot validate query dependencies before deployment. On top of that, query log scanning is resource-intensive: we might need to justify the cost, the query scanning may not be 24/7, or the retention period is short and we end up with broken lineage.
The fundamental issue is I need to deploy the code first, sometimes in production, then discover problems. Runtime lineage always lags behind code changes and you cannot have complete information in development phase.
Approach 2: Static Lineage (What I Do Now)
Parse SQL and compute lineage before deployment:
I parse my SQL files, build a lineage graph, and query it—all before deploying anything. The lineage is available immediately because it's derived from the SQL structure itself.
I was wondering why no one promotes this approach, so I decided to build clgraph with the help of Claude to share my discovery.
How It Works: SQL to AST to Lineage
Step 1: Parse SQL into an AST
SELECT u.name, COUNT(o.id) as order_count
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.name
Gets parsed into a tree structure:
- SELECT clause:
[u.name, COUNT(o.id)] - FROM clause:
users(aliased asu) - JOIN clause:
orders(aliased aso) - GROUP BY clause:
[u.name]
Step 2: Trace Columns Through the AST
u.name→ Find aliasu→ Resolves tousers.nameCOUNT(o.id)→ Trace argumento.id→ Resolves toorders.id- CTEs become named subqueries, analyzed recursively
- Subqueries are parsed as child AST nodes
Step 3: Build the Lineage Graph
Since SQL structure defines lineage relationships, AST analysis traces column dependencies through code analysis alone. No database connection needed.
Why I prefer this approach:
- Lineage during development—I see impact before deploying
- Works offline
- Validates in CI/CD before deployment
- Single source of truth in code
- Fast feedback loop (milliseconds)
- Database-agnostic, supported by sqlglot
The Key Insight
Column lineage is a property of the SQL query itself, not the data it processes.
SELECT
u.name as customer_name,
COUNT(o.id) as order_count
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
Without executing this query, I know order_count depends on orders.id and customer_name depends on users.name, unless the users table doesn't have name or id columns or the orders table doesn't have a user_id column. This is static analysis—like how an IDE tells me if a variable is undefined without running my code.
The SELECT * Problem
Unfortunately, the most common query SELECT * breaks most lineage tools:
WITH base AS (
SELECT * FROM users
),
enriched AS (
SELECT
base.*,
COUNT(*) as order_count
FROM base
JOIN orders ON base.id = orders.user_id
GROUP BY base.id
)
SELECT * FROM enriched
Most tools report: "Unable to determine - requires schema"
SELECT * expands to different columns based on table schema. Without querying the database, traditional tools cannot determine output columns.
My Solution: Star Notation as First-Class Lineage
I treat * as a valid lineage column:
enriched.*→ depends onbase.*→ depends onusers.*- Plus
enriched.order_count→ depends onorders.id
This is sufficient for impact analysis:
- "If
usersschema changes, what breaks?" → Everything downstream fromusers.* - "What does
order_countdepend on?" →orders.id
For SELECT * EXCEPT(password, ssn), I track it as: "All columns from users except password and ssn"
Example: Tracing Through Nested CTEs
Here's a query I needed to debug:
WITH monthly_sales AS (
SELECT
product_id,
DATE_TRUNC('month', sale_date) as month,
SUM(amount) as total_sales
FROM sales
GROUP BY product_id, DATE_TRUNC('month', sale_date)
),
product_stats AS (
SELECT
product_id,
AVG(total_sales) as avg_monthly_sales,
MAX(total_sales) as peak_sales
FROM monthly_sales
GROUP BY product_id
)
SELECT
p.product_name,
ps.avg_monthly_sales,
ps.peak_sales,
ps.avg_monthly_sales / ps.peak_sales as sales_consistency
FROM products p
JOIN product_stats ps ON p.id = ps.product_id
Manually tracing this would take 15+ minutes—reading through each CTE, tracking transformations, drawing diagrams on paper. With static lineage, I just ask "where does sales_consistency come from?" and get the complete lineage tree in milliseconds:
sales_consistency (output)
└─ avg_monthly_sales / peak_sales
├─ avg_monthly_sales (product_stats)
│ └─ AVG(total_sales)
│ └─ total_sales (monthly_sales)
│ └─ SUM(amount)
│ └─ amount (sales)
└─ peak_sales (product_stats)
└─ MAX(total_sales)
└─ total_sales (monthly_sales)
└─ SUM(amount)
└─ amount (sales)
Once I can programmatically construct this lineage graph, communication becomes much easier. I can visualize the graph and share it with my team—no more explaining dependencies verbally or hoping everyone interprets my hand-drawn diagrams or mental model the same way. My peers can explore the relationships on their own.
What Changed For Me
No More Context Switching
I used to jump between the database UI, the code editor, and query logs constantly. Now I stay in my editor. The lineage is right there in the code.
Confidence in Refactoring
Before, I was always nervous about changing column names or modifying transformations. Now I trace forward to see exactly what depends on a column before I touch it.
Faster Onboarding
When I join a new codebase with hundreds of SQL files, I don't have to read them all. I load them into a pipeline and explore the lineage graph interactively.
Foundation for More
Once lineage is built-in, other things become possible: auto-generated documentation, LLM-powered SQL generation with full context, automatic PII tracking. These are all just applications of the same underlying graph.
Final Thought
Runtime lineage (from query logs) has its place, but it's fundamentally reactive—I discover problems after deployment. Static lineage from SQL parsing is proactive—I see the graph before anything runs.
The diagrams I used to draw on paper? They're now generated automatically from the SQL itself. Column lineage isn't something I bolt on after the fact. It's built into how I write SQL.