How to Write SQL for Perfect Column Lineage
I've learned that column lineage should be obvious from reading the SQL code itself—without checking table schemas or guessing auto-generated column names.
When I write lineage-friendly SQL, anyone (including tools) can trace data flow without database access. Code reviews are faster, debugging is easier, and static analysis becomes possible.
The Problem with Ambiguous SQL
Consider this query:
SELECT
user_id,
COUNT(*),
total,
id + 1
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY user_id
What columns are in the output?
user_id- From which table?usersororders?COUNT(*)- What's the column name?count,COUNT(*), or something database-dependent?total- Fromorders? Or computed somewhere?id + 1- What's this column called?
To understand this query, I need to:
- Check if both tables have a
user_idcolumn - Know what my database auto-names
COUNT(*)as - Figure out if
totalis ambiguous - Know how my database names expressions
This is inefficient for code review and impossible for static analysis. I've wasted hours on queries like this.
Rules I Follow
These are rules I follow to make column lineage explicit and traceable without database access.
These are opinionated rules based on my experience building lineage tools. Not all teams will agree with every rule. The goal is to enable lineage analysis and improve code readability.
1: No Duplicate Column Names
Consider the following query:
During code review, I will ask which id is which? Most databases will show an error with this syntax, and even if some databases allow this query, it's ambiguous.
If we rewrite the query in the following way:
The query now clearly shows which id is the user_id and which is the order_id, and we know the output columns precisely.
2: Explicit Aliases for Aggregates
Bad:
Aggregate column names are database-dependent (count, sum, avg, or ?column?).
Good:
SELECT
user_id,
COUNT(*) as order_count,
SUM(total) as total_revenue,
AVG(amount) as avg_order_amount
FROM orders GROUP BY user_id
3: Explicit Aliases for Window Functions
Bad:
SELECT employee_id, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)
FROM employees
Good:
SELECT employee_id, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees
4: Qualified Columns in JOINs
Bad:
Source table of each column is ambiguous.
Good:
SELECT u.id as user_id, u.name as user_name, o.total as order_total
FROM users u JOIN orders o ON u.id = o.user_id
Each column traces to its source table without database access.
5: Explicit Aliases for CASE Expressions
Bad:
SELECT product_id,
CASE WHEN price > 100 THEN 'expensive'
WHEN price > 50 THEN 'moderate'
ELSE 'cheap' END
FROM products
Good:
SELECT product_id,
CASE WHEN price > 100 THEN 'expensive'
WHEN price > 50 THEN 'moderate'
ELSE 'cheap' END as price_category
FROM products
6: Explicit Aliases for Scalar Subqueries
Bad:
Good:
7: Explicit Aliases for Literals
Bad:
Good:
8: Explicit Aliases for Arithmetic
Bad:
Good:
SELECT
price * 1.1 as price_with_tax,
quantity * price as line_total,
revenue - cost as profit
FROM products
9: Explicit Aliases for Type Casts (Optional)
Bad:
Good:
10: The PIVOT/UNPIVOT Problem
Dynamic PIVOT breaks static lineage completely.
This is bad:
SELECT *
FROM (SELECT product_id, month, revenue FROM sales)
PIVOT (
SUM(revenue)
FOR month IN (SELECT DISTINCT month FROM sales) -- Dynamic!
)
Output columns depend on data values. Impossible to trace lineage statically.
Better - Explicit PIVOT:
SELECT product_id, jan_revenue, feb_revenue, mar_revenue
FROM (SELECT product_id, month, revenue FROM sales)
PIVOT (
SUM(revenue)
FOR month IN ('Jan', 'Feb', 'Mar')
) AS pivoted (product_id, jan_revenue, feb_revenue, mar_revenue)
Best - Explicit CASE Statements:
SELECT
product_id,
SUM(CASE WHEN month = 'Jan' THEN revenue ELSE 0 END) as jan_revenue,
SUM(CASE WHEN month = 'Feb' THEN revenue ELSE 0 END) as feb_revenue,
SUM(CASE WHEN month = 'Mar' THEN revenue ELSE 0 END) as mar_revenue
FROM sales
GROUP BY product_id
Each output column shows exactly what it computes. Works across all SQL dialects. No special syntax.
When dynamic PIVOT is unavoidable (user-configurable reports), we should isolate it and document that lineage is dynamic for that section.
Why This Matters to Me
I spent years drawing column lineage diagrams on paper while debugging SQL pipelines. Eventually I discovered that SQL parsers can construct these diagrams automatically—but only if the SQL is written clearly. I wrote about the full journey in Why Column Lineage Should Be Built-In, Not Bolted-On.
The rules in this post are the practical foundation for that approach. When every expression has an explicit alias and every column traces to its source, lineage becomes a property of the code itself rather than something reverse-engineered from query logs after deployment.