Skip to content

How to Write SQL for Perfect Column Lineage

I've learned that column lineage should be obvious from reading the SQL code itself—without checking table schemas or guessing auto-generated column names.

When I write lineage-friendly SQL, anyone (including tools) can trace data flow without database access. Code reviews are faster, debugging is easier, and static analysis becomes possible.


The Problem with Ambiguous SQL

Consider this query:

SELECT
    user_id,
    COUNT(*),
    total,
    id + 1
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY user_id

What columns are in the output?

  • user_id - From which table? users or orders?
  • COUNT(*) - What's the column name? count, COUNT(*), or something database-dependent?
  • total - From orders? Or computed somewhere?
  • id + 1 - What's this column called?

To understand this query, I need to:

  1. Check if both tables have a user_id column
  2. Know what my database auto-names COUNT(*) as
  3. Figure out if total is ambiguous
  4. Know how my database names expressions

This is inefficient for code review and impossible for static analysis. I've wasted hours on queries like this.


Rules I Follow

These are rules I follow to make column lineage explicit and traceable without database access.

These are opinionated rules based on my experience building lineage tools. Not all teams will agree with every rule. The goal is to enable lineage analysis and improve code readability.

1: No Duplicate Column Names

Consider the following query:

SELECT u.id, o.id
FROM users u JOIN orders o ON u.id = o.user_id

During code review, I will ask which id is which? Most databases will show an error with this syntax, and even if some databases allow this query, it's ambiguous.

If we rewrite the query in the following way:

SELECT u.id as user_id, o.id as order_id
FROM users u JOIN orders o ON u.id = o.user_id

The query now clearly shows which id is the user_id and which is the order_id, and we know the output columns precisely.


2: Explicit Aliases for Aggregates

Bad:

SELECT user_id, COUNT(*), SUM(total), AVG(amount)
FROM orders GROUP BY user_id

Aggregate column names are database-dependent (count, sum, avg, or ?column?).

Good:

SELECT
    user_id,
    COUNT(*) as order_count,
    SUM(total) as total_revenue,
    AVG(amount) as avg_order_amount
FROM orders GROUP BY user_id


3: Explicit Aliases for Window Functions

Bad:

SELECT employee_id, salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)
FROM employees

Good:

SELECT employee_id, salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees


4: Qualified Columns in JOINs

Bad:

SELECT id, name, total
FROM users u JOIN orders o ON u.id = o.user_id

Source table of each column is ambiguous.

Good:

SELECT u.id as user_id, u.name as user_name, o.total as order_total
FROM users u JOIN orders o ON u.id = o.user_id

Each column traces to its source table without database access.


5: Explicit Aliases for CASE Expressions

Bad:

SELECT product_id,
    CASE WHEN price > 100 THEN 'expensive'
         WHEN price > 50 THEN 'moderate'
         ELSE 'cheap' END
FROM products

Good:

SELECT product_id,
    CASE WHEN price > 100 THEN 'expensive'
         WHEN price > 50 THEN 'moderate'
         ELSE 'cheap' END as price_category
FROM products


6: Explicit Aliases for Scalar Subqueries

Bad:

SELECT u.id, u.name,
    (SELECT COUNT(*) FROM orders WHERE user_id = u.id)
FROM users u

Good:

SELECT u.id, u.name,
    (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as order_count
FROM users u


7: Explicit Aliases for Literals

Bad:

SELECT 'v1.0', CURRENT_TIMESTAMP(), 1 FROM table1

Good:

SELECT 'v1.0' as version, CURRENT_TIMESTAMP() as query_timestamp, 1 as partition_id
FROM table1


8: Explicit Aliases for Arithmetic

Bad:

SELECT price * 1.1, quantity * price, revenue - cost FROM products

Good:

SELECT
    price * 1.1 as price_with_tax,
    quantity * price as line_total,
    revenue - cost as profit
FROM products


9: Explicit Aliases for Type Casts (Optional)

Bad:

SELECT CAST(created_at AS DATE), SAFE_CAST(amount AS INT64) FROM orders

Good:

SELECT CAST(created_at AS DATE) as order_date, SAFE_CAST(amount AS INT64) as amount_int
FROM orders


10: The PIVOT/UNPIVOT Problem

Dynamic PIVOT breaks static lineage completely.

This is bad:

SELECT *
FROM (SELECT product_id, month, revenue FROM sales)
PIVOT (
    SUM(revenue)
    FOR month IN (SELECT DISTINCT month FROM sales)  -- Dynamic!
)

Output columns depend on data values. Impossible to trace lineage statically.

Better - Explicit PIVOT:

SELECT product_id, jan_revenue, feb_revenue, mar_revenue
FROM (SELECT product_id, month, revenue FROM sales)
PIVOT (
    SUM(revenue)
    FOR month IN ('Jan', 'Feb', 'Mar')
) AS pivoted (product_id, jan_revenue, feb_revenue, mar_revenue)

Best - Explicit CASE Statements:

SELECT
    product_id,
    SUM(CASE WHEN month = 'Jan' THEN revenue ELSE 0 END) as jan_revenue,
    SUM(CASE WHEN month = 'Feb' THEN revenue ELSE 0 END) as feb_revenue,
    SUM(CASE WHEN month = 'Mar' THEN revenue ELSE 0 END) as mar_revenue
FROM sales
GROUP BY product_id

Each output column shows exactly what it computes. Works across all SQL dialects. No special syntax.

When dynamic PIVOT is unavoidable (user-configurable reports), we should isolate it and document that lineage is dynamic for that section.


Why This Matters to Me

I spent years drawing column lineage diagrams on paper while debugging SQL pipelines. Eventually I discovered that SQL parsers can construct these diagrams automatically—but only if the SQL is written clearly. I wrote about the full journey in Why Column Lineage Should Be Built-In, Not Bolted-On.

The rules in this post are the practical foundation for that approach. When every expression has an explicit alias and every column traces to its source, lineage becomes a property of the code itself rather than something reverse-engineered from query logs after deployment.