FAQ

Frequently asked questions about clgraph.

General

Do I need to change my SQL?

No. clgraph works with your existing SQL files. No annotations, no special syntax required.

What databases are supported?

BigQuery, Snowflake, PostgreSQL, DuckDB, Redshift, and many more. clgraph uses sqlglot for parsing, which supports 20+ SQL dialects.

Is it open source?

Yes. MIT license. View on GitHub.

Do I need to migrate my entire codebase at once?

No. You can adopt clgraph incrementally, query by query. There's no big-bang rewrite required.

Start with a few queries, see the lineage, and expand from there. As a bonus, writing lineage-friendly SQL (explicit column names, clear transformations) makes your code easier to review anyway—so migration improves code quality along the way.

Performance

Does it work with large pipelines?

Yes. Tested on 1,000+ queries and 10,000+ columns. Parse time is typically under 5 seconds.

How much memory does it use?

Memory usage scales with pipeline size. A pipeline with 1,000 columns typically uses ~50MB.

Integration

Can I use it with Airflow?

Yes. Generate Airflow DAGs automatically with pipeline.to_airflow_dag(). See Pipeline Orchestration for details.

Can I use it with dbt?

Yes. Use dbt's compiled SQL output with clgraph:

pipeline = Pipeline.from_sql_files(
    "target/compiled/my_project/models/",
    dialect="bigquery"
)

See Template Variables for more details.

Can I export lineage to my data catalog?

Yes. Export to JSON, CSV, or GraphViz formats:

# JSON for data catalogs
pipeline.to_json()

# CSV for spreadsheets
from clgraph.export import CSVExporter
CSVExporter.export_columns_to_file(pipeline, "columns.csv")

Features

How does column lineage work?

clgraph parses your SQL and builds a graph of column dependencies. It tracks:

Direct column references
Transformations (SUM, JOIN, CASE, etc.)
Star expansion (SELECT *)
CTE and subquery resolution

See From SQL to Lineage Graph for details.

How does metadata propagation work?

When you mark a column as PII or add tags, clgraph can propagate that metadata through the lineage graph:

pipeline.columns["raw.users.email"].pii = True
pipeline.propagate_all_metadata()
# All downstream columns now marked as PII

See Metadata from Comments for details.

Can I split a pipeline into smaller pieces?

Yes. Use pipeline.split() to create sub-pipelines by sink tables:

subpipelines = pipeline.split(
    sinks=[
        ["reports.daily"],
        ["reports.hourly"]
    ]
)

See Pipeline Orchestration for details.

Troubleshooting

My SQL isn't parsing correctly

Check the dialect is correct: Pipeline.from_sql_files("sql/", dialect="bigquery")
Verify SQL is valid in your target database
Check for unsupported syntax (some edge cases may not be supported)

Column lineage is missing some columns

This can happen with:

Dynamic SQL or templated queries (use template_context parameter)
SELECT * from external tables (clgraph can't know the schema)
Unqualified column names in JOINs (use table aliases)

I'm getting import errors

Make sure clgraph is installed in your active environment:

pip list | grep clgraph

For LLM features, install with extras:

pip install clgraph[llm]

FAQ