Changelog
All notable changes to LineageBridge are documented here. The format follows Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
See the latest commits for work in progress.
[0.5.0] - 2026-05-02
Modularity refactor: services layer, pluggable storage, watcher as an independent service. Plus a stack of UX + reliability fixes that fell out during validation.
Added
- Service layer (
lineage_bridge/services/) — single entry point for extraction / enrichment / push (run_extraction,run_enrichment,run_push). UI, API, and watcher all call into it with the sameExtractionRequest/EnrichmentRequest/PushRequestshape. (Phase 1A, ADR-020) - Pluggable storage layer (
lineage_bridge/storage/) with three backends: memory(default) — process-local, ephemeralfile— JSON files +flock-guarded writessqlite— singlestorage.db, WAL mode, versioned-SQL migrations, durable across restarts; recommended for the watcher. Selected viaLINEAGE_BRIDGE_STORAGE__BACKEND={memory,file,sqlite}. (Phases 1C + 2F, ADR-022)- Catalog protocol v2 —
UC_TABLE/GLUE_TABLE/GOOGLE_TABLEcollapsed into oneCATALOG_TABLEnode type with acatalog_typediscriminator (UNITY_CATALOG / AWS_GLUE / GOOGLE_DATA_LINEAGE / AWS_DATAZONE). Adding a new catalog = one file incatalogs/. Push surface unified throughservices.run_push(PushRequest). (Phase 1B, ADR-021 — clean break, no migration; old graph JSON raises aValidationErroron load) - Watcher as an independent peer service — split into
WatcherService(pure-logic state machine, no threading),WatcherRunner(asyncio loop + persistence),WatcherRepository(memory + sqlite backends), andapi/routers/watcher.py(6 endpoints: start / stop / status / events / history / list / deregister). Two run modes:lineage-bridge-watchdaemon (production, survives UI restarts) or in-process API task (development). UI now pollsGET /api/v1/watcher/{id}/*viahttpxand holds only thewatcher_idstring. (Phase 2G, ADR-023) POST /api/v1/push/{provider}endpoint — the API gains feature parity with the UI's "Push to X" buttons. (Phase 1A)- Orchestrator phase abstraction — each phase implements
Phase.execute(ctx)and is independently testable. (Phase 1D) - Per-demo credential cache accumulation — switching
.envbetween UC / Glue / BQ / DataZone demos no longer wipes the previous demo's credentials. The~/.lineage_bridge/cache.jsondeep-merges per-cluster / per-env credential dicts; the demo provision scripts mirror Databricks workspace + AWS region into the cache too. - Metrics enrichment for every node type (with
--metrics): Telemetry-based for topics / connectors / Flink jobs; consumer-group lag →metrics_total_lag; tableflow inherits from upstream topic; catalog tables getmetrics_activefrom the most recent oflast_modified_time/updated_at/update_time/create_time; ksqlDB queries from state. New--metricsand--metrics-lookback-hoursCLI flags. - DLQ topic wiring — sink connectors now expose their internal
lcc-XXXXXresource ID viaexpand=info,status,id. The Connect extractor emits a placeholderdlq-{lcc-id}topic +PRODUCESedge, which merges with the real topic on the kafka_admin pass. DLQ topics no longer appear as orphan nodes; renderer surfaces a red "D" badge variant. - Sidebar legend variants — per-catalog brand rows (UC / Glue / BigQuery / DataZone) plus "DLQ topic" and "Topic with schema" badge rows when present in the graph.
Settings.api_url— UI uses this to find the API (defaulthttp://127.0.0.1:8000); set when API runs in a separate container.- AWS partition-aware Glue console deeplinks —
_console_host(region)maps GovCloud (us-gov-*) and China (cn-*) regions to the right console hostname. - Storage conformance suite — same 24 tests run against memory + file + sqlite (72 cases), plus 12 watcher-repository tests × memory + sqlite (24 cases), plus 11 sqlite-specific tests + 2 sqlite API-integration tests.
Changed
- UI sidebar decomposed — the 1,077-LOC
ui/sidebar.pysplit intoui/sidebar/{__init__,connection,scope,credentials,actions,filters}.py. CSS extracted fromui/app.pytoui/static/styles.css. Sample graph promoted from imperative builder to bundledui/static/sample_graph.json. (Phase 2E) - Per-catalog console deep-link labels — was hardcoded "Open in BigQuery" on every
CATALOG_TABLE(visible on UC + Glue tables too); now dispatches percatalog_type(Open in Unity Catalog/Open in AWS Glue/Open in BigQuery). Duplicate Glue button removed. - Per-catalog brand icons reach the actual graph nodes —
render_graph_rawandrender_graphnow wireicon_for_node(node)so UC / Glue / BigQuery / DataZone tables render with their brand icon (was: every catalog table got the generic database icon). - DataZone live integration tests gate on
iam:SimulatePrincipalPolicy— when the caller lacksdatazone:CreateFormType/CreateAssetType, the tests skip cleanly with the IAM diff in the skip reason rather than failing onAccessDeniedException. The graceful-degradation path is still exercised by the manualprovider.push_lineagesmoke. - Dataplex live integration test uses deterministic
GET-by-nameinstead ofLIST entries → find by FQNto avoid the eventual-consistency flake on Dataplex Catalog list endpoints.
Fixed
- Glue catalog deep-link region —
build_urlno longer silently falls back tous-east-1when the node has noaws_region.build_nodenow stampsaws_regionfrom the configured provider region (or Tableflow CI region) at build time, andbuild_urlreturnsNoneif no region is available anywhere (no more wrong-region buttons). - Google catalog table_name parity —
GoogleLineageProvider.build_nodewas producing different node IDs thanconnect.py:_build_google_tablesfor the same logical table (split-on-dot vs replace-on-dot). Fixed to mirror the Connect path. - ksqlDB extraction 401 — when
Settings.ksqldb_api_keyis unset, the processing phase now falls back to the cachedlineage-bridge-ksqldb-{env_id}provisioned SA key. Same pattern applied to the Tableflow phase. - CSS escape bug —
ui/static/styles.cssheader comment contained the literal text</style>, which the browser parsed as closing the wrapping<style>opened byst.markdown, dumping the rest of the CSS as plain text on the page. - Streamlit
cluster_selectwidget warning — the multiselect was passing bothkey=anddefault=. Pre-seedst.session_state["cluster_select"]instead, drop thedefaultarg, prune stale labels. - Watcher push path — the old
WatcherEngineimportedrun_glue_push/run_lineage_push(both deleted in Phase 1B); the new service usesservices.run_pushviaPushRequest, the same path the UI + API hit.
Removed
lineage_bridge/watcher/engine.py:WatcherEngine— replaced byservices.WatcherService+services.WatcherRunner. The threading wrapper is gone (the daemon owns its asyncio loop directly). The_use_audit_logprivate property is gone (mode is explicit inWatcherConfig.mode). The CLI no longer calls a private_run_loop()directly.run_glue_push/run_lineage_pushwrappers inextractors/orchestrator.py— the catalog protocol v2 dispatches viaservices.run_push(PushRequest)instead. (Phase 1B)UC_TABLE/GLUE_TABLE/GOOGLE_TABLENodeType members — collapsed toCATALOG_TABLE+catalog_type. (Phase 1B; clean break per ADR-021)
Tests
- 813 → 892 baseline pytest, 7 skipped (live cloud integration; opt-in via
LINEAGE_BRIDGE_*_INTEGRATION=1). - New conformance suite covers memory + file + sqlite for graphs / tasks / events; memory + sqlite for watchers.
- Lint clean (ruff, line length 100).
[0.4.1] - 2026-05-01
New catalog integrations, brand-icon refresh, BigQuery node enhancement, and CI/docs hardening.
Added
- AWS DataZone provider (
AWSDataZoneProvider,DataZoneAssetRegistrar): registers Kafka topics as DataZone assets with schema and posts OpenLineage events viapost_lineage_event. Mirrors the Google Dataplex / Data Lineage architecture for AWS. - Dataplex Catalog asset registration (
DataplexAssetRegistrar): each Kafka topic becomes a Dataplex entry with the same FQN as the lineage event, so the BigQuery Lineage tab shows column metadata on upstream Confluent nodes (events alone don't carry schema — Google strips facets at storage). - Rich
lineage_bridge.upstream_chainpayload in UC TBLPROPERTIES, UC bridge table (chain_jsoncolumn), Glue Parameters, and Glue Description: full multi-hop chain including Flink/ksqlDB SQL, intermediate topics, source connectors, and per-topic schema fields. Capped per catalog's value-size limit. - Multi-hop OpenLineage push for Google: every Job-event (source connectors, Flink, ksqlDB, sinks) is pushed so the Lineage tab can walk transitively from a BQ table back to the source topics.
- Live integration tests:
tests/integration/test_gcp_dataplex_integration.py(gated byLINEAGE_BRIDGE_GCP_INTEGRATION=1) andtests/integration/test_aws_datazone_integration.py(gated byLINEAGE_BRIDGE_AWS_DATAZONE_INTEGRATION=1). Run viamake test-integration-dataplex/make test-integration-datazone. - "Push to DataZone" button in the Streamlit publish panel, gated on
LINEAGE_BRIDGE_AWS_DATAZONE_DOMAIN_ID+LINEAGE_BRIDGE_AWS_DATAZONE_PROJECT_ID. - Official brand icons for Kafka, Flink, Databricks (Unity Catalog), Google BigQuery, and AWS Glue graph nodes. Other node types keep their geometric icons. New
_IconSpecconfig supportslogoandtilerender modes plus an optionalfill_overridefor recolouring monochrome marks. - Shell-based integration test harnesses for all three demo environments:
scripts/integration-test-uc.sh(7 tests),scripts/integration-test-glue.sh(8 tests),scripts/integration-test-bigquery.sh(9 tests). Each validates extraction, enrichment, catalog push, API, watcher, and Docker, with--skip-dockerfor CI. - Documentation polish: prominent docs-site link on the README, deep-links from catalog feature bullets to per-catalog guides, GCP environment variables documented,
lineage-bridge-apientrypoint surfaced.
Changed
- Shared OpenLineage namespace normalizer (
api/openlineage/normalize.py) — used by both Google and DataZone providers, parametrised by allowlist ({bigquery}for Google,{kafka, aws}and{bigquery, aws}for DataZone). - Shared upstream-chain builder (
catalogs/upstream_chain.py) — single source of truth for chain shape, used by all four catalogs. google-authis now a hard dependency (was previously imported lazily and silently failed if missing).- BigQuery connector lineage: BigQuery sink connectors now synthesise per-topic
GOOGLE_TABLEnodes inclients/connect.pyso the publish UI surfaces them directly without requiring Tableflow.
Fixed
- Deploy Documentation workflow: 9 broken cross-reference links in
docs/how-to/index.md,docs/reference/glossary.md, anddocs/demos/index.mdretargeted to existing pages somkdocs build --strictpasses. - CI lint: import ordering + formatter applied to
lineage_bridge/api/routers/tasks.py(had been failingruff checkon every PR). - IAM trust policy corrections and Interactive API Explorer fix (use built-in Swagger UI).
- 7 bugs discovered during integration-test validation (full list in
docs/INTEGRATION_TESTS_BUG_FIXES.md).
[0.4.0] - 2024-12-15
Major Features:
- Multi-demo infrastructure support for parallel Confluent environments
- Google Data Lineage API provider integration
- OpenLineage API server with FastAPI backend
- Comprehensive MkDocs Material documentation site
Added
- OpenLineage API server at
/api/v1/lineagewith Swagger UI at/docs - Google Data Lineage provider for GCP integration
- Dataset Lineage API documentation and reference
- Interactive API explorer with built-in Swagger UI
- MkDocs Material documentation infrastructure (port 8001)
- Getting Started guide with tab-based examples
- User Guide documentation
- Architecture and Troubleshooting documentation
- Catalog Integration documentation with diagrams
- API Reference documentation with practical examples
- Multi-demo infrastructure for testing multiple catalog providers
- Auto-provisioning of Confluent Cloud API keys via
make ui - Interactive demo credential setup with CLI auto-detection
- Auto-generate
.envafter demo provisioning - Auto-launch UI after demo provisioning
- Demo architecture diagram
- Service principal support for Databricks notebook jobs
- IAM role trust policy fixes for AWS integration
Changed
- Refactored API Reference with tabs and practical examples
- Refactored Getting Started with human-friendly text
- Refactored Catalog Integration with improved diagrams
- Switched from custom Scalar to built-in Swagger UI
- Changed docs port from 8000 to 8001 to avoid API server conflict
- Improved CLI prerequisite auto-installation during demo setup
- Enhanced post-provisioning messages with available commands
Fixed
- IAM trust policy for self-assume role in AWS Glue integration
- OAuth secret creation error handling in Databricks setup
- Databricks CLI auth environment variable parsing
- Docker workflow to use correct Dockerfile path
- Broken links in Google Data Lineage documentation
- Import sorting in generate-diagram script
[0.3.0] - 2024-11-20
Major Features:
- Change-detection watcher with REST polling and debounced re-extraction
- AWS Glue catalog provider integration
- Databricks lineage discovery and push-to-UC capabilities
- Interactive demo infrastructure with Terraform
Added
- Change-detection watcher CLI (
lineage-bridge-watch) - REST polling mode for change detection (10s interval, 30s debounce)
- Audit log Kafka consumer mode (retained for future use)
- AWS Glue catalog provider with metadata enrichment
- Format-based catalog matching for Tableflow integration
- Databricks lineage discovery via SQL Statement Execution API
- Push lineage metadata to Unity Catalog as table properties and comments
- PostgreSQL RDS sink connector for enriched orders demo
- Flink compute pool and SQL statements in demo infrastructure
- Self-contained demo infrastructure (Kafka → Tableflow → S3 → Databricks UC)
- Docker multi-stage build with watcher service
make watchandmake docker-watchtargets- Dark mode support for labels and UI components
Changed
- Separated extract and enrich into independent operations
- Redesigned node icons with representative shapes, color-coded by system
- Switched Makefile to use
uvpackage manager - Moved Docker files to
infra/docker/directory - Moved scripts to
infra/directory - Removed key provisioning UI in favor of CLI-based auto-provisioning
- Switched to single-environment selection in UI
Fixed
- Flink SQL parser to preserve dots inside backtick-quoted identifiers
- Unity Catalog integration to handle real Confluent API format
- Dot-to-underscore mapping for UC table names
- Schema definition display in topic detail panel
- Neighbor display in node details
- Demo
.envoutput to matchsettings.pyfield names
Removed
- Unused infra scaffolding (kept
infra/demoonly) - Databricks-to-Kafka publish job and topic
[0.2.0] - 2024-10-15
Major Features:
- Catalog provider framework with extensible protocol
- Databricks Unity Catalog provider
- Custom vis.js graph component with interactive features
- Metrics API integration for real-time throughput
Added
- Catalog provider framework (
catalogs/protocol.py) - Databricks Unity Catalog provider with enrichment and lineage push
- Custom vis.js Streamlit component for graph visualization
- Sugiyama-style DAG layout with minimal edge crossings
- Shift+drag region selection in graph UI
- Confluent Metrics API integration for throughput data
- Rich HTML tooltips with node metadata
- Deep links to Confluent Cloud, Databricks, and AWS consoles
- Comprehensive unit tests (208 tests total)
- Per-extractor timeouts and graph validation
- Kafka protocol fallback for cluster connectivity
- CI coverage gate (80% threshold)
- Docker pipeline with healthchecks
Changed
- Decomposed
app.pyinto state, discovery, extraction, and sidebar modules - Expanded graph renderer tests to 104 test cases
- Unified URL dispatch for catalog providers
- UX redesign with multi-credential support
Fixed
- Dark mode: theme-safe colors for CSS, tooltips, and detail panel
- Tableflow extraction and mapping logic
- Flink CTAS parsing for CREATE TABLE AS SELECT statements
- Connector extraction and external dataset handling
- Consumer group membership edges
- Graph layout edge crossing optimization
Added Documentation
- Architecture decision log (ADR) with Phase 1 tradeoffs
- Agent crew model and master plan in
docs/ - CLAUDE.md with project conventions and architecture
- Apache 2.0 license and CONTRIBUTING.md
[0.1.0] - 2024-09-01
Initial Release
- Confluent Cloud lineage extraction via REST v3 APIs
- Streamlit UI with interactive graph visualization
- Support for Kafka topics, connectors, Flink jobs, ksqlDB queries
- Schema Registry integration for Avro/Protobuf/JSON schemas
- Stream Catalog integration for tags and business metadata
- Consumer group membership tracking
- Multi-environment and multi-cluster support
- Encrypted local JSON cache
- Auto-provisioning of cluster-scoped API keys via Confluent CLI
Extraction Clients
- KafkaAdmin: topics and consumer groups
- Connect: source and sink connectors
- Flink: SQL statement parsing
- ksqlDB: persistent query parsing
- SchemaRegistry: schema enrichment
- StreamCatalog: tag and metadata enrichment
UI Features
- Interactive directed graph with drag and zoom
- Click-to-inspect detail panel
- Search by qualified name
- Export graph as JSON
- Connection settings and credential management
Migration Guides
Upgrading to v0.4.0
Breaking Changes:
- Documentation moved from port 8000 to 8001 to avoid API server conflict
- Docker file paths changed to
infra/docker/- update any custom compose files
New Features:
- OpenLineage API available at
http://localhost:8000/api/v1/lineage - Interactive API docs at
http://localhost:8000/docs - Google Data Lineage provider enabled via
LINEAGE_BRIDGE_GOOGLE_PROJECT_ID
Upgrading to v0.3.0
Breaking Changes:
- Environment variable format changed:
CLUSTER_CREDENTIALSnow uses JSON map instead of flatKAFKA_API_KEY - Docker files moved from root to
infra/docker/- update references in CI/CD - Scripts moved to
infra/directory
New Features:
- Change-detection watcher:
uv run lineage-bridge-watch - AWS Glue support: set
LINEAGE_BRIDGE_AWS_REGIONin.env - Databricks lineage push: automatically writes to UC table properties
Migration Steps:
- Update
.envfile format (see.env.example) - Update Docker Compose paths:
docker-compose.yml→infra/docker/docker-compose.yml - Run
make formatto apply new code style rules
Upgrading to v0.2.0
Breaking Changes:
app.pydecomposed into multiple modules - custom imports need updates- Graph component switched from
streamlit-agraphto custom vis.js component
New Features:
- Catalog providers: add Databricks or Glue credentials to
.env - Metrics API: enriches graphs with throughput data (opt-in)
Migration Steps:
- Install updated dependencies:
uv pip install -e ".[dev]" - Update any custom UI code to use new module structure
- Run
make testto validate changes