AWS Glue Demo

This demo showcases LineageBridge's AWS Glue Data Catalog integration, demonstrating how Kafka topics are materialized as Iceberg tables and registered in AWS Glue. It's a simpler alternative to the Unity Catalog demo, staying entirely within the AWS ecosystem without requiring a separate Databricks workspace.

Architecture

The demo provisions a streaming data pipeline with Confluent Cloud, AWS S3, and AWS Glue Data Catalog:

graph LR
    subgraph "Data Sources"
        DG1[Datagen: Orders]
        DG2[Datagen: Customers]
    end

    subgraph "Confluent Cloud (AWS us-east-1)"
        T1[orders_v2 topic]
        T2[customers_v2 topic]

        subgraph "Flink SQL"
            F1[enriched_orders JOIN]
            F2[order_stats AGG]
        end

        T3[enriched_orders topic]
        T4[order_stats topic]

        subgraph "Tableflow → Iceberg"
            TF1[orders_v2 table]
            TF2[customers_v2 table]
            TF3[order_stats table]
        end
    end

    subgraph "AWS Infrastructure"
        S3[S3 Bucket: BYOB]
        ROLE[IAM Role: Tableflow]

        subgraph "Glue Data Catalog"
            DB[Database: lkc_*]
            GT1[orders_v2 table]
            GT2[customers_v2 table]
            GT3[order_stats table]
        end
    end

    DG1 -->|produces| T1
    DG2 -->|produces| T2
    T1 -->|consumes| F1
    T2 -->|consumes| F1
    F1 -->|produces| T3
    T1 -->|consumes| F2
    F2 -->|produces| T4

    TF1 -.writes Iceberg.-> S3
    TF2 -.writes Iceberg.-> S3
    TF3 -.writes Iceberg.-> S3
    S3 -.accessed via.-> ROLE

    TF1 -->|registers| GT1
    TF2 -->|registers| GT2
    TF3 -->|registers| GT3

    GT1 -.queryable via.-> ATHENA[Amazon Athena]
    GT2 -.queryable via.-> ATHENA
    GT3 -.queryable via.-> ATHENA

    style DB fill:#ff9a00
    style S3 fill:#ff6b35
    style ATHENA fill:#8c4fff

Key Components

Datagen Sources — Generate realistic orders and customers data with Avro schemas
Flink SQL Jobs — Stream JOIN (enriched_orders) and windowed aggregation (order_stats)
Tableflow BYOB — Materializes 3 topics as Iceberg tables in S3
AWS Glue Data Catalog — Registers Iceberg tables in a dedicated Glue database (auto-created by Tableflow)
Amazon Athena — Query Iceberg tables using standard SQL (serverless analytics)

What's Different from Unity Catalog Demo

No Databricks — Stays entirely within AWS ecosystem
No ksqlDB — Simpler pipeline focusing on Kafka → Flink → Tableflow → Glue
No PostgreSQL sink — Fewer external dependencies
Iceberg format — Tables use Apache Iceberg instead of Delta Lake
Lower cost — ~$211/month vs ~$711/month (no Databricks, no ksqlDB)

Prerequisites

Before provisioning, ensure you have:

CLI Tools

Terraform >= 1.5
Confluent CLI — logged in: confluent login --save
AWS CLI — configured: aws configure or aws sso login

The demo's make setup command can auto-install these via Homebrew if you're on macOS.

AWS Account & Permissions

You'll need an AWS account with sufficient permissions to create:

IAM roles and policies
S3 buckets
Glue Data Catalog databases and tables
Athena workgroups (optional, for queries)

Recommended: AdministratorAccess or PowerUserAccess policy.

Credentials You'll Need

The setup script will prompt for these if not auto-detected:

Confluent Cloud API Key + Secret — Cloud-scoped credentials (auto-created via CLI if missing)
AWS Account ID — Auto-detected from aws sts get-caller-identity
AWS Region — Defaults to us-east-1

Optional: AWS DataZone

If your account has an AWS DataZone domain in the demo's region, setup-tfvars.sh auto-detects it and a project inside it (single match auto-picks; multiple prompts; zero silently skips). Selected IDs are written to terraform.tfvars and threaded into the generated .env, which makes the Push to DataZone button appear in the Streamlit UI alongside Push to Glue. No DataZone domain → the rest of the demo still works; DataZone is a strict opt-in. See AWS DataZone Integration for what gets registered.

Provisioning

Step 1: Credential Setup

Run the interactive setup wizard from the infra/demos/glue directory:

MakeDirect Script

cd infra/demos/glue
make setup

cd infra/demos/glue
bash scripts/setup-tfvars.sh

The script will:

Check for required CLI tools (install via Homebrew if missing)
Detect Confluent Cloud credentials from .env, environment variables, or create via confluent api-key create --resource cloud
Detect AWS account ID and region from aws sts get-caller-identity
Generate terraform.tfvars with all detected values

Example output:

══════════════════════════════════════════════════════════════════
  LineageBridge Glue Demo — Credential Setup
══════════════════════════════════════════════════════════════════

  All required CLIs found: confluent, aws

▸ Confluent Cloud credentials
  Using existing Cloud API key: abc-12345 (from .env)

▸ AWS credentials
  Account ID: 123456789012 (auto-detected)
  Region: us-east-1

✓ terraform.tfvars written successfully

Step 2: Provision Infrastructure

Deploy all resources via Terraform:

Make (Recommended)Terraform Direct

make demo-up

terraform init
terraform apply -auto-approve

The demo-up target automatically runs make setup if terraform.tfvars is missing, then executes scripts/provision-demo.sh, which:

Runs terraform init and terraform apply
Creates a Tableflow API key via Confluent CLI (required for BYOB)
Re-runs terraform apply with Tableflow credentials to complete integration
Executes health checks waiting for Tableflow tables to appear in Glue

Provisioning takes 10-12 minutes. Terraform will create approximately 30 resources:

Confluent Cloud: 1 environment, 1 Kafka cluster, 1 service account, 4 API keys, 2 topics, 2 datagen connectors, 1 Flink compute pool, 2 Flink statements, 3 Tableflow topics, 1 provider integration, 1 catalog integration
AWS: 1 S3 bucket, 1 IAM role (with S3 + Glue policies), 1 bucket policy

Step 3: Verify Provisioning

Once Terraform completes, verify the environment:

Confluent Cloud ConsoleAWS Console

Navigate to Confluent Cloud Environments
Open the environment named lb-glue-{random} (example: lb-glue-a1b2c3d4)
Verify Kafka cluster is RUNNING
Check Topics: lineage_bridge.orders_v2, lineage_bridge.customers_v2, lineage_bridge.enriched_orders, lineage_bridge.order_stats
Inspect Connectors: lb-glue-*-orders-datagen, lb-glue-*-customers-datagen (all RUNNING)
Open Flink SQL workspace: statements lb-glue-*-enrich-orders and lb-glue-*-order-stats should be RUNNING

S3 Bucket:

Open AWS Console → S3
Find bucket lb-glue-{random}-tableflow
Browse to see Iceberg table directories: lineage_bridge_orders_v2/, lineage_bridge_customers_v2/, lineage_bridge_order_stats/
Within each directory, you'll see Iceberg metadata and data files

Glue Data Catalog:

Navigate to AWS Console → Glue → Data Catalog → Databases
Find database lkc_{cluster_id} (example: lkc_mjnq51)
Click on the database → Tables
Verify 3 tables: lineage_bridge_orders_v2, lineage_bridge_customers_v2, lineage_bridge_order_stats
Click on lineage_bridge_orders_v2 → Schema tab to see Iceberg column definitions

IAM Role:

Navigate to IAM → Roles → lb-glue-{random}-tableflow-role
Verify attached policies: tableflow-s3-access, tableflow-glue-access
Check Trust relationships — should allow Confluent's Tableflow service principal to assume the role

Step 4: Run LineageBridge Extraction

Extract lineage metadata from the live environment:

cd ../../..  # Return to project root
uv run lineage-bridge-extract

The extractor will:

Auto-configure from the Terraform outputs (stored in .env by terraform output -raw demo_env_file)
Execute the 5-phase extraction pipeline:
- Phase 1: Kafka topics and consumer groups
- Phase 2: Connectors, Flink (parallel)
- Phase 3: Schema Registry and Stream Catalog enrichment (parallel)
- Phase 4: Tableflow tables and AWS Glue integration
- Phase 4b: AWS Glue catalog enrichment (fetch table metadata from Glue Data Catalog API)
- Phase 5: Metrics (throughput for topics)

Expected output:

▸ Phase 1: Kafka Admin (lkc-mjnq51)
  ✓ 4 topics, 2 consumer groups

▸ Phase 2: Transformations (parallel)
  ✓ 2 connectors (2 source)
  ✓ 2 Flink statements

▸ Phase 3: Enrichment (parallel)
  ✓ 4 schemas from Schema Registry
  ✓ Stream Catalog: 0 tags, 0 business metadata

▸ Phase 4: Tableflow
  ✓ 3 Tableflow topics (ICEBERG)
  ✓ AWS Glue integration: lb-glue-a1b2c3d4-glue

▸ Phase 4b: Catalog Enrichment
  ✓ AWS Glue: 3 tables in database lkc_mjnq51

▸ Phase 5: Metrics
  ✓ Throughput: 4 topics

Graph Summary:
  Nodes: 22 (4 topics, 2 connectors, 2 Flink jobs, 3 Tableflow tables, 3 Glue tables, 4 schemas, 2 consumer groups)
  Edges: 28 (8 PRODUCES, 6 CONSUMES, 4 TRANSFORMS, 3 MATERIALIZES, 4 HAS_SCHEMA, 3 MEMBER_OF)

Step 5: Launch the UI

Open the interactive lineage graph:

uv run streamlit run lineage_bridge/ui/app.py

Your browser will open to http://localhost:8501. The UI displays:

Hierarchical graph layout — Data flows from Datagen sources through Kafka topics, Flink transformations, Tableflow, and into Glue tables
Interactive nodes — Click any node to see metadata panel (schema, owner, throughput, Glue properties)
Deep links — Nodes link directly to Confluent Cloud Console, AWS Glue Console, S3 bucket

Expected Lineage Graph

You should see the following node types connected by lineage edges:

Kafka Topics (4 nodes)

lineage_bridge.orders_v2 — Source topic from datagen
lineage_bridge.customers_v2 — Source topic from datagen
lineage_bridge.enriched_orders — Derived topic from Flink JOIN
lineage_bridge.order_stats — Derived topic from Flink windowed aggregation

Connectors (2 nodes)

lb-glue-*-orders-datagen — Datagen source (PRODUCES → orders_v2)
lb-glue-*-customers-datagen — Datagen source (PRODUCES → customers_v2)

Flink Jobs (2 nodes)

lb-glue-*-enrich-orders — Stream JOIN (CONSUMES ← orders_v2, customers_v2 | PRODUCES → enriched_orders)
lb-glue-*-order-stats — Windowed aggregation (CONSUMES ← orders_v2 | PRODUCES → order_stats)

Tableflow Tables (3 nodes)

lineage_bridge.orders_v2 (ICEBERG) — Tableflow materialization (MATERIALIZES → Glue table)
lineage_bridge.customers_v2 (ICEBERG) — Tableflow materialization (MATERIALIZES → Glue table)
lineage_bridge.order_stats (ICEBERG) — Tableflow materialization (MATERIALIZES → Glue table)

AWS Glue Tables (3 nodes)

lkc_*.lineage_bridge_orders_v2 — Iceberg table registered via Tableflow
lkc_*.lineage_bridge_customers_v2 — Iceberg table registered via Tableflow
lkc_*.lineage_bridge_order_stats — Iceberg table registered via Tableflow

Schemas (4 nodes)

lineage_bridge.orders_v2-value — Avro schema for orders
lineage_bridge.customers_v2-value — Avro schema for customers
lineage_bridge.enriched_orders-value — Avro schema for enriched orders
lineage_bridge.order_stats-value — Avro schema for order stats (includes window_start, window_end)

Querying with Amazon Athena

AWS Glue tables are queryable via Amazon Athena, AWS's serverless SQL query engine.

Setup Athena Workgroup

Open AWS Console → Athena
If this is your first time using Athena, create a query result location:
- Navigate to Settings tab
- Set query result location: s3://lb-glue-{random}-tableflow/athena-results/
Return to Query editor

Example Queries

Query the Glue-registered Iceberg tables:

-- Count rows in orders table
SELECT COUNT(*) AS total_orders
FROM lkc_mjnq51.lineage_bridge_orders_v2;

-- Count rows in customers table
SELECT COUNT(*) AS total_customers
FROM lkc_mjnq51.lineage_bridge_customers_v2;

-- View sample orders
SELECT *
FROM lkc_mjnq51.lineage_bridge_orders_v2
LIMIT 10;

-- Join orders and customers (replicates Flink enrichment)
SELECT
  o.order_id,
  c.name AS customer_name,
  c.country AS customer_country,
  o.product_name,
  o.price,
  o.order_status,
  o.created_at
FROM lkc_mjnq51.lineage_bridge_orders_v2 o
JOIN lkc_mjnq51.lineage_bridge_customers_v2 c
  ON o.customer_id = c.customer_id
WHERE o.price > 100
ORDER BY o.price DESC
LIMIT 20;

-- Aggregate order stats
SELECT
  order_status,
  SUM(order_count) AS total_orders,
  SUM(total_quantity) AS total_quantity
FROM lkc_mjnq51.lineage_bridge_order_stats
GROUP BY order_status;

Iceberg Time Travel (Advanced)

Iceberg tables support time travel queries. List available snapshots:

-- View table history
SELECT * FROM lkc_mjnq51."lineage_bridge_orders_v2$snapshots";

Query data as of a specific snapshot:

-- Query historical data (replace snapshot_id with actual value from $snapshots)
SELECT COUNT(*) AS rows_in_snapshot
FROM lkc_mjnq51.lineage_bridge_orders_v2
FOR SYSTEM_VERSION AS OF 1234567890123456789;

AWS Glue Integration Details

The demo showcases how LineageBridge enriches Glue catalog nodes with AWS-specific metadata.

Glue Table Metadata

Click on a Glue table node (e.g., lineage_bridge_orders_v2) in the graph to see the metadata panel:

{
  "node_id": "AWS:GLUE_TABLE:env-26wn6m:lkc_mjnq51.lineage_bridge_orders_v2",
  "node_type": "GLUE_TABLE",
  "qualified_name": "lkc_mjnq51.lineage_bridge_orders_v2",
  "display_name": "lineage_bridge_orders_v2",
  "database": "lkc_mjnq51",
  "table_type": "EXTERNAL_TABLE",
  "storage_format": "ICEBERG",
  "storage_location": "s3://lb-glue-a1b2c3d4-tableflow/lineage_bridge_orders_v2/",
  "owner": "arn:aws:iam::123456789012:role/lb-glue-a1b2c3d4-tableflow-role",
  "created_at": "2026-04-30T12:34:56.000Z",
  "updated_at": "2026-04-30T12:40:12.000Z",
  "columns": [
    {"name": "order_id", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "product_name", "type": "string"},
    {"name": "quantity", "type": "bigint"},
    {"name": "price", "type": "double"},
    {"name": "order_status", "type": "string"},
    {"name": "created_at", "type": "string"}
  ],
  "parameters": {
    "table_type": "ICEBERG",
    "metadata_location": "s3://lb-glue-a1b2c3d4-tableflow/lineage_bridge_orders_v2/metadata/00001-abc123.metadata.json"
  },
  "url": "https://console.aws.amazon.com/glue/home?region=us-east-1#/v2/data-catalog/tables/view/lineage_bridge_orders_v2?database=lkc_mjnq51"
}

Verifying Tableflow Registration

Check that Tableflow successfully registered tables in Glue:

# List Glue databases
aws glue get-databases --region us-east-1

# List tables in the cluster database
aws glue get-tables --database-name lkc_mjnq51 --region us-east-1

# Get specific table metadata
aws glue get-table \
  --database-name lkc_mjnq51 \
  --name lineage_bridge_orders_v2 \
  --region us-east-1

Validation Queries

Run these queries to validate data flow through the pipeline.

Check Kafka Topic Data

confluent kafka topic consume lineage_bridge.orders_v2 \
  --from-beginning --max-messages 5

Verify Flink Transformations

-- Via Confluent Cloud Console → Flink SQL Workspace
SELECT * FROM lineage_bridge.enriched_orders LIMIT 10;

Query Glue Tables via Athena

-- Via AWS Athena Query Editor
SELECT COUNT(*) FROM lkc_mjnq51.lineage_bridge_orders_v2;

Inspect S3 Iceberg Files

# List Iceberg metadata files
aws s3 ls s3://lb-glue-{random}-tableflow/lineage_bridge_orders_v2/metadata/ --recursive

# Download Iceberg metadata (example)
aws s3 cp s3://lb-glue-{random}-tableflow/lineage_bridge_orders_v2/metadata/00001-*.metadata.json .
cat 00001-*.metadata.json | jq .

Cost Breakdown

Estimated monthly costs for 24x7 operation:

Resource	Details	Monthly Cost
Confluent Kafka Cluster	Basic, AWS us-east-1, single-zone	~$80
Confluent Flink Compute Pool	5 CFUs (minimum)	~$450
Confluent Tableflow BYOB	3 topics, Iceberg	~$25
Datagen Connectors	2 source connectors	Included
AWS S3 Storage	~15 GB Iceberg data	~$5
AWS Glue Data Catalog	3 tables, minimal API calls	~$1
Amazon Athena Queries	Pay-per-query (manual testing)	~$5
Total		~$566/month

Pause Flink to Save ~80% Costs

Flink compute pools account for $450/month. When not actively using the demo, pause the Flink pool via Confluent Cloud Console to reduce costs to ~$116/month.

Troubleshooting

Tableflow Tables Not Appearing in Glue

Symptom: Terraform completes successfully, but Glue tables are missing.

Diagnosis:

# Check Tableflow topic status
curl -u "$CONFLUENT_TABLEFLOW_API_KEY:$CONFLUENT_TABLEFLOW_API_SECRET" \
  "https://api.confluent.cloud/tableflow/v1/topics?environment=$ENV_ID&kafka_cluster=$CLUSTER_ID" | jq .

Fix: Wait 3-5 minutes for Tableflow registration to propagate. Re-run:

bash scripts/wait-for-ready.sh

IAM Role Permissions Errors

Symptom: Tableflow fails with AccessDenied errors when writing to S3 or accessing Glue.

Diagnosis: Verify IAM role policies:

aws iam get-role-policy \
  --role-name lb-glue-{random}-tableflow-role \
  --policy-name tableflow-s3-access

aws iam get-role-policy \
  --role-name lb-glue-{random}-tableflow-role \
  --policy-name tableflow-glue-access

Fix: Ensure policies include:

S3: s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket
Glue: glue:GetTable, glue:CreateTable, glue:UpdateTable, glue:GetDatabase, glue:CreateDatabase

Athena Query Errors

Symptom: Athena queries fail with HIVE_METASTORE_ERROR or ICEBERG_INVALID_METADATA.

Diagnosis: Check Glue table properties:

aws glue get-table \
  --database-name lkc_mjnq51 \
  --name lineage_bridge_orders_v2 \
  --region us-east-1 | jq .Table.Parameters

Fix: Verify table_type=ICEBERG and metadata_location points to a valid S3 path. If metadata is corrupted, wait for Tableflow to write a new snapshot (happens every few minutes with active data ingestion).

Cleanup

Tear down all resources to stop incurring costs:

cd infra/demos/glue
make demo-down

This executes:

terraform destroy -auto-approve (destroys all Terraform-managed resources)

Expected duration: 4-6 minutes.

Orphaned Glue Metadata

Tableflow may create additional Glue databases or tables outside Terraform state. After teardown, verify no leftover resources:

aws glue get-databases --region us-east-1 | jq '.DatabaseList[] | select(.Name | startswith("lkc_"))'

If found, manually delete:

aws glue delete-database --name lkc_mjnq51 --region us-east-1

Next Steps

Query with AWS Glue DataBrew — Use DataBrew for visual data profiling and transformations on Iceberg tables
Integrate with AWS Lake Formation — Add fine-grained access control to Glue tables
Push lineage to AWS Glue — Extend lineage_bridge/catalogs/aws_glue.py to implement push_lineage() using Glue's custom table properties
Productionize — Replace Datagen with real connectors (e.g., S3 Source, DynamoDB CDC) and add RBAC policies