AWS Glue Demo
This demo showcases LineageBridge's AWS Glue Data Catalog integration, demonstrating how Kafka topics are materialized as Iceberg tables and registered in AWS Glue. It's a simpler alternative to the Unity Catalog demo, staying entirely within the AWS ecosystem without requiring a separate Databricks workspace.
Architecture
The demo provisions a streaming data pipeline with Confluent Cloud, AWS S3, and AWS Glue Data Catalog:
graph LR
subgraph "Data Sources"
DG1[Datagen: Orders]
DG2[Datagen: Customers]
end
subgraph "Confluent Cloud (AWS us-east-1)"
T1[orders_v2 topic]
T2[customers_v2 topic]
subgraph "Flink SQL"
F1[enriched_orders JOIN]
F2[order_stats AGG]
end
T3[enriched_orders topic]
T4[order_stats topic]
subgraph "Tableflow → Iceberg"
TF1[orders_v2 table]
TF2[customers_v2 table]
TF3[order_stats table]
end
end
subgraph "AWS Infrastructure"
S3[S3 Bucket: BYOB]
ROLE[IAM Role: Tableflow]
subgraph "Glue Data Catalog"
DB[Database: lkc_*]
GT1[orders_v2 table]
GT2[customers_v2 table]
GT3[order_stats table]
end
end
DG1 -->|produces| T1
DG2 -->|produces| T2
T1 -->|consumes| F1
T2 -->|consumes| F1
F1 -->|produces| T3
T1 -->|consumes| F2
F2 -->|produces| T4
TF1 -.writes Iceberg.-> S3
TF2 -.writes Iceberg.-> S3
TF3 -.writes Iceberg.-> S3
S3 -.accessed via.-> ROLE
TF1 -->|registers| GT1
TF2 -->|registers| GT2
TF3 -->|registers| GT3
GT1 -.queryable via.-> ATHENA[Amazon Athena]
GT2 -.queryable via.-> ATHENA
GT3 -.queryable via.-> ATHENA
style DB fill:#ff9a00
style S3 fill:#ff6b35
style ATHENA fill:#8c4fff Key Components
- Datagen Sources — Generate realistic orders and customers data with Avro schemas
- Flink SQL Jobs — Stream JOIN (enriched_orders) and windowed aggregation (order_stats)
- Tableflow BYOB — Materializes 3 topics as Iceberg tables in S3
- AWS Glue Data Catalog — Registers Iceberg tables in a dedicated Glue database (auto-created by Tableflow)
- Amazon Athena — Query Iceberg tables using standard SQL (serverless analytics)
What's Different from Unity Catalog Demo
- No Databricks — Stays entirely within AWS ecosystem
- No ksqlDB — Simpler pipeline focusing on Kafka → Flink → Tableflow → Glue
- No PostgreSQL sink — Fewer external dependencies
- Iceberg format — Tables use Apache Iceberg instead of Delta Lake
- Lower cost — ~$211/month vs ~$711/month (no Databricks, no ksqlDB)
Prerequisites
Before provisioning, ensure you have:
CLI Tools
- Terraform >= 1.5
- Confluent CLI — logged in:
confluent login --save - AWS CLI — configured:
aws configureoraws sso login
The demo's make setup command can auto-install these via Homebrew if you're on macOS.
AWS Account & Permissions
You'll need an AWS account with sufficient permissions to create:
- IAM roles and policies
- S3 buckets
- Glue Data Catalog databases and tables
- Athena workgroups (optional, for queries)
Recommended: AdministratorAccess or PowerUserAccess policy.
Credentials You'll Need
The setup script will prompt for these if not auto-detected:
- Confluent Cloud API Key + Secret — Cloud-scoped credentials (auto-created via CLI if missing)
- AWS Account ID — Auto-detected from
aws sts get-caller-identity - AWS Region — Defaults to
us-east-1
Optional: AWS DataZone
If your account has an AWS DataZone domain in the demo's region, setup-tfvars.sh auto-detects it and a project inside it (single match auto-picks; multiple prompts; zero silently skips). Selected IDs are written to terraform.tfvars and threaded into the generated .env, which makes the Push to DataZone button appear in the Streamlit UI alongside Push to Glue. No DataZone domain → the rest of the demo still works; DataZone is a strict opt-in. See AWS DataZone Integration for what gets registered.
Provisioning
Step 1: Credential Setup
Run the interactive setup wizard from the infra/demos/glue directory:
The script will:
- Check for required CLI tools (install via Homebrew if missing)
- Detect Confluent Cloud credentials from
.env, environment variables, or create viaconfluent api-key create --resource cloud - Detect AWS account ID and region from
aws sts get-caller-identity - Generate
terraform.tfvarswith all detected values
Example output:
══════════════════════════════════════════════════════════════════
LineageBridge Glue Demo — Credential Setup
══════════════════════════════════════════════════════════════════
All required CLIs found: confluent, aws
▸ Confluent Cloud credentials
Using existing Cloud API key: abc-12345 (from .env)
▸ AWS credentials
Account ID: 123456789012 (auto-detected)
Region: us-east-1
✓ terraform.tfvars written successfully
Step 2: Provision Infrastructure
Deploy all resources via Terraform:
The demo-up target automatically runs make setup if terraform.tfvars is missing, then executes scripts/provision-demo.sh, which:
- Runs
terraform initandterraform apply - Creates a Tableflow API key via Confluent CLI (required for BYOB)
- Re-runs
terraform applywith Tableflow credentials to complete integration - Executes health checks waiting for Tableflow tables to appear in Glue
Provisioning takes 10-12 minutes. Terraform will create approximately 30 resources:
- Confluent Cloud: 1 environment, 1 Kafka cluster, 1 service account, 4 API keys, 2 topics, 2 datagen connectors, 1 Flink compute pool, 2 Flink statements, 3 Tableflow topics, 1 provider integration, 1 catalog integration
- AWS: 1 S3 bucket, 1 IAM role (with S3 + Glue policies), 1 bucket policy
Step 3: Verify Provisioning
Once Terraform completes, verify the environment:
- Navigate to Confluent Cloud Environments
- Open the environment named
lb-glue-{random}(example:lb-glue-a1b2c3d4) - Verify Kafka cluster is
RUNNING - Check Topics:
lineage_bridge.orders_v2,lineage_bridge.customers_v2,lineage_bridge.enriched_orders,lineage_bridge.order_stats - Inspect Connectors:
lb-glue-*-orders-datagen,lb-glue-*-customers-datagen(allRUNNING) - Open Flink SQL workspace: statements
lb-glue-*-enrich-ordersandlb-glue-*-order-statsshould beRUNNING
S3 Bucket:
- Open AWS Console → S3
- Find bucket
lb-glue-{random}-tableflow - Browse to see Iceberg table directories:
lineage_bridge_orders_v2/,lineage_bridge_customers_v2/,lineage_bridge_order_stats/ - Within each directory, you'll see Iceberg metadata and data files
Glue Data Catalog:
- Navigate to AWS Console → Glue → Data Catalog → Databases
- Find database
lkc_{cluster_id}(example:lkc_mjnq51) - Click on the database → Tables
- Verify 3 tables:
lineage_bridge_orders_v2,lineage_bridge_customers_v2,lineage_bridge_order_stats - Click on
lineage_bridge_orders_v2→ Schema tab to see Iceberg column definitions
IAM Role:
- Navigate to IAM → Roles →
lb-glue-{random}-tableflow-role - Verify attached policies:
tableflow-s3-access,tableflow-glue-access - Check Trust relationships — should allow Confluent's Tableflow service principal to assume the role
Step 4: Run LineageBridge Extraction
Extract lineage metadata from the live environment:
The extractor will:
- Auto-configure from the Terraform outputs (stored in
.envbyterraform output -raw demo_env_file) - Execute the 5-phase extraction pipeline:
- Phase 1: Kafka topics and consumer groups
- Phase 2: Connectors, Flink (parallel)
- Phase 3: Schema Registry and Stream Catalog enrichment (parallel)
- Phase 4: Tableflow tables and AWS Glue integration
- Phase 4b: AWS Glue catalog enrichment (fetch table metadata from Glue Data Catalog API)
- Phase 5: Metrics (throughput for topics)
Expected output:
▸ Phase 1: Kafka Admin (lkc-mjnq51)
✓ 4 topics, 2 consumer groups
▸ Phase 2: Transformations (parallel)
✓ 2 connectors (2 source)
✓ 2 Flink statements
▸ Phase 3: Enrichment (parallel)
✓ 4 schemas from Schema Registry
✓ Stream Catalog: 0 tags, 0 business metadata
▸ Phase 4: Tableflow
✓ 3 Tableflow topics (ICEBERG)
✓ AWS Glue integration: lb-glue-a1b2c3d4-glue
▸ Phase 4b: Catalog Enrichment
✓ AWS Glue: 3 tables in database lkc_mjnq51
▸ Phase 5: Metrics
✓ Throughput: 4 topics
Graph Summary:
Nodes: 22 (4 topics, 2 connectors, 2 Flink jobs, 3 Tableflow tables, 3 Glue tables, 4 schemas, 2 consumer groups)
Edges: 28 (8 PRODUCES, 6 CONSUMES, 4 TRANSFORMS, 3 MATERIALIZES, 4 HAS_SCHEMA, 3 MEMBER_OF)
Step 5: Launch the UI
Open the interactive lineage graph:
Your browser will open to http://localhost:8501. The UI displays:
- Hierarchical graph layout — Data flows from Datagen sources through Kafka topics, Flink transformations, Tableflow, and into Glue tables
- Interactive nodes — Click any node to see metadata panel (schema, owner, throughput, Glue properties)
- Deep links — Nodes link directly to Confluent Cloud Console, AWS Glue Console, S3 bucket
Expected Lineage Graph
You should see the following node types connected by lineage edges:
Kafka Topics (4 nodes)
lineage_bridge.orders_v2— Source topic from datagenlineage_bridge.customers_v2— Source topic from datagenlineage_bridge.enriched_orders— Derived topic from Flink JOINlineage_bridge.order_stats— Derived topic from Flink windowed aggregation
Connectors (2 nodes)
lb-glue-*-orders-datagen— Datagen source (PRODUCES → orders_v2)lb-glue-*-customers-datagen— Datagen source (PRODUCES → customers_v2)
Flink Jobs (2 nodes)
lb-glue-*-enrich-orders— Stream JOIN (CONSUMES ← orders_v2, customers_v2 | PRODUCES → enriched_orders)lb-glue-*-order-stats— Windowed aggregation (CONSUMES ← orders_v2 | PRODUCES → order_stats)
Tableflow Tables (3 nodes)
lineage_bridge.orders_v2 (ICEBERG)— Tableflow materialization (MATERIALIZES → Glue table)lineage_bridge.customers_v2 (ICEBERG)— Tableflow materialization (MATERIALIZES → Glue table)lineage_bridge.order_stats (ICEBERG)— Tableflow materialization (MATERIALIZES → Glue table)
AWS Glue Tables (3 nodes)
lkc_*.lineage_bridge_orders_v2— Iceberg table registered via Tableflowlkc_*.lineage_bridge_customers_v2— Iceberg table registered via Tableflowlkc_*.lineage_bridge_order_stats— Iceberg table registered via Tableflow
Schemas (4 nodes)
lineage_bridge.orders_v2-value— Avro schema for orderslineage_bridge.customers_v2-value— Avro schema for customerslineage_bridge.enriched_orders-value— Avro schema for enriched orderslineage_bridge.order_stats-value— Avro schema for order stats (includes window_start, window_end)
Querying with Amazon Athena
AWS Glue tables are queryable via Amazon Athena, AWS's serverless SQL query engine.
Setup Athena Workgroup
- Open AWS Console → Athena
- If this is your first time using Athena, create a query result location:
- Navigate to Settings tab
- Set query result location:
s3://lb-glue-{random}-tableflow/athena-results/
- Return to Query editor
Example Queries
Query the Glue-registered Iceberg tables:
-- Count rows in orders table
SELECT COUNT(*) AS total_orders
FROM lkc_mjnq51.lineage_bridge_orders_v2;
-- Count rows in customers table
SELECT COUNT(*) AS total_customers
FROM lkc_mjnq51.lineage_bridge_customers_v2;
-- View sample orders
SELECT *
FROM lkc_mjnq51.lineage_bridge_orders_v2
LIMIT 10;
-- Join orders and customers (replicates Flink enrichment)
SELECT
o.order_id,
c.name AS customer_name,
c.country AS customer_country,
o.product_name,
o.price,
o.order_status,
o.created_at
FROM lkc_mjnq51.lineage_bridge_orders_v2 o
JOIN lkc_mjnq51.lineage_bridge_customers_v2 c
ON o.customer_id = c.customer_id
WHERE o.price > 100
ORDER BY o.price DESC
LIMIT 20;
-- Aggregate order stats
SELECT
order_status,
SUM(order_count) AS total_orders,
SUM(total_quantity) AS total_quantity
FROM lkc_mjnq51.lineage_bridge_order_stats
GROUP BY order_status;
Iceberg Time Travel (Advanced)
Iceberg tables support time travel queries. List available snapshots:
Query data as of a specific snapshot:
-- Query historical data (replace snapshot_id with actual value from $snapshots)
SELECT COUNT(*) AS rows_in_snapshot
FROM lkc_mjnq51.lineage_bridge_orders_v2
FOR SYSTEM_VERSION AS OF 1234567890123456789;
AWS Glue Integration Details
The demo showcases how LineageBridge enriches Glue catalog nodes with AWS-specific metadata.
Glue Table Metadata
Click on a Glue table node (e.g., lineage_bridge_orders_v2) in the graph to see the metadata panel:
{
"node_id": "AWS:GLUE_TABLE:env-26wn6m:lkc_mjnq51.lineage_bridge_orders_v2",
"node_type": "GLUE_TABLE",
"qualified_name": "lkc_mjnq51.lineage_bridge_orders_v2",
"display_name": "lineage_bridge_orders_v2",
"database": "lkc_mjnq51",
"table_type": "EXTERNAL_TABLE",
"storage_format": "ICEBERG",
"storage_location": "s3://lb-glue-a1b2c3d4-tableflow/lineage_bridge_orders_v2/",
"owner": "arn:aws:iam::123456789012:role/lb-glue-a1b2c3d4-tableflow-role",
"created_at": "2026-04-30T12:34:56.000Z",
"updated_at": "2026-04-30T12:40:12.000Z",
"columns": [
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "product_name", "type": "string"},
{"name": "quantity", "type": "bigint"},
{"name": "price", "type": "double"},
{"name": "order_status", "type": "string"},
{"name": "created_at", "type": "string"}
],
"parameters": {
"table_type": "ICEBERG",
"metadata_location": "s3://lb-glue-a1b2c3d4-tableflow/lineage_bridge_orders_v2/metadata/00001-abc123.metadata.json"
},
"url": "https://console.aws.amazon.com/glue/home?region=us-east-1#/v2/data-catalog/tables/view/lineage_bridge_orders_v2?database=lkc_mjnq51"
}
Verifying Tableflow Registration
Check that Tableflow successfully registered tables in Glue:
# List Glue databases
aws glue get-databases --region us-east-1
# List tables in the cluster database
aws glue get-tables --database-name lkc_mjnq51 --region us-east-1
# Get specific table metadata
aws glue get-table \
--database-name lkc_mjnq51 \
--name lineage_bridge_orders_v2 \
--region us-east-1
Validation Queries
Run these queries to validate data flow through the pipeline.
Check Kafka Topic Data
Verify Flink Transformations
-- Via Confluent Cloud Console → Flink SQL Workspace
SELECT * FROM lineage_bridge.enriched_orders LIMIT 10;
Query Glue Tables via Athena
Inspect S3 Iceberg Files
# List Iceberg metadata files
aws s3 ls s3://lb-glue-{random}-tableflow/lineage_bridge_orders_v2/metadata/ --recursive
# Download Iceberg metadata (example)
aws s3 cp s3://lb-glue-{random}-tableflow/lineage_bridge_orders_v2/metadata/00001-*.metadata.json .
cat 00001-*.metadata.json | jq .
Cost Breakdown
Estimated monthly costs for 24x7 operation:
| Resource | Details | Monthly Cost |
|---|---|---|
| Confluent Kafka Cluster | Basic, AWS us-east-1, single-zone | ~$80 |
| Confluent Flink Compute Pool | 5 CFUs (minimum) | ~$450 |
| Confluent Tableflow BYOB | 3 topics, Iceberg | ~$25 |
| Datagen Connectors | 2 source connectors | Included |
| AWS S3 Storage | ~15 GB Iceberg data | ~$5 |
| AWS Glue Data Catalog | 3 tables, minimal API calls | ~$1 |
| Amazon Athena Queries | Pay-per-query (manual testing) | ~$5 |
| Total | ~$566/month |
Pause Flink to Save ~80% Costs
Flink compute pools account for $450/month. When not actively using the demo, pause the Flink pool via Confluent Cloud Console to reduce costs to ~$116/month.
Troubleshooting
Tableflow Tables Not Appearing in Glue
Symptom: Terraform completes successfully, but Glue tables are missing.
Diagnosis:
# Check Tableflow topic status
curl -u "$CONFLUENT_TABLEFLOW_API_KEY:$CONFLUENT_TABLEFLOW_API_SECRET" \
"https://api.confluent.cloud/tableflow/v1/topics?environment=$ENV_ID&kafka_cluster=$CLUSTER_ID" | jq .
Fix: Wait 3-5 minutes for Tableflow registration to propagate. Re-run:
IAM Role Permissions Errors
Symptom: Tableflow fails with AccessDenied errors when writing to S3 or accessing Glue.
Diagnosis: Verify IAM role policies:
aws iam get-role-policy \
--role-name lb-glue-{random}-tableflow-role \
--policy-name tableflow-s3-access
aws iam get-role-policy \
--role-name lb-glue-{random}-tableflow-role \
--policy-name tableflow-glue-access
Fix: Ensure policies include:
- S3:
s3:GetObject,s3:PutObject,s3:DeleteObject,s3:ListBucket - Glue:
glue:GetTable,glue:CreateTable,glue:UpdateTable,glue:GetDatabase,glue:CreateDatabase
Athena Query Errors
Symptom: Athena queries fail with HIVE_METASTORE_ERROR or ICEBERG_INVALID_METADATA.
Diagnosis: Check Glue table properties:
aws glue get-table \
--database-name lkc_mjnq51 \
--name lineage_bridge_orders_v2 \
--region us-east-1 | jq .Table.Parameters
Fix: Verify table_type=ICEBERG and metadata_location points to a valid S3 path. If metadata is corrupted, wait for Tableflow to write a new snapshot (happens every few minutes with active data ingestion).
Cleanup
Tear down all resources to stop incurring costs:
This executes:
terraform destroy -auto-approve(destroys all Terraform-managed resources)
Expected duration: 4-6 minutes.
Orphaned Glue Metadata
Tableflow may create additional Glue databases or tables outside Terraform state. After teardown, verify no leftover resources:
aws glue get-databases --region us-east-1 | jq '.DatabaseList[] | select(.Name | startswith("lkc_"))'
If found, manually delete:
Next Steps
- Query with AWS Glue DataBrew — Use DataBrew for visual data profiling and transformations on Iceberg tables
- Integrate with AWS Lake Formation — Add fine-grained access control to Glue tables
- Push lineage to AWS Glue — Extend
lineage_bridge/catalogs/aws_glue.pyto implementpush_lineage()using Glue's custom table properties - Productionize — Replace Datagen with real connectors (e.g., S3 Source, DynamoDB CDC) and add RBAC policies