In this workshop, we’ll share best practices for developing and deploying an ETL pipeline with varied latency – from continuous ingestion to hourly incremental loads. You will join the instructor in running hands-on examples of creating a medallion architecture in two ways: Spark Structured Streaming and Lakeflow Declarative Pipelines. You will be introduced to Spark syntax for incremental data processing, how checkpoints work, monitoring streaming pipelines, common table configurations, and scheduling with Lakeflow Jobs. You will also learn about the abstractions Lakeflow makes to simplify and unify both batch and streaming ETL on the platform.
During the hands-on portion, you’ll ingest streaming data, refine it, and serve it for downstream machine learning and business intelligence use cases. You will also incorporate the techniques described above to ensure production-grade performance, visibility, and fault tolerance. The workshop code can serve as a template you can tailor to meet your specific use cases in the future!
Agenda (PT)
- 11:00 AM: Introduction
- 11:05 AM: ETL on Databricks Overview
- Lakeflow: Connect & Declarative Pipelines
- Structured Streaming
- 11:25 AM: Hands-on Workshop
- 12:15 PM: Q&A
Duration: 1.5 hrs