Design Document:Data Lake

Data Lake

Technology: Apache Iceberg on S3/ADLS, or Delta Lake
Storage Format: Parquet with schema evolution support
Organization:
- Raw zone: Unprocessed data from source systems
- Curated zone: Cleaned and validated data
- Analytics zone: Aggregated data for ML training
Governance: Data lineage tracking and quality monitoring

Data Lake Design and Implementation

This section provides a detailed design for your data lake, a crucial component for storing raw, semi-structured, and processed data at scale. The design leverages Apache Iceberg and Parquet, providing a flexible yet highly organized data foundation.

1. Technology: Apache Iceberg on Object Storage

Apache Iceberg: This is the core technology for the data lake's table format. It sits on top of object storage like S3 (Amazon Web Services) or ADLS (Azure Data Lake Storage), or an on-premises equivalent like MinIO. Iceberg manages metadata in a way that allows for fast query planning, atomic writes, and schema evolution, a key requirement.
Parquet: All data within the data lake will be stored in the Apache Parquet format. Parquet is a columnar storage format, which is highly efficient for analytical queries as it only reads the required columns, not the entire row. It also provides excellent compression, saving significant storage costs.

2. Zonal Organization

The data lake will be logically divided into three distinct zones to ensure data integrity, quality, and easy access.

A. Raw Zone (Bronze Layer)

Purpose: This zone is the landing area for all unprocessed data from the source systems. Data is stored in its original format.
Implementation: The ETL pipeline service (using Airflow) will land data directly into this zone. The folder structure will be organized by source system and ingestion date:
- /raw/mes/2025/08/26/
- /raw/fdc/2025/08/26/
Features:
- No transformations: Data is written as-is to ensure data lineage and traceability back to the original source.
- Schema on Read: The schema is not enforced at the time of writing; it is inferred when the data is read by a processing engine like Spark.

B. Curated Zone (Silver Layer)

Purpose: This zone contains cleaned, validated, and de-duplicated data from the raw zone. This is where the core data assets for the organization reside.
Implementation: Apache Spark jobs orchestrated by Airflow will read data from the Raw Zone, apply cleansing rules, and write it to the Curated Zone. The data is converted to Parquet format and tables are managed by Apache Iceberg.
Features:
- Schema Enforcement: Iceberg's table format ensures that the data conforms to a predefined schema.
- Data Quality: Implements data quality rules, such as outlier detection, missing value imputation, and type conversions.
- Standardization: All timestamps, identifiers (e.g., lot_id, wafer_id), and categorizations are standardized across all sources.

C. Analytics Zone (Gold Layer)

Purpose: This zone holds aggregated and transformed data that is ready for specific use cases, such as ML model training or business intelligence dashboards.
Implementation: Spark jobs read from the Curated Zone and create highly optimized, aggregated datasets. For example, a fact table that aggregates wafer-level data to a lot-level for faster yield analysis, or a dataset specifically prepared for training a predictive maintenance model.
Features:
- Reduced Granularity: Data is often aggregated to a higher level to reduce query time and data volume for specific tasks.
- Feature Engineering: Features for ML models are created here (e.g., chamber_pressure_avg_last_5min).
- Use-Case Specific: Each dataset in this zone is tailored to a specific downstream application, such as a dashboard or a machine learning pipeline.

3. Data Governance

Data Lineage: Iceberg's metadata tracks the history of every file, allowing you to trace a specific data point back to its original source in the Raw Zone. This is critical for auditing and debugging. Tools like OpenLineage can be integrated to provide a more comprehensive view of the entire data flow.
Quality Monitoring: A scheduled Spark job will run data quality checks on the Curated Zone. It will monitor for deviations from expected metrics (e.g., missing values, schema drifts). Alerts are sent via Airflow to notify data engineers of any issues.
Schema Evolution: Iceberg natively supports non-disruptive schema evolution. You can add, remove, or rename columns without rewriting the entire table, making the data lake agile and adaptable to changing business requirements.

Using Delta Lake is an excellent alternative to Apache Iceberg for your data lake solution. It offers a similar set of powerful features with a slightly different implementation approach. Here is a detailed design and implementation plan using Delta Lake, following the same zonal organization and data governance principles.

1. Technology: Delta Lake on Object Storage

Delta Lake: This is an open-source storage layer that provides a transactional layer on top of your existing data lake. It brings ACID (Atomicity, Consistency, Isolation, Durability) properties to big data workloads on object storage like S3, ADLS, or MinIO. This is a key advantage, as it ensures data reliability and integrity, which is vital in a manufacturing environment.
Apache Spark: Delta Lake is deeply integrated with Apache Spark. All data transformations, from the raw to the curated zones, will be performed using Spark, writing to Delta tables. This provides a unified and powerful processing engine for all your data lake workloads.

2. Zonal Organization with Delta Lake

The three-zone architecture remains the same, but the implementation is now powered by Delta Lake's features.

A. Raw Zone (Bronze Layer)

Purpose: The landing zone for all raw, untransformed data.
Implementation: Raw data from your ETL pipelines is ingested and written directly into Delta tables in this zone. This allows you to immediately leverage Delta's transactional features. You can use Spark to write data from CSV, JSON, or other formats into Delta tables. The folder structure remains the same: /raw/mes/2025/08/26/.
Features:
- Append-Only: Data is appended to the Delta tables. This provides an immutable, auditable log of all ingested data, ensuring a single source of truth.
- Time Travel: Delta Lake's versioning allows you to query historical versions of your data. This is crucial for debugging and re-running pipelines on older data versions if an error is discovered.

B. Curated Zone (Silver Layer)

Purpose: To store cleansed, validated, and refined data.
Implementation: Spark jobs read from the Raw Zone and apply data quality rules and transformations. The cleansed data is then written to new Delta tables in the Curated Zone.
Features:
- Schema Enforcement: Delta Lake automatically enforces a schema upon write. If a new incoming file doesn't match the table's schema, the write operation will fail, preventing bad data from corrupting the table. You can use schema evolution to safely handle changes.
- Data Quality: Instead of just flagging bad data, you can use Delta Lake's COPY INTO command or Spark's capabilities to ingest data into a "quarantine" table for further analysis while allowing clean data to flow to the Curated Zone.
- MERGE INTO: This powerful feature allows you to perform upserts (update or insert) on your Delta tables. This is ideal for handling slowly changing dimensions or updating records without a full table rewrite.

C. Analytics Zone (Gold Layer)

Purpose: To hold aggregated, use-case-specific data for ML and BI.
Implementation: Spark jobs read from the Curated Zone and perform complex joins, aggregations, and feature engineering to create high-value datasets. These final datasets are also stored as Delta tables.
Features:
- Optimized for Analytics: Delta tables support Z-ordering and partitioning to co-locate related data, dramatically speeding up queries.
- Simplified Pipelines: You can build a robust, multi-stage pipeline using a series of MERGE and INSERT statements, simplifying the code and making it more reliable.

3. Data Governance with Delta Lake

ACID Transactions: Delta Lake's transactional log guarantees that multiple write operations on the same table won't conflict, even with concurrent writes. This ensures data consistency and integrity.
Data Lineage and Auditing: Every transaction is logged in a detailed file. This provides a full audit trail of who changed what, when, and with what data, making it easy to trace data lineage.
Schema Evolution: Delta Lake provides a simple way to evolve schemas. If you need to add a new column, you can use the schema.evolution option in your Spark write operations, and Delta will automatically handle the change without breaking downstream queries.
Data Quality Monitoring: You can set up scheduled Spark jobs that query the Delta tables to continuously monitor for data quality metrics, such as null values, value ranges, and unexpected data patterns.