Design Document:Data Warehouse

Data Warehouse

Technology: Greenplum
Schema: Star schema optimized for analytical queries
Data Models:
- Fact tables: Process measurements, test results, defect counts
- Dimension tables: Equipment, recipes, lots, wafers, time
Partitioning: Time-based partitioning for efficient querying
Retention: Configurable data retention policies

Data Warehouse Design and Implementation

This section provides a detailed design and implementation plan for the data warehouse using Greenplum, a robust, MPP (Massively Parallel Processing) database, tailored for the analytical needs of semiconductor manufacturing.

1. Technology: Greenplum

Greenplum is a powerful choice because of its Massively Parallel Processing (MPP) architecture. It distributes data and query processing across multiple nodes, allowing for the rapid execution of complex analytical queries on terabytes or even petabytes of data. This is crucial for analyzing historical trends, performing root cause analysis, and supporting high-concurrency analytical workloads.

2. Schema Design: Star Schema

The data warehouse will use a star schema, a widely accepted model for analytical data warehousing. It's simple, intuitive, and highly optimized for reporting and ad-hoc queries.

Fact Tables: These tables contain the core, quantitative measurements and metrics. They are large and hold the bulk of the data.
Dimension Tables: These tables contain descriptive attributes related to the facts. They are smaller and joined to the fact tables via foreign keys.

This design simplifies complex queries. For example, to find the average yield for a specific recipe on a particular tool, you join a small Equipment Dimension table and a small Recipe Dimension table to a large Fact table containing yield data. This is much faster than joining multiple large fact-like tables.

3. Data Models

A. Fact Tables

Each fact table will represent a specific manufacturing event or measurement.

fact_process_measurement: Captures inline process measurements from FDC and metrology systems.
- Keys: wafer_sk, lot_sk, equipment_sk, recipe_sk, time_sk.
- Measures: temperature_mean, pressure_std_dev, flow_rate_mean, rf_power_mean, CD_measurement, overlay_measurement.
fact_test_result: Stores final test results from WAT and CP.
- Keys: die_sk, wafer_sk, lot_sk, test_program_sk, time_sk.
- Measures: vth, idsat, leakage_current, yield_bin_code.
fact_defect_count: Aggregates defect data.
- Keys: wafer_sk, lot_sk, equipment_sk, process_step_sk, time_sk.
- Measures: particle_count, scratch_count, residue_count, total_defect_count.

B. Dimension Tables

These tables provide context and descriptive attributes.

dim_equipment: Information about manufacturing tools.
- Attributes: equipment_id (primary key), tool_name, tool_type, chamber_number, fab_location, supplier.
dim_recipe: Details about process recipes.
- Attributes: recipe_id (primary key), recipe_name, process_layer, target_parameters.
dim_lot: Lot-level information.
- Attributes: lot_id (primary key), product_id, start_date, finish_date.
dim_wafer: Wafer-specific attributes.
- Attributes: wafer_id (primary key), wafer_type, wafer_state (good, rework, scrap).
dim_time: A pre-generated table with every time granularity needed for analysis.
- Attributes: time_id (primary key), date, day_of_week, hour_of_day, shift, fiscal_week, fiscal_month.

4. Partitioning and Indexing

Partitioning: Greenplum's table partitioning will be implemented based on the time_sk dimension. This is the most common query pattern. For example, the fact_process_measurement table can be partitioned by month or week. This means that a query for the last two weeks will only scan the two most recent partitions, dramatically reducing I/O and improving performance.
Indexing: Use B-tree indexes on the primary keys of dimension tables to speed up joins. In Greenplum, which is an MPP system, indexing on fact tables is less critical than proper partitioning, as it relies on parallel scans.

5. Data Retention

Data retention is a critical part of the design to manage storage costs and ensure compliance.

Policy Configuration: Configure a clear data retention policy. For example, detailed, per-wafer data from the last 5 years will be stored in the primary data warehouse for real-time analysis, while data older than 5 years will be archived to a more cost-effective object store (e.g., MinIO) and aggregated.
Archiving and Deletion: A scheduled cron job or an Airflow DAG will periodically check for old partitions based on the retention policy. It will then either delete the old partitions or move them to the archival storage. This automated process ensures the data warehouse remains performant and manages storage costs.