Design Document:Caching

Caching Layer

Technology: Redis Cluster
Purpose: Cache frequently accessed data and model predictions
Cache Types:
- Query result caching
- Model prediction caching
- Session data caching
- Real-time metrics caching

Caching Layer: Design and Implementation

Here's a detailed design for the caching layer using Redis Cluster, a critical component for reducing latency and improving the performance of your AI ecosystem.

1. Technology: Redis Cluster

Redis Cluster is the ideal technology for this caching layer. It provides high availability and horizontal scalability by sharding your data across multiple Redis nodes. This ensures that even if one node fails, your caching service remains operational, and you can handle large volumes of data and a high number of concurrent users.

2. Caching Strategy and Types

The caching layer will be implemented between the API Gateway and the backend services. The backend services will first check Redis for a cached response before performing a costly operation like a database query or a complex model inference.

A. Query Result Caching

Purpose: To cache the results of frequently executed analytical queries to the Greenplum data warehouse.
Implementation:
- The Data API Gateway or a dedicated caching service will intercept API requests for data.
- It will generate a unique key based on the query parameters (e.g., query:product_id=X&date_range=Y).
- Before executing the query against Greenplum, the service checks Redis using this key.
- Cache Hit: If a result is found in Redis, it is returned immediately, bypassing the expensive database query.
- Cache Miss: If not found, the query is executed, the result is stored in Redis with a Time-To-Live (TTL), and then returned to the user.
TTL Management: Configure a TTL based on data freshness requirements. For example, cache results for daily reports for 24 hours, but cache results for real-time sensor data for only 5 minutes.

B. Model Prediction Caching

Purpose: To cache the predictions from the LLM and other machine learning models. This is crucial for real-time applications and common queries.
Implementation:
- After an expensive model inference (e.g., a root cause analysis from the LLM), the system caches the prediction in Redis.
- The cache key is based on the input parameters to the model (e.g., prediction:lot_id=123).
- This prevents redundant computations, significantly reducing latency and GPU resource consumption, which can be a major cost factor.
Cache Invalidation: Predictions might become stale. The cache can be invalidated in two ways:
- Time-based Invalidation: Set a short TTL (e.g., 1 hour) for predictions.
- Event-based Invalidation: Trigger a cache invalidation for a specific key when new source data (e.g., MES, FDC) for that lot or wafer is ingested.

C. Session Data Caching

Purpose: To store temporary user session information and application state.
Implementation:
- When a user logs in, their session token and a minimal set of user data are stored in Redis.
- This offloads the session state from the application server, making the application stateless and easy to scale horizontally.

D. Real-time Metrics Caching

Purpose: To store real-time aggregated metrics from the stream processing service (Flink).
Implementation:
- The Apache Flink stream processing jobs will continuously write the aggregated real-time metrics (e.g., average chamber pressure over the last minute, tool health scores) directly to Redis.
- The user interface (Grafana dashboards, custom web app) can then query Redis directly for these metrics, providing a sub-second, real-time view of the factory floor.
- This bypasses the data warehouse for real-time insights, significantly reducing query latency for dashboards that require up-to-the-minute data.

This caching design makes your AI ecosystem more responsive, scalable, and cost-effective by minimizing redundant computation and database load.