Vibe Coding Forem

Y.C Lee
Y.C Lee

Posted on

Code:Implement RAG engine with semantic search_Service

This code implements a REST API service layer on top of the RAG engine we previously discussed. Here's a detailed explanation of each component:

1. Pydantic Request/Response Models

QueryRequest

Function: Defines the structure for standard RAG query requests
Features:

  • query: The user's question/text to process
  • context: Optional additional context for the query
  • max_tokens: Override for maximum tokens in response
  • temperature: Override for response creativity
  • include_sources: Whether to include source documents in response
  • stream: Whether to stream the response (though streaming implementation isn't shown)

ChatMessage & ChatRequest

Function: Support chat-style interactions with conversation history
Features:

  • role: Participant role (system, user, assistant)
  • content: Message content
  • timestamp: When the message was created
  • messages: Complete conversation history for context-aware responses

AnalysisRequest

Function: Specialized request for semiconductor data analysis
Features:

  • data_type: Type of data to analyze (process_data, test_results, etc.)
  • data: The actual data to analyze
  • analysis_type: Type of analysis to perform (trend, correlation, etc.)
  • context: Additional context for the analysis

TroubleshootingRequest

Function: Specialized request for equipment/process troubleshooting
Features:

  • issue_description: Description of the problem
  • symptoms: List of observed symptoms
  • equipment_id: Specific equipment involved
  • process_step: Manufacturing process step where issue occurs
  • recent_changes: Any recent changes that might be relevant
  • data_context: Additional data context for troubleshooting

OptimizationRequest

Function: Specialized request for process optimization
Features:

  • target_metric: What to optimize (yield, throughput, etc.)
  • current_performance: Current performance metrics
  • constraints: Any constraints on the optimization
  • process_parameters: Current process parameters
  • historical_data: Historical data for analysis

HealthResponse

Function: Standardized health check response format
Features:

  • status: Overall service status
  • timestamp: When the health check was performed
  • rag_status: Detailed status of RAG components
  • service_info: General service information

2. Configuration Management

load_config()

Function: Loads service configuration from YAML file or provides defaults
Features:

  • Looks for config file at config/rag_config.yaml
  • Falls back to sensible defaults if file doesn't exist
  • Handles both RAG engine and security configurations
  • Provides robust error handling with logging

verify_api_key()

Function: API key authentication middleware
Features:

  • Checks if authentication is enabled in configuration
  • Validates API key against configured list of valid keys
  • Integrates with FastAPI's dependency injection system
  • Returns appropriate HTTP 401 errors for invalid/missing keys

3. Service Lifecycle Management

initialize_services()

Function: Initializes the RAG engine with proper configuration
Features:

  • Loads configuration using load_config()
  • Creates RAGConfig object with appropriate values
  • Initializes the global RAG engine instance using get_rag_engine()
  • Provides comprehensive error handling and logging

startup_event()

Function: FastAPI startup event handler
Features:

  • Calls initialize_services() during application startup
  • Logs successful service initialization
  • Records metrics for service startup
  • Exits application on initialization failure

shutdown_event()

Function: FastAPI shutdown event handler
Features:

  • Gracefully shuts down the RAG engine
  • Logs service shutdown completion
  • Records metrics for service stoppage
  • Handles errors during shutdown process

signal_handler()

Function: Handles OS signals for graceful shutdown
Features:

  • Catches SIGINT and SIGTERM signals
  • Logs the received signal
  • Initiates graceful application shutdown

4. API Endpoint Handlers

health_check() - GET /health

Function: Service health monitoring endpoint
Features:

  • Checks if RAG engine is initialized
  • Gets detailed health status from RAG engine
  • Returns standardized health response format
  • Provides service metadata (name, version, etc.)

process_query() - POST /query

Function: Core RAG query processing endpoint
Features:

  • Validates authentication
  • Allows overriding model parameters per request
  • Processes query using RAG engine
  • Formats response with answer and metadata
  • Optionally includes source documents (truncated for efficiency)

chat_completion() - POST /chat

Function: Chat-completion style endpoint
Features:

  • Extracts the latest user message from conversation history
  • Builds conversation context from message history
  • Processes query with conversation context
  • Returns response in OpenAI-compatible format
  • Includes RAG-specific metadata in response

analyze_data() - POST /analyze

Function: Specialized semiconductor data analysis endpoint
Features:

  • Constructs analysis-specific query from request data
  • Processes with RAG engine
  • Returns analysis results with type-specific metadata

troubleshoot_issue() - POST /troubleshoot

Function: Specialized troubleshooting endpoint
Features:

  • Constructs detailed troubleshooting query from symptoms and context
  • Processes with RAG engine
  • Returns troubleshooting guidance with issue context

optimize_process() - POST /optimize

Function: Specialized process optimization endpoint
Features:

  • Constructs optimization query from performance data and constraints
  • Processes with RAG engine
  • Returns optimization recommendations with technical justification

search_knowledge_base() - GET /knowledge/search

Function: Direct knowledge base access endpoint
Features:

  • Performs semantic search without LLM generation
  • Supports collection filtering
  • Returns raw search results with scores and metadata
  • Includes query intent classification

get_metrics() - GET /metrics

Function: Service metrics endpoint
Features:

  • Returns basic service metrics
  • Designed for integration with monitoring systems
  • Placeholder implementation (would connect to actual metrics system)

5. Main Application Setup

FastAPI App Configuration

Features:

  • Sets application title, description, and version
  • Adds CORS middleware for cross-origin requests
  • Uses HTTPBearer security scheme for API key authentication

main()

Function: Application entry point
Features:

  • Sets up signal handlers for graceful shutdown
  • Configures server host and port
  • Starts uvicorn server with appropriate settings

Key Architectural Features

  1. RESTful API Design: Clean, standardized endpoints with appropriate HTTP methods
  2. Domain Specialization: Semiconductor-specific endpoints with tailored request formats
  3. Authentication: Configurable API key authentication with proper security practices
  4. Error Handling: Comprehensive error handling with appropriate HTTP status codes
  5. Logging and Metrics: Integrated logging and metrics collection throughout
  6. Configuration Management: Flexible configuration with file-based and default options
  7. Health Monitoring: Built-in health checks for service monitoring
  8. Graceful Shutdown: Proper handling of shutdown signals and cleanup
  9. OpenAPI Compliance: Automatic API documentation generation through FastAPI
  10. Extensibility: Modular design makes it easy to add new endpoints or functionality

This service layer provides a robust, production-ready API for interacting with the semiconductor RAG engine, with appropriate security, monitoring, and domain-specific functionality.

RAG Engine Service Architecture with REST API

I'll create a comprehensive Mermaid diagram showing the architecture and information flow for this RAG Engine Service with REST API.

Information Flow for API Requests

API Endpoints Detail

Endpoint Method Purpose Input Output
/query POST Process a standard RAG query QueryRequest with query text RAG response with answer and sources
/chat POST Chat completion with conversation history ChatRequest with message history OpenAI-compatible chat response
/analyze POST Analyze semiconductor data AnalysisRequest with data and analysis type Analysis results with technical insights
/troubleshoot POST Troubleshoot process issues TroubleshootingRequest with symptoms and context Root cause analysis and solutions
/optimize POST Optimize process parameters OptimizationRequest with metrics and constraints Optimization recommendations
/knowledge/search GET Direct knowledge base search Query string and collection filters Raw search results without LLM generation
/health GET Service health status None Health status of RAG engine and dependencies
/metrics GET Service performance metrics None Operational metrics and statistics

Security Architecture

Key Architectural Features

  1. RESTful API Design:

    • Standard HTTP methods and status codes
    • JSON request/response format
    • Comprehensive error handling
    • OpenAPI documentation automatically generated by FastAPI
  2. Authentication & Security:

    • Optional API key authentication
    • CORS enabled for cross-origin requests
    • Input validation using Pydantic models
    • Secure dependency injection for auth
  3. Specialized Semiconductor Endpoints:

    • Domain-specific endpoints for analysis, troubleshooting, and optimization
    • Structured input formats for technical queries
    • Context-aware processing for manufacturing scenarios
  4. Service Management:

    • Health checks for all dependencies
    • Metrics endpoint for monitoring
    • Graceful startup and shutdown procedures
    • Configuration management with YAML files
  5. Integration Capabilities:

    • Compatible with OpenAI's chat format
    • Support for streaming responses (SSE)
    • Direct access to knowledge base without generation
    • Flexible context passing for complex workflows

This architecture provides a robust, scalable foundation for deploying RAG capabilities in semiconductor manufacturing environments, with appropriate security, monitoring, and domain-specific functionality.


Excellent! You've shared the REST API service layer that exposes the previously defined RAG engine as a production-ready, semiconductor-focused AI backend using FastAPI.

Below is a comprehensive breakdown of this code, followed by a Mermaid architecture diagram showing the full system with information flow, components, and endpoints.


🧠 Semiconductor RAG Engine API – Full Analysis

This service wraps the SemiconductorRAGEngine from rag_manager.py into a RESTful microservice with:

  • Multiple domain-specific endpoints
  • Authentication support
  • Health checks
  • Configurable behavior
  • Streaming capability (planned)
  • Observability & metrics

πŸ”§ Key Components Overview

Component Purpose
FastAPI Modern Python web framework for building APIs with automatic OpenAPI/Swagger docs
RAGConfig, SemiconductorRAGEngine Reused from core engine
Pydantic Models Define request/response schemas
HTTPBearer Optional API key authentication
YAML Config Loader External configuration via config/rag_config.yaml
Uvicorn ASGI server for async performance
SSE (planned) Server-Sent Events for streaming responses

πŸ“¦ Configuration: load_config() & RAGConfig

πŸ”Ή config/rag_config.yaml (Example)

rag:
  vector_db_url: http://vector-db:8091
  llm_service_url: http://llm-service:8092
  max_context_length: 4000
  top_k_documents: 10
  similarity_threshold: 0.7
  default_model: llama2-7b-semiconductor
  max_tokens: 1024
  temperature: 0.3

security:
  enable_auth: true
  api_keys:
    - "semiconductor_api_key_123"
Enter fullscreen mode Exit fullscreen mode

βœ… Allows external configuration without code changes.


πŸ” Authentication: verify_api_key()

  • Uses HTTPBearer scheme (Authorization: Bearer <api_key>).
  • If security.enable_auth == True, validates against api_keys list.
  • Disabled by default if not configured.

πŸ”’ Optional but production-ready security model.


πŸš€ Lifecycle: Startup & Shutdown

@app.on_event("startup")

  • Loads config
  • Initializes RAGConfig
  • Creates singleton SemiconductorRAGEngine via get_rag_engine()
  • Logs startup success or exits on failure

@app.on_event("shutdown")

  • Safely shuts down the RAG engine (closes HTTP clients)
  • Logs shutdown event

βœ… Ensures graceful boot and cleanup.


πŸ“‘ API Endpoints Summary

Endpoint Method Purpose
GET /health Health check System status and dependent services
POST /query Core RAG General Q&A with optional config override
POST /chat Chat interface Conversational AI with history
POST /analyze Domain-specific Analyze process/test/yield data
POST /troubleshoot Domain-specific Root cause + solutions for issues
POST /optimize Domain-specific Process improvement recommendations
GET /knowledge/search Retrieval-only Raw semantic search (no LLM)
GET /metrics Monitoring Service-level metrics (stubbed for Prometheus)

πŸ“₯ Request Models (Pydantic)

1. QueryRequest

{
  "query": "Why is CD uniformity poor?",
  "context": { "step": "lithography" },
  "max_tokens": 512,
  "temperature": 0.5,
  "include_sources": true,
  "stream": false
}
Enter fullscreen mode Exit fullscreen mode

Used for general queries with optional overrides.


2. ChatRequest

{
  "messages": [
    { "role": "user", "content": "What causes overlay drift?" },
    { "role": "assistant", "content": "Thermal expansion..." },
    { "role": "user", "content": "How do I fix it?" }
  ],
  "context": {},
  "stream": false
}
Enter fullscreen mode Exit fullscreen mode

Supports multi-turn conversations. Extracts last user message as query.


3. AnalysisRequest

{
  "data_type": "process_data",
  "data": { "temp": 120, "pressure": 50 },
  "analysis_type": "anomaly",
  "context": "After chamber cleaning"
}
Enter fullscreen mode Exit fullscreen mode

Tailored for data-driven analysis (trend, correlation, anomaly, summary).


4. TroubleshootingRequest

{
  "issue_description": "High defect count",
  "symptoms": ["particles", "scratches"],
  "equipment_id": "ETCH-007",
  "process_step": "plasma etch",
  "recent_changes": ["new gas line installed"]
}
Enter fullscreen mode Exit fullscreen mode

Guides structured troubleshooting with root cause analysis.


5. OptimizationRequest

{
  "target_metric": "yield",
  "current_performance": { "current_yield": 92.1 },
  "constraints": { "max_temp": 150 },
  "process_parameters": { "power": 800, "pressure": 45 }
}
Enter fullscreen mode Exit fullscreen mode

For process optimization with constraints and justification.


πŸ“€ Response Format Example (/query)

{
  "answer": "Poor CD uniformity can result from...",
  "confidence": 0.87,
  "response_time_ms": 1420,
  "tokens_generated": 234,
  "model_used": "llama2-7b-semiconductor",
  "query_intent": {
    "type": "analysis",
    "confidence": 0.91,
    "entities": ["CD", "nm"],
    "process_modules": ["lithography"],
    "equipment_types": []
  },
  "sources": [
    {
      "source": "litho_handbook_v3.pdf",
      "collection": "technical_reports",
      "score": 0.88,
      "content": "Critical dimension variation..."
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

βœ… Rich, structured, traceable output.


πŸ› οΈ Special Endpoints

GET /knowledge/search

Bypasses LLM β€” returns raw search results only.

Useful for:

  • Building search UIs
  • Debugging retrieval
  • Integrating with other systems

GET /metrics

Stubbed for future Prometheus integration.

Could expose:

  • rag_queries_total
  • rag_response_time_seconds_bucket
  • rag_confidence_avg

🧱 Architecture Diagram (Mermaid)

Here’s the complete system architecture with all components and data flow.


πŸ” Information Flow Example: /troubleshoot

  1. User β†’ POST /troubleshoot with issue details
  2. FastAPI validates input using TroubleshootingRequest
  3. Auth checks API key
  4. Service builds a natural language query from structured fields
  5. RAG Engine:
    • Classifies intent β†’ troubleshooting
    • Searches failure_analysis, equipment_manuals, etc.
    • Builds context
    • Sends to LLM β†’ generates troubleshooting steps
  6. Response includes:
    • Guidance
    • Sources
    • Confidence
    • Timing
  7. Logs & metrics recorded

🎯 Turns structured inputs into actionable expert advice.


🌐 Auto-Generated API Docs

FastAPI automatically provides:

  • Swagger UI: http://localhost:8093/docs
  • ReDoc: http://localhost:8093/redoc

Includes:

  • Interactive testing
  • Schema definitions
  • Example payloads
  • Status codes

πŸ“š Instant developer documentation.


πŸ›‘οΈ Error Handling & Resilience

Failure Mode Handling
RAG engine not ready 503 Service Unavailable
Invalid request 400 Bad Request
Missing auth 401 Unauthorized
Internal error 500 Internal Server Error + log
Service down Health check reflects degraded status

βœ… Robust and observable.


πŸš€ Deployment Notes

Run with:

python rag_service.py
Enter fullscreen mode Exit fullscreen mode

Exposes:

  • Port 8093: Main API
  • Endpoints:
    • GET /health β†’ Monitoring
    • GET /docs β†’ Swagger UI

Suggested Docker Setup:

FROM python:3.10
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "rag_service.py"]
Enter fullscreen mode Exit fullscreen mode

Deployable in Kubernetes, Docker, or bare metal.


βœ… Summary: Key Features

Feature Benefit
Domain-Specific Endpoints Tailored for semiconductor engineers
Unified RAG Backend Single engine powers chat, analysis, troubleshooting
Configurable via YAML No rebuild needed for endpoint tuning
Secure (Optional) API key protection
Observable Logging, metrics, health checks
Extensible Easy to add new endpoints (e.g., /predict_yield)
Production-Ready Async, scalable, resilient

πŸ“ˆ Future Enhancements

Idea Description
Streaming Responses Use StreamingResponse or EventSourceResponse for real-time LLM output
Prometheus Integration Export metrics for monitoring dashboards
Rate Limiting Prevent abuse using fastapi-limiter
Caching Cache frequent queries with Redis
Feedback Endpoint Allow users to rate responses for improvement
Multi-Tenant Support Isolate knowledge by fab or product line

🏁 Conclusion

This rag_service.py file transforms the core RAG engine into a full-fledged AI microservice tailored for semiconductor manufacturing.

It enables:

  • Engineers to ask questions in natural language
  • Systems to analyze data and recommend actions
  • Support teams to troubleshoot faster
  • Process engineers to optimize yield

With clean separation of concerns, strong typing, async support, and excellent observability, this is a model implementation of an enterprise RAG service.

Top comments (0)