Design Document:LLM Service

Models:
- Primary: Llama 3 70B or Qwen 72B for complex reasoning
- Secondary: Mistral 7B for faster responses
- Specialized: CodeLlama for process recipe analysis
Deployment:
- Model serving via TensorRT or vLLM
- GPU clusters (NVIDIA A100/H100)
- Auto-scaling based on request volume
Fine-tuning: LoRA adapters for semiconductor domain adaptation

Core stance: A private, swap-in LLM architecture with policy-enforced data boundaries; RAG-first for fast impact and currency; parameter-efficient domain adaptation (LoRA/QLoRA) for focused improvements; rigorous eval loops to guarantee technical accuracy and provenance.

Base models (supported, swappable):
- Llama 3.1 8B Instruct (latency-friendly on-prem), Llama 3.1 70B (higher accuracy), Mistral 7B Instruct, Qwen2-7B Instruct, CodeLlama for code/tooling prompts.
Runtime: vLLM or TensorRT-LLM for high-throughput serving; quantization: INT4/FP8 for 7–8B models; BF16 for 70B. Enable paged KV cache and continuous batching.
Tokenizer extensions: Add special tokens for SEMI/JEDEC acronyms and fab-specific terms to reduce fragmentation and improve grounding.
Model router (optional): Policy to route “code/tools” to CodeLlama, “reasoning/contextual” to Llama/Qwen/Mistral; fallback if latency/accuracy thresholds not met.

Ingestion sources: SEMI E10/E30/E40/E90/E94, JEDEC; process recipes, FA reports, yield learning docs, equipment manuals/specs; multilingual SOPs/BKMs.
Parsing & normalization:
- PDFs/Docs/PPT via OCR + layout-aware parsing; persist canonical text + structure.
- Metadata schema: {standard_id, revision, section, process_step, equipment_vendor, equipment_family, recipe, parameter, unit, language, confidentiality, effective_date}.
Chunking:
- Hybrid: semantic sections (headings, list items) + sliding windows (256–512 tokens overlap 20–25%) for specs; small blocks (150–300 tokens) for SOPs/BKMs; table-aware extraction.
Embeddings:
- On‑prem multilingual technical: BGE-m3 or E5-Large-V2; optional dual encoders for passages vs. tables. Re-embed on change; store embedding version.
Vector store: Weaviate or pgvector (if you prefer Postgres ops); Chroma for simple on‑prem POC. HNSW index, metadata filters (standard, process, equipment).
Retrieval:
- Hybrid BM25 + dense; rerank with cross-encoder (bge-reranker-large) for precision.
- Source-attribution with section IDs and page/figure refs.
Prompting:
- Process-aware system prompts, explicit grounding rules, instruction to cite standard section/figure for normative claims.
- Structured output for checklists, parameters, and units.

Strategy:
- Phase 1: RAG-only to validate grounding, reduce hallucinations, and cover breadth quickly.
- Phase 2: PEFT (LoRA/QLoRA) on:
- Instruction SFT corpus from SOP/BKM/spec-derived QA, failure analysis summaries, and yield RCA exemplars.
- Conversational repairs where base model struggles (terminology, units, SPC reasoning).
- Optional DPO/KTO on preference pairs (correct reasoning with citations > fast but shallow).
Data curation:
- Generate seed QA from standards with templated queries (what/why/how/limits/safety).
- Human-in-the-loop red teaming for edge cases: chamber matching, E90 substrate tracking nuances, E94 control job lifecycle, SECS/GEM message semantics.
Adapters:
- Separate adapters per subdomain (standards, equipment, yield/RCA). Compose adapters via PEFT merging at inference if needed.

Continuous updates:
- Daily doc watcher to re-parse, re-embed, and upsert into vector DB; SLA <24h.
- Weekly or event-driven adapter refresh for new failure modes or recurring RCA patterns.
Model registry:
- Track base model, adapter versions, embedding versions, corpus snapshot; automated A/B with rollback on regressions.

Glossary + normalization:
- Central dictionary for process steps, metrology parameters (CD, overlay, thickness, resistivity), defect classes (particles, scratches, residue, bridging), with synonyms and unit constraints.
- Pre-prompt injection of glossary snippets for disambiguation; RAG filters to prioritize glossary-adjacent passages.
SPC/analytics reasoning aids:
- Reusable prompt blocks for SPC/DOE/correlation analysis; enforce unit consistency and significance reporting.

Isolation: Air‑gapped or VPC-isolated serving; outbound egress blocked by default.
Encryption: TLS in transit, full-disk or volume-level encryption at rest; KMS-backed keys.
PII/IP controls: Content classifiers to prevent leakage of proprietary recipes; allowlist sources; redact before logging.
Compliance: ITAR/EAR data zoning, admin-only model pull, SBOM for runtimes, signed artifacts, reproducible builds.

Automatic eval:
- Domain QA benchmark: terminology, standards citation accuracy (section-level), tool/equipment procedures, SPC reasoning with numeric correctness.
- RAG faithfulness checks (answer must cite retrieved sources).
Human review loops:
- SME review for new adapters and high-risk responses (safety-critical steps).
Non-functional:
- Latency SLOs: <500 ms for grounded retrieval, <2.0 s full answer on 8B; <100 ms alert generation paths rely on precomputed templates + short-context inference.

Serving: vLLM/Triton on Kubernetes; Horizontal Pod Autoscaler; GPU partitioning (MIG if NVIDIA).
Observability: Prompt/response traces (sans secrets), retrieval stats, citation coverage, hallucination flags.
Tooling stack: Prefect/Argo for pipelines, MLflow for registry, Grafana/Loki for ops, Weaviate/pgvector for vectors, MinIO/S3 for artifacts.

Hallucination on standards: Stronger rerank + answer must cite; block non-cited normative statements.
Adapter drift: Canary deploy, eval gates, rollback; per-domain adapters avoid global degradation.
Latency spikes: Quantization + paged attention + streaming; pre-warm pools; route long answers to background generation.

Phase 1 (2–3 weeks): RAG MVP
- Stand up vLLM with Llama 3.1 8B; ingest SEMI/JEDEC + SOPs; hybrid retrieval; reranker; citation-enforced prompting; eval harness.
Phase 2 (3–4 weeks): Domain LoRA/QLoRA
- Build SFT dataset; train adapters; A/B vs. Phase 1; improve terminology/standards Q&A and SPC reasoning.
Phase 3 (2 weeks, parallelizable): Incremental learning + security hardening
- Automated ingestion with <24h SLA; model registry; egress blocks, encryption, audit; adapter canary + rollback.
Phase 4 (ongoing): Expansion
- Add multilingual content; equipment vendor manuals; recipe optimization exemplars; preference tuning from UAT.

7B+ models with swappable backends: satisfies model breadth and technical reasoning.
SEMI/JEDEC incorporation via RAG with structured metadata and source citation.
LoRA/QLoRA adapters for efficient domain adaptation and quick iteration.
Incremental learning through scheduled re-embedding and adapter refresh.
On‑prem/private cloud deployment with full encryption and zero data egress.
Explicit terminology handling and evals to ensure correct semiconductor language and SPC logic.