Vibe Coding Forem

Y.C Lee
Y.C Lee

Posted on

Design Document:LLM Service

LLM Service

  • Models:
    • Primary: Llama 3 70B or Qwen 72B for complex reasoning
    • Secondary: Mistral 7B for faster responses
    • Specialized: CodeLlama for process recipe analysis
  • Deployment:
    • Model serving via TensorRT or vLLM
    • GPU clusters (NVIDIA A100/H100)
    • Auto-scaling based on request volume
  • Fine-tuning: LoRA adapters for semiconductor domain adaptation

Best-solution proposal for Requirement 1: LLM Integration and Customization

  • Core stance: A private, swap-in LLM architecture with policy-enforced data boundaries; RAG-first for fast impact and currency; parameter-efficient domain adaptation (LoRA/QLoRA) for focused improvements; rigorous eval loops to guarantee technical accuracy and provenance.

1) Foundation models and runtime

  • Base models (supported, swappable):
    • Llama 3.1 8B Instruct (latency-friendly on-prem), Llama 3.1 70B (higher accuracy), Mistral 7B Instruct, Qwen2-7B Instruct, CodeLlama for code/tooling prompts.
  • Runtime: vLLM or TensorRT-LLM for high-throughput serving; quantization: INT4/FP8 for 7–8B models; BF16 for 70B. Enable paged KV cache and continuous batching.
  • Tokenizer extensions: Add special tokens for SEMI/JEDEC acronyms and fab-specific terms to reduce fragmentation and improve grounding.
  • Model router (optional): Policy to route “code/tools” to CodeLlama, “reasoning/contextual” to Llama/Qwen/Mistral; fallback if latency/accuracy thresholds not met.

2) RAG architecture and domain knowledge

  • Ingestion sources: SEMI E10/E30/E40/E90/E94, JEDEC; process recipes, FA reports, yield learning docs, equipment manuals/specs; multilingual SOPs/BKMs.
  • Parsing & normalization:
    • PDFs/Docs/PPT via OCR + layout-aware parsing; persist canonical text + structure.
    • Metadata schema: {standard_id, revision, section, process_step, equipment_vendor, equipment_family, recipe, parameter, unit, language, confidentiality, effective_date}.
  • Chunking:
    • Hybrid: semantic sections (headings, list items) + sliding windows (256–512 tokens overlap 20–25%) for specs; small blocks (150–300 tokens) for SOPs/BKMs; table-aware extraction.
  • Embeddings:
    • On‑prem multilingual technical: BGE-m3 or E5-Large-V2; optional dual encoders for passages vs. tables. Re-embed on change; store embedding version.
  • Vector store: Weaviate or pgvector (if you prefer Postgres ops); Chroma for simple on‑prem POC. HNSW index, metadata filters (standard, process, equipment).
  • Retrieval:
    • Hybrid BM25 + dense; rerank with cross-encoder (bge-reranker-large) for precision.
    • Source-attribution with section IDs and page/figure refs.
  • Prompting:
    • Process-aware system prompts, explicit grounding rules, instruction to cite standard section/figure for normative claims.
    • Structured output for checklists, parameters, and units.

3) Domain adaptation (LoRA/QLoRA)

  • Strategy:
    • Phase 1: RAG-only to validate grounding, reduce hallucinations, and cover breadth quickly.
    • Phase 2: PEFT (LoRA/QLoRA) on:
    • Instruction SFT corpus from SOP/BKM/spec-derived QA, failure analysis summaries, and yield RCA exemplars.
    • Conversational repairs where base model struggles (terminology, units, SPC reasoning).
    • Optional DPO/KTO on preference pairs (correct reasoning with citations > fast but shallow).
  • Data curation:
    • Generate seed QA from standards with templated queries (what/why/how/limits/safety).
    • Human-in-the-loop red teaming for edge cases: chamber matching, E90 substrate tracking nuances, E94 control job lifecycle, SECS/GEM message semantics.
  • Adapters:
    • Separate adapters per subdomain (standards, equipment, yield/RCA). Compose adapters via PEFT merging at inference if needed.

4) Incremental learning and freshness

  • Continuous updates:
    • Daily doc watcher to re-parse, re-embed, and upsert into vector DB; SLA <24h.
    • Weekly or event-driven adapter refresh for new failure modes or recurring RCA patterns.
  • Model registry:
    • Track base model, adapter versions, embedding versions, corpus snapshot; automated A/B with rollback on regressions.

5) Terminology and semantic competency

  • Glossary + normalization:
    • Central dictionary for process steps, metrology parameters (CD, overlay, thickness, resistivity), defect classes (particles, scratches, residue, bridging), with synonyms and unit constraints.
    • Pre-prompt injection of glossary snippets for disambiguation; RAG filters to prioritize glossary-adjacent passages.
  • SPC/analytics reasoning aids:
    • Reusable prompt blocks for SPC/DOE/correlation analysis; enforce unit consistency and significance reporting.

6) Security, privacy, compliance (on‑prem/private cloud)

  • Isolation: Air‑gapped or VPC-isolated serving; outbound egress blocked by default.
  • Encryption: TLS in transit, full-disk or volume-level encryption at rest; KMS-backed keys.
  • PII/IP controls: Content classifiers to prevent leakage of proprietary recipes; allowlist sources; redact before logging.
  • Compliance: ITAR/EAR data zoning, admin-only model pull, SBOM for runtimes, signed artifacts, reproducible builds.

7) Evaluation, safety, and quality gates

  • Automatic eval:
    • Domain QA benchmark: terminology, standards citation accuracy (section-level), tool/equipment procedures, SPC reasoning with numeric correctness.
    • RAG faithfulness checks (answer must cite retrieved sources).
  • Human review loops:
    • SME review for new adapters and high-risk responses (safety-critical steps).
  • Non-functional:
    • Latency SLOs: <500 ms for grounded retrieval, <2.0 s full answer on 8B; <100 ms alert generation paths rely on precomputed templates + short-context inference.

8) Deployment blueprint

  • Serving: vLLM/Triton on Kubernetes; Horizontal Pod Autoscaler; GPU partitioning (MIG if NVIDIA).
  • Observability: Prompt/response traces (sans secrets), retrieval stats, citation coverage, hallucination flags.
  • Tooling stack: Prefect/Argo for pipelines, MLflow for registry, Grafana/Loki for ops, Weaviate/pgvector for vectors, MinIO/S3 for artifacts.

9) Risks and mitigations

  • Hallucination on standards: Stronger rerank + answer must cite; block non-cited normative statements.
  • Adapter drift: Canary deploy, eval gates, rollback; per-domain adapters avoid global degradation.
  • Latency spikes: Quantization + paged attention + streaming; pre-warm pools; route long answers to background generation.

10) Phased delivery for Requirement 1

  • Phase 1 (2–3 weeks): RAG MVP
    • Stand up vLLM with Llama 3.1 8B; ingest SEMI/JEDEC + SOPs; hybrid retrieval; reranker; citation-enforced prompting; eval harness.
  • Phase 2 (3–4 weeks): Domain LoRA/QLoRA
    • Build SFT dataset; train adapters; A/B vs. Phase 1; improve terminology/standards Q&A and SPC reasoning.
  • Phase 3 (2 weeks, parallelizable): Incremental learning + security hardening
    • Automated ingestion with <24h SLA; model registry; egress blocks, encryption, audit; adapter canary + rollback.
  • Phase 4 (ongoing): Expansion
    • Add multilingual content; equipment vendor manuals; recipe optimization exemplars; preference tuning from UAT.

Why this works for your acceptance criteria

  • 7B+ models with swappable backends: satisfies model breadth and technical reasoning.
  • SEMI/JEDEC incorporation via RAG with structured metadata and source citation.
  • LoRA/QLoRA adapters for efficient domain adaptation and quick iteration.
  • Incremental learning through scheduled re-embedding and adapter refresh.
  • On‑prem/private cloud deployment with full encryption and zero data egress.
  • Explicit terminology handling and evals to ensure correct semiconductor language and SPC logic.

Top comments (0)