Case Study: Multilingual Video Platform with Local LLM Pipeline

01 — The Problem

Two AI workloads with incompatible shapes, sharing one piece of hardware.

A streaming platform for sermon video has 15,000+ videos on it today, growing weekly, across three languages. Search that works — meaning semantic understanding of what the preacher actually said, not keyword-matching on the title — is table stakes. So is multilingual captioning and translation. So is content tagging that helps a visitor find a sermon about selective obedience when what they typed into the search box was "why do I keep running from what God asked me to do."

All of that is AI work. All of it has to run against a video corpus that can't reasonably live on an OpenAI API budget at catalog scale, and a substantial portion of which belongs to churches that have strong and legitimate preferences about where their sermon recordings get sent for processing. The platform runs its own inference. That decision drove every architectural decision downstream of it.

The AI work splits into two fundamentally different workloads:

A user just uploaded a sermon through the admin UI. They're watching a progress bar. They want the transcript, translations, and analysis back in minutes — not tomorrow. This is one video, right now, with a human waiting for it.
We just imported a YouTube channel with 3,000 sermons on it. No human is watching any individual progress bar. We want the entire catalog processed overnight, at the highest throughput the GPU can sustain, and we want it to pick up cleanly where it left off if the workstation reboots.

These two workloads share a hardware budget — one RTX 4090 with 24 GB of VRAM, plus a DGX Spark sitting in the same lab for heavier LLM passes. They share an HTTP contract — a FastAPI service called sermon-api that exposes /transcribe and /analyze. They share a three-pass LLM analysis pipeline, a deterministic confidence scorer, and a hand-curated Bible reference corpus. But the execution layer — the code that moves bytes through the GPU — is deliberately different, because the two workloads want different things from the same hardware.

A word on the hardware choice, because it tends to prompt a specific kind of question. The constraints that pushed this project toward local inference — cost predictability at catalog scale and data sensitivity in the source material — aren't universal. The architectural discipline is. The two-path structure, the deterministic layers wrapping every LLM call, the GPU budgeting as a first-class engineering concern — all of it applies equally to cloud deployments. Swap the 4090 for an A10G on AWS or an L4 on GCP, and the same patterns work without modification; the hardware details change, the engineering doesn't. What local gives you here is the freedom to make decisions about quantization, memory budgeting, and batch scheduling that you cannot make when the model is on the other side of an opaque API. That freedom is the point, and it's available on rented hardware too.

Trying to serve both workloads from a single pipeline is how you build a system that's bad at both jobs. Latency optimization and throughput optimization pull in opposite directions, and the honest answer is to stop pretending they're one problem.

This case study is about that decision and its consequences: why two pipelines, how they share a substrate without colliding, how the AI layers are designed so the deterministic parts catch the LLM when it hallucinates, and the operational reality of running all of it on hardware you physically own.

02 — Architecture

Two paths, one substrate.

The platform's backend is a Go service (Echo + sqlc + PostgreSQL on Railway) that owns users, videos, the job queue, presigned R2 URLs, and the frontend. It has zero AI dependencies in its binary. Everything that touches a model goes over HTTP to sermon-api, a Python FastAPI service that runs on a workstation at the edge of the home lab.

sermon-api is the shared substrate. It exposes a small, stable contract:

POST /transcribe — submit a presigned video URL, get back a jobId
GET /transcribe/{jobId} — poll for status and result
POST /analyze — submit a transcript, get back a jobId
GET /status/{jobId} — poll analysis status

Underneath that contract, two execution paths branch based on workload shape:

Two paths into the same FastAPI surface. Each path is optimized for the workload that uses it; the substrate — prompts, scorer, corpus — is identical.

The two paths are not two systems in the "legacy versus new" sense. Both are in production today. Path A runs every time someone uploads a sermon through the admin UI. Path B runs every time the team imports a new YouTube channel or runs a backfill. The deployment is deliberate, because the two workloads genuinely do not want the same execution model.

Why the substrate is shared

The expensive parts of doing LLM work correctly are not the throughput parts — they're the correctness parts. Prompt design for a three-pass analysis. A confidence scorer that doesn't trust the model's self-report. A Bible reference validator that catches "Hezekiah 3:16" before it ships to users. Those three layers account for most of the engineering effort in the AI pipeline, and they're identical whether a transcript arrived via Path A or Path B. Every byte of business logic that could live in the substrate does.

The paths diverge where the workload shape demands it — concurrency primitives, audio handling, GPU scheduling — and converge back at the HTTP boundary. Separation of concerns isn't an aesthetic choice here. It's what makes the substrate worth building once.

03 — Path A · Online

Latency-first, single video, human in the loop.

When a user uploads a sermon through the admin interface, the Go backend creates a row in the jobs table, returns a job ID, and begins polling. The React frontend polls the backend. The backend polls sermon-api. The user sees a progress bar tick from 0 to 100.

That chain's only job is to keep the latency tolerable for the person who is watching it work. Everything about Path A follows from that constraint.

Per-type concurrency pools in the Go worker

The Go backend runs a worker with four concurrency pools, one per job type:

// internal/worker/worker.go:802-810
func (w *Worker) processJobs(ctx context.Context) {
    // Each AI job type gets its own concurrency pool
    w.processPoolJobs(ctx, "transcription", []string{domain.JobTypeAITranscription}, w.transcriptionMax)
    w.processPoolJobs(ctx, "translation",   []string{domain.JobTypeAITranslation},   w.translationMax)
    w.processPoolJobs(ctx, "analysis",      []string{domain.JobTypeAIAnalysis},      w.analysisMax)

    // Non-AI jobs: YouTube import, cleanup, etc.
    w.processGeneralJobs(ctx)
}

The reason for four pools rather than one is that the three AI job types bottleneck on different resources. Transcription is GPU-bound — the 4090 is running faster-whisper. Translation is network-bound — the job is waiting on either an OpenAI API response or a LAN round-trip to a local LLM. Analysis is also GPU-bound but hits a different box (the DGX Spark). If you collapse them into a single "AI pool," a backlog of network-bound translation jobs will happily consume every slot and starve the GPU-bound transcriptions that need them.

Per-type pools mean translation backpressure stays inside its lane. The pool caps are configurable but default to 2 for each AI type — enough that a slow request doesn't block the next, not so many that concurrent requests fight over the GPU.

In-memory job store, on purpose

sermon-api stores in-flight transcription jobs in a plain Python dict. No Redis, no Postgres, no persistent queue:

# sermon_api.py — transcription_store is an in-memory dict keyed by job_id
@app.post("/transcribe", response_model=TranscribeSubmitResponse)
async def submit_transcription(request: TranscribeRequest, background_tasks: BackgroundTasks):
    job_id = request.jobId or str(uuid.uuid4())
    initial_record = TranscribeResponse(
        jobId=job_id,
        status=ProcessingStatus.PENDING,
        whisperModel=Config.WHISPER_MODEL,
        createdAt=datetime.now().isoformat()
    )
    await transcription_store.set(job_id, initial_record)
    background_tasks.add_task(process_transcription, job_id, request.url)

This is a deliberate choice with a failure mode the Go caller has to handle. If sermon-api restarts while a job is in flight, the job ID disappears. The next poll from Go gets a 404 that is technically correct — the job doesn't exist anymore — but indistinguishable from "you mistyped the job ID" or "the job was never created."

The Go transcription client distinguishes "404 because the service forgot" from "404 because you mistyped" with two independent counters:

// internal/transcription/client.go:1117-1124
const MaxConsecutiveNotFound = 3   // 3 × 404 → fail: "job may have been deleted"
const MaxConsecutiveErrors  = 10   // 10 × non-404 errors → fail
const MaxPollCount          = 600  // 600 × 3s = ~30 min budget

// Inside WaitForCompletion: 404s reset on any non-404 response.
// Non-404 errors reset on any success. Orthogonal counters for
// orthogonal failure modes.

The dual-reset is the subtle part. A naive single-counter client trips on a mixed-fault pattern (a 404 followed by a 502 followed by a 200 followed by another 404) and fails when the real state of the world is "intermittent transient errors, job is fine." Orthogonal counters resolve that mix correctly.

Why keep the in-memory store instead of adding Redis? Because latency. Under normal operation, a Path A transcription completes in under 5 minutes — well inside sermon-api's uptime windows. Restarts during a job are rare. The cost of handling the rare case in the client (~20 lines of Go) is lower than the cost of the dependency on a distributed queue we don't otherwise need. When a simpler thing works, the simpler thing is the right thing.

Whisper `base`, not `medium`

The online path uses OpenAI Whisper's base model (~1 GB VRAM, ~74 MB on disk). That's a deliberate latency-first decision. Base loads in seconds, transcribes a 45-minute sermon in about a minute on the 4090, and produces output that the downstream LLM analysis can correct for minor name misspellings in context. If the user uploads a sermon at 9:52 AM and the analysis appears at 9:58 AM, nobody cares about the 3% WER difference between base and medium. They care that the progress bar filled.

The batch path makes the opposite trade, and I'll get to that in a minute.

Presigned URL refresh

Cloudflare R2 presigned URLs carry a TTL. The orchestration code signs them for two hours. For Path A, that's typically fine — the entire pipeline completes in well under two hours. For chained jobs (transcription → translation → analysis, all flowing off the transcription's completion), it's sometimes not:

// internal/aijobs/transcription_processor.go:205 — refreshVideoURL
// Re-fetch the canonical video row from Postgres, extract the R2 key,
// sign fresh. Called every time the processor runs, even for jobs that
// just got enqueued. One extra DB read; elimination of an entire class
// of "403 Forbidden" flake.

The pattern matters beyond this codebase: presigned URLs are a cache, and caches have TTLs. If your job queue latency can exceed your URL TTL, you don't trust the URL — you rebuild it. This kind of thing is easy to skip in the initial design and expensive to add after you've been paged for it.

04 — Path B · Batch

Throughput-first, thousands of videos, nobody watching.

When the team imports a YouTube channel with 3,000 sermons on it, the online path's shape is wrong. It's not that it would fail — it would complete, eventually. It's that two AI transcriptions running concurrently (the online path's default) against a 4090 using ~4 GB of VRAM for base while sitting idle the rest of the time is a criminal waste of the hardware.

The batch path exists to fix that. It lives outside the Go worker, talks directly to Postgres and to sermon-api, and is allowed to make decisions the online path cannot.

The origin story

The first batch implementation was the obvious one: reuse the online path, turn the concurrency knob up, let it run. After a month of running off and on, it had processed fewer than 2,500 videos out of a catalog that was growing faster than that.

The diagnosis had three parts:

Per-worker serial pipeline. Each worker was doing download → ffmpeg → transcribe in sequence. Since download and ffmpeg are I/O-bound and transcribe is GPU-bound, the GPU was idle 40–60% of the time — every worker spent its download minutes waiting on R2 and its ffmpeg minutes waiting on CPU, with the GPU starving the whole time.
Too few workers. The default was 4 on a 64-thread, 125 GB-RAM workstation. Hardware utilization was in the low single digits.
No phase separation. Even if you ran 64 workers, they'd all still fight each other for network and for GPU simultaneously — adding more workers just moves the contention around.

The rebuild fixed all three at once.

Two phases, each given the full machine

The rebuild: separate I/O-bound work from GPU-bound work, give each phase the full machine, measure.

Phase 1 is a ThreadPoolExecutor of 48 threads where each thread runs an ffmpeg child process. The ffmpeg invocation does something specific — it reads the R2 presigned URL directly, decodes the MP4 container over HTTP using range requests, extracts only the audio track, and writes 16 kHz mono PCM to local disk:

# batch_transcribe.py:144-167
def extract_audio_from_url(url: str, audio_path: Path, retries: int = 3) -> None:
    """Stream video from R2, extract audio only. No local video file needed.
    ffmpeg reads the URL directly, demuxes the audio track, writes 16kHz mono WAV.
    Only the audio stream bytes are transferred (~50MB vs ~500MB for a sermon video)."""
    cmd = [
        "ffmpeg",
        "-i", url,
        "-vn",                      # drop video track (this is the savings)
        "-acodec", "pcm_s16le",     # PCM whisper consumes natively
        "-ar", "16000", "-ac", "1", # whisper's native rate, mono
        "-y", str(audio_path),
    ]
    for attempt in range(1, retries + 1):
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=600)
        if result.returncode == 0:
            return
        if attempt < retries:
            audio_path.unlink(missing_ok=True)
            time.sleep(attempt * 3)
        else:
            raise RuntimeError(f"ffmpeg stream failed: {result.stderr[:500]}")

Two optimizations compound here:

~10× per-video data reduction — a 500 MB sermon container becomes a 50 MB audio file because we drop the video track at the source. This is the 90% bandwidth figure on the case summary.
~4.5× concurrency multiplier — 48 parallel ffmpeg streams sustain 85–93 MB/s (~70–75% of the 1 Gbps NIC), versus the ~20 MB/s a single curl can pull from R2 under the same conditions.

Phase 2 is a ProcessPoolExecutor of 14 processes where each process holds its own copy of Whisper medium in VRAM. Spawn, not fork, because CUDA cannot survive a fork:

# batch_transcribe.py
multiprocessing.set_start_method("spawn")  # CUDA forbids fork

_worker_model = None
def _init_worker_model(model_size: str, device: str):
    global _worker_model
    if _worker_model is not None:
        return
    import whisper
    log.info("Worker loading whisper '%s' on %s...", model_size, device)
    _worker_model = whisper.load_model(model_size, device=device)

Load-once-per-worker is important. Each process initializes lazily on the first task it receives and keeps the model resident for the process's lifetime — no per-task VRAM churn, no allocation fragmentation, no GIL contention because there's no GIL crossing.

Whisper `medium`, not `base`

The batch path uses Whisper medium (~1.5 GB VRAM, noticeably better WER than base). This is the opposite trade from Path A. The batch path can afford medium's better accuracy because:

Nobody is waiting on any individual video, so the slightly longer per-video transcription time is invisible.
Model-load time is amortized across hundreds of videos per worker lifetime.
Backfilling the catalog once with better transcripts is worth more than doing it twice with worse ones.

Same hardware, same library, same basic job — different model choice driven by different workload shape. The architecture of the decision is more interesting than the decision itself.

Resumability is free

Each video's audio is written to /mnt/storage/batch_audio/{video_id}/audio.wav. Phase 2 scans that directory at startup and reconstructs its worklist from disk. This means:

python batch_transcribe.py runs end to end
python batch_transcribe.py --phase1-only just extracts audio (useful for scheduling NIC-heavy work during off-hours)
python batch_transcribe.py --phase2-only just transcribes whatever's already on disk (useful for resuming after a crash, or for running Phase 2 on a different box)

If a Phase 2 crashes halfway through 800 videos, you don't lose the 800 audio extractions. The disk is the queue. This is another boring-primitives decision that looks unremarkable until you actually need it.

The overnight translation backfill

The analogous batch path for translation is bulk_translate.py, which chunks each video's segments (25 per chunk) and distributes the chunks to a thread pool. One overnight run processed 5,522 translation jobs with zero permanent failures — the "zero permanent" is load-bearing, because it's not the same as zero retries.

Two design decisions make zero-permanent-failure achievable. First, the chunker works below the video level: a video with 400 segments becomes 16 chunks, and workers steal from a shared chunk queue. Second, and more important, a chunk is only fatal if more than 20% of its segments fail to parse. Below that threshold, the missed segments are kept as source-language text and logged as a warning:

# bulk_translate.py:360
missed = len(chunk.texts) - len(applied_local)
if missed > 0:
    # Fatal only if the model dropped >20% of a chunk — that
    # suggests a real formatting failure, not just a filler word.
    if missed > max(2, len(chunk.texts) // 5):
        job.errors.append(f"chunk@{chunk.start}: only {len(applied_local)}/{len(chunk.texts)} parsed")
    else:
        job.warnings.append(f"chunk@{chunk.start}: {missed} segment(s) kept as source")

The point is honesty about what "zero failures" means. The honest definition is "every video shipped a row that met the quality bar," not "every segment perfectly translated." The LLM will occasionally decide not to translate an English "um" in a Spanish transcript, and the right engineering response to that is to encode the tolerance explicitly, in code, not in a comment.

05 — GPU Budgeting

24 GB is a constraint, not a number to round up from.

An RTX 4090 has 24 GB of VRAM. That's the entire compute budget for two concurrent AI workloads (transcription and LLM analysis), an OS that the workstation also has to serve, a CUDA driver cache, and whatever PyTorch decides to allocate for activation buffers and intermediate state. Budgeting that space carefully isn't a nice-to-have. It's the single most important engineering decision in the system.

Two configurations of the same 24 GB. Every kilobyte is accounted for.

AWQ quantization: where 7.4 GB of weights come from

Qwen2.5-14B has 14.77 billion parameters. At FP16, that's roughly 29.5 GB of weights alone — obviously doesn't fit. The AWQ (Activation-aware Weight Quantization) 4-bit format brings weight storage down to:

14.77B params × 4 bits ÷ 8 bits/byte ≈ 7.4 GB weights
+ small FP16 activation-sensitive layers (~0.2 GB)

AWQ's trick is to keep the activation-sensitive layers in higher precision while quantizing the rest aggressively. In practice, AWQ loses a few tenths of a point on standard benchmarks versus FP16, which is a negligible cost for roughly 4× memory reduction. For a three-pass analysis that cares more about structural JSON output than about the last 0.3% of token quality, it's the right tradeoff.

The KV-cache pool: the part nobody tells you about

Weights are the obvious memory cost. The KV cache — the attention key/value tensors that LLMs cache per token during generation — is the non-obvious one. Back-of-envelope for Qwen2.5-14B at max_model_len=8192:

kv_per_token = 2 (K+V) × 48 layers × 40 heads × 128 dim × 2 bytes (fp16)
             ≈ 0.98 MB / token

full_context = 0.98 MB × 8192 ≈ 8 GB (!)

A single request with a full 8K-token context eats nearly 8 GB of KV cache. This is why you can't naively say "let's just run 14 concurrent requests at full context" — you'd need 112 GB of VRAM for the KV cache alone.

vLLM's PagedAttention solves this by allocating KV cache in 16-token blocks from a shared pool and only materializing blocks as tokens are actually produced. With our three-pass analysis, real prompt sizes are 2,000–4,500 tokens in and completions cap at 4,096, so the average request uses ~5 GB of KV cache, not 8 GB. The shared pool is ~7 GB total, and continuous batching — vLLM's scheduler admitting new requests as blocks free up — means the effective in-flight request count settles at 10–15 under sustained load. Not 14 hard slots; 10–15 concurrent sequences sharing blocks at the page level.

The single-line config that controls all of this

# vllm.service
ExecStart=.../python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-14B-Instruct-AWQ \
    --quantization awq \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.65

Three flags do the work:

--gpu-memory-utilization 0.65 hard-caps vLLM's pre-allocation at 65% of VRAM (~15.6 GB). Everything else — the OS, CUDA driver cache, optional Whisper co-tenancy, PyTorch activation buffers — has to fit in the other 35%. This is the single most important knob.
--max-model-len 8192 bounds the worst-case per-sequence KV footprint. Without it, a runaway prompt could consume a full context and starve other requests.
--quantization awq tells vLLM to use the AWQ weight loader. Without it, vLLM would try to load FP16 weights and immediately OOM.

No LRU, no manual `empty_cache()`

None of this uses LRU eviction or periodic torch.cuda.empty_cache() calls. The budget is designed to not need them. The discipline is at the configuration boundary — set the caps correctly once, and the system stays inside them. Reaching for manual memory management is a sign the architecture is wrong.

06 — The Three LLM Passes

Same model, three roles, orthogonal failure modes.

A sermon transcript arrives at the analysis layer. One obvious implementation is a single mega-prompt: "extract the metadata AND review it theologically AND generate pastoral tags." I tried that first. The failure modes were maddening: when one section degraded, all of them did, and it was impossible to tell which instruction was at fault. A long prompt also consumes output budget — Qwen spends tokens satisfying the longest instruction first, and the later ones get truncated.

Splitting into three prompts with distinctly different roles — extractor, theological reviewer, pastoral counselor — produced orthogonal failure surfaces, cheaper retries, and the ability to run passes 2 and 3 in parallel because they only depend on pass 1's output.

Three passes, two deterministic gates. The scorer stands between the LLM and the user.

Pass 1 — Extraction

The system prompt is stern and specific:

PASS1_SYSTEM = (
    "You are a sermon analysis assistant. Return ONLY valid JSON — no markdown "
    "fences, no explanation, no preamble. Be concise: summary 2-3 sentences, "
    "not paragraphs. Keep string values brief."
)

The user prompt asks for a structured JSON object with a dozen specific fields — title, summary, main_points, scripture_references, biblical_themes, theological_concepts, tags, a suggested YouTube description, and so on. Temperature 0.3 — deterministic enough that re-running the same transcript yields nearly identical extractions, low enough that structured JSON tends to come out well-formed.

JSON parsing is paranoid because it has to be:

# sermon_api.py:381-398
def parse_llm_json_response(response_text: str) -> dict:
    """Parse LLM response that may contain JSON in markdown code blocks."""
    # Strip ... tags (some models emit these)
    response_text = re.sub(r'.*?', '', response_text, flags=re.DOTALL)

    # Try extracting from markdown code blocks (handles multiple blocks)
    md_match = re.search(r'```(?:json)?\s*\n?(.*?)```', response_text, re.DOTALL)
    if md_match:
        response_text = md_match.group(1)

    try:
        return json.loads(response_text.strip())
    except json.JSONDecodeError as e:
        return {
            "raw_response": response_text[:500],
            "parse_error": str(e),
            "error_type": "json_parse_failed"
        }

Even with strict "ONLY JSON" instructions, models emit leading "Sure, here's the JSON:", wrap payloads in ```json fences, or leak <think> blocks from reasoning modes. The parser handles all of that. When json.loads still fails, we don't crash — we return a structured parse_error envelope and the orchestrator retries with a shorter transcript (15k → 7.5k → 5k chars), on the theory that shorter inputs leave more output budget for well-formed structure.

Pass 2 — Theological review

The reviewer reads its own pass-1 output plus a fresh excerpt from the transcript, and is asked to identify five categories of issue: scripture reference errors, theological concerns, theme accuracy, summary accuracy, denominational bias. The output schema names severities explicitly — low, medium, high — and includes internal-consistency rules in the prompt itself:

IMPORTANT RULES:
- If scripture_issues and theological_concerns are EMPTY, overall_assessment
  MUST be "accurate" and confidence_score should be >= 0.95
- If there are only low-severity issues, use "minor_concerns"
- If there are medium or high severity issues, use "significant_concerns"
- confidence_score reflects how confident you are in the ANALYSIS quality,
  not the sermon itself

Those rules are belt-and-suspenders. Whatever the model returns for confidence_score gets overwritten downstream by the deterministic scorer — but the prompt rules keep the issue lists and the bucketed assessment internally consistent, which matters for auditability and for anyone reviewing the JSON by hand.

Pass 3 — Pastoral inference

This is the most opinionated prompt in the system and also the most novel. The goal is to generate useful discovery tags — the kind that help a visitor find a sermon about selective obedience when what they typed was "why do I keep running from what God asked me to do" — rather than the generic tags ("faith," "love," "hope") that could apply to any sermon and therefore help nobody.

The guardrail is simple and brutal: every tag must be accompanied by a connection field that cites a quote or specific reference from the sermon justifying the tag. No connection → no tag:

For each tag, you MUST provide:
- "tag": lowercase-kebab-case tag
- "connection": A quote or specific reference from the sermon that justifies this tag
- "relevance": "high" (directly addressed), "medium" (strongly implied), or "low" (loosely connected)

RULES:
1. Tags MUST be lowercase-kebab-case
2. Generate 5-10 tags per category, prioritizing quality over quantity
3. The "connection" field must cite SPECIFIC sermon content — a quote, theme, or explicit reference
4. If you cannot write a meaningful "connection", do not include that tag
5. Be SPECIFIC to this sermon — a Jonah sermon should have tags like "selective-obedience",
   "fear-of-confrontation", "knowing-gods-will-but-refusing" — NOT generic tags
6. search_phrases should be real questions people ask at 2am, directly answerable by this sermon

After parsing, the post-filter drops everything marked relevance: low and then strips the connection field before storing the tag:

# sermon_api.py:720-763
def filter_and_flatten_pastoral_results(result: dict) -> dict:
    """Filter pastoral inference to high/medium relevance only,
    then flatten to simple tag lists (removing connection field)."""
    ...
    for category in tag_categories:
        filtered_tags = []
        for item in items:
            if isinstance(item, dict):
                relevance = item.get("relevance", "").lower()
                if relevance in ("high", "medium"):
                    tag = item.get("tag", "")
                    if tag:
                        filtered_tags.append(normalize_tag(tag))
        filtered[category] = filtered_tags

The connection field is the most interesting prompt-engineering trick in the system. It costs nothing on the wire — we discard it before storing anything. But forcing the model to write down why a tag applies, before it's allowed to commit to the tag, substantially improves tag quality. The model has to stake a claim on the transcript before it's permitted to tag.

Tags that survive the filter are the ones the model was willing to quote the sermon to defend. Tags the model couldn't justify never make it past the filter, whether it's because they were low-confidence (marked low) or because the model couldn't even articulate a connection (filtered at parse time). The filter is doing deterministic work that the prompt alone can't be trusted to do.

07 — Confidence Scoring

Why the LLM doesn't get to score itself.

Pass 2 emits a confidence_score field in its JSON output. The number the model writes there is thrown away. In its place, a 25-line pure function recomputes confidence from the structural facts — issue counts, severities, count of invalid scripture references from the rule-based validator:

# sermon_api.py:840-907
def calculate_confidence_score(
    scripture_issues: list,
    theological_concerns: list,
    recommended_corrections: list,
    invalid_scripture_count: int,
    theme_accuracy: str,
    summary_accuracy: str,
) -> float:
    """
    Scoring logic:
    - Base score: 1.0
    - Deductions applied for each issue
    - Hard caps applied for automatic review triggers
    - Auto-approval threshold: 0.85
    """
    score = 1.0
    all_issues = scripture_issues + theological_concerns

    high_count   = sum(1 for i in all_issues if i.get("severity") == "high")
    medium_count = sum(1 for i in all_issues if i.get("severity") == "medium")
    low_count    = sum(1 for i in all_issues if i.get("severity") == "low")
    correction_count = len(recommended_corrections)

    # Deductions
    score -= high_count   * 0.12
    score -= medium_count * 0.06
    score -= low_count    * 0.02
    score -= invalid_scripture_count * 0.06
    score -= correction_count * 0.02

    if theme_accuracy == "partially_accurate":   score -= 0.02
    elif theme_accuracy == "needs_revision":     score -= 0.05
    if summary_accuracy == "partially_accurate": score -= 0.02
    elif summary_accuracy == "needs_revision":   score -= 0.05

    # Hard caps (review triggers)
    cap = 1.0
    if invalid_scripture_count > 0: cap = min(cap, 0.84)  # ANY invalid ref → review
    if high_count > 0:              cap = min(cap, 0.75)  # ANY high-sev → review
    if medium_count >= 2:           cap = min(cap, 0.84)  # 2+ medium → review
    if low_count >= 5:              cap = min(cap, 0.84)  # 5+ low → review

    return round(max(0.10, min(score, cap)), 2)

Then normalize_theological_review does the overwrite:

# sermon_api.py:798-837
review["confidence_score"] = score  # overwrite whatever the LLM said

if score >= 0.85:
    review["overall_assessment"] = "accurate"
elif score >= 0.70:
    review["overall_assessment"] = "minor_concerns"
else:
    review["overall_assessment"] = "significant_concerns"

When an LLM emits both a finding and a confidence in the finding, the finding is the signal and the confidence is the noise.

Derive confidence from structured facts. Never let the model self-assess its own reliability.

Why deterministic beats LLM-reported

Two reasons, one tactical and one strategic.

Auditability. When a sermon lands in the human-review queue, I can point to the exact arithmetic that put it there. "The model felt 0.74 confident" is not an explanation. "1 high-severity issue (−0.12) + 1 invalid scripture reference (−0.06), capped at 0.75 because high_count > 0, therefore bucketed as minor_concerns" is an explanation — one I can hand to a pastor reviewing the flag, one that holds up to scrutiny, and one that changes deterministically if the inputs change.

Calibration drift. Different models — GPT-4o-mini, Qwen3.5-35B-A3B, Qwen2.5-14B-AWQ — produce wildly different self-reported confidence distributions for the same inputs. Some are pessimistic. Some are wildly optimistic. The deterministic function gives me a stable scale across model swaps, which matters because I swap models. When the Spark was down for maintenance last month, the analysis path fell back to vLLM on the 4090, and nothing about the review-queue calibration changed. That only works because confidence isn't a model output.

The LLM's systematic bias

The specific bias the deterministic scorer severs is this: the LLM writes its confidence about the review process, not about the subject matter. A thorough review that finds legitimate issues feels like a successful review to the model. It will happily emit "confidence_score": 0.95 while simultaneously listing a high-severity theological concern. "I caught the problem, so I'm doing well" — but what downstream code cares about is whether the analysis is safe to auto-publish, which is anti-correlated with the model's feeling of having done a good job.

The deterministic scorer breaks that correlation by construction. The score is a function of what was found, not what the model feels about its finding.

Where the thresholds come from

The 0.85 / 0.70 cutoffs and the cap values were tuned against a hand-labeled sample of analyses — combinations of issue counts and severities that a reviewer either approved or flagged. They aren't magic; they're calibrated. The important structural properties are:

Any high-severity finding hard-caps at 0.75. Auto-approval is impossible when anything high-severity is on the list. This is the most aggressive rule and the one I'd least change.
Any invalid scripture reference hard-caps at 0.84. A single hallucinated chapter number (see §8) prevents auto-approval. This cooperates directly with the Bible-corpus validator.
Floor of 0.10. Scores can't go below 10%, which is about semantics rather than math: we distinguish "many issues" from "the analysis failed structurally," and the floor prevents the bucket from collapsing into the block category for reasons other than severity.

Real-world distribution from a 1,109-record spot check on the production catalog:

99.4%Field completeness

100%Pass success rate

5.1%Flagged for review (< 0.5)

0%Job-level failures

5.1% in the review queue is the useful signal-to-noise ratio. It catches real problems — hallucinated scripture, theologically imprecise claims, summaries that don't match the transcript — without burying the reviewers in false positives. If that number started trending above 10%, I'd tune. If it dropped below 2%, I'd worry that the deterministic gate had become too permissive.

08 — Anti-Hallucination

A 200-entry dict beats a vector database for this specific problem.

The LLM will occasionally cite scripture that doesn't exist. "Hezekiah 3:16." "Psalm 200." "1 Romans 4:5." "Jonah 5:6" (Jonah has four chapters). A downstream user sees an authoritative-looking analysis that references a verse that isn't real, and the whole output loses credibility.

The fix is a hand-curated lookup table of every canonical book and its chapter count, plus a regex parser:

# sermon_processor.py:30-99
BIBLE_BOOKS = {
    # Old Testament
    "genesis":   ("Genesis", 50),   "gen": ("Genesis", 50),   "ge":   ("Genesis", 50),
    "exodus":    ("Exodus", 40),    "exod":("Exodus", 40),    "ex":   ("Exodus", 40),
    ...
    "jonah":     ("Jonah", 4),      "jon": ("Jonah", 4),
    ...
    "revelation":("Revelation", 22),"rev": ("Revelation", 22),"re":   ("Revelation", 22),
}

66 canonical books × 2–4 aliases each = roughly 200 lookup keys. Values are (canonical_name, max_chapter). The parser normalizes ordinals ("First Peter" → "1 Peter", Roman numerals to digits), extracts book/chapter/verse with a single regex, then validates the chapter against the max:

# sermon_processor.py:132-180
def parse_scripture_reference(ref: str) -> dict:
    ref = normalize_book_number(ref.strip())
    pattern = r'^(\d?\s*[A-Za-z]+(?:\s+[A-Za-z]+)?)\s+(\d+)(?::(\d+)(?:-(\d+))?)?$'
    match = re.match(pattern, ref)
    if not match:
        return {"valid": False, "error": "Could not parse reference format"}

    book_raw, chapter_str = match.group(1).strip(), match.group(2)
    chapter = int(chapter_str)
    book_key = book_raw.lower()
    if book_key not in BIBLE_BOOKS:
        return {"valid": False, "error": f"Unknown book: {book_raw}"}

    book_name, max_chapters = BIBLE_BOOKS[book_key]
    if chapter < 1 or chapter > max_chapters:
        return {"valid": False, "error": f"{book_name} has {max_chapters} chapters, not {chapter}"}
    ...
    return {"valid": True, "book": book_name, "chapter": chapter, ...}

Every reference extracted by pass 1 runs through this validator. Invalid references go into scripture_validation.invalid, a JSONB field on the analysis record. The count feeds straight into the confidence scorer (invalid_scripture_count * 0.06 deduction, hard cap 0.84). A single hallucinated reference prevents auto-approval.

Why not a vector DB over the full text of the Bible?

You could do it. You'd embed all 31,000 verses, query with a cosine similarity search, and accept a reference if the top match scored above some threshold. People do this; it's a common pattern. For this specific problem it's worse than the dict on three axes:

Determinism. parse_scripture_reference("Jonah 5:6") always returns "Jonah has 4 chapters, not 5." A cosine search against verse embeddings returns approximately-right results — it might surface Jonah 4:6 with high similarity and happily accept the false reference. The reasoner comparing the score to a threshold gives you probabilistic catch-rate; the rule gives you categorical catch-rate.
Explainability. The invalid-references list is a structured JSON object with the original reference, the error message, and the context in which the LLM used it. A reviewer opening a flagged analysis sees exactly why each reference failed, in plain English. Good luck doing that with an embedding distance.
Zero infra cost. The lookup dict is a few hundred lines of Python. The parser is another hundred. No vector DB to maintain, no embedding model to version, no index to rebuild when the corpus changes. The corpus, in this case, is fixed — the Bible has 66 books, and that number has been stable since roughly 397 AD.

Domain-specific problems deserve domain-specific tools. When you know the shape of the failure mode you're trying to catch, a rule that catches it exactly is better than a general-purpose tool that catches it approximately.

Honest limitation

The validator catches structural hallucinations — wrong book names, invalid chapter numbers, references that span book boundaries. It does not catch content hallucinations. The model can cite a real, valid reference like John 3:16 and then paraphrase its content incorrectly, and this validator won't know. That's a separate problem that requires either an actual Bible text lookup (next version) or the theological-reviewer pass catching the content drift, which it sometimes does and sometimes doesn't.

Engineering is a sequence of decisions about which problems to solve first. Structural scripture hallucinations are the ones that cost the most credibility per instance, they're common, and they're cheap to catch. The content drift is rarer, more subjective, and the pass-2 reviewer is a reasonable second line of defense. The dict is the right tool for now.

09 — War Story

A 24× speedup hiding behind a missing CLI flag.

This is the one I tell at meetups. It's also the one that captures, more cleanly than any other moment, what actually separates someone who has deployed an LLM in production from someone who has called an API.

Symptom

batch_analyze.py running against gpt-oss-20b-UD-Q8_K_XL on the Spark's llama.cpp was failing with finish_reason=length and zero characters of content. Some requests succeeded, but each took 700 to 1,100 seconds. I switched the model to Qwen3.5-35B-A3B-UD-Q4_K_XL, reasoning that a bigger, better-tuned model would behave more predictably. Same symptoms.

07:28:09 [WARNING] JSON parse failed (finish_reason=length): raw response (0 chars):
07:28:09 [ERROR] [13891] FAILED after 1108.9s — Pass 1 parse failed: all parse attempts failed
[ERROR] [12012] FAILED after 1801.6s — Request timed out.

Three wrong fixes

First: bump max_tokens from 4,096 to 8,192 in the client. Reasoning: maybe the model was running out of output budget. No effect. Still zero characters.

Second: add reasoning_effort: "low" to the extra_body of the OpenAI-compatible request. Reasoning: Qwen3.5 is a thinking model; maybe its reasoning was eating the output budget. No effect.

Third: explicitly pass enable_thinking: false in extra_body. Reasoning: if the flag form I'd used didn't work, maybe the other form would. No effect.

Three hours in. Still zero characters of useful output. The server was clearly running — /health returned 200, GPU utilization was pegged during requests, nvidia-smi showed the model resident in VRAM. The model was doing something during those 18-minute requests, but whatever it was doing, it wasn't producing output tokens.

The boot log

Out of frustration, I started reading llama-server's startup log line by line. Most of it was noise — version info, model load progress, CUDA context setup. Line 7 had this:

reasoning-budget: activated, budget=2147483647 tokens
srv  params_from_: Chat format: peg-native

peg-native. Not jinja. The chat format was using the default PEG-based template parser rather than Jinja. Without --jinja, llama-server's chat-template layer doesn't understand Qwen's extended parameter set, which means the enable_thinking: false and reasoning_effort: "low" fields I'd been carefully passing through extra_body were being silently dropped at the template boundary.

The model was happily reasoning for 2,147,483,647 tokens (the default "activated" budget when the template couldn't parse the disable signal, effectively infinite), burning its entire context on a <think> block, and then having zero tokens left for the actual output JSON. Every request was a model spending eighteen minutes thinking about whether it could translate a sermon, then running out of room to say anything.

The fix

-./llama-server --model qwen3.5-35b-a3b-q4 --parallel 16 --ctx-size 262144
+./llama-server --jinja --model qwen3.5-35b-a3b-q4 --parallel 16 --ctx-size 262144

One flag. Restart. A single video went from ~900 seconds end-to-end to 36.7 seconds. A 24× speedup. The first overnight run after the fix completed 1,106 analyses, zero failures, ~103 videos per hour, on a model and hardware combination that hours earlier had been producing zero bytes of useful output.

Read the startup logs of every component you don't own. Line 7 of the boot output was telling me exactly what was wrong, for three hours.

The generalizable lesson

The lesson isn't "remember the --jinja flag." Nobody is going to need that specific fact more than once. The lesson is that deploying an LLM in production means being fluent in the configuration surface of the serving layer, not just the model. llama.cpp, vLLM, Ollama, TGI — each of them has dozens of flags that silently shape model behavior. The "same" model served by two different runtimes can behave meaningfully differently. And when something is wrong, the answer is usually in a log file you haven't read yet.

Using an OpenAI-compatible API is not a guarantee of OpenAI-equivalent behavior. extra_body parameters get dropped at template boundaries. Sampling defaults differ. Context-length handling differs. Error codes differ. A staff engineer deploying local inference is an engineer who has internalized this — who boots each component personally, reads each log, and treats "it was working yesterday" as data.

10 — More War Stories

Three more, shorter.

ProcessPool ↔ ThreadPool ↔ ProcessPool

The Phase 2 batch transcribe executor got rewritten three times in one week.

Round 1: ProcessPool. The intuitive choice for true parallelism. Workers died instantly on whisper.load_model(). The throughput log looked like 3,350 completed, 0 failures, 1,600,600 videos/hr, which is the kind of number you get when every "completed" video is actually a swallowed crash.

Round 2: ThreadPool with one shared model. One Whisper instance, lock around transcribe(). Worked, but limited to single-stream throughput. The user reaction, roughly: "we have 125 GB of RAM and a 4090 — why are we locking the concurrency to one?" Fair.

Round 3: ProcessPool, correctly. Back to ProcessPoolExecutor, but this time with multiprocessing.set_start_method("spawn"). The Round 1 crashes had been "Cannot re-initialize CUDA in forked subprocess" — Python's default start method on Linux is fork, and forking a process that has already touched CUDA leaves the child in an unusable state. spawn starts each child with a fresh Python interpreter and a fresh CUDA context. No crashes after that.

The takeaway: Python's multiprocessing defaults are wrong for GPU work. Always spawn. Also: ThreadPool is a perfectly fine temporary answer when ProcessPool is hiding errors — shipping single-threaded-that-works is better than shipping concurrent-that-pretends.

The systemd unit that wouldn't die

During a period where Whisper transcription and vLLM were fighting for the same GPU, I stopped vLLM explicitly with systemctl stop vllm so the Whisper workers could have the card. systemctl stop succeeded. Two seconds later, vLLM was back.

I stopped it again. Back again.

Seven times.

The root cause was in sermon-api.service:

[Unit]
Description=Sermon Processor API
After=network.target vllm.service
Wants=vllm.service       # <— this line

Wants=vllm.service means: whenever sermon-api starts — which it does, because it has Restart=on-failure — systemd ensures vllm.service is also started. Every time sermon-api hiccupped, it dragged vLLM back up. I was fighting a service-dependency graph, not malware.

Fix: drop the Wants= line, daemon-reload, restart sermon-api. Then systemctl stop vllm actually meant something.

Lesson: when a process you stopped keeps coming back, the thing bringing it up is almost always your own configuration, not an attacker.

Hung requests and JSON-shape hallucinations

Around 3 AM on a batch run, the log looked like this:

08:55:54 [ERROR] [12012] FAILED after 1801.6s — Request timed out.
08:55:55 [ERROR] [12025] FAILED after 1801.8s — Request timed out.
08:55:55 [ERROR] [12062] FAILED after 1802.1s — Request timed out.
Progress: 40/2343 (ok=30 fail=10) — 79.8 vid/hr

Two distinct problems entangled in one symptom.

Problem 1: The OpenAI client's default timeout was ~30 minutes. Stuck requests were eating slots for the full 1,800 seconds before the client even noticed. Fix: explicit httpx.Timeout(read=300.0). Stuck requests now fail in ~5 minutes, freeing the worker 6× faster.

Problem 2: Even with valid finish_reason=stop, some responses came back structurally malformed. One specimen had a main_points field that was a newline-separated string instead of an array, with a missing closing bracket and a subsequent field nested inside the unclosed array. The model thought it was done — it just produced broken JSON.

Fix: drop json_repair into the parse path as a fallback between json.loads and the structured failure envelope. json_repair is heuristic — not always correct — but it recovers a meaningful chunk of "almost JSON" responses. Anything it can't repair still fails parse and falls through to the retry path, so it's strictly a best-effort addition.

11 — Lessons

What transfers out of this project.

The implementation details of a sermon video platform don't matter to anyone not building one. The engineering judgment, I think, does.

Match your execution model to your workload, not to fashion

Path A and Path B coexist because latency-first and throughput-first are different problems. Per-type Go worker pools, in-memory job stores, and Whisper base are good answers for latency. Two-phase batch schedulers with ProcessPool concurrency and Whisper medium are good answers for throughput. Neither set of answers is better. They're answers to different questions. Pick the one that matches what you actually have.

Put the deterministic layer between the LLM and the user

Every place the LLM could be wrong, there's code standing between its output and the published result:

The Bible corpus validator catches hallucinated scripture references.
The confidence scorer replaces the LLM's self-reported score with arithmetic over issue counts.
The pastoral-inference filter drops any tag the model couldn't quote the sermon to justify.
The json_repair fallback and the shrinking-transcript retry handle malformed output without failing the job.

None of these are ML. None of them would be improved by being ML. They are the places where domain logic belongs, and they are what let you ship the LLM's output to users without holding your breath.

Budget the GPU like you'd budget any other constrained resource

24 GB of VRAM is not a number to round up from. It's a ceiling. AWQ quantization gives you 4× on weight storage. PagedAttention gives you continuous batching over a shared KV pool. --gpu-memory-utilization sets a hard cap that forces every other component into the remainder. Understanding the math is the difference between a system that runs and a system that OOMs at 2 AM during a bulk run.

Read the boot logs of every component you don't own

Three hours on the --jinja flag is my personal receipt on this one. The serving layer — llama.cpp, vLLM, Ollama — is software with its own configuration surface, its own failure modes, and its own verbose startup output. OpenAI-compatible does not mean OpenAI-equivalent. Every time you boot a new component, read the first 100 lines of its log carefully. The information is there. It's just not in the error message.

Local inference is a capability, not a compromise

There's a persistent framing in which running LLMs on your own hardware is something you do when you can't afford the API, or when you can't trust the vendor, or when you have regulatory constraints that force your hand. That framing misses the point. Local inference lets you make decisions about model selection, quantization, memory budgeting, batch scheduling, and deterministic guardrails that you cannot make when the model is on the other side of an HTTP call. Every decision in this case study — from the two-path architecture to the deterministic scorer to the --jinja flag — is a decision I was only allowed to make because the model ran on hardware I controlled.

If your data is sensitive, your workload is large, or your product's competitive advantage depends on specific model behavior — consider that running the model yourself isn't a fallback. It's the thing that unlocks the rest of the engineering.

Work with me

If this is the kind of AI infrastructure you want to build, I can help.

I work with mid-market teams on local AI deployments — running LLMs on your own hardware, designing the deterministic layers that make model output safe to ship, and building the two-path architectures that actually handle both your latency-critical and your throughput-critical workloads. Especially valuable if you're in a regulated or data-sensitive domain where cloud inference isn't an option.

jeremy@consultjl.com 724.263.3475 ← Back to all work