Case Study: Contract-Driven Semantic Search

01 — The Problem

A user types "why do I keep running from what God asked me to do."

It's 2 a.m., and someone types "why do I keep running from what God asked me to do" into the search box. Keyword search does nothing useful with that. The sermon they're looking for — a teaching on Jonah — doesn't contain the phrase "why do I keep running." It contains "the prophet fled from the presence of the Lord," which matches nothing the user typed. BM25 returns the sermons whose transcripts happen to contain the words "running" and "God" most often, which is a different set of sermons and none of them are what the user wants.

Naive vector RAG over the transcript corpus is better, but not by much. A cosine-similarity search for that query against embedded transcript chunks returns the transcript chunks most similar to the question. That might be the Jonah sermon, or it might be a sermon that once mentioned "God asked" in passing and also had the word "running" somewhere else — embeddings are permissive. More importantly, there's nothing in this approach that distinguishes what the sermon is about from what the sermon speaks to. Those are different things, and they matter differently to the user.

The sermon on Jonah isn't really about running. It's about selective obedience — the specific human pattern of hearing God clearly, understanding the assignment, and doing anything else. A user searching in the middle of the night is not looking for sermons that contain the word "running." They are looking for sermons that speak to their pattern, and they're describing that pattern in their own language.

The sermon isn't about running. It's about selective obedience. Search has to bridge that gap — from what the user said to what the content speaks to — and neither keyword matching nor vanilla vector RAG can do it.

This case study is about how we bridge that gap. It's mostly a story about architecture: how the LLM tagging pipeline described in the Oceans case study produces tags and natural-language search phrases, how those outputs flow through a small tolerant contract into a pgvector index, and how the weighted retrieval ranks them against raw transcript content in a way that the Jonah sermon surfaces when the user asks about selective obedience, by whatever words they happen to use.

There is also a thesis: the most important engineering artifact in a system like this is the contract between the tagger and the searcher. Not the tagger. Not the searcher. The interface between them.

02 — The Thesis

The contract is the central engineering artifact.

In the Oceans case study, the three-pass LLM analysis generates per-sermon metadata: a summary, biblical themes, scripture references, and — critically for this case study — a pastoral-inference pass that emits tags in several categories (life situations, struggles, emotions, audience) and a set of search_phrases. Those search phrases are natural-language questions the model thinks a user might actually type: "why do I keep running from what God asked me to do," "how do I forgive someone who isn't sorry," "is there hope for my marriage." Each phrase is generated with a justification quote from the sermon that the tag must cite before it's allowed to commit.

These phrases are the single most valuable input the search layer receives. They are already in the register of a real user query. The embedding of "why do I keep running from what God asked me to do" clusters tight to the embedding of a user typing the same phrase, because they are effectively the same utterance — the model predicted the query during tagging.

But the tagger writes JSON to a Postgres JSONB column, and the search service lives in a different repo with a different database and a different deployment surface. If those two sides drift apart — if the Go service doesn't know which fields to read, or reads them with the wrong casing, or forgets about a new category the Python tagger added — the search_phrases don't reach the index, and every downstream advantage evaporates.

An earlier version of this system had that exact problem. Pass 3 was generating pastoral tags and search phrases. The live indexing path was sending only scripture and keyword tags. The bash reconciler was sending the biblical themes plus suggested tags as a single "theme" category. Neither was sending the pastoral tags or the search phrases. Sermons indexed by one path had materially different searchability than sermons indexed by the other, and nothing about Pass 3's work was reaching users at all.

The fix wasn't a bigger model or a better embedding. The fix was a small, tolerant, single-source-of-truth request shape that all three ingest paths construct, that the search service expands into its own internal representation, and that the Python tagger's output maps into without any Python-side changes. Once the contract existed and was enforced at the boundary, every other improvement — the query_phrase embedding tier, the weighted ranking, the low-confidence signal — became easy. Before the contract existed, none of those improvements mattered because the input they depended on wasn't arriving.

The tagger generates the value. The contract delivers it. Without the contract, the tagger is writing into a void.

Everything else in this case study is downstream of that decision.

03 — The Index

Four embedding types, four weights, all tunable from the database.

The search index lives in pgvector, in its own Postgres database, in a service called oceans_semantic_search that runs separately from the main application. The storage schema is intentionally boring — three tables plus a config table:

-- oceans_semantic_search/internal/db/migrations/001_initial_schema.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE sermons (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    external_id   VARCHAR(255) UNIQUE NOT NULL,   -- oceans2.0 video.id as string
    title         VARCHAR(500) NOT NULL,
    speaker       VARCHAR(255),
    passage       VARCHAR(255),
    summary       TEXT,
    transcript    TEXT,
    language      VARCHAR(10) DEFAULT 'en',
    source_url    VARCHAR(1000),
    duration_seconds INTEGER,
    recorded_at   TIMESTAMP,
    ...
);

CREATE TABLE sermon_embeddings (
    id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sermon_id        UUID REFERENCES sermons(id) ON DELETE CASCADE,
    embedding_type   VARCHAR(50) NOT NULL,
    chunk_index      INTEGER DEFAULT 0,
    content_preview  VARCHAR(500),
    embedding        vector(1536),
    UNIQUE(sermon_id, embedding_type, chunk_index)
);

CREATE INDEX sermon_embeddings_embedding_idx
    ON sermon_embeddings USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

CREATE TABLE sermon_tags (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sermon_id   UUID REFERENCES sermons(id) ON DELETE CASCADE,
    category    VARCHAR(100) NOT NULL,
    tag         VARCHAR(255) NOT NULL,
    confidence  FLOAT DEFAULT 1.0,
    UNIQUE(sermon_id, category, tag)
);

The interesting part is the embedding_type column. A single sermon produces four different kinds of embedding rows, each representing a different facet of that sermon:

Four source types, four embedding tiers, four weights. The query_phrase tier is the one the rest of the design rotates around.

The tiers

Transcript chunks are the predictable baseline. The raw transcript is split into roughly 1,000-token chunks with 100-token overlap, each prefixed with "Sermon: {title} by {speaker}. " so the embedding captures the source context. These are the embeddings a pure-RAG implementation would use exclusively.

Summary is one row per sermon, embedding the 2–3 sentence Pass 1 summary. The summary is already condensed and thematically coherent, so its embedding is clean — but the tradeoff is that summaries are generic and similar summaries produce similar embeddings. Weight 1.0.

Themes is where something mildly clever happens. The source isn't a piece of prose — it's a set of discrete tags (biblical themes, keywords, struggle tags, emotion tags, audience tags). You can't naively embed a list of strings and expect good search behavior; the resulting vector is noisy. So the search service synthesizes a sentence from the tag set before embedding:

// search_service.go — buildThemeText (paraphrased)
// "This sermon addresses depression, anxiety. Key themes include hope,
//  gods_sovereignty. It speaks to feelings of despair."

That synthetic sentence lives in the same linguistic register as a user describing a sermon they're looking for, which is the point. Weight 0.6 — lower than summary because the synthesized sentence is more opinionated and can drift.

Query phrase is the interesting one and it gets its own section.

04 — The Decisive Tier

One embedding per Pass 3 search phrase, weighted highest.

Pass 3 of the LLM analysis generates, among other things, a list of search_phrases: natural-language questions the model thinks a user searching for this sermon might actually type. The prompt asks for "real questions people ask at 2am, directly answerable by this sermon." Each phrase must be accompanied by a quote from the sermon that justifies it, and phrases the model can't justify don't make it into the output.

So for a sermon on Jonah, Pass 3 might generate:

"why do I keep running from what God asked me to do"
"how do I know God's will when I'm scared to follow it"
"is it too late to obey God after running from him"
"what does it mean that Jonah fled from the presence of the Lord"
"why did God send a storm when Jonah disobeyed"

These phrases are already in the shape of real user queries. When a user types "why do I keep running from what God asked me to do", their embedding lands extremely close to the first phrase above — because the first phrase is effectively what they typed, authored by an LLM that was trying to predict exactly this query.

The search service embeds each search_phrase as its own row in sermon_embeddings, with embedding_type = 'query_phrase' and chunk_index equal to the phrase's position in the list:

// internal/service/search_service.go — IngestSermon
// Query-phrase embeddings (Pass 3 search_phrases) — one per phrase.
for i, phrase := range req.SearchPhrases {
    phrase = strings.TrimSpace(phrase)
    if phrase == "" { continue }
    if err := s.generateAndStoreEmbedding(
        ctx, sermon.ID, models.EmbeddingTypeQueryPhrase, i, phrase,
    ); err != nil {
        slog.Error("failed to generate query_phrase embedding",
                   "error", err, "phrase_index", i)
        continue
    }
}

A 40-minute sermon with 7 search_phrases generates 7 query_phrase rows, 1 summary row, 1 themes row, and ~30 transcript_chunk rows — roughly 39 embeddings in total. The sermons table, the sermon_embeddings table, and the tags table together hold everything the search service needs to answer a query; the source transcripts and analyses live in the main application database and are fetched only at ingest time.

Why the weight is 1.3

The weights aren't arbitrary. They're an ordering of how closely each embedding tier's source content approximates the register of a real user query:

Tier	Weight	Why
`query_phrase`	1.3	Authored by an LLM pretending to be a user typing a search. Same register as the query.
`transcript_chunk`	1.2	Actual sermon content. Can match on incidental word overlap but often contains the substance.
`summary`	1.0	Written, condensed, thematically coherent, but generic.
`themes`	0.6	Synthetic sentence built from discrete tags. Useful as a tiebreaker, noisy in isolation.

Put another way: query_phrase was generated by an LLM imagining the query. Transcript chunks were recorded by the preacher imagining the message. Summary and themes were LLM-generated after the fact and sit in a register somewhere between the two. For matching a user query, closeness to the query register is what matters, so that's the order.

Why DB-backed, runtime-tunable

The weights live in a search_config table:

-- oceans_semantic_search/internal/db/migrations/002_search_config.sql
INSERT INTO search_config (key, value, description) VALUES
    ('weight_transcript',          1.2,  '...'),
    ('weight_summary',             1.0,  '...'),
    ('weight_themes',              0.6,  '...'),
    ('min_score_threshold',        0.30, '...'),
    ('low_confidence_threshold',   0.45, '...')
ON CONFLICT (key) DO NOTHING;

-- 003_query_phrase_embeddings.sql
INSERT INTO search_config (key, value, description) VALUES
    ('weight_query_phrase', 1.3, 'Weight multiplier for search_phrase embeddings')
ON CONFLICT (key) DO NOTHING;

The search service reads the table at every search via getSearchConfig. Changing weight_query_phrase to 1.5 takes one UPDATE and no deploy. Setting it to 0 disables the tier — which is useful for A/B comparison, or for rolling back cleanly if a weight change regresses relevance.

This is not architectural glamour. It's operational hygiene. "Tune a weight without a redeploy" is the kind of capability you use sparingly in practice, but on the days you need it, the alternative — cutting a release to adjust a coefficient — is the kind of friction that makes teams stop tuning at all. Weights in code are weights that never get touched. Weights in config are weights that can be iterated on.

05 — Query Expansion

Describe the searcher, not the solution.

Some queries are too short for semantic search to work well on their own. "Suicide." "Feeling lost." "Addiction." A one- or two-word embedding lands in a wide, noisy region of vector space — there isn't enough signal in the query to pull a specific cluster of content toward it. The result is retrieval that's technically relevant and practically useless.

The standard answer is query expansion: use a language model to expand the query into something longer and more specific. The standard implementation is synonym expansion ("suicide" → "suicide, self-harm, ending my life, wanting to die") or a HyDE-style approach (generate a hypothetical document that would answer the query, then embed that). Both of those improve recall but not necessarily precision, and both tend to drift in a particular direction — toward solutions the user might want, rather than situations the user might be in.

The expansion pattern this system uses is different, and it's the second-most-interesting thing in the architecture after the query_phrase tier. It's a persona prompt: "you are describing the searcher, not the solution."

// internal/expansion/expander.go — buildExpansionPrompt (english variant)
return fmt.Sprintf(
    `You are a search query expander for a Christian sermon database.
Expand this short query into a 2–3 sentence description of the searcher's
situation, emotional state, and what they might be going through.
Focus on the person behind the query — what's happening in their life —
not on what sermons they should watch.
Keep the expansion under 100 words.

Query: %s

Expanded description:`,
    query,
)

A query of "suicide" expands to something like: "Someone experiencing deep despair and hopelessness, possibly struggling with suicidal thoughts or the aftermath of a loved one's suicide. They may be feeling isolated, overwhelmed, or like there's no way forward. They are searching for hope, for reasons to keep going, for words that acknowledge how dark things feel right now."

That expanded text is what gets embedded and sent through retrieval — not the original word. And here's why it works: the content in the index is phrased in situational and emotional language, not keyword language. The Pass 3 search_phrases ("is there hope when everything feels dark"), the synthesized theme sentences ("this sermon addresses despair, hopelessness... it speaks to feelings of isolation"), the summaries — all of them describe the person the sermon is for. The expansion moves the query into the same register as the content, which is where cosine similarity actually works.

Query expansion that moves the query toward the solution is betting on a lookup table the model might have memorized. Query expansion that moves the query toward the searcher's situation is betting on the actual shape of the content in the index. The second bet is better because the content was designed that way on purpose.

Per-language

The expander is language-aware. Three prompts, same persona framing, per user language:

// internal/expansion/expander.go — buildExpansionPrompt
func buildExpansionPrompt(query, language string) string {
    switch strings.ToLower(language) {
    case "es":
        return fmt.Sprintf(`Eres un expansor de consultas de búsqueda para una
base de datos de sermones cristianos. ... Mantén la expansión por debajo
de 100 palabras.\n\nConsulta: %s\n\nDescripción ampliada:`, query)
    case "fr":
        return fmt.Sprintf(`Vous êtes un expanseur de requêtes de recherche
pour une base de données de sermons chrétiens. ... Gardez l'expansion sous
100 mots.\n\nRequête : %s\n\nDescription élargie :`, query)
    default:
        return fmt.Sprintf(`You are a search query expander for a Christian
sermon database. ... Keep the expansion under 100 words.\n\nQuery: %s\n\n
Expanded description:`, query)
    }
}

The cache key includes the language code so an English "hope" expansion and a Spanish "hope" expansion don't collide. This is a small thing that only matters when it does, which is after a user in another language runs a query that happens to share a word with an English query someone ran earlier:

// CachedExpander.Expand — cache key now language-scoped
lang := strings.ToLower(strings.TrimSpace(language))
if lang == "" { lang = "en" }
normalizedQuery := strings.ToLower(strings.TrimSpace(query))
cacheKey := lang + ":" + normalizedQuery
// "hope"-en and "hope"-es no longer collide in Redis

When expansion fires

Expansion is gated by query length. If the query is five words or more, the user has given us enough signal — expansion would dilute, not sharpen. Under that threshold, the persona expansion runs:

// internal/expansion/expander.go
func (e *Expander) ShouldExpand(query string) bool {
    trimmed := strings.TrimSpace(query)
    if trimmed == "" { return false }
    words := strings.Fields(trimmed)
    return len(words) < e.threshold   // default 5
}

In practice: terse emotional keywords expand; full-sentence queries don't. This is honest about what expansion is for. It isn't a general-purpose improvement on all queries — it's a specific tool for the specific failure mode of short under-specified queries. Using it on longer queries makes them worse.

06 — The Contract

Three ingest paths, one request shape.

Everything in this case study so far — the four-tier index, the query_phrase weighting, the expansion prompts — depends on one premise: the pastoral tags and search phrases generated by Pass 3 actually arrive at the search service. That premise was wrong for longer than I'd like to admit.

There are three paths by which a sermon gets indexed, and all three were originally constructing their request payloads independently:

Three ingest paths, one canonical request shape. Drift is now a testable property.

The three paths

Path A — live post-transcription hook. When a video finishes transcription in the main application, the transcription manager reads the associated video_analyses record via an injected AnalysisLookup interface, constructs a CreateSermonRequest, and posts it to the search service:

// oceans2.0/internal/autotranscript/manager_processing.go
type AnalysisLookup interface {
    GetByVideoID(ctx context.Context, videoID int32) (*domain.VideoAnalysis, error)
}

// provider wiring (server/providers.go)
transcriptionManager.SetAnalysisLookup(videoAnalysisService)

// ... later, inside the post-transcription goroutine
if m.analysisLookup != nil {
    analysis, err := m.analysisLookup.GetByVideoID(ctx, videoID)
    if err != nil {
        m.logger(ctx).Debug().Err(err).Int32("video_id", videoID).
            Msg("no pastoral analysis available for search indexing (non-fatal)")
    } else if analysis != nil {
        // ... append themes ...
        if analysis.PastoralInference != nil {
            p := analysis.PastoralInference
            req.PastoralTags = &search.PastoralTags{
                LifeSituationTags: p.LifeSituationTags,
                StruggleTags:      p.StruggleTags,
                EmotionalTags:     p.EmotionalTags,
                AudienceTags:      p.AudienceTags,
            }
            req.SearchPhrases = p.SearchPhrases
        }
    }
}
result, err := m.searchClient.CreateSermon(ctx, req)

The dependency injection of AnalysisLookup is deliberate — it keeps the transcription manager testable without a live database, and it keeps the circular dependency between packages from forming. The live path is fire-and-forget: if the search service is down, the log records it and the sermon gets re-indexed later by one of the reconciler paths. No DLQ, no retry. That's honest about the failure mode rather than pretending to handle it.

Path B — Go backfill reconciler. A script at oceans2.0/scripts/backfill_search_sermons.go queries for sermons that should be indexed but aren't, including the full pastoral_inference JSONB, and unmarshals it with tolerance for both snake_case and camelCase:

// oceans2.0/scripts/backfill_search_sermons.go
type pastoralJSON struct {
    LifeSituationTags       []string `json:"life_situation_tags"`
    StruggleTags            []string `json:"struggle_tags"`
    EmotionalTags           []string `json:"emotional_tags"`
    AudienceTags            []string `json:"audience_tags"`
    SearchPhrases           []string `json:"search_phrases"`
    // camelCase fallbacks
    LifeSituationTagsCamel  []string `json:"lifeSituationTags"`
    StruggleTagsCamel       []string `json:"struggleTags"`
    EmotionalTagsCamel      []string `json:"emotionalTags"`
    AudienceTagsCamel       []string `json:"audienceTags"`
    SearchPhrasesCamel      []string `json:"searchPhrases"`
}

if len(c.PastoralInference) > 0 {
    var pi pastoralJSON
    if err := json.Unmarshal(c.PastoralInference, &pi); err == nil {
        life := firstNonEmpty(pi.LifeSituationTags, pi.LifeSituationTagsCamel)
        // ... same for other categories ...
        req.PastoralTags = &search.PastoralTags{LifeSituationTags: life, ...}
        req.SearchPhrases = firstNonEmpty(pi.SearchPhrases, pi.SearchPhrasesCamel)
    }
}

Path C — bash sync. A shell script that issues a large SQL query with json_build_object, pipes the result to curl, and posts directly to the search service. COALESCE handles both casings at the SQL level:

-- oceans_semantic_search/scripts/sync_missing_sermons.sh
SELECT json_build_object(
    'external_id', v.id::text,
    'title', v.title,
    ...
    'pastoral_tags', json_build_object(
        'life_situation_tags',
            COALESCE(va.pastoral_inference->'life_situation_tags',
                     va.pastoral_inference->'lifeSituationTags',
                     '[]'::jsonb),
        'struggle_tags',
            COALESCE(va.pastoral_inference->'struggle_tags',
                     va.pastoral_inference->'struggleTags',
                     '[]'::jsonb),
        ...
    ),
    'search_phrases',
        COALESCE(va.pastoral_inference->'search_phrases',
                 va.pastoral_inference->'searchPhrases',
                 '[]'::jsonb)
)

Why three paths

The three paths exist because they handle different operational cases. The live hook keeps the index warm in normal operation — most sermons get indexed within seconds of transcription completing. The Go backfill handles new-sermon reconciliation after outages and systemic re-indexing when model weights or embedding dimensions change. The bash sync is a break-glass tool for the cases where a human is watching and wants to see immediate output at every step.

Three paths could have been a nightmare. It wasn't, because all three construct the same request shape. The CreateSermonRequest type is defined once in oceans2.0/internal/search/client.go, mirrored on the search-service side, and consumed by every path. A Pass 3 field that doesn't make it into sermon_tags is now a bug in a specific path, not an architectural fact about which path you used. That distinction is the entire story.

Tolerant JSON parsing as a pattern

The snake_case / camelCase tolerance in Paths B and C is a small thing with large consequences. The Python sermon_processor uses Pydantic models with camelCase field names. The LLM's raw JSON output uses snake_case because the prompt asks for snake_case. The batch analysis script persists whatever the LLM emits, which is snake_case. But historical records persisted from an older pipeline version may have camelCase fields.

There are three options:

Normalize on the Python side. Requires a migration, risks breaking historical consumers, and commits you to a particular casing forever.
Normalize on the Go side at read time. Requires an explicit JSON-remapping step in every reader. Forgettable.
Be tolerant at read time. Each Go path accepts both casings with a firstNonEmpty fallback, and SQL uses COALESCE. Cheap, additive, doesn't require a migration, and new readers just copy the pattern.

Option 3 won. It costs five extra field definitions in the Go struct and adds zero runtime cost. The alternative — picking one and hoping — has bitten this codebase before, and "this is forever the right casing" is a claim that tends to get falsified by future decisions.

07 — The Retrieval Path

A query end to end.

A user types "why do I keep running from what God asked me to do" into the search box. The frontend debounces, calls POST /api/search on the main application, which forwards to the search service's POST /api/v1/search. What happens there:

Expand if terse, embed, query, dedup in SQL, flag if the top score is low. Five steps, one of them optional.

The SQL, read closely

The pgvector query is the most important piece of code in the search service. It does three things in one statement: compute weighted cosine similarity across all four embedding tiers, filter by language and minimum threshold, and deduplicate so each sermon appears at most once in the result set:

-- oceans_semantic_search/internal/repository/sermon_repository.go:492-545
WITH scored AS (
    SELECT
        s.id, s.external_id, s.title, s.speaker, s.passage,
        s.summary, s.source_url,
        (1 - (se.embedding <=> $1)) * CASE se.embedding_type
            WHEN 'transcript_chunk' THEN 1.2
            WHEN 'summary'          THEN 1.0
            WHEN 'themes'           THEN 0.6
            WHEN 'query_phrase'     THEN 1.3
            ELSE 1.0
        END AS similarity_score,
        se.embedding_type, se.chunk_index, se.content_preview
    FROM sermon_embeddings se
    JOIN sermons s ON se.sermon_id = s.id
    WHERE se.embedding_type IN ('summary','transcript_chunk','themes','query_phrase')
      AND s.language = $2                 -- new: language filter
      AND (1 - (se.embedding <=> $1)) * CASE se.embedding_type
            WHEN 'transcript_chunk' THEN 1.2
            WHEN 'summary'          THEN 1.0
            WHEN 'themes'           THEN 0.6
            WHEN 'query_phrase'     THEN 1.3
            ELSE 1.0
        END >= 0.30                       -- min score threshold
),
best_per_sermon AS (
    SELECT DISTINCT ON (id) *
    FROM scored
    ORDER BY id, similarity_score DESC
)
SELECT id, external_id, title, speaker, passage, summary, source_url,
       similarity_score, embedding_type, chunk_index, content_preview
FROM best_per_sermon
ORDER BY similarity_score DESC
LIMIT $3 OFFSET $4;

DISTINCT ON (id) is the cleanest piece of this. The previous implementation collected all rows from the scored set, then deduplicated in Go — which worked when you excluded transcript chunks (because a sermon could only appear via summary or themes, which are one-per-sermon) but broke when you included them (because a sermon could appear under many chunk embeddings). The Go handler had to know which mode it was in. That's architectural coupling for no reason.

Moving dedup to SQL via the CTE makes it irrelevant to the handler. Each sermon appears once. The row that survives is the highest-scoring embedding for that sermon, and the embedding_type column tells you which tier won — which is exactly the information the frontend needs to label the snippet card ("Matched a natural-language question from this sermon" vs "Matched transcript content"). One SQL change collapses a class of Go-side special-casing.

For the example query

"why do I keep running from what God asked me to do" is 13 words — above the expansion threshold, so no expansion fires. The query is embedded directly. The SQL runs against the English-language sermons. If the Jonah sermon's Pass 3 generated a search_phrase similar to this query, the query_phrase tier for that sermon wins at weight 1.3, and the content_preview returned is that phrase itself — which the UI then renders with the label "Matched a natural-language question from this sermon." If Pass 3 didn't happen to generate this specific phrase, a transcript chunk containing "fled from the presence of the Lord" might score highly and win at weight 1.2. Either outcome is legible to the user because the snippet tells them which tier matched.

08 — The Honest Signal

"We weren't sure any of these directly answer your question."

Most RAG demos return results and leave the user to figure out whether any of them are relevant. This is a small but corrosive dishonesty. If the system was able to compute a similarity score and knew the top match was weak, the user should know too. Hiding that information under the guise of "confident presentation" is exactly the pattern that erodes trust in AI systems over time.

The search service distinguishes two thresholds:

0.30 — below this, results are filtered out entirely. Not shown at all.
0.45 — if the top result is below this, results are still shown, but the response carries a low_confidence: true flag.

The flag is plumbed end-to-end — from the SQL threshold comparison through the search service response, through the main application's response envelope, to a banner at the top of the frontend results grid:

// search-results-content.tsx (paraphrased)
{lowConfidence && (
  <div className="mb-4 rounded-md border border-amber-800/40 bg-amber-950/30 px-4 py-3">
    <p className="text-sm text-amber-200">
      We weren't sure any of these sermons directly answer your question.
      Here are our closest matches — try rephrasing for better results.
    </p>
  </div>
)}

The UX impact is out of proportion to the implementation effort. A user who gets five sermons and a banner that says "we weren't sure any of these directly answer your question" is substantially more trusting of the system the next time they search, even though that specific search didn't give them what they wanted. The alternative — five sermons with no signal — trains the user to distrust the whole system the moment the first irrelevant result appears.

A search that admits when it's guessing is a search that users keep using. A search that always performs certainty is a search that gets abandoned the first time it's obviously wrong.

The threshold itself is intuition-tuned, not measured. I don't have click-through data yet — the signal needed to tune low_confidence_threshold against real user behavior is click-through rate on low-confidence result sets vs high-confidence result sets, which requires a click-through logger that hasn't been built. Tuning today is based on spot-checking: queries that obviously have no good answer ("how do I change a tire") land below 0.45; queries with clear answers sit well above it. When a click-through logger lands, the threshold becomes a calibrated number rather than an opinion. For now, opinion-plus-spot-checks is enough.

09 — Honest About What's Not Done

What's unfinished, in order of how much it matters.

The honest part of a case study is the list of things you haven't shipped. This system works, and it works better than most semantic search I've seen in this domain, but there are real limits. In priority order:

Timestamp deep-links. The chunker uses character offsets, not time bounds. When a transcript_chunk wins, we can't yet surface the moment in the video where that chunk occurred. The fix is either to have the transcription service emit time-aligned chunks directly, or to post-hoc align chunks against the word-level timestamps already stored in video_transcripts.content. This is the single highest-leverage feature I haven't shipped — "jump to the moment in this sermon that matched your query" would be the user-facing differentiator.
No cross-encoder reranking. The pipeline stops at weighted cosine. A small cross-encoder reranking the top 20 candidates before returning top 5 would be maybe 200ms of latency and a measurable precision improvement. Even a distilled MiniLM-class model would help. I haven't done it yet because the low-confidence signal partially compensates — users know when retrieval is weak — but precision above the threshold would still improve.
Live indexing is fire-and-forget. No retry, no DLQ on the Path A hook. The Go reconciler catches anything the live path misses, but "eventually consistent" is not the same as "consistent." A structured cron wrapping Path B with metrics would close this gap.
No click-through logging. Without a signal for whether users clicked results, the weights and thresholds are intuition-tuned forever. Logging (query, video_id, position, matched_on, created_at) per click is a small table, small handler, and immediate input to a real tuning loop.
Language filter defined but not fully wired. The filters.Language field exists on the request type in both repos. The SQL honors it. The handler and frontend don't yet thread it through. Multi-language users see cross-language mixing until this is wired, which is plumbing rather than architecture.
No content-change invalidation. If Pass 3 is re-run for a sermon, the search index doesn't automatically re-embed. The ingest paths are upsert-safe, so a re-ingest does replace cleanly, but nothing triggers it. A database trigger or a video_analyses change-stream consumer would close this.
Embedding model gotcha. The schema declares vector(1536). The default Ollama model (nomic-embed-text) emits 768-dim vectors. Inserts will fail for naïve deploys. Production overrides EMBEDDING_MODEL or uses OpenAI, but the mismatch is a documentation hazard.

Everything on this list is engineering, not research. None of it requires a better model, a larger index, or a different architecture. What it requires is time — and the willingness to be specific about what's shipped vs what's scaffolded.

10 — Lessons

On keeping AI systems from drifting apart.

The implementation details of a sermon search engine don't matter to anyone not building one. The engineering pattern — a tolerant contract between an LLM pipeline and a retrieval index, enforced at the boundary, consumed by multiple paths — is the part that transfers.

The contract is the system

AI systems composed of multiple services drift. The model behind the tagger updates. The prompt gets iterated. The output schema evolves. The consumer's read path, written six months ago against an older schema, keeps working for a while and then quietly stops producing the same quality of output as the indexer version running fresh. The contract — the shared request shape, defined in one place, consumed by every path — is the thing that makes drift detectable. Without it, drift is an emergent property that shows up in search relevance reports three months after anyone could have fixed it.

Write the contract. Mirror it on both sides. Make the consumers tolerant of reasonable schema variants (snake_case vs camelCase, missing vs null, new fields vs old) so the contract can evolve without a big-bang migration. Teach every ingest path to speak it. Then drift becomes a bug in a path rather than an architectural fact about which path indexed which sermon.

The query register is the content register

Semantic search works when query embeddings and content embeddings live in the same neighborhood of vector space. The standard pattern is to embed the query and hope the content lands nearby. The more reliable pattern is to shape the content so it speaks in the same register as the query. The Pass 3 search_phrases are that shape change, made deliberate. They exist because the tagger was asked to predict the query, not just describe the content. Once that shift is made, the searcher's job gets materially easier — it's matching phrases that were generated to be matched.

If your users search in situational language, your content should include situational phrasing. If your users search in terse keywords, the persona expander brings the query into the same register. The register match is upstream of every ranking improvement.

Determinism at the edges

The LLM generates the valuable parts of this system — the summary, the tags, the search phrases. But the LLM does not make the decisions at the edges. The decisions at the edges — which tier weight wins, whether a query gets expanded, whether a result set gets a low-confidence banner — are made by deterministic code. The scripture reference validator is a 200-entry dict. The confidence scorer is arithmetic over issue counts. The tier weights are numbers in a database.

This is the same lesson as the Oceans case study's deterministic confidence scorer: use the LLM for what only the LLM can do; use deterministic code for everything else. It's the same architectural principle here, applied to search.

Tell the user when you're guessing

The single most important UX decision in this system might be the low-confidence banner. Users know the system has limits. Pretending otherwise is the fastest way to lose their trust. Telling them "we weren't sure any of these sermons directly answer your question" is a concession, and concessions read as honesty. A search that sometimes says "I'm not sure" is a search users believe when it says "this is the right answer."

The search that admits when it's guessing is the search users keep using.

Small, boring improvements compound

None of the pieces in this architecture are novel on their own. pgvector is boring. Weighted cosine is boring. Persona prompts aren't new. Synthetic theme sentences are an obvious trick once you see them. CTE-based dedup is a SQL idiom. Redis caching with a language-scoped key is a one-line change. The architecture isn't interesting because any of its pieces are interesting. It's interesting because the pieces are in honest relationship to each other — the tagger produces phrases specifically for the index, the index ranks them specifically for the query, the UI surfaces specifically what matched and how confident we are. That relationship is the product.

If you're building search over LLM-generated content and wondering what to invest in: invest in the contract between the producer and the consumer. Everything else gets easier on the other side of that decision.

Work with me

If this is the kind of retrieval architecture you want to build, I can help.

I work with mid-market teams on AI-integrated systems where the interesting engineering lives at the seams — between the LLM, the index, and the user. Especially valuable when you already have an LLM pipeline producing content and you need search or retrieval over it that actually understands what your users mean.

jeremy@consultjl.com 724.263.3475 ← The Oceans case study (companion piece) All work