Case Study · Commercial Data Intelligence at Scale

A technographic pipeline that processes half a million organizations a month, and why its second version threw away its own queue system.

Ravel's technographic pipeline crawls websites, fingerprints technology signals, resolves accounts across enterprise partner datasets, and delivers structured findings on a monthly SLA. This is the architecture and the operational reality — including the parts where we tore down the distributed system we'd carefully built and replaced it with a single bash script.

RoleCo-founder, CTO StackPython · Go · Playwright · PostgreSQL Scale515K+ entities / month Read~22 min
  1. The problem, and why it's harder than it looks
  2. A tale of two pipelines: v1's distributed ambitions
  3. Why we ripped out the queue
  4. The v2 architecture, end to end
  5. Inside the crawler: slot isolation and atomic checkpoints
  6. Inside the processor: fingerprint matching at wire speed
  7. The ML account matcher: four strategies, eighteen features
  8. Operations: one script, five stages, three alert tiers
  9. War stories
  10. What I'd tell you if you were building this

Commercial data intelligence at scale, with an SLA attached to it.

Enterprise SaaS vendors have a specific and expensive question they need answered every month: across our entire addressable market — hundreds of thousands of organizations — which of them are running which technologies right now, and which of them just added a second vendor in a category where they already have us. That second question is the one that matters. Dual-vendor deployments are the leading indicator of customer churn. By the time sales knows a customer is evaluating a competitor, it's too late. By the time the technographic pipeline flags it, you have a quarter to get in front of it.

Ravel's job was to answer that question at enterprise scale and deliver the answer on a monthly cadence, in clean tabular form, to S3 buckets owned by the partners. The output is dull. The input is not.

The addressable market for our primary partners was roughly 515,000 organizations. Each organization has a website — and "has a website" is doing a tremendous amount of work in that sentence. Some are static HTML served out of 2008. Some are React single-page apps with twelve seconds of client-side hydration before anything meaningful renders. Some are WordPress with twenty plugins fighting each other. Some redirect to a Facebook page. Some redirect to themselves in a loop. Some respond 500 with a 24-byte body for reasons nobody will ever explain. All of them need to be crawled, all of them need their content passed through a thousand-plus technology fingerprints, and all of the findings need to land in the right partner's bucket with the right account IDs attached by the tenth of the month.

The crawler wasn't the hard part. The hard part was building a reliable, resumable, multi-stage pipeline that could handle the messy reality of websites that break in every way imaginable, while a partner's data engineering team waits for their S3 drop on the tenth of the month.

I was the co-founder and CTO. I wrote the first version of this pipeline, and I wrote the second version after we threw out the first one. What follows is an honest account of both — because the decision to collapse a distributed queue-driven architecture back into a monolithic batch pipeline is the most interesting engineering decision in the whole system, and nobody writes about deletions.

Version one was distributed. It was also wrong.

V1 looked like what a thoughtful engineer in 2022 would draw on a whiteboard if you gave them this problem. A Python async crawler running in Kubernetes, backed by RabbitMQ with prefetch-based concurrency control and a three-state circuit breaker. A Go signal processor also driven by RabbitMQ, with fsnotify as a fallback for file-based inputs. A separate Go exporter. A data-pipeline fan-out layer that pushed crawled pages to dedicated worker types — basic info, leadership, ministry, media — each on their own queue, each with their own scaling knob. Prometheus, Grafana, health endpoints, an alert system with pluggable Slack and email channels, Kubernetes manifests from namespace down to CronJob. A Makefile that could scale individual worker classes independently.

It worked. It was observable. It had clean seams between services. It was, by every shibboleth of modern distributed system design, correct.

It was also the wrong system for the problem.

The v1 architecture

SCHEDULE dispatch_crawl.py reads accounts from DB QUEUE crawl_queue prefetch=10 K8S DEPLOYMENT custom-crawler 5–25 pods, HPA SmartCrawlerService HTTP-first → browser BrowserPool (20 ctx) CircuitBreaker 3-state /healthz on :8080 QUEUE processing_queue fsnotify fallback GO SERVICE processor_v3 RabbitMQ worker pool DATA_PIPELINE · FAN-OUT WORKERS basic_info · leadership · ministry · media one queue each, independently scaled, Prometheus scraped GO SERVICE exporter OBSERVABILITY Prometheus Grafana SLA monitor Alert channels: slack · email · webhook escalation policies
V1: distributed, queue-driven, observable. And ultimately wrong for the problem shape.

The v1 crawler had genuinely clever pieces. The browser pool kept one Playwright context per domain for session persistence, with a three-tier cleanup that evicted the oldest third at 15 contexts and did emergency eviction at 20, plus time-based cleanup of anything over ten minutes old. The retry classifier distinguished permanent failures (404, 403, 410, 451 — never retry) from transient ones (502, 503, 504, 429 — always retry) and had learned edge cases like "HTTP 500 with a response body under 100 bytes is a broken server, not a transient error." The SmartCrawlerService tried a cheap HTTP GET first and only fell back to a full browser when it detected a single-page app, which it did by counting <script> tags against visible text and sniffing for __NEXT_DATA__ and similar markers. These were good ideas. They survived into v2.

What didn't survive was the topology.

Why we ripped out the queue system on purpose.

There's a specific temptation in systems design that I want to name carefully, because I've watched smart engineers fall into it more than once and I fell into it myself on this project. The temptation is to reach for distributed primitives — queues, service boundaries, horizontal scaling, independent deployment — before you've established that the workload actually has the shape those primitives are designed for.

Queues solve three kinds of problems well:

Our workload had none of these properties. The input was a CSV of 500,000 URLs known at the start of the month. The processing rate was bounded by crawler concurrency, not by arrival rate. The downstream processor was a single worker class applying the same fingerprint matching to every page. There was exactly one deployment: the whole pipeline, once a month, end to end. The queue wasn't buffering variance — there was no variance. It was just a place for our state to hide.

The queue wasn't buffering variance — there was no variance. It was just a place for our state to hide.

Every distributed-system problem I was solving was a problem my architecture had invented.

And hiding state in a queue costs you things. Specifically:

  1. Resumability became opaque. When the crawler crashed mid-run, figuring out what had and hadn't been processed meant introspecting the queue plus the database plus the file outputs. A partial run left state smeared across three systems. Our "resume from where we left off" logic had to reconcile all three, and the reconciliation logic had bugs we only discovered in production.
  2. Observability fragmented. To answer "how is this month's run going?" I had to check Grafana for crawler pod metrics, the RabbitMQ management UI for queue depth, the Kubernetes dashboard for pod status, and the database for processor progress. None of these gave me a single answer. I was doing mental JOINs across four dashboards at 2 AM.
  3. Failure modes multiplied. Every queue is another thing that can fail. RabbitMQ connection drops, channel resets, unacked-message limits hit, a consumer gets into a poison-message loop. We wrote start_consuming_with_retries wrappers with exponential backoff around every consumer. These wrappers were themselves a source of bugs — I'll tell that story in the war stories section.
  4. The ops surface was enormous. To run a single monthly pipeline, I needed Kubernetes, RabbitMQ, Prometheus, Grafana, and a Makefile with twenty targets. To diagnose a problem I needed to know all five. To onboard someone to help me I needed to teach them all five. The cognitive load was constantly high and none of it was intrinsic to the actual work.

The v2 design started from a deliberately boring premise: this is a monthly batch job on a known input, with explicit stages, writing to a local filesystem between stages. If you wrote that sentence on a whiteboard in 1998, the right answer would have been a bash script that ran five programs in sequence. In 2026, with twenty-seven years of operational pattern-hoarding between then and now, the right answer is still a bash script that runs five programs in sequence. The only thing that changed is that now we have atomic file renames, SHA256 idempotency ledgers, Pushover for alerts, and Docker to isolate the binaries.

So that's what v2 is.

Five stages, one orchestrator, no queue.

V2 is a single bash script — pipeline.sh — that orchestrates five stages in strict order. Each stage reads from a known directory and writes to a known directory. The run is keyed by a single identifier, RUN_ID=$(date +%Y-%m-%d)-monthly, which becomes the root of every artifact directory. State is files. Progress is contiguous line numbers in checkpoint JSON. Idempotency is SHA256 hashes in a ledger table. Alerting is Pushover with three priority tiers.

ORCHESTRATOR pipeline.sh · RUN_ID=$(date +%Y-%m-%d)-monthly · dry-run, retry-failed, per-stage alerts, status.sh STAGE 1 crawl Python · Playwright 20 workers × slots output/ output_*.jsonl .crawler_state.json STAGE 2 process Go · scratch 7MB ~1k fingerprints processor_results batch 600, retry 3× SHA256 ledger STAGE 3 aggregate SQL upsert ON CONFLICT website_technology _findings (site, fp, month) STAGE 4 archive retention 180d keeps hot-table small STAGE 5 export Go · libpostal CSV + JSON export/${RUN_ID}/ TechnicalFindings_* AccountDelivery.* DELIVERY upload_to_partner.sh s3://partner/{YYYY}/{MM}/{DD}/*.gz upload_to_data_lake.sh s3://datalake/... bucket-owner-full-control atomic rename · gzip · MD5 dedup cross-account AssumeRole OPERATIONAL SURFACE status.sh live shard + run-state viewer, history, log tail notify.sh Pushover · normal / high / emergency install-cron.sh 0 2 10 * * — idempotent install, tagged line
V2: single orchestrator, strict stage order, files as state, directories keyed by RUN_ID.

The total operational surface for running a monthly pipeline in v2 is:

That's it. There is no dashboard. There is no Kubernetes. There is no message broker. There is a script, five binaries, and a SQL database.

The rest of this case study is about the three stages that carry actual algorithmic weight — the crawler, the processor, and the ML account matcher — and the operational tooling that makes the whole thing go. I'll try to resist the urge to just dump every interesting detail; I'll focus on the ones that exist because a specific thing broke, or because a specific tradeoff mattered.

Slot isolation, atomic checkpoints, and a circuit breaker that knows how to wait.

The crawler's job: take a CSV of (website_id, url) pairs, visit each URL with a real browser, extract text and headers, and write structured JSON. Everything else is defense.

Concurrency model: workers, slots, contexts

A worker is an asyncio coroutine. Each worker hosts N slots. Each slot owns exactly one Playwright BrowserContext and one Page. Effective per-shard concurrency is CRAWL_CONCURRENCY × CRAWL_TABS_PER_WORKER, defaulted to 20 × 1 — deliberately conservative, because the per-slot isolation is what buys us the resilience we actually need.

# pipeline/crawler/crawler/crawler.py:223-234
async def _worker_loop(self, worker_id: int) -> None:
    """Worker coroutine hosting one context and multiple tab slots."""
    worker = Worker(worker_id, self.config, self.browser_manager)
    try:
        slots = max(1, self.config.tabs_per_worker)
        slot_tasks = [
            asyncio.create_task(self._worker_slot_loop(worker, worker_id, slot_id))
            for slot_id in range(slots)
        ]
        await asyncio.gather(*slot_tasks)
    finally:
        await worker.close()

The worker is a nursery for its slot tasks. It doesn't touch pages itself.

Why one Playwright context per slot instead of one per worker? Because of an incident in v1, which I'll cover in §9. The short version: when multiple tabs share a context, a single poisoned page can hold the context's lock forever, and its siblings queue behind it silently. The watchdog sees browser.is_connected() == True and reports healthy. Throughput goes to zero. Nobody notices for an hour.

Isolated contexts mean a hung page can be torn down in 5 seconds — _close_slot_locked wraps page.close() and context.close() in asyncio.wait_for(timeout=5) — without touching siblings. Each slot tracks its parent browser's session_id, so when the browser restarts, stale slots invalidate their contexts automatically:

# pipeline/crawler/crawler/worker.py:141-174
async def _ensure_page(self, slot_id: int):
    """Ensure worker has a valid page/tab tied to current browser session."""
    killed = self.browser_manager.enforce_playwright_memory_limit()
    if killed > 0:
        raise BrowserCrashError(f"killed {killed} oversized playwright process(es)")

    if not self.browser_manager.is_healthy():
        raise BrowserCrashError("Browser is not connected")

    async with self._slot_lock(slot_id):
        if self._slot_session_ids.get(slot_id) != self.browser_manager.session_id:
            await self._close_slot_locked(slot_id)

        page = self._pages.get(slot_id)
        if page is not None:
            try:
                if not page.is_closed():
                    return page
            except Exception:
                self._pages.pop(slot_id, None)

        try:
            context = self._contexts.get(slot_id)
            if context is None:
                context = await self.browser_manager.create_context()
                self._contexts[slot_id] = context
                self._slot_session_ids[slot_id] = self.browser_manager.session_id

            page = await context.new_page()
            self._pages[slot_id] = page
            return page

Atomic checkpoints

Resume after a crash is only useful if the checkpoint is trustworthy. Ours is. The state file is written via tempfile.mkstemp followed by os.rename — atomic on POSIX — and contains both last_processed_line (the last contiguous CSV line) and completed_lines_pending (out-of-order completions held until the contiguous front advances):

# pipeline/crawler/crawler/checkpoint.py:68-106
async def save(self) -> None:
    """Atomic save: write to temp file, then rename."""
    async with self._lock:
        self.state.last_checkpoint_at = datetime.now(timezone.utc).isoformat()
        self.state.completed_lines_pending = sorted(self._completed_lines_pending)

        state_dict = {
            "last_processed_line": self.state.last_processed_line,
            "completed_lines_pending": self.state.completed_lines_pending,
            "total_processed": self.state.total_processed,
            "total_success": self.state.total_success,
            "total_failed": self.state.total_failed,
            "failed_tasks": self.state.failed_tasks,
            "current_output_file_index": self.state.current_output_file_index,
            "current_output_row_count": self.state.current_output_row_count,
            "started_at": self.state.started_at,
            "last_checkpoint_at": self.state.last_checkpoint_at,
        }
        self.state.parent.mkdir(parents=True, exist_ok=True)
        temp_fd, temp_path = tempfile.mkstemp(dir=self.state_file.parent, suffix=".tmp")
        try:
            with os.fdopen(temp_fd, "w", encoding="utf-8") as f:
                json.dump(state_dict, f, indent=2)
            os.rename(temp_path, self.state_file)

The distinction between last_processed_line and completed_lines_pending matters. If worker A finishes URL 100 while worker B is still on URL 99, we cannot advance the checkpoint to 100 — if we crashed, we'd skip 99. So 100 goes into completed_lines_pending, waits for 99 to finish, and the contiguous front advances to 100 afterward. Out-of-order completion, in-order checkpointing.

Backpressure

The CSV producer is a coroutine that reads rows and puts them on a bounded asyncio.Queue of size 1,000. When the queue is full, queue.put() blocks. That's the whole backpressure mechanism. The producer doesn't need to know about the workers; the workers don't need to know about the producer; the queue's bounded size does the coordination:

# pipeline/crawler/crawler/queue_manager.py:99-138 (excerpt)
async def produce(self) -> None:
    logger.info("queue_producer_started",
                input_file=str(self.csv_path), start_line=self.start_line)
    with open(self.csv_path, "r", newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for line_num, row in enumerate(reader, start=1):
            self.total_lines = line_num
            if line_num <= self.start_line:
                continue                                  # resume skip
            if self._shutdown.is_set():
                break
            ...
            if self.queue.full():
                logger.debug("queue_backpressure",
                             queue_size=self.queue.qsize(), current_line=line_num)
            if not await self._put_with_shutdown(task):
                break
            self.current_line = line_num

At shutdown, the producer enqueues concurrency × tabs_per_worker poison-pill Nones — one per slot. Each slot pulls a None, stops pulling, and exits cleanly.

The circuit breaker

Crashes cluster. A flaky upstream or a batch of pages with hostile JS can trigger a burst of crashes that overwhelms the browser restart cap, and then we're watching the crawler die forever. The breaker trades throughput for stability: after N crashes inside a sliding window, it pauses all slots until a cooldown elapses.

# pipeline/crawler/crawler/crawler.py:66-99
async def _record_crash_and_maybe_trip_breaker(
    self, worker_id: int, slot_id: int, reason: str,
) -> None:
    now = time.monotonic()
    async with self._breaker_lock:
        self._crash_timestamps.append(now)
        cutoff = now - max(1, self.config.circuit_breaker_window_seconds)
        while self._crash_timestamps and self._crash_timestamps[0] < cutoff:
            self._crash_timestamps.popleft()
        crash_count = len(self._crash_timestamps)
        threshold = max(1, self.config.circuit_breaker_crashes)
        if crash_count >= threshold:
            cooldown = max(1, self.config.circuit_breaker_cooldown_seconds)
            self._breaker_pause_until = max(self._breaker_pause_until, now + cooldown)
            logger.warning("circuit_breaker_tripped",
                           worker_id=worker_id, slot_id=slot_id, reason=reason,
                           crash_count=crash_count, threshold=threshold,
                           window_seconds=self.config.circuit_breaker_window_seconds,
                           cooldown_seconds=cooldown)

Defaults: 12 crashes in a 30-second window trips a 20-second cooldown. During cooldown, slot loops call _wait_if_breaker_active() before pulling the next task, so throughput effectively zeros while the browser stabilizes. The alternative — hammering a wedged browser with more tasks until the restart cap is hit and the whole shard dies — is worse than waiting.

Failure mode matrix

Crawlers are failure-mode encyclopedias. Here's what we handle explicitly:

FailureClassificationAction
Transient network (ERR_CONNECTION_CLOSED, ERR_TUNNEL_CONNECTION_FAILED, frame detached)retry-eligibleRetry up to 3×
Page timeout (45s default)slot-levelClose page, invalidate slot, retry
Hard context timeoutslot-levelCancel slot task, recreate slot
Fatal browser markers (target crashed, browser disconnected)crashRaise BrowserCrashError, attempt browser restart (cap 3)
Playwright RSS > 2048 MBcrashSIGKILL oversized PIDs, raise BrowserCrashError
Hung page.close() / context.close()leak5s wait_for, log, abandon slot
Crash stormsystemicCircuit breaker pauses all slots
Restart cap exceededfatalSignal shutdown, drain, exit non-zero
Failure rate > 50%pipeline-levelpipeline.sh aborts, emergency Pushover

Note the last one: the crawler itself doesn't enforce a failure-rate ceiling. The orchestrator does. This is intentional — the crawler should do its job to the extent possible and report honestly; the decision to halt is an operations decision that belongs one layer up.

A thousand fingerprints, every page, at wire speed.

The processor is a Go service that reads crawler JSONL output line by line, matches each page's content against roughly 1,000 technology fingerprints in parallel, and bulk-inserts matches into PostgreSQL. Its most important property is that it is boring. It loads its fingerprints once at startup, it never talks to a queue, it writes files to processed/ when done, and if it crashes you run it again.

What a fingerprint is

A fingerprint is a row in the fingerprints table pointing at a signal (a vendor/feature). Each fingerprint has one of two match modes — matching_text (a substring) or matching_regex (a compiled regex) — plus an optional extraction_regex that pulls a captured value (an account ID, a tracker ID, a version string) out of the match:

// pipeline/models/fingerprint.go:14-27
type Fingerprint struct {
    ID                  uint           `gorm:"primarykey"`
    CreatedAt           time.Time
    UpdatedAt           time.Time
    DeletedAt           gorm.DeletedAt `gorm:"index"`
    MatchingText        string
    MatchingRegexExp    string         `gorm:"column:matching_regex"`
    MatchingRegex       *regexp.Regexp `gorm:"-"`
    ExtractionRegexText string         `gorm:"column:extraction_regex"`
    ExtractionRegex     *regexp.Regexp `gorm:"-"`
    Enabled             bool
    SignalID            int
    Signal              Signal         `gorm:"foreignKey:SignalID"`
}

At boot, the processor queries all enabled, non-deleted fingerprints and splits them into two flat global slices: models.AllText and models.AllRegex. There is no per-domain indexing — every page is matched against every fingerprint. For our corpus size this is fine; when it stops being fine we'll index. Not yet.

The match loop

Each page runs both passes concurrently. Two goroutines, two unbuffered channels, one fan-in:

// pipeline/processor/processors/processTextAndRegex.go:93-132
func ProcessTextAndRegex(page models.DownloadedPage) (matches []models.ProcessorResult) {
    if len(page.Contents) == 0 {
        return
    }
    start := time.Now()

    textChan := make(chan []models.ProcessorResult, 1)
    regexChan := make(chan []models.ProcessorResult, 1)

    go func() { textChan <- processTextFingerprints(page) }()
    go func() { regexChan <- processRegexFingerprints(page) }()

    textMatches := <-textChan
    regexMatches := <-regexChan

    matches = append(matches, textMatches...)
    matches = append(matches, regexMatches...)

    log.Info().
        Str("url", page.URL).
        Int("text_matches", len(textMatches)).
        Int("regex_matches", len(regexMatches)).
        Int("total_matches", len(matches)).
        Dur("duration", time.Since(start)).
        Msg("Page processing complete")
    return matches
}

The two passes are asymmetric in cost. Text fingerprints are strings.Contains checks — cheap. Regex fingerprints are compiled regexp evaluations — more expensive, sometimes much more. Running them concurrently means total wall time is max(text, regex) rather than text + regex, which for pages with a heavy regex tail is a meaningful win.

Two-tier batching

Matches accumulate in a slice. When the slice hits 600 records, we flush. When we flush, the GORM layer sub-batches to 100 records per INSERT. Two tiers:

// pipeline/processor/services/file_processor.go:355-370
if fatalErr == nil && len(batchMatches) >= batchSize {
    if err := exporters.SaveOnDB(batchMatches); err != nil {
        processingErrors = append(processingErrors,
            fmt.Sprintf("batch save error at line %d: %s", result.lineNum, err.Error()))
        fatalErr = fmt.Errorf("failed to save batch to database: %w", err)
        stopProcessing()
    } else {
        totalMatches += len(batchMatches)
        log.Info().Str("filename", filename).Int("batch_size", len(batchMatches)).
            Int("total_matches", totalMatches).Msg("Batch written to database")
        batchMatches = batchMatches[:0]    // keep capacity, reset length
    }
}

The inner batch size (100) exists because PostgreSQL has a hard parameter limit per prepared statement — 65,535. Nine columns per row times 100 rows is 900 parameters, well under the limit with plenty of headroom. The outer batch size (600) exists because flushing too often wastes round trips and flushing too rarely makes a crash expensive. Six-hundred was arrived at empirically and has not needed adjustment.

Every INSERT is wrapped in a retry envelope — exponential backoff, capped at three attempts:

// pipeline/processor/database/circuit_breaker.go:29-70
func WithRetry(ctx context.Context, cfg RetryConfig, operation string, fn func() error) error {
    var lastErr error
    for attempt := 0; attempt <= cfg.MaxRetries; attempt++ {
        if attempt > 0 {
            delay := cfg.BaseDelay * time.Duration(1< cfg.MaxDelay { delay = cfg.MaxDelay }
            log.Warn().Str("operation", operation).
                Int("attempt", attempt).Dur("delay", delay).Err(lastErr).
                Msg("retrying_db_operation")
            select {
            case <-ctx.Done():
                return fmt.Errorf("context cancelled during retry of %s: %w", operation, ctx.Err())
            case <-time.After(delay):
            }
        }
        lastErr = fn()
        if lastErr == nil {
            if attempt > 0 {
                log.Info().Str("operation", operation).
                    Int("attempts", attempt+1).Msg("db_operation_recovered")
            }
            return nil
        }
    }
    return fmt.Errorf("%s failed after %d retries: %w", operation, cfg.MaxRetries, lastErr)
}

Default config: 3 retries at 5s / 10s / 20s backoff, 45s cap. The name circuit_breaker.go is aspirational — it's a retry envelope, not a breaker. Renaming it is on the list.

File-level idempotency

The processor must never double-process a file. Before we open a file, we compute its SHA256 and check a ledger table:

// pipeline/processor/services/file_processor.go:133-154
fileHash, err := computeFileHash(filename)
if err != nil {
    return nil, fmt.Errorf("failed to compute file hash: %w", err)
}
alreadyProcessed, err := fp.db.ProcessedFilesRepo().IsProcessed(ctx, fileHash)
if err != nil {
    log.Warn().Err(err).Str("filename", filename).
        Msg("Failed to check processed status, continuing with processing")
} else if alreadyProcessed {
    log.Warn().Str("filename", filename).Str("sha256", fileHash).
        Msg("File already processed, skipping")
    status := &models.FileProcessorStatus{
        Filename:   filename,
        DurationMs: time.Since(startTime).Milliseconds(),
        Errors:     []string{"skipped: already processed"},
    }
    return status, nil
}

Marking is idempotent via ON CONFLICT DO NOTHING. This means we can restart the processor in the middle of a run, point it at the same directory, and it will skip everything that already made it through. Combined with the atomic crawler checkpoint, the entire pipeline has clean resume semantics without any distributed coordination primitives.

Why the processor image is 7 MB

V1's equivalent processor image weighed 4.3 GB. The current one is scratch-based and weighs about 7 MB. Three changes made that possible: removing CGO from the dependency tree (no gopostal in the processor — address parsing is the exporter's job, not ours), pinning the GORM Postgres driver to a pure-Go version, and deleting the RabbitMQ and SQS code paths we weren't using anymore. The README is explicit about it: no queue system, processing is file-based only. The binary cold-starts in tens of milliseconds, which matters because we boot a fresh container per run.

Four strategies in confidence order. A Random Forest for the long tail.

Technographic signals are only useful if they can be attached to accounts a customer's sales team actually owns. That means resolving duplicate records across two datasets — a master accounts table and a partner's CRM — where the same organization appears under different names, partial addresses, missing phone numbers, and typos. This is account matching, and it's the kind of problem where off-the-shelf solutions mostly work and then don't work in ways that cost you real dollars.

V2 of our matcher is a sequential cascade of four strategies, each progressively fuzzier than the last. Each strategy only runs on accounts that the previous strategies didn't already claim. The winner for any given account is the highest-confidence match produced by any strategy.

INPUTS master + client accounts PREPROCESS normalize · phonetic phone → 10 digits · name lower-alnum STRATEGY 1 · conf 0.95 ExactPhoneStrategy hash join on normalized_phone O(n+m) STRATEGY 2 · conf 0.90 ExactNameLocation state-block + exact name tiered location (street → city+zip → city) STRATEGY 3 · conf ≤0.85 FuzzyNameAddress state ThreadPool(max=8) RapidFuzz composite STRATEGY 4 · conf ≤0.90 MLMatching (Random Forest) 18 features · n_est=100 · depth=10 predict_proba ≥ 0.7 CONSOLIDATOR highest-confidence wins per-client dedup · no voting matched_pairs · score · method unclaimed accounts forward only
The cascade: cheap and high-confidence first, expensive and fuzzy last. Each stage only sees accounts the previous stages couldn't claim.

The orchestrator is straightforward — iterate strategies in order, each one filters out already-matched clients, pass the remainder forward:

# python/acct-matcher-v2/improved/account_matcher/matcher.py:85-150 (excerpt)
def find_matches(self, master_accounts, client_accounts,
                 strategy_order=None, parallel=True,
                 confidence_threshold=0.7, **kwargs):
    if strategy_order is None:
        strategy_order = [
            MatchMethod.EXACT_PHONE,
            MatchMethod.EXACT_NAME_LOCATION,
            MatchMethod.FUZZY_NAME_ADDRESS,
            MatchMethod.ML_BASED,
        ]
    all_matches = []
    matched_client_ids = set()
    for method in strategy_order:
        remaining_clients = [acc for acc in client_processed
                             if acc.id not in matched_client_ids]
        if not remaining_clients:
            break
        strategy_matches = self.strategies[method].find_matches(
            master_processed, remaining_clients, **kwargs)
        quality_matches = [m for m in strategy_matches
                           if m.confidence >= confidence_threshold]
        all_matches.extend(quality_matches)
        matched_client_ids.update(m.client_account_id for m in quality_matches)
    return self._consolidate_matches(all_matches)

The four strategies

1. Exact phone. A hash join on normalized ten-digit phone. If two records share a phone number, they're the same organization with extremely high probability. Score 100, confidence 0.95. Complexity is O(n+m). This catches a meaningful chunk of the dataset for almost free.

2. Exact name + location. State-blocked exact match on normalized name, then a tiered location match — street first, then city+zip, then city alone. Confidence 0.90. The tiered fallback matters because partner data quality varies; a record might have a street but no zip, or a zip but no street.

3. Fuzzy name + address. State-blocked RapidFuzz composite score. This is where things get interesting and slow, so we thread-pool it by state. Each state runs on its own worker, up to 8 concurrent:

# python/acct-matcher-v2/improved/account_matcher/strategies/fuzzy_name_address.py:82-98
num_workers = min(len(common_states), kwargs.get('max_workers', 8))
with ThreadPoolExecutor(max_workers=num_workers) as executor:
    future_to_state = {
        executor.submit(process_state, state): state
        for state in common_states
    }
    for future in as_completed(future_to_state):
        state_matches = future.result()
        all_matches.extend(state_matches)

The per-state inner loop uses early termination — if name similarity is below 50, we don't bother computing street similarity — and an early exit when we hit 95% total:

# python/acct-matcher-v2/improved/account_matcher/strategies/fuzzy_name_address.py:103-158
def _find_best_match_for_client(self, client_acc, master_accounts, min_score):
    best_match = None
    best_score = 0.0
    for master_acc in master_accounts:
        if not master_acc.name:
            continue
        name_score = fuzz.ratio(client_acc.name.lower(), master_acc.name.lower())
        if name_score < 50:               # early kill
            continue
        city_bonus = 15 if (client_acc.city and master_acc.city
                            and client_acc.city.lower() == master_acc.city.lower()) else 0
        zip_bonus  = 15 if (client_acc.zip_prefix and master_acc.zip_prefix
                            and client_acc.zip_prefix == master_acc.zip_prefix) else 0
        street_score = 0
        if client_acc.street and master_acc.street:
            street_score = fuzz.partial_ratio(client_acc.street.lower(),
                                              master_acc.street.lower()) * 0.2
        total_score = name_score + city_bonus + zip_bonus + street_score
        if total_score >= min_score and total_score > best_score:
            best_score = total_score
            best_match = MatchResult(...)
            if total_score >= 95:         # near-perfect, stop scanning
                break
    return best_match

4. Random Forest. For accounts still unclaimed after three rounds of deterministic matching, we extract 18 features per candidate pair, scale them, and run a trained Random Forest classifier. A match is accepted if predict_proba ≥ min_score / 100, capped at 0.90 confidence because we don't want the ML-derived score to outrank a deterministic exact-phone match:

# python/acct-matcher-v2/improved/account_matcher/strategies/ml_matching.py:25-87
def find_matches(self, master_accounts, client_accounts, min_score=70.0, **kwargs):
    if not self.is_trained:
        return self._heuristic_matching(master_accounts, client_accounts, min_score)
    candidate_pairs = self._generate_candidate_pairs(master_accounts, client_accounts)
    features = self._extract_features_batch(candidate_pairs)
    feature_matrix = self.scaler.transform(features)
    probabilities = self.model.predict_proba(feature_matrix)[:, 1]
    matches = []
    for i, (master_acc, client_acc) in enumerate(candidate_pairs):
        prob = probabilities[i]
        if prob * 100 >= min_score:
            matches.append(MatchResult(
                master_account_id=master_acc.id,
                client_account_id=client_acc.id,
                score=prob * 100,
                method=self.method,
                confidence=min(prob, 0.90),
                details={'ml_probability': float(prob)},
            ))
    return matches

The 18 features

Feature design is where account matching lives or dies. Ours has a specific structure: five continuous name-similarity metrics (different algorithms catch different failure modes), seven binary location/phone/phonetic indicators, and six scalar structural agreements. Each one exists because an earlier version of the model was wrong in a specific way.

#FeatureTypeWhy it's there
1fuzz.ratio namesfloatOverall string similarity
2fuzz.partial_ratio namesfloatCatches substring matches ("First Baptist" vs "First Baptist Church of Memphis")
3fuzz.token_sort_ratiofloatOrder-invariant ("Memphis Baptist" vs "Baptist Memphis")
4fuzz.token_set_ratiofloatBag-of-words — robust to missing tokens
5Levenshtein.ratiofloatCharacter-level edit distance, for typos
6–8City / state / zip-prefix exactbinaryHard location anchors
9Street partial_ratiofloat"123 Main St" vs "123 Main Street, Suite 400"
10Phone exactbinaryCheap strong signal (survives strategy 1's miss)
11Metaphone phonetic matchbinary"Katherine" ≈ "Catherine"
12Name-length diff (norm)floatHuge disparity is a mismatch signal
13Name word-count diff (norm)floatSame intent as 12 at token granularity
14–18Structural agreement (has_* flags)binaryDoes one record have a website and the other not? Records with very different completeness profiles are suspicious

The model itself is small and regularized on purpose:

# python/acct-matcher-v2/improved/account_matcher/strategies/ml_matching.py:213-258
def train_model(self, training_data):
    features, labels = [], []
    for master_acc, client_acc, is_match in training_data:
        features.append(self._extract_pair_features(master_acc, client_acc))
        labels.append(1 if is_match else 0)
    X = np.array(features); y = np.array(labels)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)
    X_train_scaled = self.scaler.fit_transform(X_train)
    X_test_scaled  = self.scaler.transform(X_test)
    self.model = RandomForestClassifier(
        n_estimators=100, max_depth=10,
        min_samples_split=5, min_samples_leaf=2, random_state=42,
    )
    self.model.fit(X_train_scaled, y_train)

100 trees, depth 10, min-samples-leaf 2. Enough capacity to learn the interactions between our 18 features, not enough to overfit the hand-labeled training set. Test accuracy lands at 92% on an 80/20 split.

Where the 73% speedup and 60% memory drop came from

The public numbers on the site compare v2 to v1 of the matcher specifically. V1 lived at python/acct-matcher/ and had a simpler architecture — fewer strategies, no state blocking, single DB connection, load-everything-into-memory. In rough order of contribution to the speedup:

  1. State blocking instead of global cartesian compare. V1 compared every client account to every master account. V2 bucketizes both sides by state upfront and only compares within a bucket. For a 250k × 250k input distributed across 50 states, this turns one quadratic into fifty small ones. This is the bulk of the 73%.
  2. Sequential strategy chain with early exit. V1 ran every strategy on every account and reconciled after. V2 removes already-matched clients from the input to each subsequent strategy, so the expensive strategies only run on the hard cases.
  3. Threaded connection pool. V1 had one DB connection. V2 uses psycopg2.pool.ThreadedConnectionPool(min=1, max=20), which lets the fuzzy strategy's thread pool actually get parallel I/O.
  4. LRU cache for normalized strings and feature vectors — 10,000-entry cap in database/cache.py. Memory stays bounded; intermediate results get reused. This is most of the 60% memory reduction.

What "92% accuracy" means specifically, because the number deserves to be pinned down: it is the held-out test-set accuracy logged by model.score(X_test_scaled, y_test) on a deterministic 80/20 split of hand-labeled (master, client, is_match) triples. It is not precision or recall individually — those are traced separately and are close to balanced given the regularization. The labels themselves are a curated set derived from prior resolved matches plus known overlaps; label quality is the ceiling on this number and we know it.

One script, five stages, three alert tiers.

Everything in v2 is designed so that the on-call answer to "how is the monthly run going?" is a single command and the on-call answer to "something is wrong" is a single Pushover notification with the stage, the run ID, and a log path.

The orchestrator

pipeline.sh owns: argument parsing, prerequisite validation, run-ID assignment, per-stage Docker invocation, status-file writing, log rotation, and alert dispatch. The five stages run in fixed order: crawl → process → aggregate → archive_processor_results → export. Each stage writes under a directory keyed by the run ID:

# pipeline/pipeline.sh (excerpt)
RUN_ID="$(date +%Y-%m-%d)-monthly"
CRAWL_DIR="${DATA_DIR}/crawl/${RUN_ID}"
PROCESS_DIR="${DATA_DIR}/processing/${RUN_ID}"
EXPORT_DIR="${DATA_DIR}/export/${RUN_ID}"
LOG_DIR="${DATA_DIR}/logs/${RUN_ID}"
STATUS_DIR="${DATA_DIR}/status"
STATUS_FILE="${STATUS_DIR}/current_run.json"

Everything downstream of pipeline.sh — the status viewer, the retry-failed mode, the cron installer — navigates this directory layout as its contract.

Dry-run mode

Running with --dry-run exercises the entire prerequisite check path — DSN exported, client-id provided, input file exists, all three Docker images present locally, timeout values sane — and exits zero without running any stage. This is what goes into the cron line before the cron fires for real. It catches the "forgot to pull latest images" class of problem before 2 AM.

Retry-failed

The crawler writes a failed_urls.csv at the end of each run containing the URLs it couldn't process. The pipeline supports --retry-failed, which locates the most recent previous run's failed-URLs file and swaps it in as the next run's input:

# pipeline/pipeline.sh:400-408
if [[ "${RETRY_FAILED}" == "true" ]]; then
    local prev_failed="${DATA_DIR}/crawl/$(ls -t "${DATA_DIR}/crawl/" 2>/dev/null | head -1)/failed_urls.csv"
    if [[ -f "${prev_failed}" ]]; then
        actual_input="${prev_failed}"
        log_info "retry_mode" "\"failed_urls_file\":\"${prev_failed}\""
    else
        log_warn "No previous failed_urls.csv found, using original input"
    fi
fi

No queue. No dead-letter exchange. A file. Which is, in every way that matters, what a dead-letter exchange was pretending to be.

Alerting

Three priority tiers, implemented as a thin Pushover wrapper:

# pipeline/notify.sh:1-63 (excerpt)
send_notification() {
    local title="${1:?Title required}"
    local message="${2:?Message required}"
    local priority="${3:-0}"
    local retry="${4:-60}"
    local expire="${5:-3600}"
    if [[ "${NOTIFICATIONS_ENABLED:-true}" != "true" ]]; then
        echo "[NOTIFY] (disabled) ${title}: ${message}"
        return 0
    fi
    ...
    local payload=(
        -F "token=${PUSHOVER_APP_TOKEN}"
        -F "user=${PUSHOVER_USER_KEY}"
        -F "title=${title}"
        -F "message=${message}"
        -F "priority=${priority}"
    )
    if [[ "${priority}" == "2" ]]; then
        payload+=(-F "retry=${retry}" -F "expire=${expire}")
    fi
    curl -s -w "\n%{http_code}" "${payload[@]}" https://api.pushover.net/1/messages.json
}

notify_normal()    { send_notification "$1" "$2" 0; }
notify_high()      { send_notification "$1" "$2" 1; }
notify_emergency() { send_notification "$1" "$2" 2; }

Normal goes on stage transitions. High goes when the crawler's failure rate crosses 15%. Emergency — which retries every 60 seconds until I acknowledge it — goes when the failure rate crosses 50% and the pipeline aborts, or when prereq validation fails. I have been woken up twice in two years by an emergency alert. Both times the alert was correct.

Status

status.sh reads the current run's JSON status file and the per-shard crawl_status.json files, computing per-stage badges (✓ ⟳ ○ ✗), aggregating across shards, and drawing an ASCII summary:

Sample status output

═══════════════════════════════════════════════════════
  Pipeline Run: 2026-04-10-monthly
  Started:     2026-04-10 02:00:04 UTC
  Current:     process (3 of 5)
═══════════════════════════════════════════════════════
  [✓] crawl     1h42m · 515,204 urls · 2.1% fail
  [⟳] process   0h38m · 47 / 63 files · 1,240,891 matches
  [○] aggregate
  [○] archive
  [○] export

Compared to v1's four-dashboard cognitive JOIN, this is a small, concrete thing. Which is the point.

The aggregate upsert

After processing writes individual match rows to processor_results, the aggregate stage folds them into website_technology_findings, which is what the exporter and downstream queries actually use. One upsert, keyed on (website_id, fingerprint_id, found_month):

-- pipeline/refresh_aggregate.sql (excerpt)
INSERT INTO website_technology_findings (...)
SELECT
    pr.website_id,
    pr.fingerprint_matched_id AS fingerprint_id,
    date_trunc('month', COALESCE(pr.found_at, pr.created_at))::date AS found_month,
    ...
    COUNT(*)::int AS match_count,
    MIN(COALESCE(pr.found_at, pr.created_at)) AS first_seen_at,
    ...
FROM processor_results pr
JOIN websites     w ON w.id = pr.website_id
JOIN fingerprints f ON f.id = pr.fingerprint_matched_id
JOIN signals      s ON s.id = f.signal_id
WHERE pr.deleted_at IS NULL
  AND s.type = 'Technology' AND s.enabled = true
  AND f.enabled = true
  AND COALESCE(pr.found_at, pr.created_at) >= :since_date::timestamp
GROUP BY pr.website_id, pr.fingerprint_matched_id, date_trunc('month', ...)
ON CONFLICT (website_id, fingerprint_id, found_month)
DO UPDATE SET
    match_count   = EXCLUDED.match_count,
    first_seen_at = LEAST(website_technology_findings.first_seen_at, EXCLUDED.first_seen_at),
    updated_at    = NOW();

The LEAST() on first_seen_at is deliberate: a technology's "first detected" date should only move backward as we discover earlier evidence, never forward. That column anchors cohort analysis downstream — churn curves and adoption timelines — and a clock that moves the wrong way corrupts everything built on top of it.

Four specific things that broke.

The poisoned-context tabs

Early versions of the crawler used one Playwright BrowserContext per worker and opened multiple Page objects inside it — tabs_per_worker > 1, contexts shared. Synthetic benchmarks loved it. Production did not.

The symptom was a worker silently dropping to zero throughput while reporting healthy. A page in tab 0 was stuck mid-goto(), holding the context's internal lock. Tabs 1–3 in the same context queued behind it, forever. browser.is_connected() returned True. Nothing in the logs. You could stare at ps and see a worker taking no CPU, consuming no bandwidth, doing nothing, and know from aggregate throughput that something was wrong, but never be able to point at it.

The fix was structural: one context per slot, not one context per worker. A hung page now only affects its own slot, and _close_slot_locked's 5-second bound on page.close() and context.close() means the worst case is losing one slot for five seconds, not losing all its siblings forever. Slots carry the parent browser's session_id so a browser restart invalidates exactly the right contexts — no leaked handles.

The other half of the fix was the RSS killer. Even with isolated contexts, some pages pin memory catastrophically — streaming video elements, runaway animation frames, whatever. We poll /proc, find any Playwright child process over 2 GB RSS, and SIGKILL it. The kill raises BrowserCrashError, which trips the same restart path as a real crash. Nothing graceful; the only good answer to a memory leak you don't control is a well-placed SIGKILL.

Resume that was actually re-emit

The first checkpoint implementation persisted last_processed_line only on graceful shutdown. Naturally, we didn't always shut down gracefully. After a Docker OOM-kill mid-run, resuming would re-fetch the last couple hundred URLs, which the processor would happily match against its fingerprints again, and would happily INSERT the matches a second time. Data was correct but every affected row had a match_count inflated by some quiet factor we couldn't easily audit.

Two things had to change. First, record_completion() had to atomically advance counters, write the JSONL output line, decide whether to checkpoint, and persist output-file-rotation state — all under a lock, because a crash in the middle of any of these would otherwise orphan some combination of them. Second, the checkpoint had to track completed_lines_pending — the set of lines that finished out of order and are held until the contiguous front catches up — so we never claim line 100 is done while line 99 is still in flight.

Idempotency at the next stage is the belt to this suspenders: the processor's SHA256 ledger catches any file that does make it through twice. Defense in depth for a boring problem that wasn't boring at 3 AM on run day.

The 4.3-GB processor image

The pre-monolith processor image was 4.3 GB. It carried glibc, CGO extensions, Debian's kitchen sink, and a base image that had picked up weight at every unrelated dependency bump for two years. Pulls took minutes. Cold-start was measured in seconds. Deploys were an event.

The current processor is a multi-stage Dockerfile producing a scratch-based binary at about 7 MB. Three changes got us there: removing CGO from the dependency tree (gopostal is an address-parsing library we only needed in the exporter — the processor has no business with it), pinning the GORM Postgres driver to a pure-Go version, and deliberately deleting the RabbitMQ/SQS code paths we weren't using anymore. The README is explicit: no queue system, processing is file-based only.

Cold-start now takes tens of milliseconds. The binary fits in a ping packet, roughly. Both of those matter because we boot a fresh container per run — not because we had to, but because it removes an entire class of "did the last run leave residue" questions from my ops mental model.

AutoMigrate in production

Both the exporter and processor originally called GORM's AutoMigrate at boot. In dev, this is helpful — your model structs and your database schema converge automatically. In production, when the GORM Postgres driver versions drift between services (separate Go modules, separate go.sums), AutoMigrate can issue subtly different DDL depending on which container wins the startup race. We watched a column flip between text and varchar over consecutive runs, then flip back, while two services fought each other over whose idea of the schema was correct.

The fix was to remove AutoMigrate from both services entirely. Schema is now owned by explicit, checked-in migrations applied before any service starts. The driver versions were pinned and aligned across both services. Startup got faster as a side effect — a two-minute boot-time migration probe became a sub-second connection ping — and the "whose migration ran last" question stopped being a production concern.

The lesson here is not that AutoMigrate is bad. It's that a convenience that writes to production state on service boot will eventually bite you, and the right answer is to remove the convenience rather than to work around it. This is a pattern that recurs across the v1-to-v2 transition more broadly: tools designed for development velocity ("Kubernetes deploys your service automatically," "RabbitMQ buffers your work queue automatically," "AutoMigrate fixes your schema automatically") have production costs that are easy to discount and hard to reverse once they've accumulated.

If you were building this, here's what I'd tell you.

I want to end on the things that transferred out of this project. Not the implementation details — those are specific to technographic pipelines and mostly worthless to anyone not building one — but the engineering judgment, which I think is portable.

Match your primitives to your workload

A queue is a tool for a specific shape of problem: variable producer/consumer rates, heterogeneous consumers, cross-service decoupling. If your problem doesn't have that shape, a queue is not architecture. It's a place for your state to hide. Same goes for Kubernetes, same goes for microservices, same goes for observability stacks. Every one of these is the right answer to some problem. The mistake is not choosing them; the mistake is choosing them before you know which problem you have.

Files are a legitimate interface

POSIX gives you atomic renames, hierarchical namespacing, access time metadata, and cp/mv/ls. That's a surprisingly complete set of primitives for coordinating batch work. The directory-per-run-id layout, the SHA256 ledger, the failed-urls file that feeds back into the next run — all of it is files. Boring, inspectable, debuggable with ls and cat. A file on disk in 2026 has the same semantics it had in 1996, which is more than you can say about most of our infrastructure.

Idempotency in depth

The pipeline has at least three independent idempotency mechanisms — atomic crawler checkpoints, SHA256 file ledger in the processor, ON CONFLICT on the aggregate upsert. Any one of them, working correctly, would prevent double-processing. All three working correctly means we can crash or rerun at any point without producing duplicate data. This is defense in depth for a problem where the cost of a single duplicate is a silently-inflated metric that takes three months to detect.

The ops surface is part of the design

Most architectural diagrams don't include the operator. They should. If your system requires an expert to run — four dashboards, three control planes, five mental models — it will be run badly, and not by you, because you'll leave. V2's ops surface is pipeline.sh and status.sh. Anyone who can read bash can run it. Anyone who can't read bash has no business running it. Either way, the handoff is clean.

Tear things down

The most effective engineering move I made on this project was deleting 4 GB of Docker image and an entire RabbitMQ-driven deployment topology. Nothing I built was as useful as what I removed. This is embarrassingly easy to forget; deletion doesn't make for good demos, and nobody puts "removed the queue system" on a resume. But the pipeline is reliable because it's simple, and it's simple because I spent six months convincing myself I was allowed to throw things away.

Nothing I built was as useful as what I removed.

If you're mid-rebuild and wondering whether the pattern you're about to adopt is the right one, the test is not whether it's fashionable and not whether it scales. The test is whether the workload has the shape that pattern was designed for. If yes, use it. If no, the boring version is almost certainly correct, and the hard part is believing yourself.

If you're building something like this, I can help.

I work with mid-market teams on production AI systems — auditing what's fragile, rebuilding what's wrong, and building new pipelines with the boring rigor they actually need. Case studies are the easy part; the conversation is where the real work starts.