Ravel's technographic pipeline crawls websites, fingerprints technology signals, resolves accounts across enterprise partner datasets, and delivers structured findings on a monthly SLA. This is the architecture and the operational reality — including the parts where we tore down the distributed system we'd carefully built and replaced it with a single bash script.
Enterprise SaaS vendors have a specific and expensive question they need answered every month: across our entire addressable market — hundreds of thousands of organizations — which of them are running which technologies right now, and which of them just added a second vendor in a category where they already have us. That second question is the one that matters. Dual-vendor deployments are the leading indicator of customer churn. By the time sales knows a customer is evaluating a competitor, it's too late. By the time the technographic pipeline flags it, you have a quarter to get in front of it.
Ravel's job was to answer that question at enterprise scale and deliver the answer on a monthly cadence, in clean tabular form, to S3 buckets owned by the partners. The output is dull. The input is not.
The addressable market for our primary partners was roughly 515,000 organizations. Each organization has a website — and "has a website" is doing a tremendous amount of work in that sentence. Some are static HTML served out of 2008. Some are React single-page apps with twelve seconds of client-side hydration before anything meaningful renders. Some are WordPress with twenty plugins fighting each other. Some redirect to a Facebook page. Some redirect to themselves in a loop. Some respond 500 with a 24-byte body for reasons nobody will ever explain. All of them need to be crawled, all of them need their content passed through a thousand-plus technology fingerprints, and all of the findings need to land in the right partner's bucket with the right account IDs attached by the tenth of the month.
The crawler wasn't the hard part. The hard part was building a reliable, resumable, multi-stage pipeline that could handle the messy reality of websites that break in every way imaginable, while a partner's data engineering team waits for their S3 drop on the tenth of the month.
I was the co-founder and CTO. I wrote the first version of this pipeline, and I wrote the second version after we threw out the first one. What follows is an honest account of both — because the decision to collapse a distributed queue-driven architecture back into a monolithic batch pipeline is the most interesting engineering decision in the whole system, and nobody writes about deletions.
V1 looked like what a thoughtful engineer in 2022 would draw on a whiteboard if you gave them this problem. A Python async crawler running in Kubernetes, backed by RabbitMQ with prefetch-based concurrency control and a three-state circuit breaker. A Go signal processor also driven by RabbitMQ, with fsnotify as a fallback for file-based inputs. A separate Go exporter. A data-pipeline fan-out layer that pushed crawled pages to dedicated worker types — basic info, leadership, ministry, media — each on their own queue, each with their own scaling knob. Prometheus, Grafana, health endpoints, an alert system with pluggable Slack and email channels, Kubernetes manifests from namespace down to CronJob. A Makefile that could scale individual worker classes independently.
It worked. It was observable. It had clean seams between services. It was, by every shibboleth of modern distributed system design, correct.
It was also the wrong system for the problem.
The v1 crawler had genuinely clever pieces. The browser pool kept one Playwright context per domain for session persistence, with a three-tier cleanup that evicted the oldest third at 15 contexts and did emergency eviction at 20, plus time-based cleanup of anything over ten minutes old. The retry classifier distinguished permanent failures (404, 403, 410, 451 — never retry) from transient ones (502, 503, 504, 429 — always retry) and had learned edge cases like "HTTP 500 with a response body under 100 bytes is a broken server, not a transient error." The SmartCrawlerService tried a cheap HTTP GET first and only fell back to a full browser when it detected a single-page app, which it did by counting <script> tags against visible text and sniffing for __NEXT_DATA__ and similar markers. These were good ideas. They survived into v2.
What didn't survive was the topology.
There's a specific temptation in systems design that I want to name carefully, because I've watched smart engineers fall into it more than once and I fell into it myself on this project. The temptation is to reach for distributed primitives — queues, service boundaries, horizontal scaling, independent deployment — before you've established that the workload actually has the shape those primitives are designed for.
Queues solve three kinds of problems well:
Our workload had none of these properties. The input was a CSV of 500,000 URLs known at the start of the month. The processing rate was bounded by crawler concurrency, not by arrival rate. The downstream processor was a single worker class applying the same fingerprint matching to every page. There was exactly one deployment: the whole pipeline, once a month, end to end. The queue wasn't buffering variance — there was no variance. It was just a place for our state to hide.
The queue wasn't buffering variance — there was no variance. It was just a place for our state to hide.
Every distributed-system problem I was solving was a problem my architecture had invented.
And hiding state in a queue costs you things. Specifically:
start_consuming_with_retries wrappers with exponential backoff around every consumer. These wrappers were themselves a source of bugs — I'll tell that story in the war stories section.The v2 design started from a deliberately boring premise: this is a monthly batch job on a known input, with explicit stages, writing to a local filesystem between stages. If you wrote that sentence on a whiteboard in 1998, the right answer would have been a bash script that ran five programs in sequence. In 2026, with twenty-seven years of operational pattern-hoarding between then and now, the right answer is still a bash script that runs five programs in sequence. The only thing that changed is that now we have atomic file renames, SHA256 idempotency ledgers, Pushover for alerts, and Docker to isolate the binaries.
So that's what v2 is.
V2 is a single bash script — pipeline.sh — that orchestrates five stages in strict order. Each stage reads from a known directory and writes to a known directory. The run is keyed by a single identifier, RUN_ID=$(date +%Y-%m-%d)-monthly, which becomes the root of every artifact directory. State is files. Progress is contiguous line numbers in checkpoint JSON. Idempotency is SHA256 hashes in a ledger table. Alerting is Pushover with three priority tiers.
The total operational surface for running a monthly pipeline in v2 is:
pipeline.sh. Or let the cron fire it on the tenth.status.sh if you want a live view, or don't, because Pushover will tell you what matters.That's it. There is no dashboard. There is no Kubernetes. There is no message broker. There is a script, five binaries, and a SQL database.
The rest of this case study is about the three stages that carry actual algorithmic weight — the crawler, the processor, and the ML account matcher — and the operational tooling that makes the whole thing go. I'll try to resist the urge to just dump every interesting detail; I'll focus on the ones that exist because a specific thing broke, or because a specific tradeoff mattered.
The crawler's job: take a CSV of (website_id, url) pairs, visit each URL with a real browser, extract text and headers, and write structured JSON. Everything else is defense.
A worker is an asyncio coroutine. Each worker hosts N slots. Each slot owns exactly one Playwright BrowserContext and one Page. Effective per-shard concurrency is CRAWL_CONCURRENCY × CRAWL_TABS_PER_WORKER, defaulted to 20 × 1 — deliberately conservative, because the per-slot isolation is what buys us the resilience we actually need.
# pipeline/crawler/crawler/crawler.py:223-234
async def _worker_loop(self, worker_id: int) -> None:
"""Worker coroutine hosting one context and multiple tab slots."""
worker = Worker(worker_id, self.config, self.browser_manager)
try:
slots = max(1, self.config.tabs_per_worker)
slot_tasks = [
asyncio.create_task(self._worker_slot_loop(worker, worker_id, slot_id))
for slot_id in range(slots)
]
await asyncio.gather(*slot_tasks)
finally:
await worker.close()
The worker is a nursery for its slot tasks. It doesn't touch pages itself.
Why one Playwright context per slot instead of one per worker? Because of an incident in v1, which I'll cover in §9. The short version: when multiple tabs share a context, a single poisoned page can hold the context's lock forever, and its siblings queue behind it silently. The watchdog sees browser.is_connected() == True and reports healthy. Throughput goes to zero. Nobody notices for an hour.
Isolated contexts mean a hung page can be torn down in 5 seconds — _close_slot_locked wraps page.close() and context.close() in asyncio.wait_for(timeout=5) — without touching siblings. Each slot tracks its parent browser's session_id, so when the browser restarts, stale slots invalidate their contexts automatically:
# pipeline/crawler/crawler/worker.py:141-174
async def _ensure_page(self, slot_id: int):
"""Ensure worker has a valid page/tab tied to current browser session."""
killed = self.browser_manager.enforce_playwright_memory_limit()
if killed > 0:
raise BrowserCrashError(f"killed {killed} oversized playwright process(es)")
if not self.browser_manager.is_healthy():
raise BrowserCrashError("Browser is not connected")
async with self._slot_lock(slot_id):
if self._slot_session_ids.get(slot_id) != self.browser_manager.session_id:
await self._close_slot_locked(slot_id)
page = self._pages.get(slot_id)
if page is not None:
try:
if not page.is_closed():
return page
except Exception:
self._pages.pop(slot_id, None)
try:
context = self._contexts.get(slot_id)
if context is None:
context = await self.browser_manager.create_context()
self._contexts[slot_id] = context
self._slot_session_ids[slot_id] = self.browser_manager.session_id
page = await context.new_page()
self._pages[slot_id] = page
return page
Resume after a crash is only useful if the checkpoint is trustworthy. Ours is. The state file is written via tempfile.mkstemp followed by os.rename — atomic on POSIX — and contains both last_processed_line (the last contiguous CSV line) and completed_lines_pending (out-of-order completions held until the contiguous front advances):
# pipeline/crawler/crawler/checkpoint.py:68-106
async def save(self) -> None:
"""Atomic save: write to temp file, then rename."""
async with self._lock:
self.state.last_checkpoint_at = datetime.now(timezone.utc).isoformat()
self.state.completed_lines_pending = sorted(self._completed_lines_pending)
state_dict = {
"last_processed_line": self.state.last_processed_line,
"completed_lines_pending": self.state.completed_lines_pending,
"total_processed": self.state.total_processed,
"total_success": self.state.total_success,
"total_failed": self.state.total_failed,
"failed_tasks": self.state.failed_tasks,
"current_output_file_index": self.state.current_output_file_index,
"current_output_row_count": self.state.current_output_row_count,
"started_at": self.state.started_at,
"last_checkpoint_at": self.state.last_checkpoint_at,
}
self.state.parent.mkdir(parents=True, exist_ok=True)
temp_fd, temp_path = tempfile.mkstemp(dir=self.state_file.parent, suffix=".tmp")
try:
with os.fdopen(temp_fd, "w", encoding="utf-8") as f:
json.dump(state_dict, f, indent=2)
os.rename(temp_path, self.state_file)
The distinction between last_processed_line and completed_lines_pending matters. If worker A finishes URL 100 while worker B is still on URL 99, we cannot advance the checkpoint to 100 — if we crashed, we'd skip 99. So 100 goes into completed_lines_pending, waits for 99 to finish, and the contiguous front advances to 100 afterward. Out-of-order completion, in-order checkpointing.
The CSV producer is a coroutine that reads rows and puts them on a bounded asyncio.Queue of size 1,000. When the queue is full, queue.put() blocks. That's the whole backpressure mechanism. The producer doesn't need to know about the workers; the workers don't need to know about the producer; the queue's bounded size does the coordination:
# pipeline/crawler/crawler/queue_manager.py:99-138 (excerpt)
async def produce(self) -> None:
logger.info("queue_producer_started",
input_file=str(self.csv_path), start_line=self.start_line)
with open(self.csv_path, "r", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for line_num, row in enumerate(reader, start=1):
self.total_lines = line_num
if line_num <= self.start_line:
continue # resume skip
if self._shutdown.is_set():
break
...
if self.queue.full():
logger.debug("queue_backpressure",
queue_size=self.queue.qsize(), current_line=line_num)
if not await self._put_with_shutdown(task):
break
self.current_line = line_num
At shutdown, the producer enqueues concurrency × tabs_per_worker poison-pill Nones — one per slot. Each slot pulls a None, stops pulling, and exits cleanly.
Crashes cluster. A flaky upstream or a batch of pages with hostile JS can trigger a burst of crashes that overwhelms the browser restart cap, and then we're watching the crawler die forever. The breaker trades throughput for stability: after N crashes inside a sliding window, it pauses all slots until a cooldown elapses.
# pipeline/crawler/crawler/crawler.py:66-99
async def _record_crash_and_maybe_trip_breaker(
self, worker_id: int, slot_id: int, reason: str,
) -> None:
now = time.monotonic()
async with self._breaker_lock:
self._crash_timestamps.append(now)
cutoff = now - max(1, self.config.circuit_breaker_window_seconds)
while self._crash_timestamps and self._crash_timestamps[0] < cutoff:
self._crash_timestamps.popleft()
crash_count = len(self._crash_timestamps)
threshold = max(1, self.config.circuit_breaker_crashes)
if crash_count >= threshold:
cooldown = max(1, self.config.circuit_breaker_cooldown_seconds)
self._breaker_pause_until = max(self._breaker_pause_until, now + cooldown)
logger.warning("circuit_breaker_tripped",
worker_id=worker_id, slot_id=slot_id, reason=reason,
crash_count=crash_count, threshold=threshold,
window_seconds=self.config.circuit_breaker_window_seconds,
cooldown_seconds=cooldown)
Defaults: 12 crashes in a 30-second window trips a 20-second cooldown. During cooldown, slot loops call _wait_if_breaker_active() before pulling the next task, so throughput effectively zeros while the browser stabilizes. The alternative — hammering a wedged browser with more tasks until the restart cap is hit and the whole shard dies — is worse than waiting.
Crawlers are failure-mode encyclopedias. Here's what we handle explicitly:
| Failure | Classification | Action |
|---|---|---|
Transient network (ERR_CONNECTION_CLOSED, ERR_TUNNEL_CONNECTION_FAILED, frame detached) | retry-eligible | Retry up to 3× |
| Page timeout (45s default) | slot-level | Close page, invalidate slot, retry |
| Hard context timeout | slot-level | Cancel slot task, recreate slot |
Fatal browser markers (target crashed, browser disconnected) | crash | Raise BrowserCrashError, attempt browser restart (cap 3) |
| Playwright RSS > 2048 MB | crash | SIGKILL oversized PIDs, raise BrowserCrashError |
Hung page.close() / context.close() | leak | 5s wait_for, log, abandon slot |
| Crash storm | systemic | Circuit breaker pauses all slots |
| Restart cap exceeded | fatal | Signal shutdown, drain, exit non-zero |
| Failure rate > 50% | pipeline-level | pipeline.sh aborts, emergency Pushover |
Note the last one: the crawler itself doesn't enforce a failure-rate ceiling. The orchestrator does. This is intentional — the crawler should do its job to the extent possible and report honestly; the decision to halt is an operations decision that belongs one layer up.
The processor is a Go service that reads crawler JSONL output line by line, matches each page's content against roughly 1,000 technology fingerprints in parallel, and bulk-inserts matches into PostgreSQL. Its most important property is that it is boring. It loads its fingerprints once at startup, it never talks to a queue, it writes files to processed/ when done, and if it crashes you run it again.
A fingerprint is a row in the fingerprints table pointing at a signal (a vendor/feature). Each fingerprint has one of two match modes — matching_text (a substring) or matching_regex (a compiled regex) — plus an optional extraction_regex that pulls a captured value (an account ID, a tracker ID, a version string) out of the match:
// pipeline/models/fingerprint.go:14-27
type Fingerprint struct {
ID uint `gorm:"primarykey"`
CreatedAt time.Time
UpdatedAt time.Time
DeletedAt gorm.DeletedAt `gorm:"index"`
MatchingText string
MatchingRegexExp string `gorm:"column:matching_regex"`
MatchingRegex *regexp.Regexp `gorm:"-"`
ExtractionRegexText string `gorm:"column:extraction_regex"`
ExtractionRegex *regexp.Regexp `gorm:"-"`
Enabled bool
SignalID int
Signal Signal `gorm:"foreignKey:SignalID"`
}
At boot, the processor queries all enabled, non-deleted fingerprints and splits them into two flat global slices: models.AllText and models.AllRegex. There is no per-domain indexing — every page is matched against every fingerprint. For our corpus size this is fine; when it stops being fine we'll index. Not yet.
Each page runs both passes concurrently. Two goroutines, two unbuffered channels, one fan-in:
// pipeline/processor/processors/processTextAndRegex.go:93-132
func ProcessTextAndRegex(page models.DownloadedPage) (matches []models.ProcessorResult) {
if len(page.Contents) == 0 {
return
}
start := time.Now()
textChan := make(chan []models.ProcessorResult, 1)
regexChan := make(chan []models.ProcessorResult, 1)
go func() { textChan <- processTextFingerprints(page) }()
go func() { regexChan <- processRegexFingerprints(page) }()
textMatches := <-textChan
regexMatches := <-regexChan
matches = append(matches, textMatches...)
matches = append(matches, regexMatches...)
log.Info().
Str("url", page.URL).
Int("text_matches", len(textMatches)).
Int("regex_matches", len(regexMatches)).
Int("total_matches", len(matches)).
Dur("duration", time.Since(start)).
Msg("Page processing complete")
return matches
}
The two passes are asymmetric in cost. Text fingerprints are strings.Contains checks — cheap. Regex fingerprints are compiled regexp evaluations — more expensive, sometimes much more. Running them concurrently means total wall time is max(text, regex) rather than text + regex, which for pages with a heavy regex tail is a meaningful win.
Matches accumulate in a slice. When the slice hits 600 records, we flush. When we flush, the GORM layer sub-batches to 100 records per INSERT. Two tiers:
// pipeline/processor/services/file_processor.go:355-370
if fatalErr == nil && len(batchMatches) >= batchSize {
if err := exporters.SaveOnDB(batchMatches); err != nil {
processingErrors = append(processingErrors,
fmt.Sprintf("batch save error at line %d: %s", result.lineNum, err.Error()))
fatalErr = fmt.Errorf("failed to save batch to database: %w", err)
stopProcessing()
} else {
totalMatches += len(batchMatches)
log.Info().Str("filename", filename).Int("batch_size", len(batchMatches)).
Int("total_matches", totalMatches).Msg("Batch written to database")
batchMatches = batchMatches[:0] // keep capacity, reset length
}
}
The inner batch size (100) exists because PostgreSQL has a hard parameter limit per prepared statement — 65,535. Nine columns per row times 100 rows is 900 parameters, well under the limit with plenty of headroom. The outer batch size (600) exists because flushing too often wastes round trips and flushing too rarely makes a crash expensive. Six-hundred was arrived at empirically and has not needed adjustment.
Every INSERT is wrapped in a retry envelope — exponential backoff, capped at three attempts:
// pipeline/processor/database/circuit_breaker.go:29-70
func WithRetry(ctx context.Context, cfg RetryConfig, operation string, fn func() error) error {
var lastErr error
for attempt := 0; attempt <= cfg.MaxRetries; attempt++ {
if attempt > 0 {
delay := cfg.BaseDelay * time.Duration(1< cfg.MaxDelay { delay = cfg.MaxDelay }
log.Warn().Str("operation", operation).
Int("attempt", attempt).Dur("delay", delay).Err(lastErr).
Msg("retrying_db_operation")
select {
case <-ctx.Done():
return fmt.Errorf("context cancelled during retry of %s: %w", operation, ctx.Err())
case <-time.After(delay):
}
}
lastErr = fn()
if lastErr == nil {
if attempt > 0 {
log.Info().Str("operation", operation).
Int("attempts", attempt+1).Msg("db_operation_recovered")
}
return nil
}
}
return fmt.Errorf("%s failed after %d retries: %w", operation, cfg.MaxRetries, lastErr)
}
Default config: 3 retries at 5s / 10s / 20s backoff, 45s cap. The name circuit_breaker.go is aspirational — it's a retry envelope, not a breaker. Renaming it is on the list.
The processor must never double-process a file. Before we open a file, we compute its SHA256 and check a ledger table:
// pipeline/processor/services/file_processor.go:133-154
fileHash, err := computeFileHash(filename)
if err != nil {
return nil, fmt.Errorf("failed to compute file hash: %w", err)
}
alreadyProcessed, err := fp.db.ProcessedFilesRepo().IsProcessed(ctx, fileHash)
if err != nil {
log.Warn().Err(err).Str("filename", filename).
Msg("Failed to check processed status, continuing with processing")
} else if alreadyProcessed {
log.Warn().Str("filename", filename).Str("sha256", fileHash).
Msg("File already processed, skipping")
status := &models.FileProcessorStatus{
Filename: filename,
DurationMs: time.Since(startTime).Milliseconds(),
Errors: []string{"skipped: already processed"},
}
return status, nil
}
Marking is idempotent via ON CONFLICT DO NOTHING. This means we can restart the processor in the middle of a run, point it at the same directory, and it will skip everything that already made it through. Combined with the atomic crawler checkpoint, the entire pipeline has clean resume semantics without any distributed coordination primitives.
V1's equivalent processor image weighed 4.3 GB. The current one is scratch-based and weighs about 7 MB. Three changes made that possible: removing CGO from the dependency tree (no gopostal in the processor — address parsing is the exporter's job, not ours), pinning the GORM Postgres driver to a pure-Go version, and deleting the RabbitMQ and SQS code paths we weren't using anymore. The README is explicit about it: no queue system, processing is file-based only. The binary cold-starts in tens of milliseconds, which matters because we boot a fresh container per run.
Technographic signals are only useful if they can be attached to accounts a customer's sales team actually owns. That means resolving duplicate records across two datasets — a master accounts table and a partner's CRM — where the same organization appears under different names, partial addresses, missing phone numbers, and typos. This is account matching, and it's the kind of problem where off-the-shelf solutions mostly work and then don't work in ways that cost you real dollars.
V2 of our matcher is a sequential cascade of four strategies, each progressively fuzzier than the last. Each strategy only runs on accounts that the previous strategies didn't already claim. The winner for any given account is the highest-confidence match produced by any strategy.
The orchestrator is straightforward — iterate strategies in order, each one filters out already-matched clients, pass the remainder forward:
# python/acct-matcher-v2/improved/account_matcher/matcher.py:85-150 (excerpt)
def find_matches(self, master_accounts, client_accounts,
strategy_order=None, parallel=True,
confidence_threshold=0.7, **kwargs):
if strategy_order is None:
strategy_order = [
MatchMethod.EXACT_PHONE,
MatchMethod.EXACT_NAME_LOCATION,
MatchMethod.FUZZY_NAME_ADDRESS,
MatchMethod.ML_BASED,
]
all_matches = []
matched_client_ids = set()
for method in strategy_order:
remaining_clients = [acc for acc in client_processed
if acc.id not in matched_client_ids]
if not remaining_clients:
break
strategy_matches = self.strategies[method].find_matches(
master_processed, remaining_clients, **kwargs)
quality_matches = [m for m in strategy_matches
if m.confidence >= confidence_threshold]
all_matches.extend(quality_matches)
matched_client_ids.update(m.client_account_id for m in quality_matches)
return self._consolidate_matches(all_matches)
1. Exact phone. A hash join on normalized ten-digit phone. If two records share a phone number, they're the same organization with extremely high probability. Score 100, confidence 0.95. Complexity is O(n+m). This catches a meaningful chunk of the dataset for almost free.
2. Exact name + location. State-blocked exact match on normalized name, then a tiered location match — street first, then city+zip, then city alone. Confidence 0.90. The tiered fallback matters because partner data quality varies; a record might have a street but no zip, or a zip but no street.
3. Fuzzy name + address. State-blocked RapidFuzz composite score. This is where things get interesting and slow, so we thread-pool it by state. Each state runs on its own worker, up to 8 concurrent:
# python/acct-matcher-v2/improved/account_matcher/strategies/fuzzy_name_address.py:82-98
num_workers = min(len(common_states), kwargs.get('max_workers', 8))
with ThreadPoolExecutor(max_workers=num_workers) as executor:
future_to_state = {
executor.submit(process_state, state): state
for state in common_states
}
for future in as_completed(future_to_state):
state_matches = future.result()
all_matches.extend(state_matches)
The per-state inner loop uses early termination — if name similarity is below 50, we don't bother computing street similarity — and an early exit when we hit 95% total:
# python/acct-matcher-v2/improved/account_matcher/strategies/fuzzy_name_address.py:103-158
def _find_best_match_for_client(self, client_acc, master_accounts, min_score):
best_match = None
best_score = 0.0
for master_acc in master_accounts:
if not master_acc.name:
continue
name_score = fuzz.ratio(client_acc.name.lower(), master_acc.name.lower())
if name_score < 50: # early kill
continue
city_bonus = 15 if (client_acc.city and master_acc.city
and client_acc.city.lower() == master_acc.city.lower()) else 0
zip_bonus = 15 if (client_acc.zip_prefix and master_acc.zip_prefix
and client_acc.zip_prefix == master_acc.zip_prefix) else 0
street_score = 0
if client_acc.street and master_acc.street:
street_score = fuzz.partial_ratio(client_acc.street.lower(),
master_acc.street.lower()) * 0.2
total_score = name_score + city_bonus + zip_bonus + street_score
if total_score >= min_score and total_score > best_score:
best_score = total_score
best_match = MatchResult(...)
if total_score >= 95: # near-perfect, stop scanning
break
return best_match
4. Random Forest. For accounts still unclaimed after three rounds of deterministic matching, we extract 18 features per candidate pair, scale them, and run a trained Random Forest classifier. A match is accepted if predict_proba ≥ min_score / 100, capped at 0.90 confidence because we don't want the ML-derived score to outrank a deterministic exact-phone match:
# python/acct-matcher-v2/improved/account_matcher/strategies/ml_matching.py:25-87
def find_matches(self, master_accounts, client_accounts, min_score=70.0, **kwargs):
if not self.is_trained:
return self._heuristic_matching(master_accounts, client_accounts, min_score)
candidate_pairs = self._generate_candidate_pairs(master_accounts, client_accounts)
features = self._extract_features_batch(candidate_pairs)
feature_matrix = self.scaler.transform(features)
probabilities = self.model.predict_proba(feature_matrix)[:, 1]
matches = []
for i, (master_acc, client_acc) in enumerate(candidate_pairs):
prob = probabilities[i]
if prob * 100 >= min_score:
matches.append(MatchResult(
master_account_id=master_acc.id,
client_account_id=client_acc.id,
score=prob * 100,
method=self.method,
confidence=min(prob, 0.90),
details={'ml_probability': float(prob)},
))
return matches
Feature design is where account matching lives or dies. Ours has a specific structure: five continuous name-similarity metrics (different algorithms catch different failure modes), seven binary location/phone/phonetic indicators, and six scalar structural agreements. Each one exists because an earlier version of the model was wrong in a specific way.
| # | Feature | Type | Why it's there |
|---|---|---|---|
| 1 | fuzz.ratio names | float | Overall string similarity |
| 2 | fuzz.partial_ratio names | float | Catches substring matches ("First Baptist" vs "First Baptist Church of Memphis") |
| 3 | fuzz.token_sort_ratio | float | Order-invariant ("Memphis Baptist" vs "Baptist Memphis") |
| 4 | fuzz.token_set_ratio | float | Bag-of-words — robust to missing tokens |
| 5 | Levenshtein.ratio | float | Character-level edit distance, for typos |
| 6–8 | City / state / zip-prefix exact | binary | Hard location anchors |
| 9 | Street partial_ratio | float | "123 Main St" vs "123 Main Street, Suite 400" |
| 10 | Phone exact | binary | Cheap strong signal (survives strategy 1's miss) |
| 11 | Metaphone phonetic match | binary | "Katherine" ≈ "Catherine" |
| 12 | Name-length diff (norm) | float | Huge disparity is a mismatch signal |
| 13 | Name word-count diff (norm) | float | Same intent as 12 at token granularity |
| 14–18 | Structural agreement (has_* flags) | binary | Does one record have a website and the other not? Records with very different completeness profiles are suspicious |
The model itself is small and regularized on purpose:
# python/acct-matcher-v2/improved/account_matcher/strategies/ml_matching.py:213-258
def train_model(self, training_data):
features, labels = [], []
for master_acc, client_acc, is_match in training_data:
features.append(self._extract_pair_features(master_acc, client_acc))
labels.append(1 if is_match else 0)
X = np.array(features); y = np.array(labels)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
self.model = RandomForestClassifier(
n_estimators=100, max_depth=10,
min_samples_split=5, min_samples_leaf=2, random_state=42,
)
self.model.fit(X_train_scaled, y_train)
100 trees, depth 10, min-samples-leaf 2. Enough capacity to learn the interactions between our 18 features, not enough to overfit the hand-labeled training set. Test accuracy lands at 92% on an 80/20 split.
The public numbers on the site compare v2 to v1 of the matcher specifically. V1 lived at python/acct-matcher/ and had a simpler architecture — fewer strategies, no state blocking, single DB connection, load-everything-into-memory. In rough order of contribution to the speedup:
psycopg2.pool.ThreadedConnectionPool(min=1, max=20), which lets the fuzzy strategy's thread pool actually get parallel I/O.database/cache.py. Memory stays bounded; intermediate results get reused. This is most of the 60% memory reduction.What "92% accuracy" means specifically, because the number deserves to be pinned down: it is the held-out test-set accuracy logged by model.score(X_test_scaled, y_test) on a deterministic 80/20 split of hand-labeled (master, client, is_match) triples. It is not precision or recall individually — those are traced separately and are close to balanced given the regularization. The labels themselves are a curated set derived from prior resolved matches plus known overlaps; label quality is the ceiling on this number and we know it.
Everything in v2 is designed so that the on-call answer to "how is the monthly run going?" is a single command and the on-call answer to "something is wrong" is a single Pushover notification with the stage, the run ID, and a log path.
pipeline.sh owns: argument parsing, prerequisite validation, run-ID assignment, per-stage Docker invocation, status-file writing, log rotation, and alert dispatch. The five stages run in fixed order: crawl → process → aggregate → archive_processor_results → export. Each stage writes under a directory keyed by the run ID:
# pipeline/pipeline.sh (excerpt)
RUN_ID="$(date +%Y-%m-%d)-monthly"
CRAWL_DIR="${DATA_DIR}/crawl/${RUN_ID}"
PROCESS_DIR="${DATA_DIR}/processing/${RUN_ID}"
EXPORT_DIR="${DATA_DIR}/export/${RUN_ID}"
LOG_DIR="${DATA_DIR}/logs/${RUN_ID}"
STATUS_DIR="${DATA_DIR}/status"
STATUS_FILE="${STATUS_DIR}/current_run.json"
Everything downstream of pipeline.sh — the status viewer, the retry-failed mode, the cron installer — navigates this directory layout as its contract.
Running with --dry-run exercises the entire prerequisite check path — DSN exported, client-id provided, input file exists, all three Docker images present locally, timeout values sane — and exits zero without running any stage. This is what goes into the cron line before the cron fires for real. It catches the "forgot to pull latest images" class of problem before 2 AM.
The crawler writes a failed_urls.csv at the end of each run containing the URLs it couldn't process. The pipeline supports --retry-failed, which locates the most recent previous run's failed-URLs file and swaps it in as the next run's input:
# pipeline/pipeline.sh:400-408
if [[ "${RETRY_FAILED}" == "true" ]]; then
local prev_failed="${DATA_DIR}/crawl/$(ls -t "${DATA_DIR}/crawl/" 2>/dev/null | head -1)/failed_urls.csv"
if [[ -f "${prev_failed}" ]]; then
actual_input="${prev_failed}"
log_info "retry_mode" "\"failed_urls_file\":\"${prev_failed}\""
else
log_warn "No previous failed_urls.csv found, using original input"
fi
fi
No queue. No dead-letter exchange. A file. Which is, in every way that matters, what a dead-letter exchange was pretending to be.
Three priority tiers, implemented as a thin Pushover wrapper:
# pipeline/notify.sh:1-63 (excerpt)
send_notification() {
local title="${1:?Title required}"
local message="${2:?Message required}"
local priority="${3:-0}"
local retry="${4:-60}"
local expire="${5:-3600}"
if [[ "${NOTIFICATIONS_ENABLED:-true}" != "true" ]]; then
echo "[NOTIFY] (disabled) ${title}: ${message}"
return 0
fi
...
local payload=(
-F "token=${PUSHOVER_APP_TOKEN}"
-F "user=${PUSHOVER_USER_KEY}"
-F "title=${title}"
-F "message=${message}"
-F "priority=${priority}"
)
if [[ "${priority}" == "2" ]]; then
payload+=(-F "retry=${retry}" -F "expire=${expire}")
fi
curl -s -w "\n%{http_code}" "${payload[@]}" https://api.pushover.net/1/messages.json
}
notify_normal() { send_notification "$1" "$2" 0; }
notify_high() { send_notification "$1" "$2" 1; }
notify_emergency() { send_notification "$1" "$2" 2; }
Normal goes on stage transitions. High goes when the crawler's failure rate crosses 15%. Emergency — which retries every 60 seconds until I acknowledge it — goes when the failure rate crosses 50% and the pipeline aborts, or when prereq validation fails. I have been woken up twice in two years by an emergency alert. Both times the alert was correct.
status.sh reads the current run's JSON status file and the per-shard crawl_status.json files, computing per-stage badges (✓ ⟳ ○ ✗), aggregating across shards, and drawing an ASCII summary:
═══════════════════════════════════════════════════════
Pipeline Run: 2026-04-10-monthly
Started: 2026-04-10 02:00:04 UTC
Current: process (3 of 5)
═══════════════════════════════════════════════════════
[✓] crawl 1h42m · 515,204 urls · 2.1% fail
[⟳] process 0h38m · 47 / 63 files · 1,240,891 matches
[○] aggregate
[○] archive
[○] export
Compared to v1's four-dashboard cognitive JOIN, this is a small, concrete thing. Which is the point.
After processing writes individual match rows to processor_results, the aggregate stage folds them into website_technology_findings, which is what the exporter and downstream queries actually use. One upsert, keyed on (website_id, fingerprint_id, found_month):
-- pipeline/refresh_aggregate.sql (excerpt)
INSERT INTO website_technology_findings (...)
SELECT
pr.website_id,
pr.fingerprint_matched_id AS fingerprint_id,
date_trunc('month', COALESCE(pr.found_at, pr.created_at))::date AS found_month,
...
COUNT(*)::int AS match_count,
MIN(COALESCE(pr.found_at, pr.created_at)) AS first_seen_at,
...
FROM processor_results pr
JOIN websites w ON w.id = pr.website_id
JOIN fingerprints f ON f.id = pr.fingerprint_matched_id
JOIN signals s ON s.id = f.signal_id
WHERE pr.deleted_at IS NULL
AND s.type = 'Technology' AND s.enabled = true
AND f.enabled = true
AND COALESCE(pr.found_at, pr.created_at) >= :since_date::timestamp
GROUP BY pr.website_id, pr.fingerprint_matched_id, date_trunc('month', ...)
ON CONFLICT (website_id, fingerprint_id, found_month)
DO UPDATE SET
match_count = EXCLUDED.match_count,
first_seen_at = LEAST(website_technology_findings.first_seen_at, EXCLUDED.first_seen_at),
updated_at = NOW();
The LEAST() on first_seen_at is deliberate: a technology's "first detected" date should only move backward as we discover earlier evidence, never forward. That column anchors cohort analysis downstream — churn curves and adoption timelines — and a clock that moves the wrong way corrupts everything built on top of it.
Early versions of the crawler used one Playwright BrowserContext per worker and opened multiple Page objects inside it — tabs_per_worker > 1, contexts shared. Synthetic benchmarks loved it. Production did not.
The symptom was a worker silently dropping to zero throughput while reporting healthy. A page in tab 0 was stuck mid-goto(), holding the context's internal lock. Tabs 1–3 in the same context queued behind it, forever. browser.is_connected() returned True. Nothing in the logs. You could stare at ps and see a worker taking no CPU, consuming no bandwidth, doing nothing, and know from aggregate throughput that something was wrong, but never be able to point at it.
The fix was structural: one context per slot, not one context per worker. A hung page now only affects its own slot, and _close_slot_locked's 5-second bound on page.close() and context.close() means the worst case is losing one slot for five seconds, not losing all its siblings forever. Slots carry the parent browser's session_id so a browser restart invalidates exactly the right contexts — no leaked handles.
The other half of the fix was the RSS killer. Even with isolated contexts, some pages pin memory catastrophically — streaming video elements, runaway animation frames, whatever. We poll /proc, find any Playwright child process over 2 GB RSS, and SIGKILL it. The kill raises BrowserCrashError, which trips the same restart path as a real crash. Nothing graceful; the only good answer to a memory leak you don't control is a well-placed SIGKILL.
The first checkpoint implementation persisted last_processed_line only on graceful shutdown. Naturally, we didn't always shut down gracefully. After a Docker OOM-kill mid-run, resuming would re-fetch the last couple hundred URLs, which the processor would happily match against its fingerprints again, and would happily INSERT the matches a second time. Data was correct but every affected row had a match_count inflated by some quiet factor we couldn't easily audit.
Two things had to change. First, record_completion() had to atomically advance counters, write the JSONL output line, decide whether to checkpoint, and persist output-file-rotation state — all under a lock, because a crash in the middle of any of these would otherwise orphan some combination of them. Second, the checkpoint had to track completed_lines_pending — the set of lines that finished out of order and are held until the contiguous front catches up — so we never claim line 100 is done while line 99 is still in flight.
Idempotency at the next stage is the belt to this suspenders: the processor's SHA256 ledger catches any file that does make it through twice. Defense in depth for a boring problem that wasn't boring at 3 AM on run day.
The pre-monolith processor image was 4.3 GB. It carried glibc, CGO extensions, Debian's kitchen sink, and a base image that had picked up weight at every unrelated dependency bump for two years. Pulls took minutes. Cold-start was measured in seconds. Deploys were an event.
The current processor is a multi-stage Dockerfile producing a scratch-based binary at about 7 MB. Three changes got us there: removing CGO from the dependency tree (gopostal is an address-parsing library we only needed in the exporter — the processor has no business with it), pinning the GORM Postgres driver to a pure-Go version, and deliberately deleting the RabbitMQ/SQS code paths we weren't using anymore. The README is explicit: no queue system, processing is file-based only.
Cold-start now takes tens of milliseconds. The binary fits in a ping packet, roughly. Both of those matter because we boot a fresh container per run — not because we had to, but because it removes an entire class of "did the last run leave residue" questions from my ops mental model.
Both the exporter and processor originally called GORM's AutoMigrate at boot. In dev, this is helpful — your model structs and your database schema converge automatically. In production, when the GORM Postgres driver versions drift between services (separate Go modules, separate go.sums), AutoMigrate can issue subtly different DDL depending on which container wins the startup race. We watched a column flip between text and varchar over consecutive runs, then flip back, while two services fought each other over whose idea of the schema was correct.
The fix was to remove AutoMigrate from both services entirely. Schema is now owned by explicit, checked-in migrations applied before any service starts. The driver versions were pinned and aligned across both services. Startup got faster as a side effect — a two-minute boot-time migration probe became a sub-second connection ping — and the "whose migration ran last" question stopped being a production concern.
The lesson here is not that AutoMigrate is bad. It's that a convenience that writes to production state on service boot will eventually bite you, and the right answer is to remove the convenience rather than to work around it. This is a pattern that recurs across the v1-to-v2 transition more broadly: tools designed for development velocity ("Kubernetes deploys your service automatically," "RabbitMQ buffers your work queue automatically," "AutoMigrate fixes your schema automatically") have production costs that are easy to discount and hard to reverse once they've accumulated.
I want to end on the things that transferred out of this project. Not the implementation details — those are specific to technographic pipelines and mostly worthless to anyone not building one — but the engineering judgment, which I think is portable.
A queue is a tool for a specific shape of problem: variable producer/consumer rates, heterogeneous consumers, cross-service decoupling. If your problem doesn't have that shape, a queue is not architecture. It's a place for your state to hide. Same goes for Kubernetes, same goes for microservices, same goes for observability stacks. Every one of these is the right answer to some problem. The mistake is not choosing them; the mistake is choosing them before you know which problem you have.
POSIX gives you atomic renames, hierarchical namespacing, access time metadata, and cp/mv/ls. That's a surprisingly complete set of primitives for coordinating batch work. The directory-per-run-id layout, the SHA256 ledger, the failed-urls file that feeds back into the next run — all of it is files. Boring, inspectable, debuggable with ls and cat. A file on disk in 2026 has the same semantics it had in 1996, which is more than you can say about most of our infrastructure.
The pipeline has at least three independent idempotency mechanisms — atomic crawler checkpoints, SHA256 file ledger in the processor, ON CONFLICT on the aggregate upsert. Any one of them, working correctly, would prevent double-processing. All three working correctly means we can crash or rerun at any point without producing duplicate data. This is defense in depth for a problem where the cost of a single duplicate is a silently-inflated metric that takes three months to detect.
Most architectural diagrams don't include the operator. They should. If your system requires an expert to run — four dashboards, three control planes, five mental models — it will be run badly, and not by you, because you'll leave. V2's ops surface is pipeline.sh and status.sh. Anyone who can read bash can run it. Anyone who can't read bash has no business running it. Either way, the handoff is clean.
The most effective engineering move I made on this project was deleting 4 GB of Docker image and an entire RabbitMQ-driven deployment topology. Nothing I built was as useful as what I removed. This is embarrassingly easy to forget; deletion doesn't make for good demos, and nobody puts "removed the queue system" on a resume. But the pipeline is reliable because it's simple, and it's simple because I spent six months convincing myself I was allowed to throw things away.
Nothing I built was as useful as what I removed.
If you're mid-rebuild and wondering whether the pattern you're about to adopt is the right one, the test is not whether it's fashionable and not whether it scales. The test is whether the workload has the shape that pattern was designed for. If yes, use it. If no, the boring version is almost certainly correct, and the hard part is believing yourself.
I work with mid-market teams on production AI systems — auditing what's fragile, rebuilding what's wrong, and building new pipelines with the boring rigor they actually need. Case studies are the easy part; the conversation is where the real work starts.