weekly-2026-03-29

3 commits, auth system rewritten twice, scheduler finally learns self-cleanup.

This Week's Progress

GitHub Auth: Rewriting is Admission

The biggest engineering event this week was the 411-line rewrite of packages/github/src/auth.ts.

This wasn't a refactor. It was an admission.

What was wrong with the original auth implementation? Logs show the token was reused after expiration, turning every API call into a 401 instantly. How long had the system been running before this surfaced? Based on commit history, the bug had been lurking at least since before #128. Every auto-discovery failure reported "Bad credentials" but nobody suspected the token lifecycle — because the issue was "expired but still in use," not "never acquired."

The rewrite strategy: replace direct token injection with auth.callback. Every request now goes through a callback layer that checks token validity before deciding whether to re-authenticate. Correct direction, but the cost is additional async overhead. GitHub API calls went from sync to async chain — performance impact needs real pressure testing.

By the way, this isn't the first auth rewrite. Git log shows a previous "hotfix(unknown): fix(github-auth): use async auth callback" — meaning someone already fixed it once before. How far apart were the two rewrites? If too close, the first fix didn't solve the root cause, just the symptom.

Reflection: This system's auth has never been stable. Every time it breaks, the response is "patch one spot" instead of "examine the whole chain." When "Bad credentials" appears, first instinct should be token lifecycle management, not API key suspicion.

Scheduler's Self-Awakening

This week the scheduler learned to clean up after itself.

Commit fix(scheduler): cleanup remote branches of recently merged PRs did two things: auto-clean origin branches after local merge + maintain a recent-merge list to avoid duplicate cleanup. Another commit fix(governance): prevent VE empty-commit CI-trigger loop used a pre-push hook to block empty commits from triggering CI.

The common thread: they solved problems they created.

Why did the scheduler need cleanup? Because it creates and merges temporary branches frequently, naming gets messy, cleanup logic is either missing or in the wrong place. Why did the VE loop happen? Because review gate logic bypassed CI's check mechanism, treating empty commits as valid changes. Neither of these are new features — they're side effects of incomplete implementations.

The real progress isn't "scheduler learned cleanup." It's "someone finally noticed cleanup was missing." But what's the cost of that late noticing? How many CI resources were wasted on ghost branches during the messy period? How many builds were incorrectly triggered during the VE loop?

Reflection: The system evolves via "get it working, then patch." Every patch fills the gap left by an incomplete design decision from the last iteration. This isn't technical debt — it's compound interest on technical debt.

DCD Crawler: The Price of Monitoring

feat(dcd-crawler): daily param coverage monitoring with threshold alerts introduced parameter coverage monitoring and threshold alerts.

The motivation is clear: if the crawler's parameter coverage drops below a threshold, the system should alert automatically. But the problem is this monitoring needs extra data storage and scheduled tasks. Is there correlation between monitoring coverage and actual crawl quality? How are thresholds set? If alerts trigger once a day but crawler issues can happen anytime, is that daily monitoring window too large?

Monitoring isn't the problem — monitoring introduces new dependencies and new potential failure points.

Critical Lens

1. Fix Speed Masks Shallow Design

Three major issues fixed this week: auth rewrite, scheduler cleanup, VE loop prevention. All three were fast — from discovery to merge possibly within hours. Fast fixes are good, but fast enough to skip design docs, rollback tests, and post-mortems is problematic.

After the auth rewrite, is there test coverage? Is the new callback path properly mocked? Does scheduler cleanup have race conditions in concurrent scenarios? If the answer to all of these is "don't know yet," the next auth failure or scheduler deadlock could be tomorrow.

2. Config Inflation Is Eroding System Boundaries

Staged changes show 58 files, 3882 lines added. docs/ops/ gained QDRANT_DATA_CONTRACT.md and RUNNER_TOPOLOGY.md, each a long document. Are these documents truly necessary, or is the system so complex that nobody can hold all the details in memory, so they write docs instead of building understanding?

More complexity → more docs → higher maintenance cost → more errors. What's the endgame of this chain?

3. Technical Choices Lack Long-Term Perspective

The auth callback solves token reuse but introduces async overhead on every request. If GitHub API call volume doubles in the future, this overhead scales linearly. Scheduler cleanup uses an in-memory list for recent merges — if the scheduler restarts, the list is gone. These are "good enough for now" choices, not "will still work in 3 years" designs.

This Week by Numbers

Metric	Value
Commits	3
Lines added	~3900
Files changed	58
Primary areas	auth, scheduler, crawler
Auth rewrites	2 (this week alone)

Looking Ahead

Auth callback performance benchmark results
Scheduler cleanup recovery logic when branch deletion fails
DCD crawler monitoring alert actual trigger conditions
New features coming or just more patching

The lesson of the week: fast fixes aren't the same as good design. Before rewriting auth, someone should have drawn a token lifecycle diagram first, not jumped straight into code.