Pre-Ship AI-Assisted Review Planning
The Approach: Think Before You Prompt
A project I created is code complete—but as we know, code complete and “ship it” are two different things.
I’m using Codex Web for the actual reviews since it’s connected to my GitHub repo. But a sophisticated product needs more than “do a code review.”
My workflow involves two layers of AI collaboration:
| Stage | Tool | Purpose |
|---|---|---|
| 1. Design the reviews | ChatGPT | Structure what needs to be examined |
| 2. Write the prompts | ChatGPT | Convert structure into Codex-ready instructions |
| 3. Execute | Codex Web | Run against the actual codebase |
This look before you leap approach increases outcome quality by involving AI in planning, not just execution.
A) Engineering Quality Reviews
1. Architecture & Boundaries Review
- Are modules separated cleanly? Any leaky abstractions between agents ↔ LLM ↔ Kanban?
- Is the “single source of truth” clear? (config, roles, prompts, traces)
- Any cyclic dependencies or import-order landmines?
2. Governance / Ratchet Enforcement Review
- Can approved artifacts be overwritten via edge cases? (delete/recreate, rename, symlink, unstaged edits)
- Does Ralph have any path to mutate tests indirectly?
- Are governance rules enforced consistently across webhook + orchestrator polling paths?
3. Correctness & Idempotency Review
- Spawner idempotency + flood control: prove no duplicate children under retries/races
- Webhook handler idempotency: duplicate events won’t double-run agents
- “Exactly once” vs “at least once” behavior documented and safe
4. Error Handling & Recovery Review
- Failure modes: provider outage, JSON repair failure, Kanboard API fail, git fail?
- Are retries bounded and safe?
- Dead-letter / quarantine behaviors clear?
5. Performance & Scaling Review
- Profiling tool sanity: does it measure the right things?
- Hot paths: trace parsing, compression, JSON repair, LLM call loops
- Latency budgets by agent phase, and obvious bottlenecks
6. Test Suite Quality Review
- Are tests meaningful or just mocking the happy path?
- Coverage of “bad paths” (timeouts, malformed JSON, missing env, Kanban rejects)
- Is there any flakiness risk (time, filesystem, subprocess)?
B) Operational Readiness Reviews
7. Observability Review
- Logs: are important fields always present? correlation IDs consistent?
- Traces: do they contain enough to reproduce issues? (raw output, prompt pack hash, provider metadata)
- Monitoring dashboards: are their recommendations sane and not noisy?
8. Configuration & Deployability Review
config/llm.yamlschema clarity + validation.env.examplecompleteness + secure defaults- “doctor/health” commands give actionable output
- Upgrade path: backward compatibility, migration notes
9. Release Engineering Review
- Branch / PR hygiene: changelog, versioning, tags/releases
- CI suggestions: minimal workflows to run tests/lint on PR
- Reproducible dev setup instructions
C) Security Reviews
10. Secrets & Data Leakage Review
- Do traces/logs ever leak secrets (keys, headers, env vars)?
- Does config hashing truly exclude secrets everywhere?
- Are prompts/responses potentially sensitive? How is redaction handled (or intentionally not handled)?
11. Supply Chain & Dependency Review
- New deps:
pyyaml,requests, anything else — pinned? minimal? - Any risky subprocess calls?
- Any shell injection surfaces?
12. Threat Model Review (Practical)
| Attacker Vector | Controls to Validate |
|---|---|
| Malicious Kanban card content | syntax_guard, JSON repair, file path guards |
| Compromised provider output | git guards, command validation |
| Local user on box | workspace boundaries, allow-list enforcement |
13. LLM Safety-Integration Review
- Prompt injection resistance across phases (design → plan → tests → code)
- Are untrusted inputs ever used to generate commands/paths?
- Are there “deny lists” or “allow list” boundaries (e.g., only write within workspace)?
D) Documentation Reviews
14. User Documentation Review
- Can a new user install + run end-to-end from README alone?
- Missing prerequisites? (Kanboard plugin assumptions like MetaMagik)
- Clear “happy path” + troubleshooting sections?
15. Developer Documentation Review
- Architecture diagram current?
- Where to add a new provider? a new agent? a new phase?
- Prompt-pack strategy (even if not implemented yet) clearly anticipated?
16. Runbook / Ops Docs Review
- What to do when: provider down, Kanboard rejects, spawner dupe risk, health fails
- How to read traces & monitor outputs
- How to safely retry / resume
E) Product / UX Reviews (Lightweight)
17. Workflow UX Review
- Kanban column naming: stable? documented?
- Does the system behave predictably when humans move cards “wrong”?
- Is feedback surfaced back into cards in a usable way?
18. Demo Story Review
- Is there a canonical “showcase epic” that exercises every stage?
- Are there screenshots / commands / expected outputs?
Codex Execution Bundles
For running these as separate review tasks in Codex:
| Bundle | Reviews | Focus |
|---|---|---|
| Bundle 1 | Architecture + Governance + Idempotency | Structural integrity |
| Bundle 2 | Observability + Config/Deployability + Docs | Operational clarity |
| Bundle 3 | Secrets + Threat Model + Injection Surfaces | Security posture |
| Bundle 4 | Tests + Performance/Scaling | Quality & resilience |