PromptForge
A retrieval-grounded recommendation engine that turns a project brief into a paste-ready spec for Claude Code, Cursor, Codex, Windsurf, Copilot, ChatGPT, Antigravity, Replit, or any AI coding tool. Every pick traces to a verifiable source; the engine proposes candidates and defers locked decisions to the runtime AI plus the user. Live at promptforge.uk.
What this is
Most people open Claude Code or Cursor with a one-line idea and figure out the structure as they go. That’s fine for a weekend hobby. For anything you’d want to ship, the planning quietly happens mid-build, the structure drifts between sessions, and the stack ends up being whatever the model named first instead of what actually fits your situation.
PromptForge is the five minutes of planning that should happen before the chat starts. You pick your AI tool, paste a project brief, and get back a meta-prompt where every recommendation points at a verified source you can re-check. The output is one paste away from a working session in Claude Code, Cursor, Codex, Windsurf, Copilot, ChatGPT, Antigravity, Replit, or any AI coding tool.
Live at promptforge.uk.
Why traceable picks are the product
Most AI tooling hands you a build plan and asks you to trust it. PromptForge hands you a build plan where every line ties back to a verifiable source. Each recommendation carries a source URL plus a last-verified date; if you don’t agree with a pick, you know exactly which document to argue with, and if a recommendation goes stale the next refresh pass catches it.
That contract is enforced server-side. A validator strips sentences from the synthesised output if they cite a source the retriever didn’t return, then retries once at temperature zero before falling back to evidence-only rendering. The model can be wrong about what to build for you. It cannot ship a confidently-sourced recommendation that traces to nothing.
Doctrine: candidates, not decisions
The biggest architectural shift in the engine is the doctrine the runtime AI sees as system prompt. Two principles I had to ship explicitly after watching the early outputs:
- PromptForge proposes; the AI verifies; the user confirms. When the brief leaves a layer silent (does this £0 personal app need Stripe or PayPal Donate? does this small SaaS need Postgres or SQLite?) PromptForge surfaces 2-3 candidates with pros and cons against the user’s stated constraints, and a recommended default. It does not bake a vendor pick into the decision anchors before the user has agreed.
- Architectural axes the brief doesn’t determine get delegated, not invented. Instead of guessing whether a build is monolith-vs-services or sync-vs-async, the architecture block instructs the runtime AI to research 2-3 reference architectures and ask the user before locking. PromptForge tells users’ AIs to ask before assuming; the same rule applies to PromptForge.
How it’s built
The pipeline is plain:
The brief gets embedded by Voyage AI’s voyage-3.5-lite (1024
dims, Anthropic’s officially recommended embedding partner).
pgvector inside Supabase Postgres returns the top 40 by cosine
similarity; Postgres full-text search returns the top 40 by BM25.
Reciprocal rank fusion merges the two into a top-20 candidate
list. An optional reranker narrows that to the top six, either a
local CrossEncoder or Voyage’s rerank-2-lite from the same vendor
as the embedder, so adding it costs no new dependency. Hard-fired
dependency rules (the kind that need to fire regardless of vector
ranking, like “a £0 budget plus a payments feature is a hard
block”) are injected separately. Anthropic Haiku 4.5 synthesises
the final meta-prompt against the retrieval set, and the citation
validator above polices the output.
The knowledge base is ~500 atomic YAML rows across nine corpora:
build blocks, stack options, dependency rules, AI-tool
capabilities, external sources, MCP servers, skills, workflow
patterns, OS/IDE compatibility. Editing a row bumps its version;
the ingest is idempotent on content_sha256; the prior version
is flipped to superseded atomically so retrieval queries only
see live rows. Adding a new option is a YAML edit, not a code
change. I picked this shape because rules a user-facing tool
relies on shouldn’t live where only the original author can change
them safely.
Frontend is Next.js on Vercel. Backend is FastAPI on Railway. The retrieval, ranking, and synthesis logic is about 700 lines across five files. No LangChain, no LlamaIndex. Removing the framework is half the signal.
Eval as a first-class artefact
An EVAL.md scoreboard sits in the repo. Four deterministic gates
(recall@6, precision@6, citation_accuracy, and
expected_claims_present) run on a fixed-seed CI harness with 15
golden fixtures spanning every supported AI tool. Two soft nightly
gates (faithfulness, context_precision via ragas) bring an LLM
into the loop only when the deterministic gates have passed.
An ABLATIONS.md table alongside it runs the same fixtures with
one piece of the pipeline pulled out at a time, so the value of
the reranker, the hard-fired rules, and the citation contract is
measured, not asserted. Most undergrad RAG portfolios prove the
pipeline works. Ablations prove the maker understood which parts
earned their cost.
What building this taught me about using AI
Most of the real work was reading the failure modes of giving instructions to a language model and shaping the product around them rather than the happy path. A few patterns stuck.
Empty beats guessed. A lot of AI tooling silently fills in missing context with a plausible default and the user spots the wrong assumption three weeks in. I built the opening-sentence extractor to do the opposite: if the text doesn’t say what scale, budget, or compliance the project is at, the field stays empty, the engine refuses to recommend, and the user sees a “gathering your constraints first” message instead of a half-informed guess. Refusal over degraded answering, when the input under-determines the output, is the cheapest control surface a small product has.
Citations or no shipping. The first version of the synthesis layer was Haiku-with-temperature, no validator, no retry. It sounded great and made things up. The lesson was that “the AI is mostly right” is not the same product as “the AI cannot be confidently wrong”. The citation validator was the line I had to draw between the two; I’d put it in earlier next time.
The catalogue teaches, it doesn’t tell. A specific library
version named in a static prompt is a slow-motion bug; the moment
that library deprecates or pivots, every output ships stale
advice. Every recommendation in PromptForge carries a three-step
contract instead of a static answer: the runtime AI fetches the
primary source, writes the verification date into a repo note,
and flags defaults that have aged. The branch is per-AI because
the wrong invocation idiom is another way to fail silently. Claude
Code is pointed at Context7 + WebFetch, Cursor at @docs, Codex
at /research, the universal fallback at plain bash + curl.
Quality has to be a test, not a vibe. “The output looked good last time” is not a guarantee for the next user. The deterministic eval gates pin retrieval recall, citation accuracy, and expected claims. The ablation table measures whether each piece of the pipeline earns its cost. Most of the bugs I caught late were the kind you can only see when you write the test that forces the failure to be visible.
Errors are part of the product. “We couldn’t reach the server” reads better than “Failed to fetch”, and “the knowledge base is still being indexed, try the guided wizard instead” reads better than a 503. The structured detail still travels in the payload so the UI can branch on the underlying reason; only the visible text sits on the user’s side of the wall.
See it for yourself
A worked example of one brief flowing through the engine end-to-end lives at promptforge.uk/case-study. Static, ninety seconds to read, shows what gets picked up from the brief, the constraints flagged for the build, the suggested stack with the actual source URL for each pick, and a condensed excerpt of the meta-prompt the user receives.
Status
Live at promptforge.uk. The retrieval
demo sits at /rag; the older guided wizard at /session/new and
still works.
The roadmap from here: more golden fixtures, wider use of the
Voyage reranker now that it is wired in, and a local-CLI mode so
power users can run against their own keys.