Abdalla Bakr
All projects
Live

PromptForge

A retrieval-grounded recommendation engine that turns a project brief into a paste-ready spec for Claude Code, Cursor, Codex, Windsurf, Copilot, ChatGPT, Antigravity, Replit, or any AI coding tool. Every pick traces to a verifiable source; the engine proposes candidates and defers locked decisions to the runtime AI plus the user. Live at promptforge.uk.

Next.js TypeScript FastAPI Python Supabase pgvector Voyage Anthropic

What this is

Most people open Claude Code or Cursor with a one-line idea and figure out the structure as they go. That’s fine for a weekend hobby. For anything you’d want to ship, the planning quietly happens mid-build, the structure drifts between sessions, and the stack ends up being whatever the model named first instead of what actually fits your situation.

PromptForge is the five minutes of planning that should happen before the chat starts. You pick your AI tool, paste a project brief, and get back a meta-prompt where every recommendation points at a verified source you can re-check. The output is one paste away from a working session in Claude Code, Cursor, Codex, Windsurf, Copilot, ChatGPT, Antigravity, Replit, or any AI coding tool.

Live at promptforge.uk.

Why traceable picks are the product

Most AI tooling hands you a build plan and asks you to trust it. PromptForge hands you a build plan where every line ties back to a verifiable source. Each recommendation carries a source URL plus a last-verified date; if you don’t agree with a pick, you know exactly which document to argue with, and if a recommendation goes stale the next refresh pass catches it.

That contract is enforced server-side. A validator strips sentences from the synthesised output if they cite a source the retriever didn’t return, then retries once at temperature zero before falling back to evidence-only rendering. The model can be wrong about what to build for you. It cannot ship a confidently-sourced recommendation that traces to nothing.

Doctrine: candidates, not decisions

The biggest architectural shift in the engine is the doctrine the runtime AI sees as system prompt. Two principles I had to ship explicitly after watching the early outputs:

  • PromptForge proposes; the AI verifies; the user confirms. When the brief leaves a layer silent (does this £0 personal app need Stripe or PayPal Donate? does this small SaaS need Postgres or SQLite?) PromptForge surfaces 2-3 candidates with pros and cons against the user’s stated constraints, and a recommended default. It does not bake a vendor pick into the decision anchors before the user has agreed.
  • Architectural axes the brief doesn’t determine get delegated, not invented. Instead of guessing whether a build is monolith-vs-services or sync-vs-async, the architecture block instructs the runtime AI to research 2-3 reference architectures and ask the user before locking. PromptForge tells users’ AIs to ask before assuming; the same rule applies to PromptForge.

How it’s built

The pipeline is plain:

The brief gets embedded by Voyage AI’s voyage-3.5-lite (1024 dims, Anthropic’s officially recommended embedding partner). pgvector inside Supabase Postgres returns the top 40 by cosine similarity; Postgres full-text search returns the top 40 by BM25. Reciprocal rank fusion merges the two into a top-20 candidate list. An optional reranker narrows that to the top six, either a local CrossEncoder or Voyage’s rerank-2-lite from the same vendor as the embedder, so adding it costs no new dependency. Hard-fired dependency rules (the kind that need to fire regardless of vector ranking, like “a £0 budget plus a payments feature is a hard block”) are injected separately. Anthropic Haiku 4.5 synthesises the final meta-prompt against the retrieval set, and the citation validator above polices the output.

The knowledge base is ~500 atomic YAML rows across nine corpora: build blocks, stack options, dependency rules, AI-tool capabilities, external sources, MCP servers, skills, workflow patterns, OS/IDE compatibility. Editing a row bumps its version; the ingest is idempotent on content_sha256; the prior version is flipped to superseded atomically so retrieval queries only see live rows. Adding a new option is a YAML edit, not a code change. I picked this shape because rules a user-facing tool relies on shouldn’t live where only the original author can change them safely.

Frontend is Next.js on Vercel. Backend is FastAPI on Railway. The retrieval, ranking, and synthesis logic is about 700 lines across five files. No LangChain, no LlamaIndex. Removing the framework is half the signal.

Eval as a first-class artefact

An EVAL.md scoreboard sits in the repo. Four deterministic gates (recall@6, precision@6, citation_accuracy, and expected_claims_present) run on a fixed-seed CI harness with 15 golden fixtures spanning every supported AI tool. Two soft nightly gates (faithfulness, context_precision via ragas) bring an LLM into the loop only when the deterministic gates have passed.

An ABLATIONS.md table alongside it runs the same fixtures with one piece of the pipeline pulled out at a time, so the value of the reranker, the hard-fired rules, and the citation contract is measured, not asserted. Most undergrad RAG portfolios prove the pipeline works. Ablations prove the maker understood which parts earned their cost.

What building this taught me about using AI

Most of the real work was reading the failure modes of giving instructions to a language model and shaping the product around them rather than the happy path. A few patterns stuck.

Empty beats guessed. A lot of AI tooling silently fills in missing context with a plausible default and the user spots the wrong assumption three weeks in. I built the opening-sentence extractor to do the opposite: if the text doesn’t say what scale, budget, or compliance the project is at, the field stays empty, the engine refuses to recommend, and the user sees a “gathering your constraints first” message instead of a half-informed guess. Refusal over degraded answering, when the input under-determines the output, is the cheapest control surface a small product has.

Citations or no shipping. The first version of the synthesis layer was Haiku-with-temperature, no validator, no retry. It sounded great and made things up. The lesson was that “the AI is mostly right” is not the same product as “the AI cannot be confidently wrong”. The citation validator was the line I had to draw between the two; I’d put it in earlier next time.

The catalogue teaches, it doesn’t tell. A specific library version named in a static prompt is a slow-motion bug; the moment that library deprecates or pivots, every output ships stale advice. Every recommendation in PromptForge carries a three-step contract instead of a static answer: the runtime AI fetches the primary source, writes the verification date into a repo note, and flags defaults that have aged. The branch is per-AI because the wrong invocation idiom is another way to fail silently. Claude Code is pointed at Context7 + WebFetch, Cursor at @docs, Codex at /research, the universal fallback at plain bash + curl.

Quality has to be a test, not a vibe. “The output looked good last time” is not a guarantee for the next user. The deterministic eval gates pin retrieval recall, citation accuracy, and expected claims. The ablation table measures whether each piece of the pipeline earns its cost. Most of the bugs I caught late were the kind you can only see when you write the test that forces the failure to be visible.

Errors are part of the product. “We couldn’t reach the server” reads better than “Failed to fetch”, and “the knowledge base is still being indexed, try the guided wizard instead” reads better than a 503. The structured detail still travels in the payload so the UI can branch on the underlying reason; only the visible text sits on the user’s side of the wall.

See it for yourself

A worked example of one brief flowing through the engine end-to-end lives at promptforge.uk/case-study. Static, ninety seconds to read, shows what gets picked up from the brief, the constraints flagged for the build, the suggested stack with the actual source URL for each pick, and a condensed excerpt of the meta-prompt the user receives.

Status

Live at promptforge.uk. The retrieval demo sits at /rag; the older guided wizard at /session/new and still works. The roadmap from here: more golden fixtures, wider use of the Voyage reranker now that it is wired in, and a local-CLI mode so power users can run against their own keys.