The Prompt Factory: How We Turned Prompt Engineering into an Assembly Line

Most companies have no shortage of AI ideas. There are endless possibilities to automate or reengineer our most repetitive or time-consuming processes with the help of LLMs.  

What’s commonly lacking though is the ability to quickly translate these ideas into reliable tooling. The gap between “we should automate this” and “here’s a prompt that actually consistently works” is where many AI initiatives quietly go to die. That’s because generating AI workflows that add value usually requires a time-consuming process of scoping calls, drafting requirements, generating system prompts, testing and evaluating outputs, then optimizing the prompt until it’s finally production-ready. Some large enterprises have the resourcing to handle this process, but many of our clients don’t. 

So we decided to solve that gap: not with more AI talent, but by constructing a system that helps to automate prompt development and optimization. The result is an automated workflow that moves from scoping conversation to production-ready, battle-tested, AI-powered tooling that can solve most business problems in less than half an hour.

This post documents how we built it, what we learned, and the design decisions that had to be scrapped entirely. 

The Problem With How AI Tools Get Built Today

Custom-built AI workflows still aren’t accessible to most employees today. Despite all of the hype, 64% of employees cite not being adequately trained in AI use (BCG), and only 31% of employees have received prompt engineering training from their employers. Many of the untrained are employees of SMBs, which lack a dedicated AI team that can quickly pull together new tooling or support team members in the field. Even teams within enterprises often have trouble securing the limited resources available from AI centers of excellence. 

But even when resourcing is available, plenty can go wrong when building AI tools across the process from design to development. End users commonly don’t know what to ask for, or what AI can realistically accomplish. They can easily describe symptoms, but can struggle to identify root causes. They request features that sound great, but which don’t map to how LLMs actually behave or reason.

After a workflow is built, teams need to invest in testing efficacy, which is a time-consuming and resource intensive exercise. How do you know when a prompt is “good enough”? Most teams resort to vibes: they run it a few times, eyeball the outputs, then ship it. The teams building AI tools are often technically knowledgeable, but lack functional knowledge, domain expertise, or the end users’ guidance on feedback for improvement. End users often won’t have the patience to work through the, or won’t have the heart to share it’s not meeting their needs.  

We wanted to eliminate these modes of failure. That meant building a system that could extract real requirements from scoping conversations, enrich conversations with relevant context, evaluate prompts objectively, and optimize with precision rather than guesswork.

From Messy Transcripts to Production-Ready Agents

We started with a simple ambition: What if every good AI idea could become a production-ready tool in under an hour? That’s now possible: where scoping conversation transcripts can act as the fuel and raw materials for a product pipeline. 

In scoping calls, I start by asking customers about the problem they’re facing, the constraints they’re up against, and the design solutions that will work the best. These conversations contain everything needed to spec a tool – but that information is unstructured, incomplete, and often buried in conversational tangents. Fortunately, LLMs excel at synthesizing unstructured data into organized, detailed outputs. Problem solved!

Our workflow ingests call transcripts and automatically extracts from it mission statements, user personas, technical or operational constraints, and success criteria – all before tool generation begins. The AI acts as thought partner, context synthesizer, and product manager simultaneously.

But the call alone usually won’t offer all the detail we need. If you’ve worked with clients, you know the challenge: they commonly struggle to articulate their needs. Sometimes they haven’t deconstructed the problem, sharing the frustration rather than the root cause. Sometimes they don’t know what solutions are possible. The information you need is there – but it’s incomplete or scattered across assumptions they haven’t yet examined.

This is where most prompt engineering efforts underinvest. They take what’s given and start building immediately. We do the opposite: we invest heavily in enriching context before writing a single instruction.

Enriching Context Before Building Tools

Our enrichment approach works using two methods: researching business analogs and leaning on LLMs as specialized teammates.  

Third Party Research to Fill Scoping Detail Gaps

If customers aren’t offering enough context to build a useful prompt, how should we fill in the gaps? One insight we had in our build process is that there are business analogues everywhere. There are similar user roles, business problems, and work environments to any issue an end user surfaces. By using LLMs to mine in these areas, we can fill in the gaps left in scoping conversations with supplemental details and context that can strengthen our final prompt. We leaned on research to enrich three specific areas:

Problem Context. Every tool solves a problem that other companies have already solved before. We supplement feedback from clients with secondary research on analogous challenges and proven solutions. This helps us to better map constraints, potential solution specs, and supporting details.

Role Context. Details about the end user inform how a tool should be built, but those details are commonly left out during. For example, is the user an SDR, a Sales Manager, or a CRO? Each role has different information needs, decision patterns, and pressures. Additional context about our end user persona helps to improve design decisions in ways that scoping calls alone can’t. There’s enormous overlap in responsibilities, constraints, and pressures across similar job titles and job functions. 

Company Context. We experienced firsthand that context on the type of company you’re building for matters. Attributes like company size, industry, business model, competitive positioning, and cultural attributes all shape the operating environment and what users need out of a tool. A tool that works beautifully for a 50-person startup may fail completely in a 5,000-person enterprise. 

Our goal isn’t comprehensiveness. It’s filling the specific gaps in our scoping call that would otherwise turn into blind spots in our final prompt. We enlisted AI to use one of its’ superpowers (Deep Research) to collect additional content on the problem, role, and company context before formulating our prompt. 

But that wasn’t the only enrichment strategy we used…

Models as Teammates

Early on, we realized quickly that no single model excels at everything. Instead of asking one model to be perfect, we let three models (ChatGPT, Claude, and Gemini) be brilliant in their respective niches and designed a sequence that plays to each of their strengths.

ChatGPT serves as our generalist. It’s strongest at connecting disparate ideas into coherent wholes, which makes it ideal for the initial context assembly.

Claude brings user empathy and human sensitivity. Our instructions ask Claude to “focus on the human experience behind each persona – their daily frustrations and emotional responses.” This isn’t soft thinking; it helps us to catch the requirements that users feel but don’t articulate. We know that users won’t use new tooling if it’s not solving their biggest pains in ways that improve their work lives. 

Gemini operates as our logic engine, handling technical feasibility and constraint analysis. It’s best at identifying where proposed solutions might break against real-world limitations. In other words, it’s the technical geek of the group that manages technical feasibility and steers our outputs to become usable without requiring additional integrations or middleware. We found that left to its own devices – LLMs sometimes recommend overly complex solutions that most teams aren’t resourced enough to apply. Gemini helps us reel in this behavior.

The sequence matters. GPT establishes the frame, Claude humanizes it, and Gemini stress-tests technical feasibility. All three feed into a single context model before any prompt instructions are generated. We treated LLMs like a product team: the analyst, the product manager, and the architect all had a seat at the table. Context generation became a team sport.

This approach admittedly takes more time than single-model generation (five minutes to be exact). But the prompts that emerge are dramatically more robust because we’ve filled any holes from initial interviews, and brought in three unique collaborators to weigh in. Ultimately, the challenge we’re solving for is changing user behavior, and that can only happen if the tools we build solve problems consistently. 

Building Prompts as Products

Using research interviews and enrichment context as inputs – our system then generates a preliminary prompt. This could be an automated coaching tool for sales, a blog writer for marketing, or a strategy analysis tool for executives. The use cases are excitingly endless!

But speed isn’t the real innovation. The innovation is that we treat prompts like products. 

Every prompt that we generate receives a mission statement, a persona-aware user description, and a technical gap analysis (similar to a PRD). Each of these prompts then receive a performance review (aka eval), and are optimized to round out weaknesses. 

Check out below for details on how we structured evals, and how we perform surgical optimizations. 

The Evaluation Framework: Teaching AI to Grade Its Own Work

Here’s where most prompt development processes fall apart.

They build a prompt and run it a few times. The outputs look… fine? Maybe good? So they ship it and move on, hoping it holds up in production.

This approach is how you end up with AI tools that work in demos but fail in the field. Without systematic evaluation, you can’t tell if you’re improving employees’ work lives, or just changing them.

We needed a way to evaluate any tool’s instructions objectively: whether it was a sales agent, research assistant, or market strategist. That meant building a detailed rubric that could travel across all use cases without requiring reinvention each time.

The result is our Universal Success Criteria: ten lenses that every prompt should be judged against, organized around three core dimensions.

Quality: How Good Are the Instructions?

Quality measures how well the prompt serves the end goal – examining whether the instructions are clear enough to follow and complete enough to succeed.

Sufficient Context – Does the prompt provide necessary background, constraints, goals, and persona details? Or are there gaping context blind spots that leave models guessing?

Instruction Logic – Is the sequence coherent? And do the instructions build on each other or contradict themselves?

Specific and Relevant Diction – Is the language precise and unambiguous? Or does it rely on the LLM’s intuition to fill gaps?

Word Economy – Does the prompt communicate maximum information using minimal characters? Or is it bloated with redundancy that drive up costly token use in the process?

Consistency: How Reliable Is It?

Consistency measures whether the prompt produces dependable results across multiple runs. One strong output doesn’t cut it for us, we need to produce strong outputs reliably. To assess that, we use the following criteria: 

Factuality & Grounding – Do outputs avoid fabrication? And do they distinguish sourced information from inference?

Adherence to Prompt – Does the output follow explicit instructions, or does the model drift or improvise?

Goal Accomplishment and Reliability – Does the prompt consistently achieve the intended outcome, or only sometimes?

Usability: How User-Friendly Is the Output?

Usability measures whether users can actually use what the tool produces. Often tools can produce impressive results which require significant editing, reformatting, and reworking before they can be put to use. We wanted to produce outputs that can be as close to usable in one shot as possible. So we measured:

Zero-Shot Utility – Can users extract value from the first interaction? Or are heavy revisions or time-consuming refinements required?

Output Usability and Formatting – Are outputs structured and predictable? Or does each run produce a variable format?

Output Quality and Helpfulness – Is the content clear, accurate, and tonally appropriate for the intended audience?

Why These Ten Criteria?

We didn’t arrive at these criteria arbitrarily. We used a combination of both external research from other eval processes, along with reviewing prompt output failures and asking: what went wrong?

Almost every failure we ran into traced back to one of these ten dimensions. Either the instructions were unclear (Quality), the outputs were unreliable (Consistency), or the results required too much rework (Usability).

By evaluating against all ten, we catch problems that narrower assessments miss. A prompt can be clear but unreliable. It can be reliable but produce unusable outputs. The ten criteria force a complete picture. Armed with this evaluation model, we built a system where AI judges evaluate every prompt produced against these 10 criteria. But that process wasn’t as simple as it sounds. 

Teaching AI to Be a Ruthless Critic

Here’s a problem we didn’t anticipate: our earliest AI judges were sycophants. Initial grading runs were generous. Scores clustered around 80/100. Everything looked “pretty good.” That made optimization harder to achieve. If everything’s a B, then we can’t choose which areas to optimize, or detect whether results are improving.

Our response was to train the politeness out of the evaluation system, and unleash a Simon Cowell level critique that would give us the tougher feedback our prompts needed to improve.

Our latest instructions now create a “very tough critic” with “olympian levels of attention to detail.” We explicitly requested: “avoid sycophantic behavior; don’t sugar coat it.”

Our scoring philosophy shifted too. Conventional scales put “average” at 50 and “failure” at 0. We inverted that. In our system, 20 is where scoring starts – the prompt works and produces something useful. Everything above 20 must be earned by demonstrating explicit control over nuance and quality.

In our design, most scores land in a distribution between 20 and 60. A compressed scale can’t distinguish “slightly better” from “significantly better.” But our high levels of variation and spread-out scale can.

Each of the 10 criteria above are measured on a six-band scale. Unusable (0-19), Adequate (20-39), Good (40-59), Exceptional (60-79), Masterful (80-95), and Perfect (95-100). 

To ensure we’re getting meaningful data – we run each prompt five times and use AI as a judge to score each output across all ten criteria. Gumloop is the AI orchestration tool that helped us automate this complex task (screenshot below). Executing the prompt 5 times per optimization run instead of once allows us to derive more accurate evaluations, and to more precisely identify the areas that need the most revisions.

Targeted Optimization: Surgical Improvement Without Breaking What Works

Evaluations can tell you where a prompt is weak. But knowing the problem and understanding how to fix it are different challenges.

One core difficulty we needed to manage was delivering surgical improvements without breaking what’s actually working well in the prompt. So we adopted two core features: a change budget and freeze zones.

The Change Budget: Delivering Surgical Changes

Each optimization cycle is constrained to a 700-word change budget. That’s enough to make meaningful improvements but small enough to preserve intent and to surgically repair components that need improvement.

If something improves, we know roughly what caused it. If something breaks, we can revert without losing other progress. The discipline forces focus: you can’t change everything, so you have to prioritize what matters most.

Every optimization cycle for us starts with the same question: “Which five things hurt the most?”

We then target the five lowest-scoring criteria for improvement. This creates a systematic ranking from weak to strong.

Freeze Zones: Protecting What Works

Here’s a risk that’s easy to overlook: the biggest danger in iteration isn’t failing to improve. It’s accidentally destroying what already works.

So we formalized “freeze zones” to protect high-performing criteria during revision. After aggregating scores, we identify the strongest five criteria and mark them as Protected Guardrails. Each edit must avoid weakening these protected areas. Protecting strengths was as important as fixing weaknesses.

The Optimization Loop in Practice

Running a few cycles of this optimization process can dramatically improve prompt quality without requiring much human time. I simply input the prompt into the optimizer and hit “Go” – and I can have that work happening in the background while I do other things. Admittedly, this took some time to build and costs about $3 to run end-to-end. But now that it’s created, I can empirically optimize tools without requiring much human time. But if I were to execute this process manually, it would take days of manual work.

Each cycle, the system dynamically selects new criteria to optimize based on which scores are currently the lowest. The impact of running multiple optimization loops resembles an unevenly inflating balloon: with each run, a few areas improve while most stay stable. Over multiple cycles, the balloon fills out evenly and rounds out the weakest areas.

After three or four cycles, the lowest scores aren’t the lowest scores anymore. The prompt has been systematically strengthened across all dimensions, with each improvement tracked and each strength protected.

Case Study: The Impact from Optimizations

Theory is useful, but great results are much more persuasive. So we tested this system on a sales transcript analysis tool for a client that’s designed to ingest transcripts and output CRM-ready call notes and a follow-up email to the client. 

What started as a serviceable prompt became something genuinely production-ready over three iterations, and the journey reveals a lot about what actually moves the needle in prompt engineering. Below is a screenshot of our scores after a few optimization loops:

The numbers tell a clear story of both broad-based improvements and significantly improved reliability. In our V1 prompt, just two of our criteria scored in the “Exceptional” range. By V3, six of our criteria reached this threshold – including critical outcome measures like Output Quality & Helpfulness, and Output Usability & Formatting. We saw meaningful improvement across 70% of our criteria, with the sharpest improvements taking place across Factuality & Grounding (+20%) and Zero-Shot Utility (+25%). In other words, our outputs became more factually trustworthy, and required less adjustments by the end user. 

Our optimizer’s biggest unlock was replacing fuzzy guidance with hard constraints. “Make it skimmable” sounds reasonable until you realize every LLM interprets that differently. Swap that for “12 words max per bullet” and suddenly you’ve eliminated an entire category of inconsistent outputs. That single pattern (quantifying the qualitative) drove significant gains across the board. 

The optimization system also added new design principles to improve usability, such as “when in doubt, prefer the shortest output that still satisfies these requirements.” And it minimized errors by adding safeguards for hallucinations, and instructions for edge cases (such as conflicting information and informal requests that come up during calls.) 

Implementing all of these optimizations manually would have taken hours of a person’s attention to fine-tune. Instead, they took just a minute to copy/paste the prompt and hit “Go.” 

I should also mention that every prompt should still be reviewed by a human before turning back to the end user. This process is designed to fast-cycle the scoping, construction, evaluation and optimization of new tools. After the process runs, I still add in supporting context that AI didn’t prioritize, and remove highly detailed instructions that add minimal value. Humans should always be in the loop when working with AI.

The Mistakes We Made (And What We Learned)

The system you just read about isn’t exactly the plan we drew up. It’s what we built after some of our promising ideas failed.

These failures taught us a lot about building with AI. Each one revealed something about the limits of automation and how to design systems that take these into account.

The Word Budget Discovery

In our earliest prompts – we constructed sprawling 5,000+ word instruction sets, since we assumed that more detail meant better results. The top LLMs have claimed they can accept hundreds of thousands of words in their context windows… But our experiments revealed a critical threshold: after ~3,000 words, prompt performance declined dramatically. The models weren’t just ignoring extra instructions; they were getting worse at following the core ones.

This taught us something important about how LLMs actually process instructions. More detail creates more surface area for the model to misinterpret, contradict itself, or lose track of priorities. Past a certain point, you’re not adding clarity; you’re just adding noise.

So we designed initial prompt generation to cap at 1,500 words, and created room for subsequent optimizations to creep up toward this 3,000-word ceiling. We stopped thinking in pages and started thinking in budgets: every sentence has to earn its tokens.

The Evaluation Criteria Builder We Had to Kill

Our most ambitious failure was a dynamic evaluation criteria builder.

The vision was elegant: instead of fixed criteria, we’d auto-generate custom evaluation criteria for each tool we build. A sales coaching prompt would be evaluated on coaching-specific dimensions. A research assistant would be evaluated on research-specific dimensions. Perfect customization for every use case.

We built it. It worked (kind of). And then we tore it out.

What we learned is that AI made poor judgment calls about what matters most. It would emphasize governance for scrappy internal tools. Or it would adopt criteria that were binary, which rendered our full optimization system much less useful.  AI was good at generating plausible criteria, and excelled at pattern-matching, synthesis, and generation. But when deciding what matters most – human judgment about context and priorities surpassed AI judgment. In this example, we learned the hard way about the limitations of what tasks to delegate to AI. Deciding what’s most important in a given context requires a human touch.  

So we replaced dynamic eval criteria with fixed Universal Success Criteria. This system doesn’t try to decide what’s important. It takes human-defined criteria and evaluates against them with superhuman consistency and speed.  

It’s less elegant. But it’s also dramatically more effective.

Sometimes the best optimization is the delete key…

Conclusion: From Idea to Impact in Under Half an Hour

From my conversations with dozens of companies, the gap between AI ideas and AI impact is where most organizations get stuck. Not because of technology limitations, the models are often plenty capable. The unspoken barrier are the process limitations that make building, testing, and refining tools too slow and too uncertain.

We built a system that closes that gap. Scoping conversations have become structured requirements. Requirements become context-enriched through research and LLM teamwork. Prompts get evaluated against rigorously defined criteria. Evaluations drive surgical optimizations. Each step is systematic, measurable, and happens in minutes.

The process of moving from AI idea to production-ready tool now takes under an hour from scoping call to prompt deployment.

That speed changes what’s possible. Teams can experiment more freely because failed experiments cost less. They can build tools tailored to their actual workflows rather than waiting months for internal AI resources.

If you have any questions on what we built or how we built it, don’t hesitate to reach out!

Discover more from GearGarden Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading