Here's something that took us embarrassingly long to figure out: the quality of an AI feature depends more on how you write the prompt than which model you use. We've seen a well-crafted prompt on Claude Haiku outperform a lazy prompt on GPT-4 — consistently, measurably, in production.
At Pillai Infotech, we've built AI-powered features into over 40 client applications in the past two years. Document classifiers, customer support bots, code review assistants, data extraction pipelines — the works. And in every single project, we spent more time on prompt engineering than on any other part of the AI integration.
This guide is the document we wish we had when we started. No theory-first academic approach — just the techniques that survived contact with real users and production traffic.
Why Prompts Are the Real Product
When a client asks us to "add AI" to their application, they're usually thinking about the model. Should we use Claude? GPT-4? An open-source alternative? And yes, model selection matters. But here's what we've found after dozens of projects:
- Model choice accounts for about 30% of output quality. The remaining 70% is prompt design, context management, and output parsing.
- A prompt rewrite can improve accuracy by 15-40% without changing anything else in the stack.
- Prompt engineering is where domain expertise meets AI capability. Your business knowledge is the competitive advantage, not the API key.
We learned this the hard way. In early 2024, we built a contract analysis tool for a legal tech client. First version: threw the contract at GPT-4 with a generic prompt. Results were... okay. 65% accuracy on clause extraction. After three weeks of prompt refinement — same model, same contracts — we hit 94%. The prompt was the product.
The Prompt Quality Spectrum
50-65% accuracy. "Summarize this document." No structure, no constraints, inconsistent output format.
75-85% accuracy. Role + context + examples + output format. Handles 80% of cases well.
90-97% accuracy. Chain-of-thought + few-shot + guardrails + edge case handling + structured output.
System Prompts That Actually Work in Production
The system prompt is the most underrated part of any AI integration. Most developers treat it as an afterthought — "You are a helpful assistant" — and then wonder why the model hallucinates, breaks character, or gives inconsistent outputs.
A good system prompt is a specification document. It tells the model who it is, what it can do, what it absolutely must not do, and how to format its output. Here's the framework we use at Pillai Infotech for every production system prompt:
The RCEG Framework
Role → Context → Examples → Guardrails. Every system prompt we write follows this structure:
You are a senior contract analyst at a legal technology firm.
You have 15 years of experience reviewing commercial contracts.
// Context: WHAT is the task?
You will receive commercial contracts in plain text format.
Extract all clauses related to: termination, liability, IP ownership, confidentiality.
The contracts are between US-based technology companies.
// Examples: HOW should output look?
Example clause extraction:
Input: "Either party may terminate this agreement with 30 days written notice..."
Output: {"type": "termination", "notice_period": "30 days", "conditions": "written notice", ...}
// Guardrails: What MUST NOT happen?
- Never infer terms that aren't explicitly stated in the contract
- If a clause is ambiguous, flag it as "NEEDS_REVIEW" rather than guessing
- Do not provide legal advice or interpretation beyond extraction
That contract analysis tool I mentioned? The system prompt alone was 847 words. It specified 12 clause types, gave examples for each, defined the exact JSON output format, and listed 8 things the model should never do. That specificity is what took us from 65% to 94% accuracy.
System Prompt Anti-Patterns
- "Be helpful and accurate" — Too vague. The model is already trying to be helpful. Tell it what "helpful" means in your context.
- Contradictory instructions — "Be concise" and "Be thorough" in the same prompt. Pick one and define what it means. We typically say: "Respond in 2-3 sentences for simple questions, up to 2 paragraphs for complex ones."
- No output format specification — If you need JSON, say so. If you need markdown, say so. If you need a specific structure, show an example. Models will match the format you demonstrate.
- Missing edge case handling — What should the model do when it doesn't know the answer? When input is malformed? When the user asks something outside scope? Define these explicitly.
Chain-of-Thought: Making Models Show Their Work
Chain-of-thought (CoT) prompting is the single most impactful technique we use. The idea is simple: instead of asking the model for a direct answer, you ask it to reason through the problem step by step before answering.
Why does this work? Because LLMs generate tokens sequentially. When you force intermediate reasoning steps, each subsequent token has more context to work with. It's like the difference between asking a developer to "fix the bug" versus "read the error log, trace the call stack, identify the root cause, then fix it."
When to Use CoT
| Task Type | CoT Needed? | Why |
|---|---|---|
| Simple classification | Usually no | Sentiment, spam detection — model can handle in one step |
| Multi-step reasoning | Yes | Math, logic, code review — needs intermediate steps |
| Data extraction | Sometimes | Simple fields no, complex relationships yes |
| Content generation | Yes for planning | Outline first, then write — dramatically better structure |
| Decision making | Always | Any decision should show reasoning for auditability |
Our CoT Template
In production, we wrap CoT in XML tags so we can parse the reasoning separately from the answer:
<thinking>
1. What is the customer's primary issue?
2. What product/feature does it relate to?
3. What is the urgency level based on the language used?
4. Has the customer mentioned any prior interactions?
</thinking>
<classification>
{"category": "...", "urgency": "...", "product": "...", "requires_escalation": true/false}
</classification>
The <thinking> block gives us auditability — we can review why the model made a decision. The <classification> block gives us structured output we can parse programmatically. In production, we typically only show the classification to the end user, but we log the thinking for debugging and quality reviews.
Few-Shot Learning: Teaching by Example
Few-shot prompting means including examples of input-output pairs in your prompt. It's the fastest way to align model behavior with your expectations, and we use it in virtually every production prompt.
The key insight that took us a while to internalize: the examples ARE the specification. You can write paragraphs explaining what you want, or you can show three examples and the model will get it. We've found that 3-5 well-chosen examples outperform pages of instructions almost every time.
Choosing the Right Examples
Not all examples are created equal. Here's what we've learned about example selection:
- Cover the edges, not just the middle. If you're building a sentiment classifier, don't show 5 examples of clearly positive and clearly negative text. Show the ambiguous cases — sarcasm, mixed sentiment, neutral-with-context.
- Show the failures you care about. Include an example where the correct answer is "I don't know" or "this doesn't apply." Otherwise the model will force a classification even when none fits.
- Match production data distribution. If 60% of your real inputs are a certain type, your examples should reflect that roughly.
- Include the reasoning. Don't just show input→output. Show input→reasoning→output. The model learns the thinking pattern, not just the mapping.
Dynamic Few-Shot Selection
For applications with diverse input types, static examples aren't enough. We built a pattern we call dynamic few-shot selection — at runtime, we embed the user's input, find the 3-5 most similar examples from a vector database, and inject those into the prompt. This gives us the specificity of few-shot with the coverage of a fine-tuned model.
We used this approach on a customer support classification system for a SaaS client. Static few-shot gave us 82% accuracy. Dynamic few-shot (pulling examples from a database of 2,000 labeled tickets) pushed us to 93%. Same model, same system prompt — just smarter example selection.
Getting Reliable Structured Outputs
If you're building an application (not a chatbot), you almost certainly need the model to return structured data — JSON, XML, a specific format your code can parse. This is where most developers hit their first production wall.
The Reliability Problem
LLMs are text generators. They don't "understand" JSON — they generate characters that usually look like valid JSON. The key word is "usually." In our experience:
- Simple JSON objects: ~98% valid on first try
- Nested objects with arrays: ~92% valid
- Complex schemas with 20+ fields: ~85% valid
85% sounds good until you realize that means 1 in 7 API calls returns garbage that crashes your parser. In production, that's not acceptable.
Techniques That Work
- Use the model's native structured output mode. Claude's tool use, OpenAI's function calling, and similar features constrain the output at the generation level. This is the most reliable approach — near 100% valid JSON.
- JSON Schema in the system prompt. When native modes aren't available, include the exact JSON schema you expect. Models respect schemas remarkably well when you show them the TypeScript interface or JSON Schema definition.
- Validation + retry. Always validate the output against your schema. If it fails, retry with the error message appended: "Your previous output had this error: [error]. Please fix and output valid JSON." One retry catches 95% of failures.
- Constrained decoding. Libraries like Outlines or Instructor force the model to generate tokens that match a grammar. Guaranteed valid output, but adds latency and limits model flexibility.
At Pillai Infotech, we default to option 1 (native tool use) whenever possible, with option 3 as a fallback. For our agentic AI systems, reliable structured output is non-negotiable — agents need to parse each other's outputs to chain actions.
Testing and Iterating Prompts Like Code
This is the part most teams skip, and it's where most AI features fail in production. Prompts need the same rigor as code — version control, testing, CI/CD.
Our Prompt Testing Framework
Every production prompt at Pillai Infotech goes through this process:
- Golden dataset: We create a set of 50-100 input-output pairs that represent the full range of expected inputs. This is our test suite.
- Accuracy baseline: Run the golden dataset through the current prompt. Record accuracy, latency, and cost per call.
- Iterate and measure: Every prompt change is tested against the golden dataset. If accuracy drops on any category, the change doesn't ship.
- A/B testing in production: For high-traffic applications, we run two prompt versions simultaneously and compare real-world performance.
- Regression monitoring: Models update. Prompts that worked last month might not work after a model version change. We run golden dataset checks weekly.
This process adds 2-3 days to development but has saved us from shipping broken AI features more times than I can count.
Prompt Version Control
We store prompts in version-controlled YAML files alongside the application code. Each prompt has a semantic version, a changelog, and a link to the golden dataset results. When debugging a production issue, we can trace exactly which prompt version was running and what changed.
The 7 Prompt Engineering Mistakes We See Most Often
After reviewing hundreds of AI implementations across client projects, these are the patterns that cause the most production pain:
- Prompt length anxiety. Developers worry about token costs and write terse prompts. A 500-token system prompt that works is infinitely cheaper than a 50-token prompt that requires 3 retries and manual cleanup. Spend the tokens.
- No output format specification. "Extract the data" is not a specification. "Return a JSON object with fields: name (string), amount (number), date (ISO 8601)" is. Be explicit.
- Ignoring temperature. Temperature 0 for structured extraction and classification. Temperature 0.7-1.0 for creative content. We've seen teams leave temperature at the default for tasks that need deterministic outputs.
- Testing only the happy path. Your prompt works great with well-formatted input? What about typos, missing fields, extra whitespace, wrong language, empty input, adversarial input? Test the edges.
- Prompt injection vulnerabilities. If your prompt includes user input, that input can contain instructions that override your system prompt. Always sanitize, and use model features designed to separate instructions from user content.
- Not using the latest model features. Models are getting better at following instructions. Features like tool use, structured outputs, and cached prompts exist specifically for production use cases. Don't write workarounds for problems that already have solutions.
- Copy-pasting prompts from blog posts. A prompt is optimized for a specific model, task, and data distribution. Generic prompts from tutorials are starting points, not production solutions. Always iterate on real data.
How Pillai Infotech Approaches Prompt Engineering
When a client comes to us with an AI feature request, our AI development team follows a systematic process:
- Data audit: Before writing a single prompt, we analyze the actual data the model will process. What are the formats, edge cases, volumes, and error rates?
- Model selection: Based on the task complexity, latency requirements, and budget, we pick the right model. We often start with a more capable model, get the prompt right, then downgrade to a cheaper model to see if quality holds.
- Prompt development: Using the RCEG framework (Role, Context, Examples, Guardrails), we write the initial prompt and test against the golden dataset.
- Production hardening: Error handling, retry logic, output validation, fallback models, monitoring, and alerting. The prompt is the brain; this is the body that keeps it alive.
- Ongoing optimization: We monitor prompt performance, collect failure cases, and refine continuously. A prompt is never "done" — it evolves with the model, data, and business requirements.
If you're building AI features and struggling with output quality, consistency, or reliability, reach out for a consultation. We can review your prompts and architecture in a single session and give you a concrete improvement plan.