Let's Connect

A developer's screen showing code in an editor, representing prompt engineering for reliable code output

Prompt engineering for developers is the difference between an LLM that ships and one that wastes your afternoon. You have seen it: you ask for a function, you get back code that is almost right — wrong format, a hallucinated API method that does not exist, a JSON blob wrapped in three paragraphs of "Here is the code you requested." Reliable output is a discipline, not luck. You get it by being specific about the goal and constraints, feeding the model the real context (the actual file, the actual error, the actual versions), showing one example of the shape you want, and — for anything you parse in code — constraining the output with a JSON schema or tool use instead of regexing free text.

Why does the model return almost-right code?

Almost-right is the default failure mode because you under-specified and the model filled the gap with a plausible guess. Three things cause most of it. First, missing context: you described the bug in English instead of pasting the stack trace and the function. Second, no version pinning: the model wrote code against whatever version of the library was most common in its training data, not yours. Third, no output contract: you asked for "the code" and let the model decide whether to add prose, comments, or an explanation you then have to strip.

The fix is not a magic phrase. It is treating the prompt like an interface: inputs, constraints, and an expected output shape. The more of that interface you specify, the less the model improvises.

Lines of source code on a dark screen, representing structured, constrained LLM output
Constrain the output shape and the model stops improvising around the edges.

What actually changes the result?

Four techniques carry most of the weight. They are not clever — they are just specific.

  • Be explicit about the goal AND the constraints. Not "write a retry wrapper" but "write a retry wrapper for a fetch call: max 3 attempts, exponential backoff starting at 200ms, only retry on 429 and 5xx, throw on everything else."
  • Give real context. Paste the actual file, the actual error message, and the exact versions ("Next.js from node_modules, React 19, TypeScript 5.6"). A version string prevents a whole class of hallucinated APIs.
  • Show one example of the output you want (few-shot). A single before/after pair pins down formatting far more reliably than describing the format in words.
  • Specify the output format explicitly. Say "return only the TypeScript function, no markdown fences, no explanation" — or better, enforce it with a schema (below).

Here is the same request, before and after. The before is what most people send; the after is what gets a clean, paste-ready result.

prompt-before-after.txt
# BEFORE — vague, no context, no output contract
Write a function to validate an email and return whether it's valid.

# AFTER — goal + constraints + context + output contract
Write ONE TypeScript function `isValidEmail(input: string): boolean`.
Constraints:
- No external libraries (this is a Next.js project, TypeScript 5.6, strict mode).
- Trim whitespace before checking.
- Reject inputs longer than 254 chars.
- Use a single pragmatic regex; do not try to be RFC 5322 complete.
Return ONLY the function body as TypeScript. No markdown fences, no prose,
no usage example.

How do I get output I can parse in code?

If a script consumes the model's output, do not parse prose. Free text drifts — one day the model wraps the JSON in a code fence, the next day it adds a leading sentence, and your `JSON.parse` throws in production. Constrain the response to a schema instead. With the Claude API you do this with structured outputs: you pass `output_config.format` with a JSON schema, and the response is constrained to valid JSON matching that schema. Note the schema needs `additionalProperties: false` and a `required` array on every object.

extract-with-schema.ts
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const schema = {
  type: "object",
  properties: {
    summary: { type: "string" },
    severity: { type: "string", enum: ["low", "medium", "high"] },
    files_touched: { type: "array", items: { type: "string" } },
  },
  required: ["summary", "severity", "files_touched"],
  additionalProperties: false,
} as const;

const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 4096,
  output_config: { format: { type: "json_schema", schema } },
  messages: [
    { role: "user", content: "Review this diff and report severity:\n\n<diff here>" },
  ],
});

// response.content[0] is a text block whose text is valid JSON for `schema`.
const block = response.content[0];
if (block.type === "text") {
  const review = JSON.parse(block.text); // no regex, no fence-stripping
  console.log(review.severity);
}

For actions — not just data extraction — use tool use. You declare a `tools` array where each tool has a `name`, `description`, and an `input_schema` (JSON Schema). The model replies with `stop_reason: "tool_use"` and a `tool_use` block containing validated arguments; you execute the function and send a `tool_result` back. This is how the model calls your code instead of describing what it would do. I lean on this hard when wiring LLM features into a backend — I wrote up the full pattern in my post on adding AI features to a Laravel app, and it is the same mechanism behind automating code review with an AI agent.

Does over-prescriptive prompting still work on modern models?

No — and this is the gotcha that bites people who learned prompting on older models. The instinct, after months of fighting reluctant models, is to SHOUT: "CRITICAL: YOU MUST use the search tool. If in doubt, ALWAYS call it." On modern Claude models that backfires. They follow the system prompt closely, so aggressive language causes overtriggering — the model calls the tool when it should not, or pads output to satisfy a rule you did not actually need.

State the rule plainly and let the model follow it. "Use the search tool when the answer depends on current data" beats "CRITICAL: YOU MUST ALWAYS search" — the second one makes the model search when it shouldn't.Md Raihan Hasan

The replacement for forceful prompting is not more prose — it is the effort parameter. For hard problems, raise reasoning effort instead of writing a longer prompt. On Claude Opus 4.8 you set it inside `output_config`, and you combine it with adaptive thinking so the model decides how much to reason per request.

effort-and-thinking.ts
const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 16000,
  // Let the model decide how much to think; raise effort for hard problems.
  thinking: { type: "adaptive" },
  output_config: { effort: "high" }, // low | medium | high | xhigh | max
  messages: [
    {
      role: "user",
      content: "Find the race condition in this queue worker and fix it. <code>",
    },
  ],
});

A note on cost so you pick the right tool: Claude Opus 4.8 sits at the top of the Opus tier, at $5 per million input tokens and $25 output, with a 1M-token context window — use it when correctness matters. When you are running a high-volume extraction loop where the schema does the heavy lifting, drop to Claude Sonnet 4.6 ($3 / $15 per MTok) or Claude Haiku 4.5 ($1 / $5). The schema constrains the output the same way regardless of model, so cheaper models stay reliable for structured tasks.

How do I keep prompts reliable over time?

Treat the prompt like code, because it is code. The single biggest improvement to my workflow was putting prompts in version control and iterating against a small set of example inputs. When a prompt regresses — and they do, when you tweak one line or switch models — a diff and a handful of saved examples tell you exactly what broke.

  • Commit prompts as files in the repo, not buried in a string literal you forget exists.
  • Keep 5 to 10 representative inputs with their expected outputs, and re-run them whenever you edit the prompt or change the model.
  • When output drifts, diff the prompt bytes — a stray instruction or a reordered constraint is usually the culprit.
  • Pin the model ID explicitly (`claude-opus-4-8`, never a guessed date suffix) so an upgrade is a deliberate change, not a surprise.

None of this is exotic. Specific goals, real context, one example, an enforced output shape, plain instructions instead of shouting, and prompts under version control. Do those six things and the LLM stops returning almost-right code and starts returning code you can ship. The model was never the unreliable part — the prompt was.