Automating Code Review and QA With an AI Agent

Every reviewer I know burns their first ten minutes on a PR catching the same dull things: the off-by-one, the swallowed exception, the missing null check. By the time you get to the parts that actually need a human, the architecture and the intent, your attention is already spent. AI code review automation fixes that ordering. On every pull request, CI sends the diff to an AI agent, the agent reviews it, and it posts findings as PR comments before a human reviewer ever opens the page, so the human walks in with the mechanical noise already cleared. The catch, and the whole point of this post: it only works if you scope it tightly and keep a human in the loop. The agent is a reviewer, never an approver. Here is exactly how I wire it in.

What does the pipeline actually look like?

The shape is simple. A PR opens or updates, GitHub Actions grabs the diff for the changed files, and a step pipes that diff plus a review prompt to the Claude API. The model returns structured findings, and a small script posts them back as review comments. No magic, just a CI step that happens to call a model instead of a linter. If you already run a CI/CD pipeline you can bolt this on as one more job, the way I describe in my post on building a CI/CD pipeline with GitHub Actions.

Here is the GitHub Actions step that collects the diff and runs the reviewer. I scope the diff to the PR's changed files so I am not paying to re-review the whole repo on every push.

.github/workflows/ai-review.yml

name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

permissions:
  contents: read
  pull-requests: write

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Collect the PR diff
        run: |
          git diff --unified=3 \
            origin/${{ github.base_ref }}...HEAD \
            -- '*.php' '*.ts' '*.tsx' '*.py' > pr.diff

      - name: Run the reviewer
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: node scripts/review.mjs

If you prefer not to hand-roll the API calls, you can run Claude Code directly in the CI step instead. Install it with npm install -g @anthropic-ai/claude-code, authenticate with an API key in the environment, and let it read the diff and write comments. Same idea, fewer moving parts. I lean on the raw API when I want full control over the prompt and the output shape, which is most of the time.

How do you prompt the agent so it does not drop real bugs?

This is the part people get wrong. The instinct is to write "only report important bugs" so the PR does not drown in noise. Do not do that. When you tell a capable model to self-filter for importance, it investigates just as thoroughly, finds the bug, and then silently declines to report it because it judged the finding below your bar. Your precision goes up and your recall quietly falls off a cliff, which is the opposite of what you want from a reviewer.

Tell the agent to report every issue it finds with a confidence and a severity attached, then rank and filter downstream. Coverage is the reviewer's job; importance is the filter's job.Md Raihan Hasan

Split the two concerns. Ask the model for everything, each finding tagged with a confidence level and an estimated severity, and let a downstream step (or just the human) decide what is worth surfacing. The prompt that has worked best for me is blunt about this:

scripts/review.mjs (review prompt)

import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "fs";

const client = new Anthropic();
const diff = readFileSync("pr.diff", "utf8");

const system = `You are a senior code reviewer. Review the diff below.
Report EVERY issue you find, including ones you are uncertain about
or consider low-severity. Do not filter for importance or confidence
at this stage. A separate step ranks findings. Coverage is the goal:
better to surface a finding that gets filtered out than to drop a bug.`;

const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 16000,
  thinking: { type: "adaptive" },
  system,
  output_config: {
    format: {
      type: "json_schema",
      schema: {
        type: "object",
        additionalProperties: false,
        properties: {
          findings: {
            type: "array",
            items: {
              type: "object",
              additionalProperties: false,
              properties: {
                file: { type: "string" },
                line: { type: "integer" },
                severity: { type: "string", enum: ["low", "medium", "high"] },
                confidence: { type: "string", enum: ["low", "medium", "high"] },
                message: { type: "string" },
              },
              required: ["file", "line", "severity", "confidence", "message"],
            },
          },
        },
        required: ["findings"],
      },
    },
  },
  messages: [{ role: "user", content: diff }],
});

Structured outputs guarantee the response parses cleanly into the shape your posting script expects, so you are not regexing JSON out of prose. From there, a trivial filter (say, drop low-confidence-and-low-severity, post the rest) keeps the comment volume sane while preserving the agent's full coverage. If you want to go deeper on writing prompts that behave, I wrote up the patterns I use in prompt engineering for developers.

Lines of source code on a dark editor, the kind of diff an AI reviewer reads — The agent reads the same diff a human would, but it never gets bored on line 400.

What guardrails keep this from going wrong?

An AI reviewer that can merge code is a liability, not a feature. The guardrails are non-negotiable, and most of them are about respecting the boundary between advisory and authoritative.

Never auto-merge on an AI pass. A green check from the agent is a suggestion, not a sign-off. Branch protection still requires a human approval.
Treat every finding as advisory. The agent flags, the human decides. Wrong findings get dismissed; that is expected and fine.
Keep diffs small. A focused 200-line PR gets a focused review; a 2,000-line PR gets a vague one. Small batches are good QA hygiene whether or not a model is involved.
Scope to changed files only. Do not send the whole repo. Less cost, less noise, sharper findings.
Cap tokens and set a sane max_tokens so a runaway review cannot rack up a surprise bill.

The small-diff point matters more than it looks. The same discipline that makes human review effective, breaking work into reviewable chunks, is what makes AI review effective too. I cover the batching workflow in detail in my post on automated website QA with small PR batches, and it pairs directly with this setup.

What does it cost, and which model should you use?

Cost is a function of how much you send and which model reviews it. Scoping to the diff is the single biggest lever, you are sending a few hundred lines, not the codebase. For the model itself, I default to claude-opus-4-8 (roughly $5 per million input tokens and $25 per million output) because catching a real bug before it ships is worth far more than the few cents the review costs. If you run reviews on a high-traffic repo and cost becomes the deciding factor, claude-sonnet-4-6 (around $3 / $15 per million) is a strong middle ground, and claude-haiku-4-5 (around $1 / $5) handles lightweight style passes cheaply.

One more architectural note worth understanding: tool use is what lets the agent do more than read a diff. You pass a tools array, the model replies with stop_reason "tool_use" and a tool_use block, your code executes the tool and sends a tool_result back. That is how you would let the reviewer pull the full file around a flagged line, check a related test, or query CI history, instead of guessing from the diff alone. Start without tools; add them when the agent keeps asking for context it does not have.

Wire it in this way and the agent earns its keep: it clears the mechanical findings off the human reviewer's plate, it never tires on a long diff, and it never has the authority to merge anything. Keep it scoped, keep it advisory, keep a person in the loop, and treat it as exactly what it is, a tireless first-pass reviewer that makes your real reviewers faster. That is the whole bargain, and it is a good one.

Let's Connect

Automating Code Review and QA With an AI Agent

What does the pipeline actually look like?

How do you prompt the agent so it does not drop real bugs?

What guardrails keep this from going wrong?

What does it cost, and which model should you use?

Building a Personal AI Voice Assistant: Architecture and Tradeoffs

How to Connect Claude to Your Own Tools With MCP

Search

Category

Latest Articles

Laravel Queue Workers on Production: Supervisor Setup That Actually Survives Reboots

Auto-Filing Email Attachments in Laravel with Google Workspace IMAP

Laravel API Authentication: Sanctum vs Passport vs JWT in 2026

Need a Full-Stack Developer?

Let's Connect

Automating Code Review and QA With an AI Agent

Md Raihan Hasan

Mar 28, 2026

7 min read

What does the pipeline actually look like?

How do you prompt the agent so it does not drop real bugs?

What guardrails keep this from going wrong?

What does it cost, and which model should you use?

Building a Personal AI Voice Assistant: Architecture and Tradeoffs

How to Connect Claude to Your Own Tools With MCP

Search

Category

Latest Articles

Laravel Queue Workers on Production: Supervisor Setup That Actually Survives Reboots

Auto-Filing Email Attachments in Laravel with Google Workspace IMAP

Laravel API Authentication: Sanctum vs Passport vs JWT in 2026

Popular Tags

Need a Full-Stack Developer?