Part 5: Prompt Engineering & Structured Output

Two words will save you across this entire domain: be explicit. "Be conservative" does not improve precision. "Only report high-confidence findings" does not reduce false positives. What works is defining exactly which issues to report versus skip, with concrete code examples for each severity level.

01 Explicit Criteria

The Core Principle

Specific categorical criteria obliterate vague confidence-based instructions.

Approach	Result
"Be conservative"	Inconsistent filtering
"Only report high-confidence findings"	Arbitrary threshold applied differently each time
"Flag comments only when claimed behaviour contradicts actual code behaviour. Report bugs and security vulnerabilities. Skip minor style preferences."	Consistent, precise output

The False Positive Trust Problem

High false positive rates in one category destroy trust in all categories. Developers stop reading your findings entirely.

Fix: temporarily disable high false-positive categories while improving prompts for those categories. This restores trust while you iterate on the problematic areas.

Severity Calibration

Define explicit severity criteria with concrete code examples for each level. Not prose descriptions of severity. Actual code showing what "critical" vs "minor" looks like.

Critical: SQL injection via unsanitised user input
  connection.execute(f"SELECT * FROM users WHERE id = {user_input}")

Minor: Unused import statement
  import os  # never referenced

02 Few-Shot Prompting

Few-shot examples are the most effective technique for consistency. Not more instructions. Not confidence thresholds. Examples.

When to Deploy

Detailed instructions alone produce inconsistent formatting
Model makes inconsistent judgment calls on ambiguous cases
Extraction tasks produce empty/null fields for information that exists in the document

How to Construct

Use 2-4 targeted examples for ambiguous scenarios. Each example must show reasoning for why one action was chosen over plausible alternatives. This teaches generalisation to novel patterns, not just pattern-matching pre-specified cases.

Example 1:
Input: "The function returns null on error"
Classification: Bug (not style)
Reasoning: Null returns without error context violate the project's
error-handling contract. This is a correctness issue, not a style preference.

Example 2:
Input: "Variable named 'x' in a 3-line lambda"
Classification: Skip (style preference)
Reasoning: Short variable names in short lambdas are idiomatic.
Flagging this would be noise.

Hallucination Reduction

Few-shot examples showing correct handling of varied document structures (inline citations vs bibliographies, narrative vs structured tables) dramatically improve extraction quality.

03 Structured Output with tool_use

The Reliability Hierarchy

Method	Syntax Errors	Semantic Errors
`tool_use` with JSON schemas	Eliminated	Still possible
Prompt-based JSON	Possible	Still possible

tool_use eliminates syntax errors entirely. But it does NOT prevent:

Semantic errors: line items that do not sum to stated total
Field placement errors: values in wrong fields
Fabrication: model invents values for required fields when source lacks information

tool_choice Configuration

Setting	Behaviour	Use Case
`"auto"`	Model may return text instead of tool call	Default operation
`"any"`	Must call a tool, chooses which	Guaranteed structured output with unknown doc types
`{"type": "tool", "name": "..."}`	Must call this specific tool	Forcing mandatory first steps

Schema Design to Prevent Fabrication

Technique	Purpose
Optional/nullable fields	When source may not contain information. Prevents fabrication.
"unclear" enum value	For ambiguous cases
"other" + freeform detail string	For extensible categorisation
Format normalisation rules	In prompts alongside strict schemas

04 Validation-Retry Loops

Retry with Error Feedback

Send back: original document + failed extraction + specific validation error. The model uses the error to self-correct.

When Retries Work (and When They Do Not)

Effective For	Ineffective For
Format mismatches	Information genuinely absent from source
Structural output errors
Misplaced values

The exam presents both scenarios. You must identify which is fixable by retry.

Self-Correction Patterns

detected_pattern fields: track which code construct triggered the finding. Enables analysis of dismissal patterns. Improves prompts over time.
calculated_total alongside stated_total: flag discrepancies automatically.
conflict_detected booleans: for inconsistent source data.

Practice Scenario

An extraction tool produces JSON with line items totalling $1,247.83 but the stated_total field says $1,347.83. A validation-retry loop catches this: it sends back the extraction with the specific discrepancy, and the model self-corrects by re-reading the source document.

05 Batch Processing

Message Batches API Constraints

Feature	Detail
Cost savings	50%
Processing window	Up to 24 hours
Latency SLA	None
Multi-turn tool calling	Not supported
Request correlation	`custom_id` field

The Matching Rule

API	Use For
Synchronous	Blocking workflows — pre-merge checks, anything developers wait for
Batch	Latency-tolerant — overnight reports, weekly audits, nightly test generation

The exam presents a manager proposing batch for everything. The correct answer keeps blocking workflows synchronous.

Batch Failure Handling

Identify failed documents by custom_id
Resubmit only failures with modifications (e.g., chunking oversized documents)
Refine prompts on a sample set BEFORE batch processing to maximise first-pass success

Pass	Purpose
Per-file local analysis	Consistent depth per file
Cross-file integration pass	Catches data flow issues across files

This prevents attention dilution and contradictory findings.

Confidence-Based Routing

Model self-reports confidence per finding
Route low-confidence findings to human review
Calibrate confidence thresholds using labelled validation sets
Prioritise limited reviewer capacity on highest-uncertainty items

07 What to Build

Create an extraction pipeline:

Define a tool with JSON schema (required, optional, nullable fields, enums with "other")
Implement a validation-retry loop
Process 10 documents with varied formats
Add few-shot examples and compare before/after extraction quality
Run a batch through the Batches API
Handle failures by custom_id