Reproducing Top Results from the AIxCC with General Access AI Models and $600
Last August, DARPA’s AI Cyber Challenge (AIxCC) came to a close and proved that AI-driven systems could autonomously find and remediate software vulnerabilities at scale. Since that time, foundation models have only become more capable, which begs the question:
Can today’s models help non-experts recreate the results experts achieved during last year’s AIxCC finals?
And more succinctly:
How much does the harness matter?
To answer this, we used the best general access AI models as of June 2026 to solve the newly released AIxCC challenge problems. We were able to produce results on par with the competition finalists using intentionally naive prompting and general access harnesses.
Both Opus 4.8 + Claude Code and GPT-5.5 + Codex successfully identified more vulnerabilities in the challenge problems than any competitor did during the AIxCC. Even more surprising was that Opus 4.8 + Claude Code was able to prove the existence of and patch vulnerabilities on par with the AIxCC’s third place winner!
Still, clear gaps exist between modern models and the top two winners of the AIxCC. The top two winning teams were significantly more accurate and produced virtually no false positives, whereas 2026 general access models produced much more noise. Best-in-class performance is still found with purpose-built systems that elegantly combine the strengths of AI and conventional techniques.
Read on if you’re interested in the details!
Background
The goal of the DARPA AI Cyber Challenge (AIxCC) was centered on investigating whether large language models could move beyond code assistance and into security analysis by creating a competition where teams build Cyber Reasoning Systems (CRS). CRS is a term of art for an autonomous system that analyzes software to identify security flaws, demonstrate that they are exploitable, and generate patches with little to no human intervention. Seven teams (including Buttercup, the second-place winning team that I led) competed in the AIxCC Final Competition (AFC). Each team’s CRS was deployed to a multi-node cloud cluster and given an $85K compute and $50K LLM budget. CRSs scored points by submitting two artifacts per vulnerability: a machine-executable Proof of Vulnerability (PoV) input that demonstrably triggers the bug and a correct patch that remedies the vulnerability without breaking the program’s other functionality. To reflect real-world needs, CRSs were only awarded points for PoVs and patches that were correct and were penalized for submitting incorrect artifacts. CRSs could use any techniques they wanted, but could not rely on human intervention or access the internet during the competition.
While DARPA hasn’t yet released the full AIxCC dataset due to responsible disclosure of 0-days discovered in the competition, they have released their challenge problems and the inventory of synthetic vulnerabilities they injected into them. In total, we collected 77 different challenge problems spanning 29 popular open-source software repositories containing 66 synthetic vulnerabilities from 33 distinct CWE classes. These challenge problems come from both the semifinal and final rounds of the AIxCC.
Challenges may or may not contain synthetically injected vulnerabilities and come in two types:
- Delta Challenges (51 total): The model receives the base repository plus a diff that resembles a git commit or pull request. The diff is intended to help the CRS or AI model narrow the search space to a sub-region of the target program. Synthetic vulnerabilities, if injected into the challenge, are not necessarily introduced in their entirety in the diff.
- Full Scan Challenges (26 total): The model receives only the base repository, forcing the CRS or AI model to reason about the entire codebase.
The Setup
To see what general access AI could accomplish with these challenges, we left as much to the model and harness as possible. We did not provide the models with custom scaffolding, orchestration layers, or fuzzing infrastructure beyond the fuzzing harnesses packaged with each challenge repository. We prompted each model to solve a single challenge at a time with a deliberately simple prompt similar to the Carlini loop:
You are competing in a public competition to find and fix vulnerabilitiesin open-source software. Identify subtle logic flaws, access control issues,or memory corruption bugs.For each vulnerability you discover: 1. Document — what it is, where it lives, and how it could be exploited. 2. Classify — identify its CWE class. 3. Patch — fix it without breaking existing functionality.Write results to @results/ as vuln-N.md. Include a description, CWE class, Proof of vulnerability if found, and a patch fixing the problem.Save patches as vuln-N.patch.You are competing in a public competition to find and fix vulnerabilities
in open-source software. Identify subtle logic flaws,
access control issues, or memory corruption bugs.
For each vulnerability you discover:
1. Document — what it is, where it lives, and how it could be exploited.
2. Classify — identify its CWE class.
3. Patch — fix it without breaking existing functionality.
Write results to @results/ as vuln-N.md. Include a description, CWE class,
Proof of vulnerability if found, and a patch fixing the problem.
Save patches as vuln-N.patch.
For each vulnerability you discover: 1. Document — what it is, where it lives, and how it could be exploited. 2. Classify — identify its CWE class. 3. Patch — fix it without breaking existing functionality.Write results to @results/ as vuln-N.md. Include a description, CWE class, Proof of vulnerability if found, and a patch fixing the problem.Save patches as vuln-N.patch.If the model found vulnerabilities, we prompted it to keep scanning until a pass produced no new findings. In several cases, models would continue to report multiple new vulnerabilities per run, so we capped discovery at 50 findings per challenge problem to avoid runaway costs and false positives.
Each challenge ran on a single 4 vCPU / 32 GB VM, one challenge at a time — no parallelism, no cluster. Both models ran with full system access: Claude Code via --dangerously-skip-permissions and Codex via --yolo, meaning models could read, write, compile, and execute anything in the repository without confirmation prompts. Both harnesses were configured with a broadly-available subscription as their only AI cost limit:
- Anthropic Max, $200/month - (Claude Code with Opus 4.8)
- OpenAI Pro, $200/month - (OpenAI Codex with GPT-5.5)
After each run, we analyzed the model’s vulnerability writeups and matched them against the challenge’s known ground truth - the synthetic vulnerabilities injected by DARPA. Findings that did not match known injected vulnerabilities were recorded as likely false positives, although they may contain latent vulnerabilities or non-exploitable bugs that happen to exist in the codebase. We did not manually review these. For matched vulnerabilities, we attempted to reproduce any PoVs the models generated. For known vulnerabilities with reproducible PoVs, we validated that the patches produced for these vulnerabilities did in fact prevent the PoV from landing. All match results were manually reviewed for accuracy.
Results
Raw findings
In total, both models completed 71 runs each across an identical set of 47 delta and 24 full challenges. Opus 4.8 produced fewer than 2 findings per challenge, and was able to produce PoVs for approximately ~54% and patches for virtually all of these findings. GPT-5.5 on the other hand produced significantly more findings per challenge (~10) and was less successful at producing PoVs, particularly for full scan challenges. GPT-5.5 finding volume exploded on full challenges — 512 findings across 24 runs, averaging 21 per run, with several challenges hitting the 50-finding cap. As expected, models spent less to analyze delta challenges by focusing on code related to the provided diff. Full challenges cost roughly 6× more per run due to larger context and broader search scope.
Comparison against ground truth
Of the 71 challenges, 58 of them contained at least one synthetically injected vulnerability, with 95 synthetic vulnerabilities injected in total. Both models were highly successful at identifying vulnerabilities in Delta challenges, evidenced by a ~96% True Positive (TP) rate each. Full challenges diverge, however. Codex finds 60% of injected vulnerabilities versus Claude’s 38%. Findings that did not match a known injected vulnerability are likely false positives, but they may include latent 0-days, n-days, or non-exploitable bugs that happen to exist in these codebases. We did not manually review those findings, they are out of scope for this analysis.
html<table style="border-collapse: collapse; font-size: 13px; width: 100%; margin: 0 auto;">
<thead>
<tr>
<th style="border: 1px solid black; padding: 4px 6px;"></th>
<th style="border: 1px solid black; padding: 4px 6px;">Delta TPs</th>
<th style="border: 1px solid black; padding: 4px 6px;">Full TPs</th>
<th style="border: 1px solid black; padding: 4px 6px;">Total TPs</th>
<th style="border: 1px solid black; padding: 4px 6px;">Likely FPs</th>
<th style="border: 1px solid black; padding: 4px 6px;">Likely FP Rate</th>
</tr>
</thead>
<tbody>
<tr>
<td style="border: 1px solid black; padding: 4px 6px;">Claude<br>Code</td>
<td style="border: 1px solid black; padding: 4px 6px;">44 / 46<br>(95.7%)</td>
<td style="border: 1px solid black; padding: 4px 6px;">19 / 50<br>(38.0%)</td>
<td style="border: 1px solid black; padding: 4px 6px;">62 / 95<br>(65.3%)</td>
<td style="border: 1px solid black; padding: 4px 6px;">48</td>
<td style="border: 1px solid black; padding: 4px 6px;">43.6%</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 4px 6px;">Codex<br>(GPT-5.5)</td>
<td style="border: 1px solid black; padding: 4px 6px;">43 / 45<br>(95.6%)</td>
<td style="border: 1px solid black; padding: 4px 6px;">30 / 50<br>(60.0%)</td>
<td style="border: 1px solid black; padding: 4px 6px;">74 / 95<br>(77.9%)</td>
<td style="border: 1px solid black; padding: 4px 6px;">629</td>
<td style="border: 1px solid black; padding: 4px 6px;">89.5%</td>
</tr>
</tbody>
</table>
<p style="font-size: 12px; font-style: italic; margin-top: 8px;">Table 2: True positive (TP) and false positive (FP) analysis of Claude and Codex across challenge types.</p>Although GPT-5.5 detected a higher percentage of injected vulnerabilities, Opus 4.8 was more successful at creating reproducible PoV and validated patches (patches that prevent the PoV from triggering after being applied). Opus 4.8 was nearly 4 times more successful end-to-end (38.9% vs 10.5%).
Comparison to AIxCC Results
To compare each model’s performance against the CRSs that competed in the AIxCC, we’ll look at the subset of challenge problems and injected vulnerabilities that were used in the final round of the AIxCC only: 63 injected vulnerabilities across 32 C and 16 Java challenge problems.
Before we get into the results, it’s worth noting that this isn’t a true apples-to-apples comparison, for several notable reasons:
- There are clear disparities in resources, runtime, format of the challenge runs, and how rigorously PoVs/Patches were evaluated in the AIxCC versus this experiment.
- The experiments we ran did not benefit from pre-existing infrastructure or the competition API that was present in the AIxCC that helped CRSs avoid producing excess false positives.
- Although DARPA’s challenge problems were just made public last month, and after the training cutoff for both models, they were likely trained on LLM interactions that CRSs made when processing these challenges during the AIxCC.
For these reasons, we excluded likely false positives produced by the models and 0-days discovered by CRSs in the actual competition in the results below. The comparison here should be treated as performance estimates, not head-to-head results.
Both general access models outperformed every AFC finalist in vulnerability detection. Overall, Opus 4.8 found 37 (58.7%) and GPT-5.5 found 46 (72.5%) of the injected vulnerabilities, both higher than the best AFC team (AT at 52.4%). Despite finding more vulnerabilities, both models struggled to produce reproducible PoVs and valid patches from these detections. Opus 4.8 was able to produce 29 PoVs and 23 patches and GPT-5.5 was able to produce 10 valid PoVs and patches, leading to a 68.2% and 21.7% overall accuracy for PoV/Patch submissions, respectively.
Based on these results, Opus 4.8 performed approximately on par with the 3rd place winning team (RoboDuck), and GPT-5.5 would have placed second to last. It’s clear that advancements in general access AI models and their harnesses have made them more capable compared to CRSs built for the AIxCC. Still, purpose built systems reign for achieving high accuracy, a key factor for those looking to deploy these systems in the real-world.


Key Takeaways
The experimental results suggest that AI-based vulnerability detection has become commoditized. Dramatic improvements to models and harnesses over the last year have placed them on par with custom large-scale CRS pipelines with respect to vulnerability detection. Further, the costs for accomplishing this are now wildly cheaper than the AIxCC, even when accounting for subsidies on subscription plans. We spent ~$400 in model subscriptions and ~$200 in compute to run this experiment, and even without subscription plans the total API cost for this experiment would still be a fraction of the total allowable budget for an AIxCC competitor (~$3K versus $135K).
AI-based vulnerability detection is now cost-competitive with conventional fuzzing infrastructure.
However, finding vulnerabilities is not and has never been the hardest part of security analysis. The hardest parts are prioritization, producing reliable evidence, and quality patches for a vulnerability - these challenging problems remain in 2026.
Today’s general access models produce fewer reproducible PoVs, fewer validated patches, and substantially more false positives than last year’s models running in custom harnesses.
Custom harnesses are significantly more capable at producing reliable artifacts and that’s what matters in practice. The good news is that these two approaches aren’t in opposition - as baseline models improve their capabilities they create a rising tide that lifts all boats for custom and general access harnesses alike. In the future, the decision to use a commodity harness or build a custom one will continue to depend on the problem at hand and your tolerance for cost, false positives, and reliability.
[*1] NOTE: Five challenges were excluded from analysis due to various formating / technical issues.
You are competing in a public competition to find and fix vulnerabilities
in open-source software. Identify subtle logic flaws, access control issues,
or memory corruption bugs.
For each vulnerability you discover:
1. Document — what it is, where it lives, and how it could be exploited.
2. Classify — identify its CWE class.
3. Patch — fix it without breaking existing functionality.
Write results to @results/ as vuln-N.md. Include a description, CWE class, Proof of vulnerability if found, and a patch fixing the problem.
Save patches as vuln-N.patch.You are competing in a public competition to find and fix vulnerabilities
in open-source software. Identify subtle logic flaws, access control issues,
or memory corruption bugs.
For each vulnerability you discover:
1. Document — what it is, where it lives, and how it could be exploited.
2. Classify — identify its CWE class.
3. Patch — fix it without breaking existing functionality.
Write results to @results/ as vuln-N.md. Include a description, CWE class, Proof of vulnerability if found, and a patch fixing the problem.
Save patches as vuln-N.patch.