2026 OSS Code Review Benchmark

2026 Open Source Benchmark Report

This 2026 open source benchmark is grounded in 50 real pull requests from 5 major open-source projects, each evaluated against a set of human-verified golden comments: the actual issues a thorough code reviewer should catch. Each code review tool is then measured on how well it found those issues and how much irrelevant noise it produced alongside them.

CloudaEye ranks #1 amongst 21 popular code review agents with 65% overall F1 score.

The Dataset

Five repositories, five languages, fifty real PRs

The benchmark dataset covers open-source repositories that map to real production workflows rather than toy examples.

Evaluation Analysis
High and Critical Golden Comments

Detection by language

High and critical golden comments grouped by benchmark language, with inferred detection results from each tool's review comments.

Key Findings

CloudAEye Strengths

  • Strong Recall.
  • Detected critical authorization bypass: && vs || in team admin/owner check allowing unauthorized access.
  • Caught OAuth token corruption: hardcoded refresh_token string persisted to DB, breaking all future refresh flows.
  • Identified race condition in backup code validation enabling concurrent single-code reuse attacks.
  • Perfect score on scheduling logic: caught wrong slot end time in working-hours check and dayjs === comparison, which is always false, in the same PR.
  • Good at async/await patterns: unawaited forEach async calls causing silent Promise rejection on booking cancellation.

CloudAEye Weaknesses

  • Tends to be noisy on large booking-lifecycle PRs with many async call sites.
  • Misses unreachable code branches and org-level data-scoping edge cases.

Key Findings

CloudAEye Strengths

  • Caught critical SSRF vulnerability: open(url) without URL validation in TopicEmbed.import_remote.
  • Identified origin check bypass: indexOf-based validation allows superstring domains to spoof origin.
  • Caught method arity override: duplicate downsize definition silently breaks all existing callers.
  • Detected thread-safety bug in lazy locale loading with unsynchronized @loaded_locales access.
  • Good at nil/NoMethodError patterns: before_validation crash when host is nil and nil controller dereferences.

CloudAEye Weaknesses

  • Misses CSS/Sass styling regressions; visual and theme-only changes tend to produce no signal.
  • Tends to be noisy on security PRs touching multiple attack surfaces at once.
  • Missed a data-integrity risk in raw SQL migration.

Key Findings

CloudAEye Strengths

  • Best recall in the benchmark (86%)
  • Caught auth bypass: asymmetric cache trust lets stale permission grants survive after revocation.
  • Identified incomplete double-checked locking allowing duplicate expensive cache population.
  • Found nil cache overwrite: transient load error silently replaces a valid cache entry with nil.
  • Detected critical SQL logic bug: enableSqlExpressions always returns false, unconditionally disabling feature.
  • Strong on race conditions: concurrent index builds, TotalDocs map race, and device limit race.

CloudAEye Weaknesses

  • Tends to be noisy on large middleware refactors touching many nil-deref patterns at once.
  • Misses parameter-passing type mismatches in React alerting components.

Key Findings

CloudAEye Strengths

  • Solid recall on complex Java auth and permission code.
  • Caught critical NPE: wrong null check in AccessTokenContext validates grantType twice instead of rawTokenId.
  • Identified Optional.get() without isPresent(), causing login form rendering crash when user has no credential.
  • Found feature flag mismatch: V1 flag guards V2 permission cleanup, causing orphaned permission data.
  • Caught permission lookup always missing: resource lookup uses wrong owner ID, silently falling back to type-level grants.
  • Detects test defects: race condition where reader thread is not joined before assertion.

CloudAEye Weaknesses

  • Misses translation file locale bugs; wrong-language strings in .properties files produce no signal.
  • Tends to be noisy on large permission refactors where the same flag pattern recurs across many files.

Key Findings

CloudAEye Strengths

  • Highest Precision of all five repos, with the best signal-to-noise ratio.
  • Caught Django QuerySet negative slicing crash introduced by new pagination feature branch.
  • Identified member.has_global_access AttributeError for API-token-authenticated requests where member is None.
  • Found math.floor/math.ceil TypeError when paginator key is a datetime field.
  • Caught non-deterministic hash() usage breaking cross-process cache invalidation.

CloudAEye Weaknesses

  • Tends to be noisy on concurrency test code; test-quality PRs generate more false positives.
  • Misses dataset-scoping bugs where the wrong collection is passed through multiple layers.
Evaluation Pipeline

How the scoring works

Scoring a code review tool is harder than it looks. Tools produce freeform comments that may bundle multiple issues into one, repeat the same finding across a summary and an inline comment, or describe the same bug in completely different words. The benchmark handles this through a four-step automated pipeline

01. Issue extraction

Each tool’s raw review output is passed through an LLM extractor that breaks freeform comments into individual, standalone issue statements — one problem per entry. A comment like “Found two problems: null input isn’t handled, and the cache key uses the wrong field” becomes two separate candidates. This prevents a tool from being penalized for bundling findings, and prevents inflated candidate counts from padding scores.

02. Deduplication

Some tools post the same finding twice — once in a PR-level summary and again as an inline comment on the affected line. A second LLM pass groups near-identical candidates before scoring. The rule is strict: two candidates are duplicates only if a single code change would fix both. The same type of bug appearing in two different functions counts as two separate issues. When in doubt, candidates are kept separate.

03. Semantic matching

Each deduplicated candidate is compared against every golden comment in an all-pairs LLM evaluation. The judge determines whether the candidate identifies the same underlying issue as the golden comment — accepting semantic matches regardless of wording. Each match receives a confidence score; when multiple candidates could match the same golden comment, the highest-confidence match wins. A matched candidate’s duplicates (from Step 2) are also marked matched, so they don’t count as false positives.

04. Scoring

underlying issue as the golden comment — accepting semantic matches regardless of wording. Each match receives a confidence score; when multiple candidates could match the same golden comment, the highest-confidence match wins. A matched candidate’s duplicates (from Step 2) are also marked matched, so they don’t count as false positives.

Scoring Mechanism

Precision, Recall, and F1

TP, FP, and FN counts are tallied per PR and micro-averaged across all 50 PRs to produce the final Precision, Recall, and F1 scores reported in the leaderboard.

  • Precision = TP / (TP + FP)

    Of all the issues the tool flagged, what fraction were real?

  • Recall = TP / (TP + FN)

    Of all the golden issues, what fraction did the tool find?

  • F1 = Harmonic mean of Precision and Recall

A tool that posts 100 comments per PR may get high Recall but collapsed Precision. A tool that posts 2 comments may get high Precision but low Recall. F1 rewards the tool that finds the most real issues with the least noise.

Proof View

See the proof

Side-by-side benchmark review comments with CloudAEye fixed on the left and a selectable comparison provider on the right.

Why CloudAEye?

CloudAEye delivers code review that works from the very first pull request. Our @cloudaeye /inspect command is built for speed and precision. Independently benchmarked against 21 of the world's best code review solutions, it achieves the #1 F1 score overall and ranks first for correctness, security, feature PRs, and cross file analysis, the scenarios that matter most in real world development. Whether your team is shipping new features or hardening security, CloudAEye consistently outperforms the field, even when issues span multiple files.


For teams that need deeper coverage, @cloudaeye /review goes further. It re evaluates updated code for bugs and logical errors, runs a full security pass aligned with OWASP standards across web applications, LLM and generative AI apps, agentic applications, and MCP servers, executes PR checklists and custom rules, and provides final merge recommendations, all in one pass.


CloudAEye does not just spot issues. @cloudaeye /implement goes a step further and fixes them for you by pushing changes directly to your PR through commit suggestions, so your team can keep moving without context switching.


Beyond identifying issues, CloudAEye enforces the standards your team actually cares about. It automatically tracks docstring and unit test coverage, catches outdated README documentation, and ensures dependency files never fall out of sync. These quality gates run on every PR without any manual effort. If you need something more specific, you can write your own rules in plain English and CloudAEye handles the rest. With more than 75 prebuilt checklist items spanning bugs, security, LLM security, agentic risks, and MCP server vulnerabilities, your entire codebase is covered from first commit to final merge.


Unlike rigid tools, CloudAEye adapts to how your team works. You can toggle only the categories you care about, target specific file patterns or subdirectories, and tailor every review to your repository's standards. From day one inspection to enterprise grade compliance, CloudAEye grows with your team.

Key Takeaways

#1 Code review tool

The results speak for themselves. CloudAEye ranks #1 overall and leads across every major real world filter, including correctness, security, feature PRs, cross file analysis, authentication, and high risk PRs. Customers who demand precision without noise and recall without blind spots consistently achieve the best outcomes with CloudAEye.

A credible open source benchmark you can trust

The benchmark uses public open-source pull requests and golden comments instead of closed, marketing-only examples. Every score is anchored to real reviewable changes.

Reflects how code is written and reviewed in practice

The repository mix spans scheduling, authentication, observability, forums, and error tracking across TypeScript, Python, Java, Ruby, and Go.

One review system for accuracy, security, and scale

With the best F1 score, OWASP aligned security review standards, and an extensive PR checklist, CloudAEye is the best fit for production code review.