circle-ir SAST Benchmarks
Static analysis only. No LLM. Reproducible.
circle-ir 3.19.4 · benchmark date April 22, 2026circle-ir is an MIT-licensed neuro-symbolic static analysis library. These benchmarks measure the static analysis engine only—no LLM verification layer is involved. All benchmark code, harnesses, and raw outputs are available in the circle-ir repository for independent reproduction.
Results
Benchmarks by Language
Java (6 benchmarks)
| Benchmark | Tests | TP | TN | FP | FN | TPR | FPR | Score |
|---|---|---|---|---|---|---|---|---|
| OWASP Benchmark | 1,415 | 708 | 707 | 0 | 0 | 100% | 0% | 100% |
| Juliet Test Suite | 243 | 122 | 121 | 0 | 0 | 100% | 0% | 100% |
| SecuriBench Micro | 123 | 60 | 60 | 1 | 2 | 96.8% | 1.6% | 97.7% |
| CWE-Bench-Java | 120 | 61 | — | — | 59 | 50.8% | — | 50.8% |
| WebGoat | 29 | 26 | — | — | 3 | 89.7% | — | 89.3% |
| DVJA | 7 | 7 | — | — | 0 | 100% | — | 100% |
Node.js / TypeScript (3 benchmarks)
| Benchmark | Tests | TP | TN | FP | FN | TPR | FPR | Score |
|---|---|---|---|---|---|---|---|---|
| NodeGoat | 14 | 14 | — | — | 0 | 100% | — | 100% |
| Juice Shop | 14 | 14 | — | — | 0 | 100% | — | 100% |
| NodeJS Synthetic | 25 | 23 | — | — | 2 | 92.0% | — | 92.9% |
Python (2 benchmarks)
| Benchmark | Tests | TP | TN | FP | FN | TPR | FPR | Score |
|---|---|---|---|---|---|---|---|---|
| PyGoat | 26 | 23 | — | — | 3 | 88.5% | — | 90.0% |
| DVPWA | 6 | 6 | — | — | 0 | 100% | — | 100% |
Rust (2 benchmarks)
| Benchmark | Tests | TP | TN | FP | FN | TPR | FPR | Score |
|---|---|---|---|---|---|---|---|---|
| Rust Synthetic | 50 | 46 | — | — | 4 | 92.0% | — | 92.3% |
| CWE-Bench-Rust | 30 | 28 | — | — | 2 | 93.3% | — | 94.4% |
Other Languages (3 benchmarks)
| Benchmark | Tests | TP | TN | FP | FN | TPR | FPR | Score |
|---|---|---|---|---|---|---|---|---|
| Bash Synthetic | 31 | 31 | — | — | 0 | 100% | — | 100% |
| HTML/JS Synthetic | 30 | 30 | — | — | 0 | 100% | — | 100% |
| Firing Range | 40 | 35 | — | 2 | 3 | 92.1% | — | 92.1% |
Summary
Results by Language
| Language | Perfect (100%) | Near-perfect (90%+) | Total Benchmarks |
|---|---|---|---|
| Java | 3 | 4 | 6 |
| Node.js / TypeScript | 2 | 3 | 3 |
| Python | 1 | 2 | 2 |
| Rust | 0 | 2 | 2 |
| Bash | 1 | 1 | 1 |
| HTML/JS | 1 | 1 | 1 |
| Total | 8 | 13 | 16 |
Deep Dive
CWE-Bench-Java by Category
| CWE | Category | Detected | Missed | Rate |
|---|---|---|---|---|
| CWE-022 | Path Traversal | 37 / 55 | 18 | 67.3% |
| CWE-078 | Command Injection | 6 / 13 | 7 | 46.2% |
| CWE-079 | XSS | 13 / 31 | 18 | 41.9% |
| CWE-094 | Code Injection | 5 / 21 | 16 | 23.8% |
Methodology
How we measured
- circle-ir is a neuro-symbolic static analyzer that combines traditional dataflow analysis with learned patterns
- All results are from static analysis only—no LLM involvement in detection or verification
- Each benchmark's source dataset is linked to its origin: OWASP Benchmark, NIST Juliet Test Suite, CWE-Bench-Java, and others
- CWE-Bench-Java uses per-project binary detection: each project contains one CVE, scored as detected or not
- All benchmark code, harnesses, and raw outputs are in cogniumhq/circle-ir/benchmarks/
Limitations
Known gaps
- SSTI (Server-Side Template Injection) is not currently in circle-ir's CWE coverage—this causes the PyGoat false negative
- Firing Range has 2 false positives in the
escape/category (escaped output flagged) and 3 false negatives incors/(CORS misconfigurations not detected) - CWE-Bench-Java uses per-project detection, not per-CVE counts—a single missed sink in a complex project counts as a full miss
- These benchmarks test static analysis only—the full circle-ir + LLM verification pipeline (SAST+LLM) produces different results, published separately
Reproduce
Run it yourself
To reproduce these benchmark results, you need Git and Node.js (18+) installed.
git clone https://github.com/cogniumhq/circle-ir cd circle-ir/benchmarks npm install npm run benchmark
If you cannot reproduce a result, please open an issue at github.com/cogniumhq/circle-ir/issues.