Independent Benchmark Study
In a rigorous, blinded evaluation across 40 medical research scenarios, ProtoCol achieved a +10% higher composite score than GPT-5 -- with statistically significant advantages in ethical awareness, code generation, reporting standards, and bias identification.
Composite Score
4.33
vs GPT-5's 3.95
Case Win Rate
2.2x
13 wins vs 6 losses
Dimensions Won
11 / 16
vs GPT-5's 2 of 16
Code Correctness
+69%
3.65 vs 2.16 (p=0.005)
Ethical Awareness
+42%
3.95 vs 2.79 (p<0.001)
Each test case was scored independently by an AI judge that never knew which system produced which answer. Our Research Assistant won more than twice as many cases as GPT-5.
16 dimensions scored 1-5 by blinded expert judge. Highlighted bars indicate statistically significant differences.
Scores are averaged across 3 independent judge runs per case. All 40 advanced-user scenarios included.
Visualizing multi-dimensional performance across biostatistics and methodology domains.
Five dimensions with statistically significant improvements (Wilcoxon signed-rank test, p < 0.05). Ethical Considerations also survives Bonferroni correction.
The MethodologyAgent consistently raises IRB requirements, informed consent protocols, vulnerable populations, and data privacy. GPT-5 frequently overlooks ethical dimensions entirely. The only dimension significant after full Bonferroni correction.
Our multi-agent architecture routes biostatistics queries through a specialized CodingAgent with built-in code validation. GPT-5 frequently produces code with logical errors or incorrect parameter mappings.
Generated scripts follow best practices: clear variable naming, proper commenting, modular structure, and reproducible random seeds. GPT-5 outputs are often monolithic and harder to audit.
Stronger on EQUATOR guidelines (CONSORT, STROBE, PRISMA). Our MethodologyAgent embeds reporting standards into every study design recommendation, while GPT-5 rarely cites them.
More thorough in identifying potential sources of bias -- immortal time bias, confounding by indication, selection bias. Our structured causal inference framework catches what GPT-5 misses.
| Dimension | Domain | Ours | GPT-5 | Delta | Winner |
|---|---|---|---|---|---|
| B1Statistical Test Selection | Biostatistics | 4.71 | 4.67 | +0.04 | Tie |
| B2Sample Size Calculation | Biostatistics | 3.81 | 4.22 | -0.41 | GPT-5 |
| B3Code Correctness* | Biostatistics | 3.65 | 2.16 | +1.49 | Ours |
| B4Assumption Checking | Biostatistics | 4.32 | 3.94 | +0.38 | Ours |
| B5Effect Size Interpretation | Biostatistics | 3.67 | 3.71 | -0.04 | Tie |
| B6Clinical vs Statistical Significance | Biostatistics | 3.02 | 2.95 | +0.07 | Ours |
| B7Explanation Quality | Biostatistics | 4.59 | 4.70 | -0.11 | GPT-5 |
| B8Code Quality* | Biostatistics | 3.76 | 2.21 | +1.55 | Ours |
| M1Research Question Structuring | Methodology | 4.95 | 4.84 | +0.11 | Ours |
| M2Study Design Appropriateness | Methodology | 4.95 | 4.84 | +0.11 | Ours |
| M3Causal Inference Framework | Methodology | 4.77 | 4.63 | +0.14 | Ours |
| M4Bias Identification* | Methodology | 4.91 | 4.56 | +0.35 | Ours |
| M5Ethical Considerations* | Methodology | 3.95 | 2.79 | +1.16 | Ours |
| M6Reporting Standards (EQUATOR)* | Methodology | 4.54 | 3.23 | +1.31 | Ours |
| M7Explanation Quality | Methodology | 5.00 | 5.00 | 0.00 | Tie |
| M8Actionability | Methodology | 4.96 | 5.00 | -0.04 | Tie |
* Statistically significant (Wilcoxon signed-rank, p < 0.05)
Click rows marked with a remark icon to see analysis notes.
Every sample size calculation is validated against published formulas from Chow et al. (2018), Cohen (1988), Julious (2023), Schoenfeld (1983), and statsmodels/scipy reference implementations. Our system uses executed Python code (not LLM estimation) to compute exact sample sizes.
Exact Match
70%
35/50
Within 5%
100%
50/50
Within 10%
100%
50/50
Mean Deviation
0.4%
median 0.0%
| ID | Scenario | Category | Expected | Computed | Dev % | Result |
|---|---|---|---|---|---|---|
| V01 | Two independent means (equal groups) | t-test | 91 | 92 | 1.1% | 5% |
| V02 | Two independent means (unequal SD) | t-test | 58 | 59 | 1.7% | 5% |
| V03 | Paired t-test | t-test | 29 | 29 | 0.0% | Exact |
| V04 | Two proportions (chi-square) | Chi-square | 294 | 294 | 0.0% | Exact |
| V05 | Two proportions (small difference) | Proportions | 1,543 | 1,543 | 0.0% | Exact |
| V06 | One-way ANOVA (3 groups) | ANOVA | 53 | 53 | 0.0% | Exact |
| V07 | One-way ANOVA (4 groups, small effect) | ANOVA | 109 | 109 | 0.0% | Exact |
| V08 | Survival analysis (log-rank, Schoenfeld) | Survival | 380 | 380 | 0.0% | Exact |
| V09 | Survival analysis (HR=0.70, 90% power) | Survival | 328 | 330 | 0.6% | 5% |
| V10 | Non-inferiority trial (means) | NI/Equiv | 142 | 143 | 0.7% | 5% |
| V11 | Correlation test (r=0.30) | Correlation | 85 | 85 | 0.0% | Exact |
| V12 | Logistic regression (single predictor) | Regression | 262 | 265 | 1.1% | 5% |
| V13 | McNemar's test (paired proportions) | Paired prop | 113 | 113 | 0.0% | Exact |
| V14 | Two independent means (90% power) | t-test | 122 | 122 | 0.0% | Exact |
| V15 | Cluster-randomized trial (proportions) | Cluster | 875 | 870 | 0.6% | 5% |
| V16 | Equivalence trial (two proportions) | NI/Equiv | 199 | 200 | 0.5% | 5% |
| V17 | Repeated measures ANOVA | ANOVA | 42 | 42 | 0.0% | Exact |
| V18 | Two proportions (large effect) | Proportions | 141 | 141 | 0.0% | Exact |
| V19 | Two-sample t-test with 2:1 allocation | t-test | 118 | 119 | 0.8% | 5% |
| V20 | Single proportion (exact binomial) | Proportions | 58 | 58 | 0.0% | Exact |
Expected values from Chow et al. (2018), Cohen (1988), Julious (2023), Schoenfeld (1983), and statsmodels/scipy reference implementations.
Purpose-built for medical research planning, not general-purpose chat.
Every statistical script is generated by a specialized CodingAgent and validated through a DiagnosticTool. No more debugging GPT output -- get production-ready R, Python, and STATA code.
Seven specialized agents work in concert: literature search, evidence appraisal, methodology design, biostatistics, coding, diagnostics, and summarization. Each expert does what it does best.
Automatically surfaces IRB requirements, informed consent protocols, and data privacy considerations. The only research assistant that consistently prioritizes ethical compliance.
Methodology recommendations follow CONSORT, STROBE, PRISMA, and other EQUATOR Network guidelines. Your study design meets journal submission standards from day one.
From identifying research gaps through literature search, to designing study methodology, to calculating sample sizes and generating analysis code -- all in one conversation.
100% of sample size calculations fall within 5% of published formulas across 50 benchmarks (Chow, Cohen, Julious, Schoenfeld). Code is executed in a sandbox, not estimated by an LLM.
Curated medical research scenarios for advanced users across biostatistics, methodology, and edge cases. 14 specialties represented.
Systems randomly assigned as 'System A' or 'System B'. The judge never knew which was which.
Each case judged 3 times independently. 93% exact agreement, 100% within-1 agreement across runs.
Wilcoxon signed-rank tests with Bonferroni correction (alpha/17 = 0.0029). No cherry-picking.
Join researchers who trust our specialized AI assistant for gap analysis, study design, and biostatistical planning.
Start a Research Session