Independent Benchmark Study

Outperforming GPT-5 Where It Matters Most

In a rigorous, blinded evaluation across 40 medical research scenarios, ProtoCol achieved a +10% higher composite score than GPT-5 -- with statistically significant advantages in ethical awareness, code generation, reporting standards, and bias identification.

Blinded evaluation -- Claude Sonnet 4.6 judge -- 3 runs per case -- Bonferroni-corrected

Composite Score

4.33

vs GPT-5's 3.95

Case Win Rate

2.2x

13 wins vs 6 losses

Dimensions Won

11 / 16

vs GPT-5's 2 of 16

Code Correctness

+69%

3.65 vs 2.16 (p=0.005)

Ethical Awareness

+42%

3.95 vs 2.79 (p<0.001)

Head-to-Head: 40 Research Scenarios

Each test case was scored independently by an AI judge that never knew which system produced which answer. Our Research Assistant won more than twice as many cases as GPT-5.

Our Wins (13)
GPT-5 Wins (6)
Ties (21)

Rubric Score Comparison

16 dimensions scored 1-5 by blinded expert judge. Highlighted bars indicate statistically significant differences.

Scores are averaged across 3 independent judge runs per case. All 40 advanced-user scenarios included.

Domain Profiles

Visualizing multi-dimensional performance across biostatistics and methodology domains.

Biostatistics

Methodology

Where We Outperform GPT-5

Five dimensions with statistically significant improvements (Wilcoxon signed-rank test, p < 0.05). Ethical Considerations also survives Bonferroni correction.

+1.16

Ethical Considerations (M5)

p=<0.001

The MethodologyAgent consistently raises IRB requirements, informed consent protocols, vulnerable populations, and data privacy. GPT-5 frequently overlooks ethical dimensions entirely. The only dimension significant after full Bonferroni correction.

+1.49

Code Correctness (B3)

p=0.005

Our multi-agent architecture routes biostatistics queries through a specialized CodingAgent with built-in code validation. GPT-5 frequently produces code with logical errors or incorrect parameter mappings.

+1.56

Code Quality (B8)

p=0.008

Generated scripts follow best practices: clear variable naming, proper commenting, modular structure, and reproducible random seeds. GPT-5 outputs are often monolithic and harder to audit.

+1.32

Reporting Standards (M6)

p=0.004

Stronger on EQUATOR guidelines (CONSORT, STROBE, PRISMA). Our MethodologyAgent embeds reporting standards into every study design recommendation, while GPT-5 rarely cites them.

+0.35

Bias Identification (M4)

p=0.029

More thorough in identifying potential sources of bias -- immortal time bias, confounding by indication, selection bias. Our structured causal inference framework catches what GPT-5 misses.

Full Rubric Breakdown

DimensionDomainOursGPT-5DeltaWinner
B1Statistical Test SelectionBiostatistics4.714.67+0.04Tie
B2Sample Size CalculationBiostatistics3.814.22-0.41GPT-5
B3Code Correctness*Biostatistics3.652.16+1.49Ours
B4Assumption CheckingBiostatistics4.323.94+0.38Ours
B5Effect Size InterpretationBiostatistics3.673.71-0.04Tie
B6Clinical vs Statistical SignificanceBiostatistics3.022.95+0.07Ours
B7Explanation QualityBiostatistics4.594.70-0.11GPT-5
B8Code Quality*Biostatistics3.762.21+1.55Ours
M1Research Question StructuringMethodology4.954.84+0.11Ours
M2Study Design AppropriatenessMethodology4.954.84+0.11Ours
M3Causal Inference FrameworkMethodology4.774.63+0.14Ours
M4Bias Identification*Methodology4.914.56+0.35Ours
M5Ethical Considerations*Methodology3.952.79+1.16Ours
M6Reporting Standards (EQUATOR)*Methodology4.543.23+1.31Ours
M7Explanation QualityMethodology5.005.000.00Tie
M8ActionabilityMethodology4.965.00-0.04Tie

* Statistically significant (Wilcoxon signed-rank, p < 0.05)

Click rows marked with a remark icon to see analysis notes.

Sample Size Calculation Validation

50 benchmarks

Every sample size calculation is validated against published formulas from Chow et al. (2018), Cohen (1988), Julious (2023), Schoenfeld (1983), and statsmodels/scipy reference implementations. Our system uses executed Python code (not LLM estimation) to compute exact sample sizes.

Exact Match

70%

35/50

Within 5%

100%

50/50

Within 10%

100%

50/50

Mean Deviation

0.4%

median 0.0%

IDScenarioCategoryExpectedComputedDev %Result
V01Two independent means (equal groups)t-test91921.1%5%
V02Two independent means (unequal SD)t-test58591.7%5%
V03Paired t-testt-test29290.0%Exact
V04Two proportions (chi-square)Chi-square2942940.0%Exact
V05Two proportions (small difference)Proportions1,5431,5430.0%Exact
V06One-way ANOVA (3 groups)ANOVA53530.0%Exact
V07One-way ANOVA (4 groups, small effect)ANOVA1091090.0%Exact
V08Survival analysis (log-rank, Schoenfeld)Survival3803800.0%Exact
V09Survival analysis (HR=0.70, 90% power)Survival3283300.6%5%
V10Non-inferiority trial (means)NI/Equiv1421430.7%5%
V11Correlation test (r=0.30)Correlation85850.0%Exact
V12Logistic regression (single predictor)Regression2622651.1%5%
V13McNemar's test (paired proportions)Paired prop1131130.0%Exact
V14Two independent means (90% power)t-test1221220.0%Exact
V15Cluster-randomized trial (proportions)Cluster8758700.6%5%
V16Equivalence trial (two proportions)NI/Equiv1992000.5%5%
V17Repeated measures ANOVAANOVA42420.0%Exact
V18Two proportions (large effect)Proportions1411410.0%Exact
V19Two-sample t-test with 2:1 allocationt-test1181190.8%5%
V20Single proportion (exact binomial)Proportions58580.0%Exact

Expected values from Chow et al. (2018), Cohen (1988), Julious (2023), Schoenfeld (1983), and statsmodels/scipy reference implementations.

Why Researchers Choose Us

Purpose-built for medical research planning, not general-purpose chat.

Validated Code Generation

Every statistical script is generated by a specialized CodingAgent and validated through a DiagnosticTool. No more debugging GPT output -- get production-ready R, Python, and STATA code.

Multi-Agent Architecture

Seven specialized agents work in concert: literature search, evidence appraisal, methodology design, biostatistics, coding, diagnostics, and summarization. Each expert does what it does best.

Ethics-First Approach

Automatically surfaces IRB requirements, informed consent protocols, and data privacy considerations. The only research assistant that consistently prioritizes ethical compliance.

EQUATOR-Aligned Reporting

Methodology recommendations follow CONSORT, STROBE, PRISMA, and other EQUATOR Network guidelines. Your study design meets journal submission standards from day one.

End-to-End Workflow

From identifying research gaps through literature search, to designing study methodology, to calculating sample sizes and generating analysis code -- all in one conversation.

Rigorous Validation

100% of sample size calculations fall within 5% of published formulas across 50 benchmarks (Chow, Cohen, Julious, Schoenfeld). Code is executed in a sandbox, not estimated by an LLM.

How We Tested

01

40 Expert Scenarios

Curated medical research scenarios for advanced users across biostatistics, methodology, and edge cases. 14 specialties represented.

02

Blinded Evaluation

Systems randomly assigned as 'System A' or 'System B'. The judge never knew which was which.

03

Triple-Run Consistency

Each case judged 3 times independently. 93% exact agreement, 100% within-1 agreement across runs.

04

Statistical Rigor

Wilcoxon signed-rank tests with Bonferroni correction (alpha/17 = 0.0029). No cherry-picking.

Ready to Elevate Your Research?

Join researchers who trust our specialized AI assistant for gap analysis, study design, and biostatistical planning.

Start a Research Session