Independent Benchmark Study

Outperforming GPT-5 Where It Matters Most

In a rigorous, blinded evaluation across 40 medical research scenarios, ProtoCol achieved a +10% higher composite score than GPT-5 -- with statistically significant advantages in ethical awareness, code generation, reporting standards, and bias identification.

Blinded evaluation -- Claude Sonnet 4.6 judge -- 3 runs per case -- Bonferroni-corrected

Composite Score

4.33

vs GPT-5's 3.95

Case Win Rate

2.2x

13 wins vs 6 losses

Dimensions Won

11 / 16

vs GPT-5's 2 of 16

Code Correctness

+69%

3.65 vs 2.16 (p=0.005)

Ethical Awareness

+42%

3.95 vs 2.79 (p<0.001)

Head-to-Head: 40 Research Scenarios

Each test case was scored independently by an AI judge that never knew which system produced which answer. Our Research Assistant won more than twice as many cases as GPT-5.

Our Wins (13)

GPT-5 Wins (6)

Ties (21)

Rubric Score Comparison

16 dimensions scored 1-5 by blinded expert judge. Highlighted bars indicate statistically significant differences.

Scores are averaged across 3 independent judge runs per case. All 40 advanced-user scenarios included.

Domain Profiles

Visualizing multi-dimensional performance across biostatistics and methodology domains.

Biostatistics

Methodology

Where We Outperform GPT-5

Five dimensions with statistically significant improvements (Wilcoxon signed-rank test, p < 0.05). Ethical Considerations also survives Bonferroni correction.

+1.16

Ethical Considerations (M5)

p=<0.001

The MethodologyAgent consistently raises IRB requirements, informed consent protocols, vulnerable populations, and data privacy. GPT-5 frequently overlooks ethical dimensions entirely. The only dimension significant after full Bonferroni correction.

+1.49

Code Correctness (B3)

p=0.005

Our multi-agent architecture routes biostatistics queries through a specialized CodingAgent with built-in code validation. GPT-5 frequently produces code with logical errors or incorrect parameter mappings.

+1.56

Code Quality (B8)

p=0.008

Generated scripts follow best practices: clear variable naming, proper commenting, modular structure, and reproducible random seeds. GPT-5 outputs are often monolithic and harder to audit.

+1.32

Reporting Standards (M6)

p=0.004

Stronger on EQUATOR guidelines (CONSORT, STROBE, PRISMA). Our MethodologyAgent embeds reporting standards into every study design recommendation, while GPT-5 rarely cites them.

+0.35

Bias Identification (M4)

p=0.029

More thorough in identifying potential sources of bias -- immortal time bias, confounding by indication, selection bias. Our structured causal inference framework catches what GPT-5 misses.

Full Rubric Breakdown

Dimension	Domain	Ours	GPT-5	Delta	Winner
B1Statistical Test Selection	Biostatistics	4.71	4.67	+0.04	Tie
B2Sample Size Calculation	Biostatistics	3.81	4.22	-0.41	GPT-5
B3Code Correctness*	Biostatistics	3.65	2.16	+1.49	Ours
B4Assumption Checking	Biostatistics	4.32	3.94	+0.38	Ours
B5Effect Size Interpretation	Biostatistics	3.67	3.71	-0.04	Tie
B6Clinical vs Statistical Significance	Biostatistics	3.02	2.95	+0.07	Ours
B7Explanation Quality	Biostatistics	4.59	4.70	-0.11	GPT-5
B8Code Quality*	Biostatistics	3.76	2.21	+1.55	Ours
M1Research Question Structuring	Methodology	4.95	4.84	+0.11	Ours
M2Study Design Appropriateness	Methodology	4.95	4.84	+0.11	Ours
M3Causal Inference Framework	Methodology	4.77	4.63	+0.14	Ours
M4Bias Identification*	Methodology	4.91	4.56	+0.35	Ours
M5Ethical Considerations*	Methodology	3.95	2.79	+1.16	Ours
M6Reporting Standards (EQUATOR)*	Methodology	4.54	3.23	+1.31	Ours
M7Explanation Quality	Methodology	5.00	5.00	0.00	Tie
M8Actionability	Methodology	4.96	5.00	-0.04	Tie

* Statistically significant (Wilcoxon signed-rank, p < 0.05)

Click rows marked with a remark icon to see analysis notes.

Sample Size Calculation Validation

50 benchmarks

Every sample size calculation is validated against published formulas from Chow et al. (2018), Cohen (1988), Julious (2023), Schoenfeld (1983), and statsmodels/scipy reference implementations. Our system uses executed Python code (not LLM estimation) to compute exact sample sizes.

Exact Match

70%

35/50

Within 5%

100%

50/50

Within 10%

100%

50/50

Mean Deviation

0.4%

median 0.0%

ID	Scenario	Category	Expected	Computed	Dev %	Result
V01	Two independent means (equal groups)	t-test	91	92	1.1%	5%
V02	Two independent means (unequal SD)	t-test	58	59	1.7%	5%
V03	Paired t-test	t-test	29	29	0.0%	Exact
V04	Two proportions (chi-square)	Chi-square	294	294	0.0%	Exact
V05	Two proportions (small difference)	Proportions	1,543	1,543	0.0%	Exact
V06	One-way ANOVA (3 groups)	ANOVA	53	53	0.0%	Exact
V07	One-way ANOVA (4 groups, small effect)	ANOVA	109	109	0.0%	Exact
V08	Survival analysis (log-rank, Schoenfeld)	Survival	380	380	0.0%	Exact
V09	Survival analysis (HR=0.70, 90% power)	Survival	328	330	0.6%	5%
V10	Non-inferiority trial (means)	NI/Equiv	142	143	0.7%	5%
V11	Correlation test (r=0.30)	Correlation	85	85	0.0%	Exact
V12	Logistic regression (single predictor)	Regression	262	265	1.1%	5%
V13	McNemar's test (paired proportions)	Paired prop	113	113	0.0%	Exact
V14	Two independent means (90% power)	t-test	122	122	0.0%	Exact
V15	Cluster-randomized trial (proportions)	Cluster	875	870	0.6%	5%
V16	Equivalence trial (two proportions)	NI/Equiv	199	200	0.5%	5%
V17	Repeated measures ANOVA	ANOVA	42	42	0.0%	Exact
V18	Two proportions (large effect)	Proportions	141	141	0.0%	Exact
V19	Two-sample t-test with 2:1 allocation	t-test	118	119	0.8%	5%
V20	Single proportion (exact binomial)	Proportions	58	58	0.0%	Exact

Expected values from Chow et al. (2018), Cohen (1988), Julious (2023), Schoenfeld (1983), and statsmodels/scipy reference implementations.

Why Researchers Choose Us

Purpose-built for medical research planning, not general-purpose chat.

Validated Code Generation

Every statistical script is generated by a specialized CodingAgent and validated through a DiagnosticTool. No more debugging GPT output -- get production-ready R, Python, and STATA code.

Multi-Agent Architecture

Seven specialized agents work in concert: literature search, evidence appraisal, methodology design, biostatistics, coding, diagnostics, and summarization. Each expert does what it does best.

Ethics-First Approach

Automatically surfaces IRB requirements, informed consent protocols, and data privacy considerations. The only research assistant that consistently prioritizes ethical compliance.

EQUATOR-Aligned Reporting

Methodology recommendations follow CONSORT, STROBE, PRISMA, and other EQUATOR Network guidelines. Your study design meets journal submission standards from day one.

End-to-End Workflow

From identifying research gaps through literature search, to designing study methodology, to calculating sample sizes and generating analysis code -- all in one conversation.

Rigorous Validation

100% of sample size calculations fall within 5% of published formulas across 50 benchmarks (Chow, Cohen, Julious, Schoenfeld). Code is executed in a sandbox, not estimated by an LLM.

How We Tested

40 Expert Scenarios

Curated medical research scenarios for advanced users across biostatistics, methodology, and edge cases. 14 specialties represented.

Blinded Evaluation

Systems randomly assigned as 'System A' or 'System B'. The judge never knew which was which.

Triple-Run Consistency

Each case judged 3 times independently. 93% exact agreement, 100% within-1 agreement across runs.

Statistical Rigor

Wilcoxon signed-rank tests with Bonferroni correction (alpha/17 = 0.0029). No cherry-picking.

Ready to Elevate Your Research?

Join researchers who trust our specialized AI assistant for gap analysis, study design, and biostatistical planning.

Start a Research Session