How to Calculate Sample Size for a Clinical Trial (With Free AI Tool)
Learn how to calculate sample size for clinical trials step by step -- including power analysis, effect size, and free Python/R code. No biostatistician required.

Physician-data scientist at King Chulalongkorn Memorial Hospital, Bangkok. Research interests: Machine learning, causal inference, and AI in healthcare.
If your study is underpowered, it doesn't matter how well you run it.
An underpowered study -- one with too few participants to reliably detect the effect you're looking for -- wastes months of work, fails to reach statistical significance, and gets rejected at peer review. Most early-career researchers only discover this after the damage is done.
Sample size calculation is the single most important decision you make before a study begins. This guide walks you through how to do it correctly, what the common mistakes are, and how to get a validated calculation with generated code -- without waiting weeks for a biostatistician consultation. If you want a broader view of where sample size fits, see our guide to the three-phase research planning approach.
What Is Sample Size Calculation?
Sample size calculation is the process of determining the minimum number of participants needed in a study to reliably detect a meaningful difference or effect, if one truly exists.
Get it right and your study has enough statistical power to produce publishable results. Get it wrong in either direction -- too few participants, and your study is underpowered; too many, and you waste resources and potentially expose more patients to experimental treatments than necessary.
The calculation is based on four inputs that you must define before you begin:
- Alpha (α): Your significance threshold -- typically 0.05, meaning you accept a 5% chance of a false positive (Type I error)
- Power (1 − β): Your desired probability of detecting a true effect -- typically 0.80 (80%) or 0.90 (90%)
- Effect size: The minimum clinically meaningful difference you want to detect
- Standard deviation (for continuous outcomes): The expected variability in your outcome measure
Change any one of these and your required sample size changes. This is why two researchers studying similar questions can end up with very different sample size requirements.
Step 1: Define Your Primary Outcome
Before calculating anything, you need to know what you're measuring. Your primary outcome determines which formula you'll use.
Continuous outcomes (e.g., blood pressure, HbA1c, pain score) use a different formula than binary outcomes (e.g., mortality, readmission, disease occurrence). Getting this wrong means your entire calculation is invalid.
Ask yourself:
- Is my outcome a number on a continuous scale, or a yes/no event?
- What is the clinically meaningful difference I want to detect?
- What does the existing literature suggest about variability in this outcome?
Step 2: Decide on Alpha and Power
For most clinical research, the convention is:
- α = 0.05 (two-tailed)
- Power = 0.80
However, these are not rules -- they are defaults. If your study has serious consequences for a false negative (e.g., missing a harmful drug effect), you should consider increasing power to 0.90 or even 0.95. If you are running a preliminary study where some tolerance for error is acceptable, a lower power target may be justified.
Document your reasoning. Reviewers and ethics boards will ask why you chose your parameters.
Step 3: Determine Your Effect Size
Effect size is where most researchers struggle. You have three main approaches:
- Use published literature. Search for similar studies and use the mean difference and standard deviation they reported. This is the most defensible approach for grant applications and ethics submissions.
- Define a minimum clinically important difference (MCID). For many validated outcome measures (e.g., SF-36, VAS pain scale), consensus MCIDs exist in the literature. Use these rather than inventing a threshold.
- Use Cohen's conventions as a last resort. Cohen's d of 0.2 = small, 0.5 = medium, 0.8 = large (Cohen, 1988). This is acceptable for exploratory studies but will be challenged by reviewers for confirmatory trials.
Step 4: Apply the Formula
For a Two-Sample t-test (Continuous Outcome)
The formula for each group, where Zα/2 is the critical value for your significance level, Zβ for power, σ the standard deviation, and δ the minimum detectable difference:
Where:
- Zα/2 = 1.96 for α = 0.05 two-tailed
- Zβ = 0.84 for 80% power; 1.28 for 90% power
- σ = expected standard deviation
- δ = minimum detectable difference
Worked Example: You want to detect a mean difference of 5 mmHg in systolic blood pressure (SD = 12 mmHg) with 80% power and α = 0.05.
Add 10--15% for expected dropout: ~100 participants per group, 200 total.
For a Two-Proportion z-test (Binary Outcome)
Where p₁ and p₂ are the expected proportions in each group.
Step 5: Generate and Verify Your Code
Performing these calculations by hand introduces errors. The standard practice is to use statistical software -- and critically, to show your code so reviewers can reproduce your result.
Here is validated Python code using statsmodels for the two-sample t-test example above:
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
effect_size = 5 / 12 # delta / sigma = Cohen's d
n_per_group = analysis.solve_power(
effect_size=effect_size,
alpha=0.05,
power=0.80,
alternative='two-sided'
)
print(f"Required n per group: {n_per_group:.1f}")
# Output: Required n per group: 90.3And in R using the pwr package:
library(pwr)
pwr.t.test(
d = 5/12,
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "two.sided"
)
# n = 90.3 per groupRun this code yourself to verify. Never submit a sample size calculation you cannot reproduce independently.
Common Mistakes to Avoid
For a deeper look at these and other errors, see 5 sample size pitfalls that trip up even experienced researchers.
1. Using a one-tailed test when two-tailed is appropriate. One-tailed tests require half the sample size of two-tailed tests. Reviewers know this -- unless you have a strong prior reason to test in only one direction, always use two-tailed.
2. Ignoring clustering effects in cluster-randomized trials. If you are randomizing at the group level (e.g., clinics, schools), you must apply a design effect correction based on the intraclass correlation coefficient (ICC). Ignoring this can underpower your study by 50% or more.
3. Calculating sample size after data collection. Post-hoc power calculations are widely considered invalid. Sample size must be determined before the study begins.
4. Not accounting for dropout. Always inflate your sample size by your expected attrition rate. For longitudinal studies, a 15--20% dropout allowance is standard.
5. Changing the primary outcome after the fact. Your sample size is calculated for a specific outcome. Switching outcomes after data collection is considered research misconduct and a major red flag for reviewers.
Frequently Asked Questions
Do I need a biostatistician to calculate sample size?
For complex designs (adaptive trials, cluster-randomized, survival analysis), yes -- a biostatistician review is strongly recommended. For standard two-group comparisons, the formulas above and the code provided are well-validated and sufficient for most ethics submissions.
What if my calculated sample size is not feasible?
You have three options: increase your effect size threshold (but justify it clinically), lower your power target (but acknowledge the limitation), or redesign the study (consider a pilot study first).
How do I report sample size in my methods section?
State your alpha, power, expected effect size with source, formula or software used, and resulting n per group. Example: "Sample size was calculated using a two-sided independent t-test (α = 0.05, power = 0.80). Based on a prior study, we assumed a mean difference of 5 mmHg (SD = 12 mmHg), requiring 90 participants per group. Accounting for 15% dropout, we will enroll 105 participants per group (total n = 210)."
Calculate Sample Size Automatically -- Free
Everything described in this guide is automated in ProtoCol.
Enter your research question, and ProtoCol's AI will determine the appropriate statistical test, ask for your parameters, run the power analysis, and generate Python, R, and STATA code -- all validated in a sandboxed interpreter before delivery.
Benchmarked against GPT-5 across 40 medical research scenarios, ProtoCol scores significantly higher on calculation accuracy (100% within 5% of published formulas) and code correctness (3.65 vs. 2.16).
Free to start. No credit card required.