Beyond conventional P-values: Addressing statistical challenges in big data
Date
2026
Authors
Zhang, Jing
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Do larger sample sizes lead to higher false positive rates in statistical analysis? The answer provided by ChatGPT 4o is ’no’, which is a common opinion shared by many statisticians. However, empirical evidence from large datasets analyses, such as those from biobanks and single-cell genomics, challenges this conclusion. Com- mon practice assesses both p-values and effect sizes to mitigate the risk of identifying spurious effects in large samples. Nonetheless, the need to adjust p-values in these contexts is unaddressed, which motivated this investigation.
We found that common beliefs and practices are incorrect in real-world data analysis, since theoretical assumptions are always violated. Growing sample sizes can amplify violation impacts, inflating false positive rates. Using a simulation study, we provide examples to support our statement and illustrate a permutation-based remedy.
This work’s intended contribution is to heighten awareness within our community about the pressing need to reevaluate standard statistical methods in analyzing datasets with huge sample sizes, thereby inspiring further substantial efforts to tackle this emerging challenge of the big data era.
Description
Keywords
Big data, Hypothesis testing, Inflated type I error, Violated model assumptions