Beyond conventional P-values: Addressing statistical challenges in big data

Date

2026

Authors

Zhang, Jing

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Do larger sample sizes lead to higher false positive rates in statistical analysis? The answer provided by ChatGPT 4o is ’no’, which is a common opinion shared by many statisticians. However, empirical evidence from large datasets analyses, such as those from biobanks and single-cell genomics, challenges this conclusion. Com- mon practice assesses both p-values and effect sizes to mitigate the risk of identifying spurious effects in large samples. Nonetheless, the need to adjust p-values in these contexts is unaddressed, which motivated this investigation. We found that common beliefs and practices are incorrect in real-world data analysis, since theoretical assumptions are always violated. Growing sample sizes can amplify violation impacts, inflating false positive rates. Using a simulation study, we provide examples to support our statement and illustrate a permutation-based remedy. This work’s intended contribution is to heighten awareness within our community about the pressing need to reevaluate standard statistical methods in analyzing datasets with huge sample sizes, thereby inspiring further substantial efforts to tackle this emerging challenge of the big data era.

Description

Keywords

Big data, Hypothesis testing, Inflated type I error, Violated model assumptions

Citation