A Bernstein polynomial approach for the estimation of cumulative distribution functions in the presence of missing data

Gharbi, Rihab; Jedidi, Wissem; Khardani, Salah; Ouimet, Frédéric

数学 > 统计理论

arXiv:2510.07235 (math)

[提交于 2025年10月8日 ]

标题：一种基于伯恩斯坦多项式的估计方法，在存在缺失数据的情况下用于累积分布函数的估计

标题： A Bernstein polynomial approach for the estimation of cumulative distribution functions in the presence of missing data

Authors:Rihab Gharbi, Wissem Jedidi, Salah Khardani, Frédéric Ouimet

摘要：我们研究了数据随机缺失情况下单变量累积分布函数（CDF）的非参数估计。所提出的估计量通过伯恩斯坦算子平滑逆概率加权（IPW）经验CDF，产生单调的、取值为$[0,1]$的曲线，这些曲线能够自动适应有界支撑。我们分析了两种版本：一种是使用已知倾向性的伪估计量，另一种是使用从离散辅助变量中非参数估计的倾向性的可行估计量，后者在实践中更为常见。对于这两种估计量，我们推导了逐点偏差和方差展开式，建立了相对于均方误差的最优多项式次数$m$，并证明了渐近正态性。一个关键发现是，可行估计量的方差比伪估计量小一个显式的非负修正项。我们还通过最小二乘交叉验证开发了一个高效的次数选择过程。蒙特卡罗实验表明，对于中等至较大的样本量，伯恩斯坦平滑的可行估计量在相同情境下优于其未平滑的对应项以及Dubnicka（2009）提出的IPW核密度估计器的集成版本。对2017-2018年国家健康与营养检查调查（NHANES）中空腹血糖数据的实际数据分析展示了该方法在实际场景中的应用。所有用于重现我们分析的代码均可在GitHub上轻松获取。

摘要： We study nonparametric estimation of univariate cumulative distribution functions (CDFs) pertaining to data missing at random. The proposed estimators smooth the inverse probability weighted (IPW) empirical CDF with the Bernstein operator, yielding monotone, $[0,1]$-valued curves that automatically adapt to bounded supports. We analyze two versions: a pseudo estimator that uses known propensities and a feasible estimator that uses propensities estimated nonparametrically from discrete auxiliary variables, the latter scenario being much more common in practice. For both, we derive pointwise bias and variance expansions, establish the optimal polynomial degree $m$ with respect to the mean integrated squared error, and prove the asymptotic normality. A key finding is that the feasible estimator has a smaller variance than the pseudo estimator by an explicit nonnegative correction term. We also develop an efficient degree selection procedure via least-squares cross-validation. Monte Carlo experiments demonstrate that, for moderate to large sample sizes, the Bernstein-smoothed feasible estimator outperforms both its unsmoothed counterpart and an integrated version of the IPW kernel density estimator proposed by Dubnicka (2009) in the same context. A real-data application to fasting plasma glucose from the 2017-2018 NHANES survey illustrates the method in a practical setting. All code needed to reproduce our analyses is readily accessible on GitHub.

评论：	25页，1图，3表
主题：	统计理论 (math.ST) ; 方法论 (stat.ME)
MSC 类：	62G05, 62E20, 62G08, 62G20
引用方式：	arXiv:2510.07235 [math.ST]
	(或者 arXiv:2510.07235v1 [math.ST] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.07235

提交历史

来自： Frédéric Ouimet [查看电子邮件]
[v1] 星期三， 2025 年 10 月 8 日 17:03:03 UTC (38 KB)

数学 > 统计理论

标题：一种基于伯恩斯坦多项式的估计方法，在存在缺失数据的情况下用于累积分布函数的估计

标题： A Bernstein polynomial approach for the estimation of cumulative distribution functions in the presence of missing data

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

数学 > 统计理论

标题： 一种基于伯恩斯坦多项式的估计方法，在存在缺失数据的情况下用于累积分布函数的估计 显示英文标题

标题： A Bernstein polynomial approach for the estimation of cumulative distribution functions in the presence of missing data

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：一种基于伯恩斯坦多项式的估计方法，在存在缺失数据的情况下用于累积分布函数的估计