Nonparametric Identification and Estimation of Ratios of Multi-Category Means under Preferential Sampling

Hopkins, Grant; Teichman, Sarah; Graham, Ellen; Willis, Amy D

统计学 > 方法论

arXiv:2510.23920 (stat)

[提交于 2025年10月27日 ]

标题：非参数识别和估计偏好抽样下的多类别均值比

标题： Nonparametric Identification and Estimation of Ratios of Multi-Category Means under Preferential Sampling

Authors:Grant Hopkins, Sarah Teichman, Ellen Graham, Amy D Willis

摘要：多类别数据出现在多个领域，包括市场营销、化学、公共政策、基因组学、政治科学和生态学。我们考虑在完全非参数设置下估计特定类别的均值比率的问题，允许观测单元和类别被选择性抽样。我们考虑了协变量调整和未调整的估计量，这些估计量是非参数定义的，且易于解释。虽然通过参数分布或对条件均值的限制（例如，对数线性）已经建立了相关模型的可识别性，但我们表明，可以通过独立性假设或类别约束（如参考类别或中心化函数）获得可识别性。我们开发了一种高效、双重稳健的目标最小损失估计器，具有出色的有限样本性能，包括在大量不常观察到的类别情况下。我们通过模拟对比了我们方法与相关方法的性能，并将其应用于识别在腹泻病例中与对照组相比差异丰富的细菌。我们的工作提供了一个通用框架，用于研究组成数据设置中的参数可识别性，而无需对数据分布做出参数假设。

摘要： Multi-category data arise in diverse fields including marketing, chemistry, public policy, genomics, political science, and ecology. We consider the problem of estimating ratios of category-specific means in a fully nonparametric setting, allowing for both observational units and categories to be preferentially sampled. We consider covariate-adjusted and unadjusted estimands that are non-parametrically defined and straightforward to interpret. While identifiability for related models has been established through parametric distributions or restrictions on the conditional mean (e.g., log-linearity), we show that identifiability can be obtained through an independence assumption or a category constraint, such as a reference category or a centering function. We develop an efficient, doubly-robust targeted minimum loss based estimator with excellent finite-sample performance, including in the setting of a large number of infrequently observed categories. We contrast the performance of our method with related approaches via simulation, and apply it to identify bacteria that are differentially abundant in diarrheal cases compared to controls. Our work provides a general framework for studying parameter identifiability in compositional data settings without requiring parametric assumptions on the data distribution.

主题：	方法论 (stat.ME)
引用方式：	arXiv:2510.23920 [stat.ME]
	(或者 arXiv:2510.23920v1 [stat.ME] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.23920

提交历史

来自： Amy Willis [查看电子邮件]
[v1] 星期一， 2025 年 10 月 27 日 23:03:56 UTC (921 KB)

统计学 > 方法论

标题：非参数识别和估计偏好抽样下的多类别均值比

标题： Nonparametric Identification and Estimation of Ratios of Multi-Category Means under Preferential Sampling

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 方法论

标题： 非参数识别和估计偏好抽样下的多类别均值比 显示英文标题

标题： Nonparametric Identification and Estimation of Ratios of Multi-Category Means under Preferential Sampling

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：非参数识别和估计偏好抽样下的多类别均值比