CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

Mansinghka, Vikash; Shafto, Patrick; Jonas, Eric; Petschulat, Cap; Gasner, Max; Tenenbaum, Joshua B.

计算机科学 > 人工智能

arXiv:1512.01272 (cs)

[提交于 2015年12月3日 ]

标题：交叉猫：一种用于分析异质性高维数据的全贝叶斯非参数方法

标题： CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

Authors:Vikash Mansinghka, Patrick Shafto, Eric Jonas, Cap Petschulat, Max Gasner, Joshua B. Tenenbaum

摘要：存在对统计方法的广泛需求，这些方法能够在不施加限制性或不透明建模假设的情况下分析高维数据集。本文描述了一种通用领域数据分析方法，称为CrossCat。 CrossCat推断数据的多个不重叠视图，每个视图由变量的一个子集组成，并使用单独的非参数混合模型来对每个视图进行建模。 CrossCat基于数据表的分层、非参数模型中的近似贝叶斯推理。该模型包括数据表列上的狄利克雷过程混合，其中每个混合成分本身是行上的独立狄利克雷过程混合；内部混合成分是简单的参数模型，其形式取决于表中的数据类型。 CrossCat结合了混合建模和贝叶斯网络结构学习的优势。与混合建模类似，CrossCat可以通过假定潜在变量来对广泛的分布类进行建模，并且为预测提供可以高效条件化和采样的表示。与贝叶斯网络类似，CrossCat表示变量之间的依赖关系和独立关系，因此在存在多个统计信号时仍能保持准确性。推断是通过可扩展的吉布斯抽样方案完成的；本文展示了它在实践中效果良好。本文还包括对多达1000万单元的异构表格数据的实证结果，例如医院费用和质量指标、投票记录、失业率、基因表达测量值以及手写数字图像。 CrossCat推断的结构在多个领域中与已接受的发现和常识知识一致，并且预测准确性与生成性、判别性和无模型替代方法相当。

摘要： There is a widespread need for statistical methods that can analyze high-dimensional datasets with- out imposing restrictive or opaque modeling assumptions. This paper describes a domain-general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparamet- ric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian net- work structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.

主题：	人工智能 (cs.AI) ; 计算 (stat.CO); 机器学习 (stat.ML)
引用方式：	arXiv:1512.01272 [cs.AI]
	(或者 arXiv:1512.01272v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.1512.01272

提交历史

来自： Vikash Mansinghka [查看电子邮件]
[v1] 星期四， 2015 年 12 月 3 日 22:39:37 UTC (8,827 KB)

计算机科学 > 人工智能

标题：交叉猫：一种用于分析异质性高维数据的全贝叶斯非参数方法

标题： CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 交叉猫：一种用于分析异质性高维数据的全贝叶斯非参数方法 显示英文标题

标题： CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：交叉猫：一种用于分析异质性高维数据的全贝叶斯非参数方法