Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

Murugaboopathy, Satiyabooshan; Jerzak, Connor T.; Daoud, Adel

计算机科学 > 人工智能

arXiv:2508.01109 (cs)

[提交于 2025年8月1日 ]

标题：柏拉图式的贫困制图表示：统一的视觉语言代码还是代理引起的创新？

标题： Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

Authors:Satiyabooshan Murugaboopathy, Connor T. Jerzak, Adel Daoud

摘要：我们研究社会经济指标如家庭财富是否会在卫星图像（捕捉物理特征）和互联网来源的文本（反映历史/经济叙述）中留下可恢复的痕迹。使用来自非洲社区的人口与健康调查（DHS）数据，我们将Landsat图像与基于位置/年份的条件生成的大型语言模型（LLM）文本描述进行配对，并通过AI搜索代理从网络资源中检索文本。我们开发了一个多模态框架，通过五个管道预测家庭财富（国际财富指数）：(i) 对卫星图像的视觉模型，(ii) 仅使用位置/年的LLM，(iii) AI代理搜索/综合网络文本，(iv) 联合图像-文本编码器，(v) 所有信号的集成。我们的框架带来了三个贡献。首先，融合视觉和代理/LLM文本在财富预测中优于仅视觉基线（例如，在样本外分割中R平方为0.77 vs. 0.63），其中LLM内部知识比代理检索的文本更有效，提高了对跨国和跨时间泛化的鲁棒性。其次，我们发现部分表示收敛：融合的视觉/语言模态嵌入相关性适中（对齐后中位数余弦相似度为0.60），表明存在共享的物质福祉潜在代码，同时保留互补细节，这与柏拉图表示假设一致。尽管仅LLM文本表现优于代理检索的数据，这挑战了我们的代理诱导新颖性假设，但在某些分割中结合代理数据的适度增益弱支持了代理收集的信息引入了静态LLM知识未完全捕捉的独特表示结构这一观点。第三，我们发布了一个大规模多模态数据集，包含超过60,000个DHS集群，与卫星图像、LLM生成的描述和代理检索的文本相关联。

摘要： We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

评论：	7个数字
主题：	人工智能 (cs.AI)
MSC 类：	68T07
ACM 类：	I.2; J.4
引用方式：	arXiv:2508.01109 [cs.AI]
	(或者 arXiv:2508.01109v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.01109

提交历史

来自： Connor Jerzak [查看电子邮件]
[v1] 星期五， 2025 年 8 月 1 日 23:07:16 UTC (3,639 KB)

计算机科学 > 人工智能

标题：柏拉图式的贫困制图表示：统一的视觉语言代码还是代理引起的创新？

标题： Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 柏拉图式的贫困制图表示：统一的视觉语言代码还是代理引起的创新？ 显示英文标题

标题： Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：柏拉图式的贫困制图表示：统一的视觉语言代码还是代理引起的创新？