Skip to main content
CenXiv.org
此网站处于试运行阶段,支持我们!
我们衷心感谢所有贡献者的支持。
贡献
赞助
cenxiv logo > cs.HC

帮助 | 高级搜索

人机交互

  • 新提交
  • 交叉列表
  • 替换

查看 最近的 文章

显示 2025年08月06日, 星期三 新的列表

总共 42 条目
显示最多 2000 每页条目: 较少 | 更多 | 所有

新提交 (展示 18 之 18 条目 )

[1] arXiv:2508.02679 [中文pdf, pdf, html, 其他]
标题: 基于LLM代理的学生行为与心理健康智能手机感知数据模拟
标题: LLM Agent-Based Simulation of Student Activities and Mental Health Using Smartphone Sensing Data
Wayupuk Sommuang, Kun Kerdthaisong, Pasin Buakhaw, Aslan B. Wong, Nutchanon Yongsatianchot
主题: 人机交互 (cs.HC)

学生的心理健康对学业成功至关重要,学习、社交和睡眠等活动都起到一定作用。 当前的移动传感数据通过统计和机器学习分析突显了这种复杂的联系。 我们提出了一种基于大语言模型(LLM)代理的仿真框架,使用StudentLife数据集来模拟学生行为和心理健康。 每个LLM代理都通过人格问卷初始化,并在整个模拟学期中通过智能手机传感数据进行指导。 这些代理可以预测个体行为,通过生态瞬时评估(EMAs)提供自我报告的心理健康数据,并完成后续的人格问卷。 为了确保准确性,我们研究了各种提示技术、记忆系统以及基于活动的心理状态管理策略,这些策略根据代理的日常活动动态更新其心理状态。 这种仿真不仅仅是复制现有数据。 这使我们能够探索原始数据集中不存在的新场景,例如通过代理间互动产生的同伴影响和社会媒体的影响。 此外,我们可以通过传感信号和人格特征操纵活动模式,并利用问卷回答进行干预研究。 这为可能增强学生心理健康的行为变化提供了有价值的见解。 该框架还促进了与LLM代理的假设性访谈,从而更深入地了解他们的心理健康状况。 本研究展示了传感数据驱动的LLM行为建模的力量,为理解和支持学生心理健康开辟了新的途径。

Students' mental well-being is vital for academic success, with activities such as studying, socializing, and sleeping playing a role. Current mobile sensing data highlight this intricate link using statistical and machine learning analyses. We propose a novel LLM agent-based simulation framework to model student activities and mental health using the StudentLife Dataset. Each LLM agent was initialized with personality questionnaires and guided by smartphone sensing data throughout the simulated semester. These agents predict individual behaviors, provide self-reported mental health data via ecological momentary assessments (EMAs), and complete follow-up personality questionnaires. To ensure accuracy, we investigated various prompting techniques, memory systems, and activity-based mental state management strategies that dynamically update an agent's mental state based on their daily activities. This simulation goes beyond simply replicating existing data. This allows us to explore new scenarios that are not present in the original dataset, such as peer influence through agent-to-agent interactions and the impact of social media. Furthermore, we can conduct intervention studies by manipulating activity patterns via sensing signals and personality traits using questionnaire responses. This provides valuable insights into the behavioral changes that could enhance student well-being. The framework also facilitates hypothetical interviews with LLM agents, offering deeper insights into their mental health. This study showcases the power of LLM-driven behavioral modeling with sensing data, opening new avenues for understanding and supporting student mental health.

[2] arXiv:2508.02680 [中文pdf, pdf, html, 其他]
标题: AnnoSense:一种用于日常环境中生理情感数据收集的框架用于人工智能
标题: AnnoSense: A Framework for Physiological Emotion Data Collection in Everyday Settings for AI
Pragya Singh, Ankush Gupta, Mohan Kumar, Pushpendra Singh
评论: 将于2025年9月发表于IMWUT
主题: 人机交互 (cs.HC) ; 人工智能 (cs.AI)

情感和心理健康是生活质量的重要组成部分,随着智能手机、可穿戴设备和人工智能(AI)等智能设备的兴起,出现了在日常环境中监测情绪的新机会。 然而,为了使AI算法有效,它们需要高质量的数据和准确的注释。 随着重点转向在现实环境中的情绪数据收集,以捕捉更真实的感情体验,收集情绪注释的过程变得越来越复杂。 这项工作从关键利益相关者的角度探讨了日常情绪数据收集的挑战。 我们收集了75份问卷回复,对公众进行了32次访谈,并与12名心理健康专业人士进行了3次焦点小组讨论(FGDs)。 从总共119名利益相关者那里获得的见解促进了我们的框架AnnoSense的开发,该框架旨在支持AI的日常情绪数据收集。 然后,由25名情绪AI专家对AnnoSense框架的清晰度、实用性和适应性进行了评估。 最后,我们讨论了AnnoSense在情绪AI未来研究中的潜在下一步行动和影响,强调了其在真实情境中增强情绪数据收集和分析的潜力。

Emotional and mental well-being are vital components of quality of life, and with the rise of smart devices like smartphones, wearables, and artificial intelligence (AI), new opportunities for monitoring emotions in everyday settings have emerged. However, for AI algorithms to be effective, they require high-quality data and accurate annotations. As the focus shifts towards collecting emotion data in real-world environments to capture more authentic emotional experiences, the process of gathering emotion annotations has become increasingly complex. This work explores the challenges of everyday emotion data collection from the perspectives of key stakeholders. We collected 75 survey responses, performed 32 interviews with the public, and 3 focus group discussions (FGDs) with 12 mental health professionals. The insights gained from a total of 119 stakeholders informed the development of our framework, AnnoSense, designed to support everyday emotion data collection for AI. This framework was then evaluated by 25 emotion AI experts for its clarity, usefulness, and adaptability. Lastly, we discuss the potential next steps and implications of AnnoSense for future research in emotion AI, highlighting its potential to enhance the collection and analysis of emotion data in real-world contexts.

[3] arXiv:2508.02817 [中文pdf, pdf, html, 其他]
标题: 现实世界对自适应心理健康干预的接受度:一项真实环境研究的结果
标题: Real-World Receptivity to Adaptive Mental Health Interventions: Findings from an In-the-Wild Study
Nilesh Kumar Sahu, Aditya Sneh, Snehil Gupta, Haroon R Lone
主题: 人机交互 (cs.HC) ; 人工智能 (cs.AI) ; 计算机与社会 (cs.CY) ; 信号处理 (eess.SP)

移动健康(mHealth)技术的兴起使得利用被动获取的智能手机数据实时监测和干预心理健康状况成为可能。 在这些能力的基础上,即时自适应干预(JITAIs)旨在适时提供个性化支持,根据用户的不断变化的情境和需求进行调整。 尽管之前的研究已经探讨了情境如何影响用户对通用通知和一般mHealth信息的反应,但相对较少的工作研究了情境对其参与实际心理健康干预的影响。 此外,虽然现有研究大多集中在检测用户何时可能从干预中受益,但较少关注理解接受度,即用户参与并执行干预的意愿和能力。 在本研究中,我们通过两个组成部分来研究用户接受度:接受度(承认或参与提示)和可行性(在情境约束下采取行动的能力)。 我们使用定制的Android应用LogMe对70名学生进行了为期两周的实地研究,该应用收集被动传感器数据和主动情境报告以触发心理健康干预。 自适应干预模块是使用Thompson Sampling构建的,这是一种强化学习算法。 我们提出了四个与智能手机功能和自我报告情境相关的研究问题,探讨自适应强化学习方法是否可以通过最大化综合接受度奖励来优化干预的传递。 我们的结果表明,几种被动获取的数据显著影响了用户对干预的接受度。 我们的发现为设计情境感知、自适应的干预提供了见解,这些干预不仅及时,而且在现实环境中可操作。

The rise of mobile health (mHealth) technologies has enabled real-time monitoring and intervention for mental health conditions using passively sensed smartphone data. Building on these capabilities, Just-in-Time Adaptive Interventions (JITAIs) seek to deliver personalized support at opportune moments, adapting to users' evolving contexts and needs. Although prior research has examined how context affects user responses to generic notifications and general mHealth messages, relatively little work has explored its influence on engagement with actual mental health interventions. Furthermore, while much of the existing research has focused on detecting when users might benefit from an intervention, less attention has been paid to understanding receptivity, i.e., users' willingness and ability to engage with and act upon the intervention. In this study, we investigate user receptivity through two components: acceptance(acknowledging or engaging with a prompt) and feasibility (ability to act given situational constraints). We conducted a two-week in-the-wild study with 70 students using a custom Android app, LogMe, which collected passive sensor data and active context reports to prompt mental health interventions. The adaptive intervention module was built using Thompson Sampling, a reinforcement learning algorithm. We address four research questions relating smartphone features and self-reported contexts to acceptance and feasibility, and examine whether an adaptive reinforcement learning approach can optimize intervention delivery by maximizing a combined receptivity reward. Our results show that several types of passively sensed data significantly influenced user receptivity to interventions. Our findings contribute insights into the design of context-aware, adaptive interventions that are not only timely but also actionable in real-world settings.

[4] arXiv:2508.02823 [中文pdf, pdf, html, 其他]
标题: NeuroSync:通过直接大语言模型理解的意图感知代码问题解决方法修改
标题: NeuroSync: Intent-Aware Code-Based Problem Solving via Direct LLM Understanding Modification
Wenshuo Zhang, Leixian Shen, Shuchang Xu, Jindu Wang, Jian Zhao, Huamin Qu, Linping Yuan
评论: 被UIST 2025接受
主题: 人机交互 (cs.HC) ; 人工智能 (cs.AI) ; 计算与语言 (cs.CL) ; 软件工程 (cs.SE)

对话式大语言模型已被缺乏编程经验的领域用户广泛采用,以解决领域问题。 然而,这些用户常常面临其意图与生成的代码之间不匹配的问题,导致挫败感和多次澄清。 这项工作首先研究了这种不匹配的原因,这归因于双向歧义:用户的意图和编码任务本质上是非线性的,但必须通过线性提示和代码序列进行表达和解释。 为了解决这个问题,我们提出了直接意图-任务匹配,这是一种新的人-大语言模型交互范式,它外化并允许直接操作大语言模型的理解,即大语言模型在代码生成之前推断出的编码任务及其关系。 作为概念验证,该范式被实现于NeuroSync中,该系统采用知识蒸馏流程提取大语言模型的理解、用户意图及其映射,并通过可视化让用户直观地检查和编辑它们,从而增强对齐度。 我们通过技术实验评估了NeuroSync的算法组件,并通过用户研究(N=12)评估了其整体可用性和有效性。 结果表明,它提高了意图-任务对齐度,降低了认知努力,并提高了编码效率。

Conversational LLMs have been widely adopted by domain users with limited programming experience to solve domain problems. However, these users often face misalignment between their intent and generated code, resulting in frustration and rounds of clarification. This work first investigates the cause of this misalignment, which dues to bidirectional ambiguity: both user intents and coding tasks are inherently nonlinear, yet must be expressed and interpreted through linear prompts and code sequences. To address this, we propose direct intent-task matching, a new human-LLM interaction paradigm that externalizes and enables direct manipulation of the LLM understanding, i.e., the coding tasks and their relationships inferred by the LLM prior to code generation. As a proof-of-concept, this paradigm is then implemented in NeuroSync, which employs a knowledge distillation pipeline to extract LLM understanding, user intents, and their mappings, and enhances the alignment by allowing users to intuitively inspect and edit them via visualizations. We evaluate the algorithmic components of NeuroSync via technical experiments, and assess its overall usability and effectiveness via a user study (N=12). The results show that it enhances intent-task alignment, lowers cognitive effort, and improves coding efficiency.

[5] arXiv:2508.02868 [中文pdf, pdf, html, 其他]
标题: 毒品使用者(PWUD)内容审核中的关键挑战:审核员关于在线减害实践的见解
标题: Critical Challenges in Content Moderation for People Who Use Drugs (PWUD): Insights into Online Harm Reduction Practices from Moderators
Kaixuan Wang, Loraine Clarke, Carl-Cyril J Dreue, Guancheng Zhou, Jason T. Jacques
评论: 22页
主题: 人机交互 (cs.HC) ; 计算机与社会 (cs.CY)

在线社区为使用毒品的人(PWUD)提供了重要的支持渠道,提供同伴支持和减少伤害的信息。这些社区的管理涉及影响成员安全的重要决策,但现有的社会技术系统对管理员的支持不足。通过与Reddit上PWUD论坛的资深管理员进行访谈,我们分析了这项工作的独特性质。我们认为,这项工作构成了一种独特的公共卫生干预形式,其特点是三个管理挑战:需要专业、专家的风险评估;时间紧迫的危机响应;以及在平台政策和社区安全目标之间的结构性冲突中导航。我们展示了当前的管理系统在支持PWUD社区方面是不足的。例如,旨在最小化平台对非法活动法律责任的政策可能会无意中迫使管理员实施限制性规则以保护社区的存在,这可能会限制这个脆弱群体在线分享可能挽救生命的资源的能力。最后,我们指出了支持管理员工作的社会技术设计的两个必要转变:首先,转向自动化工具,在存在竞争利益的环境中支持人类的理解;其次,从要求管理员执行低级规则编程的系统转向允许高级、基于示例的指令的系统。此外,我们强调了在线空间中社会技术系统的設計如何影响旨在改善PWUD社区健康结果的减少伤害努力。

Online communities serve as essential support channels for People Who Use Drugs (PWUD), providing access to peer support and harm reduction information. The moderation of these communities involves consequential decisions affecting member safety, yet existing sociotechnical systems provide insufficient support for moderators. Through interviews with experienced moderators from PWUD forums on Reddit, we analyse the unique nature of this work. We argue that this work constitutes a distinct form of public health intervention characterised by three moderation challenges: the need for specialised, expert risk assessment; time-critical crisis response; and the navigation of a structural conflict between platform policies and community safety goals. We demonstrate how current moderation systems are insufficient in supporting PWUD communities. For example, policies minimising platforms' legal exposure to illicit activities can inadvertently push moderators to implement restrictive rules to protect community's existence, which can limit such a vulnerable group's ability to share potentially life-saving resources online. We conclude by identifying two necessary shifts in sociotechnical design to support moderators' work: first, moving to automated tools that support human sensemaking in contexts with competing interests; and second, shifting from systems that require moderators to perform low-level rule programming to those that enable high-level, example-based instruction. Further, we highlight how the design of sociotechnical systems in online spaces could impact harm reduction efforts aimed at improving health outcomes for PWUD communities.

[6] arXiv:2508.02958 [中文pdf, pdf, html, 其他]
标题: VRSight:一种用于改善盲人虚拟现实可访问性的AI驱动场景描述系统
标题: VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People
Daniel Killough, Justin Feng, Zheng Xue "ZX" Ching, Daniel Wang, Rithvik Dyava, Yapeng Tian, Yuhang Zhao
评论: 17页,10图,2表,LaTeX;将发表于ACM 2025年用户界面软件与技术研讨会(UIST 2025)
主题: 人机交互 (cs.HC)

虚拟现实(VR)对盲人来说是不可访问的。 尽管研究已经调查了许多增强VR可访问性的技术,但它们需要额外的开发人员努力来集成。 因此,大多数主流VR应用程序仍然不可访问,因为行业优先级较低。 我们提出了 VRSight,一个端到端系统,通过一组AI模型(例如,目标检测、深度估计、基于大语言模型的氛围解释)在事后识别VR场景,并生成基于音调的空间音频反馈,使盲人用户能够在不进行开发人员干预的情况下与VR互动。 为了实现虚拟元素检测,我们进一步贡献了DISCOVR,这是一个VR数据集,包含来自17个社交VR应用的30个虚拟物体类别,替代了仍然不适用于VR环境的真实世界数据集。 九名参与者使用VRSight探索了一个现成的VR应用(Rec Room),证明了其在促进社交任务(如角色意识和可用座位识别)方面的有效性。

Virtual Reality (VR) is inaccessible to blind people. While research has investigated many techniques to enhance VR accessibility, they require additional developer effort to integrate. As such, most mainstream VR apps remain inaccessible as the industry de-prioritizes accessibility. We present VRSight, an end-to-end system that recognizes VR scenes post hoc through a set of AI models (e.g., object detection, depth estimation, LLM-based atmosphere interpretation) and generates tone-based, spatial audio feedback, empowering blind users to interact in VR without developer intervention. To enable virtual element detection, we further contribute DISCOVR, a VR dataset consisting of 30 virtual object classes from 17 social VR apps, substituting real-world datasets that remain not applicable to VR contexts. Nine participants used VRSight to explore an off-the-shelf VR app (Rec Room), demonstrating its effectiveness in facilitating social tasks like avatar awareness and available seat identification.

[7] arXiv:2508.03014 [中文pdf, pdf, html, 其他]
标题: 扩展现实中的大型语言模型综述:技术范式与应用前沿
标题: Survey of Large Language Models in Extended Reality: Technical Paradigms and Application Frontiers
Jingyan Wang, Yang Zhao, Haotian Mao, Xubo Yang
评论: 29页,5张表格
主题: 人机交互 (cs.HC)

大型语言模型(LLMs)在自然语言理解和生成方面表现出色,它们与扩展现实(XR)的结合有望改变用户与沉浸式环境的交互方式。 本综述全面回顾了LLMs与XR交叉领域的最新发展,从技术和应用两个维度对研究进行了结构化整理。 我们提出了一种以关键技术范式为中心的LLM增强型XR系统分类,例如交互代理控制、XR开发工具包和生成场景合成,并讨论了这些范式如何在XR中实现新功能。 同时,我们探讨了LLM驱动的技术如何在多个领域支持实际的XR应用,包括沉浸式教育、临床医疗和工业制造。 通过将这些技术范式与应用前沿相连接,我们的综述突出了当前趋势,明确了设计考量,并识别了构建LLM增强型XR系统中的开放挑战。 这项工作提供了有助于研究人员和实践者推动智能XR体验最新技术进展的见解。

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, and their integration with Extended Reality (XR) is poised to transform how users interact with immersive environments. This survey provides a comprehensive review of recent developments at the intersection of LLMs and XR, offering a structured organization of research along both technical and application dimensions. We propose a taxonomy of LLM-enhanced XR systems centered on key technical paradigms -- such as interactive agent control, XR development toolkits, and generative scene synthesis -- and discuss how these paradigms enable novel capabilities in XR. In parallel, we examine how LLM-driven techniques support practical XR applications across diverse domains, including immersive education, clinical healthcare, and industrial manufacturing. By connecting these technical paradigms with application frontiers, our survey highlights current trends, delineates design considerations, and identifies open challenges in building LLM-augmented XR systems. This work provides insights that can guide researchers and practitioners in advancing the state of the art in intelligent XR experiences.

[8] arXiv:2508.03061 [中文pdf, pdf, html, 其他]
标题: 通过人工智能驱动的互动叙事帮助盲人和低视力用户探索视觉媒体
标题: Facilitating Visual Media Exploration for Blind and Low Vision Users through AI-Powered Interactive Storytelling
Shuchang Xu
主题: 人机交互 (cs.HC)

赋能盲人和低视力(BLV)用户探索视觉媒体可以提高内容理解,增强用户自主性,并满足多样化的信息需求。 然而,大多数现有工具将探索与主要叙述分开,这会破坏叙述流程,增加认知负荷,并限制用户对视觉媒体的深入参与。 为解决这些挑战,我的博士研究引入了人工智能驱动的互动叙事范式,该范式利用人工智能生成互动叙述,使BLV用户能够在连贯的叙述体验中探索视觉媒体。 我通过三种技术实现了这一范式:(1) 层次化叙述,支持在不同细节层次上探索照片集合;(2) 平行叙述,提供对同步视频评论的无缝访问;以及(3) 分支叙述,使用户能够沉浸式地导航360{\deg }视频。 这三种技术共同表明,人工智能驱动的互动叙事可以有效地在多种媒体格式中平衡用户自主性与叙述连贯性。 我未来的工作将进一步推进这一范式,为BLV受众提供更加个性化和富有表现力的叙述体验。

Empowering blind and low vision (BLV) users to explore visual media improves content comprehension, strengthens user agency, and fulfills diverse information needs. However, most existing tools separate exploration from the main narration, which disrupts the narrative flow, increases cognitive load, and limits deep engagement with visual media. To address these challenges, my PhD research introduces the paradigm of AI-powered interactive storytelling, which leverages AI to generate interactive narratives, enabling BLV users to explore visual media within a coherent storytelling experience. I have operationalized this paradigm through three techniques: (1) Hierarchical Narrative, which supports photo-collection exploration at different levels of detail; (2) Parallel Narrative, which provides seamless access to time-synced video comments; and (3) Branching Narrative, which enables immersive navigation of 360{\deg} videos. Together, these techniques demonstrate that AI-powered interactive storytelling can effectively balance user agency with narrative coherence across diverse media formats. My future work will advance this paradigm by enabling more personalized and expressive storytelling experiences for BLV audiences.

[9] arXiv:2508.03182 [中文pdf, pdf, html, 其他]
标题: StoryEnsemble:通过人工智能和前向后向传播在设计过程中实现动态探索与迭代
标题: StoryEnsemble: Enabling Dynamic Exploration & Iteration in the Design Process with AI and Forward-Backward Propagation
Sangho Suh, Michael Lai, Kevin Pu, Steven P. Dow, Tovi Grossman
主题: 人机交互 (cs.HC) ; 人工智能 (cs.AI)

设计过程涉及探索、迭代以及在相互关联的阶段之间的移动,如角色创建、问题定义、解决方案构思和原型设计。 然而,时间和资源限制常常阻碍设计师进行广泛探索、收集反馈以及重新审视早期假设,这使得在实践中坚持核心设计原则变得困难。 为了更好地理解这些挑战,我们对15名参与者进行了形成性研究,参与者包括用户体验从业者、学生和讲师。 基于研究结果,我们开发了StoryEnsemble,这是一个将人工智能集成到节点链接界面中的工具,并利用前向和后向传播来支持设计过程中的动态探索和迭代。 一项包含10名参与者的用户研究显示,StoryEnsemble能够实现快速的多方向迭代,并在设计阶段之间灵活导航。 这项工作通过引入新颖的交互方式,使探索和迭代更加流畅、易用和有趣,从而加深了我们对人工智能如何促进更频繁设计实践的理解。

Design processes involve exploration, iteration, and movement across interconnected stages such as persona creation, problem framing, solution ideation, and prototyping. However, time and resource constraints often hinder designers from exploring broadly, collecting feedback, and revisiting earlier assumptions-making it difficult to uphold core design principles in practice. To better understand these challenges, we conducted a formative study with 15 participants-comprised of UX practitioners, students, and instructors. Based on the findings, we developed StoryEnsemble, a tool that integrates AI into a node-link interface and leverages forward and backward propagation to support dynamic exploration and iteration across the design process. A user study with 10 participants showed that StoryEnsemble enables rapid, multi-directional iteration and flexible navigation across design stages. This work advances our understanding of how AI can foster more iterative design practices by introducing novel interactions that make exploration and iteration more fluid, accessible, and engaging.

[10] arXiv:2508.03216 [中文pdf, pdf, html, 其他]
标题: 导航精灵:面向商业元宇宙中按需导航代理的实现与实证研究
标题: Navigation Pixie: Implementation and Empirical Study Toward On-demand Navigation Agents in Commercial Metaverse
Hikari Yanagawa, Yuichi Hiroi, Satomi Tokida, Yuji Hatada, Takefumi Hiraki
评论: 11页 + 补充3页。将发表于IEEE ISMAR 2025
主题: 人机交互 (cs.HC) ; 人工智能 (cs.AI)

虽然商业元宇宙平台提供多样化的用户生成内容,但缺乏能够动态适应用户兴趣和意图的有效导航辅助功能。 尽管之前的研究已经探讨了在受控环境中的按需代理,但在具有多种世界配置和平台限制的商业环境中实现仍然具有挑战性。 我们提出了Navigation Pixie,一个按需导航代理,采用松耦合架构,将结构化空间元数据与基于大语言模型的自然语言处理相结合,同时最小化平台依赖性,从而能够在商业元宇宙平台的广泛用户群体上进行实验。 我们在包含99个PC客户端和94个VR-HMD参与者的商业元宇宙平台Cluster上进行了跨平台实验,结果表明,与固定路线和无代理条件相比,Navigation Pixie在两个平台上显著增加了停留时间和自由探索时间。 主观评估显示,在PC环境中表现出一致的按需偏好,而在VR-HMD中则表现出依赖于情境的社会感知优势。 这项研究通过对话式空间导航代理推动了VR交互设计的发展,建立了揭示环境依赖性效果的跨平台评估方法,并展示了商业元宇宙平台的实证实验框架。

While commercial metaverse platforms offer diverse user-generated content, they lack effective navigation assistance that can dynamically adapt to users' interests and intentions. Although previous research has investigated on-demand agents in controlled environments, implementation in commercial settings with diverse world configurations and platform constraints remains challenging. We present Navigation Pixie, an on-demand navigation agent employing a loosely coupled architecture that integrates structured spatial metadata with LLM-based natural language processing while minimizing platform dependencies, which enables experiments on the extensive user base of commercial metaverse platforms. Our cross-platform experiments on commercial metaverse platform Cluster with 99 PC client and 94 VR-HMD participants demonstrated that Navigation Pixie significantly increased dwell time and free exploration compared to fixed-route and no-agent conditions across both platforms. Subjective evaluations revealed consistent on-demand preferences in PC environments versus context-dependent social perception advantages in VR-HMD. This research contributes to advancing VR interaction design through conversational spatial navigation agents, establishes cross-platform evaluation methodologies revealing environment-dependent effectiveness, and demonstrates empirical experimentation frameworks for commercial metaverse platforms.

[11] arXiv:2508.03281 [中文pdf, pdf, html, 其他]
标题: 汽车多智能体研究的未来方向? 参与式研讨会和问卷调查的见解
标题: Quo-Vadis Multi-Agent Automotive Research? Insights from a Participatory Workshop and Questionnaire
Pavlo Bazilinskyy, Francesco Walker, Debargha Dey, Tram Thi Minh Tran, Hyungchai Park, Hyochang Kim, Hyunmin Kang, Patrick Ebel
期刊参考: 第17届国际汽车用户界面和交互式车辆应用会议(AutomotiveUI 2025附属会议)
主题: 人机交互 (cs.HC)

向涉及自动驾驶车辆、人工操作车辆和易受伤害道路用户的混合交通环境的转变给人类中心的汽车研究带来了新的挑战。 尽管如此,该领域大多数研究仍集中在单智能体交互上。 本文报告了在 AutomotiveUI '24 会议期间进行的参与式研讨会(N = 15)和问卷调查(N = 19),以探索多智能体汽车研究的现状。 参与者讨论了现实世界环境、模拟和计算建模中的方法论挑战和机遇。 关键发现表明,虽然多智能体方法的价值被广泛认可,但实际和技术障碍阻碍了其实施。 该研究强调了需要跨学科方法、更好的工具和能够支持可扩展、现实且符合伦理的多智能体研究的仿真环境。

The transition to mixed-traffic environments that involve automated vehicles, manually operated vehicles, and vulnerable road users presents new challenges for human-centered automotive research. Despite this, most studies in the domain focus on single-agent interactions. This paper reports on a participatory workshop (N = 15) and a questionnaire (N = 19) conducted during the AutomotiveUI '24 conference to explore the state of multi-agent automotive research. The participants discussed methodological challenges and opportunities in real-world settings, simulations, and computational modeling. Key findings reveal that while the value of multi-agent approaches is widely recognized, practical and technical barriers hinder their implementation. The study highlights the need for interdisciplinary methods, better tools, and simulation environments that support scalable, realistic, and ethically informed multi-agent research.

[12] arXiv:2508.03293 [中文pdf, pdf, html, 其他]
标题: 增强机器人任务中的联合人机推理:一种基于置信度的方法
标题: Enhancing Joint Human-AI Inference in Robot Missions: A Confidence-Based Approach
Duc-An Nguyen, Clara Colombatto, Steve Fleming, Ingmar Posner, Nick Hawes, Raunak Bhattacharyya
主题: 人机交互 (cs.HC) ; 机器人技术 (cs.RO)

联合人机推理在提高人类监督的机器人任务结果方面具有巨大的潜力。 当前的任务通常处于AI辅助的环境中,人类操作员根据AI的建议做出最终推理。 然而,由于人类在何时接受或拒绝AI建议方面的判断失误,互补性很少实现。 我们研究了联合人机推理,在这种情况下选择置信度更高的推理。 通过一项在代表性模拟机器人远程操作任务上的用户研究,共有N=100名参与者,特别研究了机器人控制延迟的推理,我们发现:a) 联合推理的准确性更高,其程度由AI代理的置信度校准调节,以及b) 人类会根据AI建议改变他们的推理,这种改变的程度和方向也由AI代理的置信度校准调节。 有趣的是,我们的结果表明,将校准不良的AI-DSS与人类配对反而会损害性能,而不是帮助团队,这重申了需要具有良好元认知敏感性的基于AI的决策支持系统。 据我们所知,我们的研究首次在模拟机器人远程操作任务中应用了基于最大置信度的启发式方法进行联合人机推理。

Joint human-AI inference holds immense potential to improve outcomes in human-supervised robot missions. Current day missions are generally in the AI-assisted setting, where the human operator makes the final inference based on the AI recommendation. However, due to failures in human judgement on when to accept or reject the AI recommendation, complementarity is rarely achieved. We investigate joint human-AI inference where the inference made with higher confidence is selected. Through a user study with N=100 participants on a representative simulated robot teleoperation task, specifically studying the inference of robots' control delays we show that: a) Joint inference accuracy is higher and its extent is regulated by the confidence calibration of the AI agent, and b) Humans change their inferences based on AI recommendations and the extent and direction of this change is also regulated by the confidence calibration of the AI agent. Interestingly, our results show that pairing poorly-calibrated AI-DSS with humans hurts performance instead of helping the team, reiterating the need for AI-based decision support systems with good metacognitive sensitivity. To the best of our knowledge, our study presents the first application of a maximum-confidence-based heuristic for joint human-AI inference within a simulated robot teleoperation task.

[13] arXiv:2508.03355 [中文pdf, pdf, html, 其他]
标题: Remini:利用聊天机器人中介的相互回忆促进亲人之间的积极情绪和联系感
标题: Remini: Leveraging Chatbot-Mediated Mutual Reminiscence for Promoting Positive Affect and Feeling of Connectedness among Loved Ones
Zhuoqun Jiang, ShunYi Yeo, Wei Xuan Donovan Seow, Simon Perrault
评论: 针对PACM HCI、CSCW 2025的最终版投稿
主题: 人机交互 (cs.HC)

互忆,定义为通过相互自我披露重新回顾共同的积极记忆,增强了情感联系,提升了幸福感,并加深了亲密感。 然而,大多数基于技术的回忆工具强调个人反思或单向讲述故事,忽视了有意义的互忆所必需的动态互动对话。 为解决这一局限性,我们引入了Remini,一个旨在支持亲密伴侣(如情侣、朋友或家庭成员)之间相互自我披露的聊天机器人。 基于自传体记忆的社会功能(SFAM)框架,Remini利用对话人工智能通过五个叙事阶段引导情感丰富的交流:建立融洽关系、叙述记忆、扩展、反思和总结。 在一项混合方法研究中,包括组间和组内被试研究(N = 48,24对被试),我们将Remini与一个提供最少记忆触发提示的基础聊天机器人进行了比较。 我们的研究结果表明,Remini提供的结构化指导显著提高了积极情绪、连接感和参与度。 它还促进了更详细的叙述共同构建和更大的相互自我披露。 参与者反馈强调了聊天机器人辅助回忆的实际价值、感知到的好处和设计考虑因素。 我们提出了实证基础上的设计启示,用于通过互忆加强人际连接的对话代理。

Mutual reminiscence, defined as revisiting shared positive memories through reciprocal self-disclosure, strengthens emotional bonds, enhances well-being, and deepens intimacy. However, most technology-mediated reminiscence tools emphasize individual reflection or one-way storytelling, which overlooks the dynamic, interactive dialogue essential for meaningful mutual reminiscence. To address this limitation, we introduce Remini, a chatbot designed to support reciprocal self-disclosure between close partners such as couples, friends, or family members. Grounded in the Social Functions of Autobiographical Memory (SFAM) framework, Remini uses conversational AI to guide emotionally rich exchanges through five narrative phases: rapport building, memory narration, elaboration, reflection, and summary. In a mixed-method, both between- and within- subjects study (N = 48, 24 dyads), we compare Remini to a baseline chatbot that offers minimal memory-trigger prompts. Our findings show that structured guidance from Remini significantly improves positive affect, feeling of connection, and engagement. It also fosters more detailed narrative co-construction and greater reciprocal self-disclosure. Participant feedback highlights the practical value, perceived benefits, and design considerations of chatbot-mediated reminiscence. We contribute empirically grounded design implications for conversational agents that strengthen human connection through mutual reminiscence.

[14] arXiv:2508.03430 [中文pdf, pdf, html, 其他]
标题: 科幻科学方法
标题: The Science Fiction Science Method
Iyad Rahwan, Azim Shariff, Jean-François Bonnefon
期刊参考: 自然,第643卷,第8075期(2025)
主题: 人机交互 (cs.HC) ; 人工智能 (cs.AI)

在未来技术实现之前预测其社会和行为影响,将使我们能够在这些影响根深蒂固之前引导其发展和监管。 传统上,这种预测依赖于定性的、叙述性方法。 在这里,我们描述了一种方法,该方法使用实验方法来模拟未来技术,并收集被分配到未来技术的受控变体的参与者的态度和行为的定量测量结果。 我们称这种方法为“科幻科学”。 我们认为,尽管这种方法具有潜在的好处,但尚未被充分接受的原因是,实验科学家可能不愿意从事面临如此严重有效性威胁的科幻科学工作。 为了解决这些威胁,我们考虑了科幻科学可能研究的技术类型的可能限制,以及科幻科学可能需要的非传统、沉浸式方法。 我们试图提供对这种方法长期被边缘化的原因的见解,如果能够建立在强大但不寻常的方法基础上,它将带来哪些好处,以及我们如何使这些方法正常化,以帮助科幻科学家的多样化社区参与一个有效的改进良性循环。

Predicting the social and behavioral impact of future technologies, before they are achieved, would allow us to guide their development and regulation before these impacts get entrenched. Traditionally, this prediction has relied on qualitative, narrative methods. Here we describe a method which uses experimental methods to simulate future technologies, and collect quantitative measures of the attitudes and behaviors of participants assigned to controlled variations of the future. We call this method 'science fiction science'. We suggest that the reason why this method has not been fully embraced yet, despite its potential benefits, is that experimental scientists may be reluctant to engage in work facing such serious validity threats as science fiction science. To address these threats, we consider possible constraints on the kind of technology that science fiction science may study, as well as the unconventional, immersive methods that science fiction science may require. We seek to provide perspective on the reasons why this method has been marginalized for so long, what benefits it would bring if it could be built on strong yet unusual methods, and how we can normalize these methods to help the diverse community of science fiction scientists to engage in a virtuous cycle of validity improvement.

[15] arXiv:2508.03547 [中文pdf, pdf, html, 其他]
标题: 引导现实:使用大语言模型和视觉模型生成视觉丰富的增强现实任务指导
标题: Guided Reality: Generating Visually-Enriched AR Task Guidance with LLMs and Vision Models
Ada Yi Zhao, Aditya Gunturu, Ellen Yi-Luen Do, Ryo Suzuki
评论: 出现在UIST 2025上
主题: 人机交互 (cs.HC)

大型语言模型(LLMs)已实现了对一系列物理任务的逐步增强现实(AR)指令的自动生成。 然而,现有的基于LLM的AR指导通常缺乏丰富的视觉增强功能,难以有效地将指令嵌入空间上下文中以提高用户的理解。 我们提出了Guided Reality,一个完全自动化的AR系统,该系统基于逐步指令生成嵌入式和动态视觉指导。 我们的系统结合了LLM和视觉模型来:1)从用户查询生成多步骤指令,2)识别适当的视觉指导类型,3)提取现实世界中关键交互点的空间信息,以及4)将视觉指导嵌入物理空间以支持任务执行。 基于用户手册语料库,我们定义了五类视觉指导,并提出了一种基于当前步骤的识别策略。 我们通过用户研究(N=16)评估了该系统,完成了现实世界的任务并探索了在野外使用该系统的情况。 此外,四位指导者分享了关于如何将Guided Reality整合到他们的培训工作流程中的见解。

Large language models (LLMs) have enabled the automatic generation of step-by-step augmented reality (AR) instructions for a wide range of physical tasks. However, existing LLM-based AR guidance often lacks rich visual augmentations to effectively embed instructions into spatial context for a better user understanding. We present Guided Reality, a fully automated AR system that generates embedded and dynamic visual guidance based on step-by-step instructions. Our system integrates LLMs and vision models to: 1) generate multi-step instructions from user queries, 2) identify appropriate types of visual guidance, 3) extract spatial information about key interaction points in the real world, and 4) embed visual guidance in physical space to support task execution. Drawing from a corpus of user manuals, we define five categories of visual guidance and propose an identification strategy based on the current step. We evaluate the system through a user study (N=16), completing real-world tasks and exploring the system in the wild. Additionally, four instructors shared insights on how Guided Reality could be integrated into their training workflows.

[16] arXiv:2508.03630 [中文pdf, pdf, html, 其他]
标题: SlideAudit:用于演示文稿自动评估的数据集和分类法
标题: SlideAudit: A Dataset and Taxonomy for Automated Evaluation of Presentation Slides
Zhuohao Jerry Zhang, Ruiqi Chen, Mingyuan Zhong, Jacob O. Wobbrock
评论: 用户界面系统和技术会议 2025
主题: 人机交互 (cs.HC)

自动评估特定图形设计,如演示文稿,是一个开放性问题。 我们提出了SlideAudit,一个用于自动幻灯片评估的数据集。 我们与设计专家合作,开发了一个全面的幻灯片设计缺陷分类法。 我们的数据集包含从多个来源收集和合成的2400张幻灯片,其中包括一个故意引入特定设计问题的子集。 然后我们通过Prolific严格培训的众包方式,使用我们的分类法对它们进行了全面标注。 为了评估人工智能是否能够识别设计缺陷,我们在不同的提示策略下比较了多个大型语言模型,并与现有的设计评论流程进行了比较。 我们表明,人工智能模型在准确识别幻灯片设计缺陷方面存在困难,F1分数范围从0.331到0.655。 值得注意的是,利用我们分类法的提示技术取得了最高性能。 我们进一步进行了一项修复研究,以评估人工智能在改进幻灯片方面的潜力。 在82.0%显示出显著改进的幻灯片中,87.8%的幻灯片通过我们的分类法得到了更好的改进,进一步证明了其有效性。

Automated evaluation of specific graphic designs like presentation slides is an open problem. We present SlideAudit, a dataset for automated slide evaluation. We collaborated with design experts to develop a thorough taxonomy of slide design flaws. Our dataset comprises 2400 slides collected and synthesized from multiple sources, including a subset intentionally modified with specific design problems. We then fully annotated them using our taxonomy through strictly trained crowdsourcing from Prolific. To evaluate whether AI is capable of identifying design flaws, we compared multiple large language models under different prompting strategies, and with an existing design critique pipeline. We show that AI models struggle to accurately identify slide design flaws, with F1 scores ranging from 0.331 to 0.655. Notably, prompting techniques leveraging our taxonomy achieved the highest performance. We further conducted a remediation study to assess AI's potential for improving slides. Among 82.0% of slides that showed significant improvement, 87.8% of them were improved more with our taxonomy, further demonstrating its utility.

[17] arXiv:2508.03651 [中文pdf, pdf, html, 其他]
标题: 探测ChatGPT实时视频聊天在为视障或低视力人群提供现实世界帮助中的漏洞
标题: Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired
Ruei-Che Chang, Rosiana Natalie, Wenqian Xu, Jovan Zheng Feng Yap, Anhong Guo
评论: ACM 资产 2025
主题: 人机交互 (cs.HC) ; 人工智能 (cs.AI)

近年来,大型多模态模型的进展为盲人或视力受损(BVI)个体提供了通过利用实时视频流的交互系统来解读和参与现实世界的新能力。 然而,这些能力在支持各种现实辅助任务中的潜在好处和挑战仍不明确。 在本文中,我们介绍了对八名BVI参与者的探索性研究结果。 参与者在各种现实场景中使用了ChatGPT的高级语音与视频功能,这是一种于2024年底发布的最先进的实时视频人工智能,包括从定位物体到识别视觉地标,覆盖陌生的室内和室外环境。 我们的研究结果表明,当前的实时视频人工智能在为静态视觉场景提供指导和答案方面效果良好,但在动态情况下无法提供必要的实时描述。 尽管空间和距离信息存在不准确性,参与者仍利用提供的视觉信息来补充他们的移动策略。 尽管由于高质量的语音互动,系统被感知为类似人类,但关于用户视觉能力的假设、幻觉、通用回复以及讨好倾向导致了BVI用户的困惑、不信任和潜在风险。 基于研究结果,我们讨论了辅助视频人工智能代理的含义,包括在现实使用中整合额外的感知能力,确定超越轮流交互的适当干预时机,以及解决生态和安全问题。

Recent advancements in large multimodal models have provided blind or visually impaired (BVI) individuals with new capabilities to interpret and engage with the real world through interactive systems that utilize live video feeds. However, the potential benefits and challenges of such capabilities to support diverse real-world assistive tasks remain unclear. In this paper, we present findings from an exploratory study with eight BVI participants. Participants used ChatGPT's Advanced Voice with Video, a state-of-the-art live video AI released in late 2024, in various real-world scenarios, from locating objects to recognizing visual landmarks, across unfamiliar indoor and outdoor environments. Our findings indicate that current live video AI effectively provides guidance and answers for static visual scenes but falls short in delivering essential live descriptions required in dynamic situations. Despite inaccuracies in spatial and distance information, participants leveraged the provided visual information to supplement their mobility strategies. Although the system was perceived as human-like due to high-quality voice interactions, assumptions about users' visual abilities, hallucinations, generic responses, and a tendency towards sycophancy led to confusion, distrust, and potential risks for BVI users. Based on the results, we discuss implications for assistive video AI agents, including incorporating additional sensing capabilities for real-world use, determining appropriate intervention timing beyond turn-taking interactions, and addressing ecological and safety concerns.

[18] arXiv:2508.03673 [中文pdf, pdf, 其他]
标题: 知识关系在人机交互中的分类:一种探索性方法
标题: Classifying Epistemic Relationships in Human-AI Interaction: An Exploratory Approach
Shengnan Yang, Rongqian Ma
主题: 人机交互 (cs.HC) ; 人工智能 (cs.AI) ; 计算机与社会 (cs.CY)

随着人工智能系统在知识密集型工作中的日益重要,不仅关于其功能的问题浮现,还涉及它们在人机交互中的认识论角色。 尽管人机交互研究提出了各种人工智能角色类型,但往往忽视了人工智能如何重塑用户作为知识贡献者的作用。 本研究考察了用户如何形成与人工智能的认识论关系——他们在研究和教学情境中如何评估、信任并与之合作。 基于对跨学科学者的31次访谈,我们开发了一个五部分的编码手册,并确定了五种关系类型:工具性依赖、条件性委派、共代理协作、权威替代和认识论回避。 这些关系反映了信任、评估方式、任务和人类认识论地位的差异。 我们的研究结果表明,认识论角色是动态且依赖于情境的。 我们认为应超越人工智能的静态隐喻,采用更细致的框架,以捕捉人类与人工智能共同构建知识的方式,从而丰富人机交互领域对人工智能使用的关系性和规范性维度的理解。

As AI systems become integral to knowledge-intensive work, questions arise not only about their functionality but also their epistemic roles in human-AI interaction. While HCI research has proposed various AI role typologies, it often overlooks how AI reshapes users' roles as knowledge contributors. This study examines how users form epistemic relationships with AI-how they assess, trust, and collaborate with it in research and teaching contexts. Based on 31 interviews with academics across disciplines, we developed a five-part codebook and identified five relationship types: Instrumental Reliance, Contingent Delegation, Co-agency Collaboration, Authority Displacement, and Epistemic Abstention. These reflect variations in trust, assessment modes, tasks, and human epistemic status. Our findings show that epistemic roles are dynamic and context-dependent. We argue for shifting beyond static metaphors of AI toward a more nuanced framework that captures how humans and AI co-construct knowledge, enriching HCI's understanding of the relational and normative dimensions of AI use.

交叉提交 (展示 9 之 9 条目 )

[19] arXiv:2508.02733 (交叉列表自 cs.SE) [中文pdf, pdf, html, 其他]
标题: 什么是证明? 分析F*和Verus中的专家证明编写过程
标题: What's in a Proof? Analyzing Expert Proof-Writing Processes in F* and Verus
Rijul Jain, Shraddha Barke, Gabriel Ebner, Md Rakib Hossain Misu, Shan Lu, Sarah Fakhoury
主题: 软件工程 (cs.SE) ; 人机交互 (cs.HC)

以证明为导向的编程语言(POPLs)使开发人员能够在编写代码的同时进行形式化正确性证明,提供形式化保证,确保代码符合指定要求。 尽管它们具有强大的功能,但POPLs的学习曲线陡峭,尚未被更广泛的软件社区采用。 对证明开发过程的理解不足以及专家证明开发者如何与POPLs交互的问题,阻碍了有效证明工程的进步和证明合成模型/工具的发展。 在本工作中,我们进行了一项用户研究,涉及从八位使用两种语言F*和Verus的专家那里收集和分析细粒度源代码遥测数据。 结果揭示了关于专家如何推理证明以及在证明开发过程中遇到的关键挑战的一些有趣趋势和模式。 我们识别出三种不同的策略和多种非正式实践,这些实践未被最终代码快照所捕获,但对任务结果具有预测性。 我们将这些发现转化为AI证明助手的具体设计指导:倾向于早期规范起草、显式子目标分解、有限主动错误和有纪律的验证器交互。 我们还展示了一个基于这些建议的F*证明代理案例研究,并展示了其在基线LLM上的性能提升。

Proof-oriented programming languages (POPLs) empower developers to write code alongside formal correctness proofs, providing formal guarantees that the code adheres to specified requirements. Despite their powerful capabilities, POPLs present a steep learning curve and have not yet been adopted by the broader software community. The lack of understanding about the proof-development process and how expert proof developers interact with POPLs has hindered the advancement of effective proof engineering and the development of proof-synthesis models/tools. In this work, we conduct a user study, involving the collection and analysis of fine-grained source code telemetry from eight experts working with two languages, F* and Verus. Results reveal interesting trends and patterns about how experts reason about proofs and key challenges encountered during the proof development process. We identify three distinct strategies and multiple informal practices that are not captured final code snapshots, yet are predictive of task outcomes. We translate these findings into concrete design guidance for AI proof assistants: bias toward early specification drafting, explicit sub-goal decomposition, bounded active errors, and disciplined verifier interaction. We also present a case study of an F* proof agent grounded in these recommendations, and demonstrate improved performance over baseline LLMs

[20] arXiv:2508.02926 (交叉列表自 cs.LG) [中文pdf, pdf, html, 其他]
标题: 大陪审团:一种用于动态质量标准的协作机器学习模型评估协议
标题: GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics
Arthur Cho
评论: 26页,1张表格。开源实现可在PyPI(grandjury包)和GitHub上获得。数据集可在Hugging Face上获取,遵循CC-BY-4.0许可。
主题: 机器学习 (cs.LG) ; 人工智能 (cs.AI) ; 人机交互 (cs.HC)

生成式机器学习模型已成为现代系统的核心,推动了创意写作、摘要生成、多跳推理和上下文感知对话等应用。 这些模型支撑着大规模人工智能助手、工作流自动化和自主决策。 在这些领域中,可接受的响应很少是绝对或静态的,而是多种多样且高度依赖于上下文。 然而,标准的评估制度仍然依赖于静态的、基准测试风格的测试,鼓励优化以获得排行榜分数,而不是与动态用户需求或不断变化的现实保持一致。 GrandJury 引入了一种正式的评估协议,结合时间衰减聚合、完整的可追溯性,以及对动态、透明的任务评分标准归属和支持,以及多评估者的人类判断。 这些元素共同实现了多元化的、可问责的评估,能够捕捉不断演变的共识并揭示分歧。 我们提供了一个开源实现(grandjury PyPI 包)和一个公开的大型语言模型(LLM)推理输出集合,以说明需求和方法。 GrandJury 为人工智能从业者在评估没有绝对真实情况的机器学习输出时提供了一个新范式。

Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth.

[21] arXiv:2508.03037 (交叉列表自 cs.CL) [中文pdf, pdf, html, 其他]
标题: 当算法遇见艺术家:对2013-2025年AI艺术争论的主题建模
标题: When Algorithms Meet Artists: Topic Modeling the AI-Art Debate, 2013-2025
Ariya Mukherjee-Gandhi, Oliver Muellerklein
评论: 18页,5图,5表
主题: 计算与语言 (cs.CL) ; 计算机与社会 (cs.CY) ; 人机交互 (cs.HC)

随着生成式AI不断重塑艺术创作和人类表达的替代模式,那些生计直接受到影响的艺术家们对同意、透明度和创意劳动的未来提出了紧迫的担忧。 然而,艺术家的声音在主流公共和学术话语中往往被边缘化。 本研究对2013年至2025年间的英文关于AI生成艺术的论述进行了十二年的分析。 它从439个经过筛选的500字摘录中获取数据,这些摘录来自观点文章、新闻报道、博客、法律文件和口头演讲的转录文本。 通过一种可重复的方法论,我们识别出五个稳定的主题聚类,并揭示了艺术家的感知与主流媒体叙事之间的不一致。 我们的研究结果强调了技术术语的使用如何作为一种微妙的准入机制,常常使艺术家认为最紧迫的问题被忽视。 我们的工作提供了一种基于BERTopic的方法论和多模态基线,为未来的研究提供了明确的呼吁,要求在不断演变的人工智能与创意领域中更深入地、以透明度为导向地关注艺术家的观点。

As generative AI continues to reshape artistic production and alternate modes of human expression, artists whose livelihoods are most directly affected have raised urgent concerns about consent, transparency, and the future of creative labor. However, the voices of artists are often marginalized in dominant public and scholarly discourse. This study presents a twelve-year analysis, from 2013 to 2025, of English-language discourse surrounding AI-generated art. It draws from 439 curated 500-word excerpts sampled from opinion articles, news reports, blogs, legal filings, and spoken-word transcripts. Through a reproducible methodology, we identify five stable thematic clusters and uncover a misalignment between artists' perceptions and prevailing media narratives. Our findings highlight how the use of technical jargon can function as a subtle form of gatekeeping, often sidelining the very issues artists deem most urgent. Our work provides a BERTopic-based methodology and a multimodal baseline for future research, alongside a clear call for deeper, transparency-driven engagement with artist perspectives in the evolving AI-creative landscape.

[22] arXiv:2508.03274 (交叉列表自 eess.SP) [中文pdf, pdf, html, 其他]
标题: 使用EEG研究刹车灯在引发制动动作中的认知反应
标题: Investigating the Cognitive Response of Brake Lights in Initiating Braking Action Using EEG
Ramaswamy Palaniappan, Surej Mouli, Howard Bowman, Ian McLoughlin
评论: arXiv管理员注释:与arXiv:2010.10584文本重叠
期刊参考: IEEE智能交通系统汇刊 2022年8月
主题: 信号处理 (eess.SP) ; 新兴技术 (cs.ET) ; 人机交互 (cs.HC) ; 信息检索 (cs.IR)

一半的交通事故是由于驾驶员注意力不足或车辆之间保持的距离不足造成的。 特别是追尾事故,已被确定为英国最常见的事故类型,其影响因素多年来一直被广泛研究。 后置刹车灯在刹车时亮起,是提醒后方驾驶员需要减速或刹车的主要机制。 本文提出了一种新的脑响应方法,用于测量受试者对不同刹车灯设计的反应。 在物理模拟驾驶环境中测试了各种现成的刹车灯组件,以评估22名受试者的认知反应时间。 使用了八组基于LED的刹车灯组件和两组基于白炽灯泡的刹车灯组件,并记录了脑电图(EEG)数据。 利用通道Pz来提取在参与者决定将脚从油门移开并踩下刹车时大脑中发生的决策过程中产生的P3成分。 脑电图分析显示,基于白炽灯泡的灯光在引发认知反应方面比所有测试的基于LED的灯光都要慢。 在LED设计之间存在差异,但不具有统计学意义,这归因于脑电图信号中存在大量的运动伪影。

Half of all road accidents result from either lack of driver attention or from maintaining insufficient separation between vehicles. Collision from the rear, in particular, has been identified as the most common class of accident in the UK, and its influencing factors have been widely studied for many years. Rear-mounted stop lamps, illuminated when braking, are the primary mechanism to alert following drivers to the need to reduce speed or brake. This paper develops a novel brain response approach to measuring subject reaction to different brake light designs. A variety of off-the-shelf brake light assemblies are tested in a physical simulated driving environment to assess the cognitive reaction times of 22 subjects. Eight pairs of LED-based and two pairs of incandescent bulb-based brake light assemblies are used and electroencephalogram (EEG) data recorded. Channel Pz is utilised to extract the P3 component evoked during the decision making process that occurs in the brain when a participant decides to lift their foot from the accelerator and depress the brake. EEG analysis shows that both incandescent bulb-based lights are statistically slower to evoke cognitive responses than all tested LED-based lights. Between the LED designs, differences are evident, but not statistically significant, attributed to the significant amount of movement artifact in the EEG signal.

[23] arXiv:2508.03410 (交叉列表自 cs.MM) [中文pdf, pdf, html, 其他]
标题: VisAug:通过自动生成的视觉增强功能促进语音丰富的网络视频导航和参与
标题: VisAug: Facilitating Speech-Rich Web Video Navigation and Engagement with Auto-Generated Visual Augmentations
Baoquan Zhao, Xiaofan Ma, Qianshi Pang, Ruomei Wang, Fan Zhou, Shujin Lin
主题: 多媒体 (cs.MM) ; 人机交互 (cs.HC)

数字技术的广泛应用已在我们生活的各个方面带来了数字化转型的新时代。 在线学习、社交和工作活动,如远程教育、视频会议、面试和讲座,导致了语音丰富的视频内容大幅增加。 与其他类型的视频(如监控录像)相比,这些视频通常包含丰富的视觉线索,而语音丰富的视频则主要通过音频通道传达大部分有意义的信息。 这对利用现有的基于视觉的视频摘要、导航和探索系统来提升内容消费提出了挑战。 在本文中,我们介绍了 VisAug,这是一种新颖的交互式系统,旨在通过根据视频的语音内容自动生成信息丰富且富有表现力的视觉增强内容,从而提高语音丰富的视频导航和参与度。 我们的研究结果表明,该系统有潜力显著提升在日益以视频为主导的数字环境中的信息消费和参与度。

The widespread adoption of digital technology has ushered in a new era of digital transformation across all aspects of our lives. Online learning, social, and work activities, such as distance education, videoconferencing, interviews, and talks, have led to a dramatic increase in speech-rich video content. In contrast to other video types, such as surveillance footage, which typically contain abundant visual cues, speech-rich videos convey most of their meaningful information through the audio channel. This poses challenges for improving content consumption using existing visual-based video summarization, navigation, and exploration systems. In this paper, we present VisAug, a novel interactive system designed to enhance speech-rich video navigation and engagement by automatically generating informative and expressive visual augmentations based on the speech content of videos. Our findings suggest that this system has the potential to significantly enhance the consumption and engagement of information in an increasingly video-driven digital landscape.

[24] arXiv:2508.03514 (交叉列表自 cs.RO) [中文pdf, pdf, 其他]
标题: 循环中的戏剧:一种基于排练的协作工作流程,用于表达性机器人行为
标题: Theatre in the Loop: A Rehearsal-Based, Collaborative Workflow for Expressive Robotic Behaviours
Pavlos Panagiotidis, Victor Zhi Heung Ngo, Sean Myatt, Roma Patel, Rachel Ramchurn, Alan Chamberlain, Ayse Kucukyilmaz
评论: 论文已被接受在国际社会机器人与人工智能会议(https://icsr2025.eu/)上进行展示。
主题: 机器人技术 (cs.RO) ; 人机交互 (cs.HC)

在本文中,我们提出了一种名为“剧场循环”的框架,通过导演指导的傀儡操控工作流程,开发出适合艺术表演的富有表现力的机器人行为。 利用戏剧方法,我们使用叙述目标来指导傀儡操控者生成传达特定情感的即兴机器人动作。 这些即兴动作被捕捉和整理,以构建可用于未来自主表演的可重复使用的动作模板数据集。 初步试验展示了这种方法的可行性,说明该工作流程如何使机器人动作精确地塑造为连贯的情感弧线,同时揭示了机器人机械限制带来的挑战。 我们认为,这种实践导向的框架为创建社会性表达机器人行为的跨学科团队提供了一个模型,对(1)戏剧作为人机交互的互动训练场和(2)人类与机器之间的共同创作方法论做出了贡献。

In this paper, we propose theatre-in-the-loop, a framework for developing expressive robot behaviours tailored to artistic performance through a director-guided puppeteering workflow. Leveraging theatrical methods, we use narrative objectives to direct a puppeteer in generating improvised robotic gestures that convey specific emotions. These improvisations are captured and curated to build a dataset of reusable movement templates for standalone playback in future autonomous performances. Initial trials demonstrate the feasibility of this approach, illustrating how the workflow enables precise sculpting of robotic gestures into coherent emotional arcs while revealing challenges posed by the robot's mechanical constraints. We argue that this practice-led framework provides a model for interdisciplinary teams creating socially expressive robot behaviours, contributing to (1) theatre as an interactive training ground for human-robot interaction and (2) co-creation methodologies between humans and machines.

[25] arXiv:2508.03638 (交叉列表自 cs.FL) [中文pdf, pdf, 其他]
标题: 多带图灵机的设计支持
标题: Design Support for Multitape Turing Machines
Marco T. Morazán (Seton Hall University), Oliwia Kempinski (University of Maryland), Andrés M. Garced (Seton Hall University)
评论: 在《TFPiE 2025论文集》中,arXiv:2508.02305
期刊参考: EPTCS 424,2025,第1-24页
主题: 形式语言与自动机理论 (cs.FL) ; 人机交互 (cs.HC) ; 编程语言 (cs.PL) ; 软件工程 (cs.SE)

许多形式语言与自动机理论课程会向学生介绍图灵机的扩展。 其中最广泛使用的扩展是为图灵机提供多条磁带。 尽管多带图灵机是一种简化图灵机设计的抽象,但学生们发现它们并不比单带图灵机更容易理解。 为了帮助学生理解这些机器,FSM编程语言提供了它们的定义和执行支持。 然而,这已被证明对于许多学生来说不足以理解这些机器的操作语义,以及为什么这些机器接受或拒绝一个单词。 为了解决这个问题,开发了三种可视化工具。 第一个是动态可视化工具,用于模拟机器的执行。 第二个是静态可视化工具,能够自动生成多带图灵机转移图的图形。 第三个是静态可视化工具,能够自动生成多带图灵机的计算图。 本文介绍了这些工具,并说明了它们如何帮助学生设计和实现多带图灵机。 此外,还提供了实证数据,表明这些工具受到学生的欢迎并被认为是有用的。

Many Formal Languages and Automata Theory courses introduce students to Turing machine extensions. One of the most widely-used extensions endows Turing machines with multiple tapes. Although multitape Turing machines are an abstraction to simplify Turing machine design, students find them no less challenging. To aid students in understanding these machines, the FSM programming language provides support for their definition and execution. This, however, has proven insufficient for many students to understand the operational semantics of such machines and to understand why such machines accept or reject a word. To address this problem, three visualization tools have been developed. The first is a dynamic visualization tool that simulates machine execution. The second is a static visualization tool that automatically renders a graphic for a multitape Turing machine's transition diagram. The third is a static visualization tool that automatically renders computation graphs for multitape Turing machines. This article presents these tools and illustrates how they are used to help students design and implement multitape Turing machines. In addition, empirical data is presented that suggests these tools are well-received and found useful by students.

[26] arXiv:2508.03639 (交叉列表自 cs.FL) [中文pdf, pdf, 其他]
标题: 一种正则表达式的设计配方及基于配方的错误
标题: A Design Recipe and Recipe-Based Errors for Regular Expressions
Marco T. Morazán (Seton Hall University), Shamil Dzhatdoyev (Axoni, USA), Josephine Des Rosiers (Penguin Random House), Tijana Minić (University of Washington), Andrés M. Garced (Seton Hall University), David Anthony K. Fields (Seton Hall University)
评论: 在《TFPiE 2025论文集》中,arXiv:2508.02305
期刊参考: EPTCS 424,2025,第25-48页
主题: 形式语言与自动机理论 (cs.FL) ; 人机交互 (cs.HC) ; 编程语言 (cs.PL) ; 软件工程 (cs.SE)

本文提出了一种新框架,为形式语言与自动机理论的学生提供正则表达式设计支持。该框架包括一个正则表达式的设计步骤和一个定制的错误信息系统。错误信息系统生成基于设计步骤的错误,包括未成功完成的设计步骤。此外,错误信息遵循简洁、明了、无专业术语且不具指导性的既定实践。此外,还描述了一种用于编写单元测试的简写语法。本文展示了课堂上设计步骤的应用,讨论了使用所述系统的两个调试会话,并简要概述了错误信息系统的实现。

This article presents a novel framework to provide Formal Languages and Automata Theory students design support for the development of regular expressions. This framework includes a design recipe for regular expressions and a customized error messaging system. The error messaging system produces recipe-based errors that include the step of the design recipe not successfully completed. Furthermore, the error messages follow the established practices of being concise, succinct, jargon-free, and nonprescriptive. In addition, a shorthand syntax developed for writing unit tests is described. The in-class use of the design recipe is illustrated, two debugging sessions using the described system are discussed, and the implementation of the error messaging system is briefly sketched.

[27] arXiv:2508.03641 (交叉列表自 cs.FL) [中文pdf, pdf, 其他]
标题: 视觉执行与有限状态机和下推自动机的验证
标题: Visual Execution and Validation of Finite-State Machines and Pushdown Automata
Marco T. Morazán (Seton Hall University), David Anthony K. Fields (Seton Hall University), Andrés M. Garced (Seton Hall University), Tijana Minić (University of Washington)
评论: 在《TFPiE 2025论文集》中,arXiv:2508.02305
期刊参考: EPTCS 424,2025年,第87-108页
主题: 形式语言与自动机理论 (cs.FL) ; 人机交互 (cs.HC) ; 编程语言 (cs.PL) ; 软件工程 (cs.SE)

在形式语言与自动机理论课程中,学生发现理解非确定性有限状态自动机和下推自动机很困难。 在许多情况下,这意味着他们很难理解这些机器的操作语义,因此难以确定一个单词为何被接受或拒绝。 这并不令人意外,因为学生主要被训练设计和实现确定性程序。 对下推自动机的理解进一步复杂化,因为需要对栈进行推理。 例如,学生面临的一个常见困难是理解同一单词上的两种不同计算可能在不同的栈值下到达相同的状态。 为了帮助学生理解,我们介绍了两种新颖的动态可视化工具用于 FSM——一种用于自动机理论课堂的领域特定编程语言——以支持此类机器的设计。 这些工具分别以逐步的方式可视化非确定性有限状态机或下推自动机可能执行的所有计算。 此外,这些工具通过允许用户视觉验证当机器进入某个状态时,该状态所代表的属性是否成立,从而有助于机器验证过程。

In Formal Languages and Automata Theory courses, students find understanding nondeterministic finite-state and pushdown automata difficult. In many cases, this means that it is challenging for them to comprehend the operational semantics of such machines and, as a consequence, determine why a word is accepted or rejected. This is not entirely surprising, because students are mostly trained to design and implement deterministic programs. Comprehension of pushdown automata is further complicated, because reasoning about the stack is necessary. A common difficulty students face, for example, is understanding that two different computations on the same word may reach the same state with different stack values. To aid student understanding, we present two novel dynamic visualization tools for FSM -- a domain-specific programming language for the Automata Theory classroom -- to support the design of such machines. These tools visualize all computations that may be performed, respectively, by a nondeterministic finite-state machine or by a pushdown automata in a stepwise manner. In addition, these tools aid the machine verification process by allowing users to visually validate whether the properties a state represents hold when a machine transitions into it.

替换提交 (展示 15 之 15 条目 )

[28] arXiv:2404.17730 (替换) [中文pdf, pdf, html, 其他]
标题: 衰老增强AAC:对自闭症成年人辅助和替代沟通应用的反思
标题: Aging Up AAC: An Introspection on Augmentative and Alternative Communication Applications for Autistic Adults
Lara J. Martin, Malathy Nagalakshmi
主题: 人机交互 (cs.HC) ; 计算与语言 (cs.CL)

高科技增强和替代通信(AAC)近年来由于大型语言模型(LLMs)如ChatGPT的广泛应用而迅速发展,但其中许多技术在集成时并未考虑到用户的观点。自闭症成人尤其在AAC工具的设计中被忽视。我们对12名自闭症成人进行了深入访谈,以发现当前AAC的痛点,并确定他们可能认为有帮助的技术进步。我们从访谈中发现了8个不同的主题类别:输入灵活性、输出灵活性、选择或适应AAC、AAC使用的场景、优势、作为成年人的访问、持续使用中的障碍以及沟通控制。在本文中,我们深入探讨这些类别——将其与先前工作进行比较——然后突出新的发现,以建议可能的研究方向。

High-tech Augmentative and Alternative Communication (AAC) has been rapidly advancing in recent years due to the increased use of large language models (LLMs) like ChatGPT, but many of these techniques are integrated without the inclusion of the users' perspectives. Autistic adults have been particularly neglected in the design of AAC tools. We conducted in-depth interviews with 12 autistic adults to find the pain points of current AAC and determine what technological advances they might find helpful. We found 8 different categories of themes from our interviews: input flexibility, output flexibility, selecting or adapting AAC, contexts for AAC use, benefits, access as an adult, stumbling blocks for continued use, and control of communication. In this paper, we go through these categories in depth -- comparing each to prior work -- and then highlight novel findings to suggest possible research directions.

[29] arXiv:2410.04286 (替换) [中文pdf, pdf, html, 其他]
标题: 拥抱透明度:早期职业人机交互研究人员开放科学实践的研究
标题: Embracing Transparency: A Study of Open Science Practices Among Early Career HCI Researchers
Tatiana Chakravorti, Sanjana Gautam, Sarah M. Rajtmajer
主题: 人机交互 (cs.HC)

许多科学领域,包括人机交互(HCI),在对已发表研究的可重复性和可复制性产生担忧之后,加强了自我反思。 值得注意的是,近年来HCI社区努力实施政策变化并推广开放科学实践。 我们的工作通过18次半结构化访谈,调查了早期职业HCI研究人员对开放科学的看法以及对最佳实践的参与情况。 我们的研究结果突出了数据和材料共享以及预注册广泛采用的关键障碍,即:缺乏明确的激励措施;文化抵制;培训有限;时间限制;对知识产权的担忧;以及数据隐私问题。 我们观察到,在像CHI这样的主要会议上进行的小改变可能会对社区规范产生有意义的影响。 我们提出了建议,以解决这些障碍,并促进HCI中的透明度和开放性。 虽然这些发现为早期职业HCI研究人员的开放科学实践提供了有价值的有趣见解,但其适用性仅限于美国。 该访谈研究依赖于自我报告的数据,因此可能受到回忆偏差等偏见的影响。 未来的研究将包括扩大HCI研究人员的范围,涵盖不同经验和不同国家的研究人员,从而提供更有说服力的例子。

Many fields of science, including Human-Computer Interaction (HCI), have heightened introspection in the wake of concerns around reproducibility and replicability of published findings. Notably, in recent years the HCI community has worked to implement policy changes and mainstream open science practices. Our work investigates early-career HCI researchers' perceptions of open science and engagement with best practices through 18 semi-structured interviews. Our findings highlight key barriers to the widespread adoption of data and materials sharing, and preregistration, namely: lack of clear incentives; cultural resistance; limited training; time constraints; concerns about intellectual property; and data privacy issues. We observe that small changes at major conferences like CHI could meaningfully impact community norms. We offer recommendations to address these barriers and to promote transparency and openness in HCI. While these findings provide valuable and interesting insights about the open science practices by early career HCI researchers, their applicability is limited to the USA only. The interview study relies on self-reported data; therefore, it can be subject to biases like recall bias. Future studies will include the scope to expand HCI researchers from different levels of experience and different countries, allowing for more justifiable examples.

[30] arXiv:2501.04543 (替换) [中文pdf, pdf, html, 其他]
标题: 假冒者在我们之中:大型语言模型能否捕捉人类人格的复杂性?
标题: The Impostor is Among Us: Can Large Language Models Capture the Complexity of Human Personas?
Christopher Lazik, Christopher Katins, Charlotte Kauter, Jonas Jakob, Caroline Jay, Lars Grunske, Thomas Kosch
主题: 人机交互 (cs.HC)

大型语言模型(LLMs)为生成人物档案创造了新的机会,有望简化并加速以人类为中心的设计过程。 然而,AI生成的人物档案可能无法准确反映实际的用户体验,因为它们可能会遗漏对理解真实用户需求和行为至关重要的上下文和情感洞察。 这可能会对质量构成潜在威胁,尤其是对于新手而言。 本文研究了用户在可信度方面对由LLMs创建的人物档案与由人类创建的人物档案之间的感知差异。 我们收集了十个人类创建的人物档案,这些档案由人机交互(HCI)专家根据相关研究中确立的相关属性开发。 然后,我们系统地生成了十个由LLMs创建的人物档案,并在一项调查中将其与人工创建的档案进行了比较。 结果表明,参与者能够区分人工创建和AI生成的人物档案,后者被认为更具信息性和一致性。 然而,参与者指出,AI生成的人物档案往往遵循刻板印象,这突显了在利用LLMs进行人物档案创建时需要更加重视多样性。

Large Language Models (LLMs) created new opportunities for generating personas, expected to streamline and accelerate the human-centered design process. Yet, AI-generated personas may not accurately represent actual user experiences, as they can miss contextual and emotional insights critical to understanding real users' needs and behaviors. This introduces a potential threat to quality, especially for novices. This paper examines the differences in how users perceive personas created by LLMs compared to those crafted by humans regarding their credibility for design. We gathered ten human-crafted personas developed by HCI experts according to relevant attributes established in related work. Then, we systematically generated ten personas with an LLM and compared them with human-crafted ones in a survey. The results showed that participants differentiated between human-created and AI-generated personas, with the latter perceived as more informative and consistent. However, participants noted that the AI-generated personas tended to follow stereotypes, highlighting the need for a greater emphasis on diversity when utilizing LLMs for persona creation.

[31] arXiv:2501.13308 (替换) [中文pdf, pdf, html, 其他]
标题: “试图停止是精神上的痛苦”:现实世界中强迫症患者即时干预的设计机会
标题: "It was Mentally Painful to Try and Stop": Design Opportunities for Just-in-Time Interventions for People with Obsessive-Compulsive Disorder in the Real World
Ru Wang, Kexin Zhang, Yuqing Wang, Keri Brown, Yuhang Zhao
主题: 人机交互 (cs.HC)

强迫症(OCD)是一种严重影响人们生活质量的心理健康状况。 虽然基于证据的疗法,如暴露与反应预防(ERP),可能有效,但由于害怕面对和缺乏适当的支持,在日常生活中管理OCD症状——治疗和独立生活的重要部分——仍然具有挑战性。 为了更好地了解OCD自我管理中的挑战和需求,我们采访了10位不同OCD状况的参与者和7位专门从事OCD治疗的治疗师。 通过这些访谈,我们探讨了参与者触发因素的特征以及它们如何塑造他们的强迫行为,并揭示了OCD发作不同阶段的关键应对策略。 我们的研究结果突显了OCD自我管理需求与目前可用支持之间的关键差距。 基于这些见解,我们提出了针对OCD的即时自我管理技术的设计机会,包括个性化症状跟踪、即时干预以及针对OCD特定隐私和社会需求的支持——通过技术和超越技术的方式。

Obsessive-compulsive disorder (OCD) is a mental health condition that significantly impacts people's quality of life. While evidence-based therapies such as exposure and response prevention (ERP) can be effective, managing OCD symptoms in everyday life -- an essential part of treatment and independent living -- remains challenging due to fear confrontation and lack of appropriate support. To better understand the challenges and needs in OCD self-management, we conducted interviews with 10 participants with diverse OCD conditions and seven therapists specializing in OCD treatment. Through these interviews, we explored the characteristics of participants' triggers and how they shaped their compulsions, and uncovered key coping strategies across different stages of OCD episodes. Our findings highlight critical gaps between OCD self-management needs and currently available support. Building on these insights, we propose design opportunities for just-in-time self-management technologies for OCD, including personalized symptom tracking, just-in-time interventions, and support for OCD-specific privacy and social needs -- through technology and beyond.

[32] arXiv:2501.14327 (替换) [中文pdf, pdf, html, 其他]
标题: 通过眼动追踪表征低视力人群的视觉意图
标题: Characterizing Visual Intents for People with Low Vision through Eye Tracking
Ru Wang, Ruijia Chen, Anqiao Erica Cai, Zhiyuan Li, Sanbrita Mondal, Yuhang Zhao
主题: 人机交互 (cs.HC)

获取视觉信息对于因低视力状况如低视觉清晰度和有限视野的人群来说至关重要,但同时也具有挑战性。 然而,与盲人不同,低视力人群在日常任务中拥有并更倾向于使用他们的功能性视力。 因此,凝视模式成为揭示他们视觉挑战和意图的重要指标,从而激发了更多适应性的视觉支持。 我们旨在深入理解低视力用户在不同图像查看任务中的凝视行为,描述典型的视觉意图以及不同低视力条件下人们表现出的独特凝视模式。 我们对20名低视力参与者和20名视力正常对照组进行了基于眼动追踪的回顾性口头报告研究。 参与者完成了各种图像查看任务,并观看了自己凝视轨迹的回放,以反思他们的视觉体验。 基于研究,我们提出了一种包含五种视觉意图的视觉意图分类体系,这些意图由参与者的凝视行为所表征。 我们展示了低视力参与者和视力正常参与者在凝视行为上的差异,以及视觉能力如何影响低视力参与者在不同视觉意图下的凝视模式。 我们的研究结果强调了在视觉意图识别中结合视觉能力信息、视觉上下文和眼动数据的重要性,为低视力人群的意图感知辅助技术奠定了基础。

Accessing visual information is crucial yet challenging for people with low vision due to visual conditions like low visual acuity and limited visual fields. However, unlike blind people, low vision people have and prefer using their functional vision in daily tasks. Gaze patterns thus become an important indicator to uncover their visual challenges and intents, inspiring more adaptive visual support. We seek to deeply understand low vision users' gaze behaviors in different image-viewing tasks, characterizing typical visual intents and the unique gaze patterns exhibited by people with different low vision conditions. We conducted a retrospective think-aloud study using eye tracking with 20 low vision participants and 20 sighted controls. Participants completed various image-viewing tasks and watched the playback of their gaze trajectories to reflect on their visual experiences. Based on the study, we derived a visual intent taxonomy with five visual intents characterized by participants' gaze behaviors. We demonstrated the difference between low vision and sighted participants' gaze behaviors and how visual ability affected low vision participants' gaze patterns across visual intents. Our findings underscore the importance of combining visual ability information, visual context, and eye tracking data in visual intent recognition, setting up a foundation for intent-aware assistive technologies for low vision people.

[33] arXiv:2502.13920 (替换) [中文pdf, pdf, html, 其他]
标题: 通过数据驱动、理论指导的大型语言模型探索个性化健康支持:睡眠健康案例研究
标题: Exploring Personalized Health Support through Data-Driven, Theory-Guided LLMs: A Case Study in Sleep Health
Xingbo Wang, Janessa Griffith, Daniel A. Adler, Joey Castillo, Tanzeem Choudhury, Fei Wang
评论: 被计算机系统中的人因素会议(CHI 2025)接受。代码可在 https://github.com/xingbow/sleephealthLLM 获取。
主题: 人机交互 (cs.HC) ; 计算与语言 (cs.CL)

尽管睡眠追踪设备很普遍,但许多人难以将数据转化为改善睡眠健康的可行措施。 当前的方法通常提供数据驱动的建议,但可能无法适应现实生活中的限制和个体情境。 我们提出了HealthGuru,一种新型的由大型语言模型驱动的聊天机器人,通过数据驱动、理论指导和自适应的建议以及对话式行为改变支持来增强睡眠健康。 HealthGuru的多代理框架整合了可穿戴设备数据、上下文信息和上下文多臂老虎机模型,以建议定制化的睡眠增强活动。 该系统在进行自然对话的同时,融入数据驱动的见解和理论上的行为改变技术。 我们对16名参与者进行了为期八周的野外部署研究,将HealthGuru与基线聊天机器人进行比较。 结果表明,使用HealthGuru后,睡眠时长和活动评分等指标有所改善,响应质量更高,并且用户的行为改变动机增加。 我们还识别了在健康聊天机器人中个性化和用户参与的设计挑战和考虑因素。

Despite the prevalence of sleep-tracking devices, many individuals struggle to translate data into actionable improvements in sleep health. Current methods often provide data-driven suggestions but may not be feasible and adaptive to real-life constraints and individual contexts. We present HealthGuru, a novel large language model-powered chatbot to enhance sleep health through data-driven, theory-guided, and adaptive recommendations with conversational behavior change support. HealthGuru's multi-agent framework integrates wearable device data, contextual information, and a contextual multi-armed bandit model to suggest tailored sleep-enhancing activities. The system facilitates natural conversations while incorporating data-driven insights and theoretical behavior change techniques. Our eight-week in-the-wild deployment study with 16 participants compared HealthGuru to a baseline chatbot. Results show improved metrics like sleep duration and activity scores, higher quality responses, and increased user motivation for behavior change with HealthGuru. We also identify challenges and design considerations for personalization and user engagement in health chatbots.

[34] arXiv:2506.14468 (替换) [中文pdf, pdf, html, 其他]
标题: MERba:用于微表情识别的多感受野MambaVision
标题: MERba: Multi-Receptive Field MambaVision for Micro-Expression Recognition
Xinglong Mao, Shifeng Liu, Sirui Zhao, Tong Xu, Hanchao Wang, Baozhi Jia, Enhong Chen
主题: 人机交互 (cs.HC)

微表情(MEs)是短暂的、不由自主的面部动作,揭示真实情绪,为心理评估和刑事调查提供了有价值的见解。 尽管在自动微表情识别(MER)方面取得了显著进展,现有方法仍然难以同时捕捉局部肌肉激活和全局面部依赖关系,这两者对于解码细微的情感线索都是必不可少的。 为了解决这一挑战,我们提出了MERba,一种专门为MER设计的分层多感受野架构,该架构包含一系列局部-全局特征集成阶段。 在每个阶段中,使用MERba局部提取器捕获详细的窗口内运动模式,这些提取器结合了MambaVision混合器和定制的非对称多扫描策略,以增强局部空间敏感性。 然后通过轻量级自注意力层聚合这些局部特征,明确建模窗口间的关系,从而实现有效的全局上下文构建。 此外,为了缓解负面微表情之间的高类间相似性问题,我们引入了一个双粒度分类模块,将识别任务分解为从粗到细的范式。 在三个基准数据集上的大量实验表明,MERba始终优于现有方法,消融研究证实了每个所提组件的有效性。

Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, offering valuable insights for psychological assessment and criminal investigations. Despite significant progress in automatic ME recognition (MER), existing methods still struggle to simultaneously capture localized muscle activations and global facial dependencies, both essential for decoding subtle emotional cues. To address this challenge, we propose MERba, a hierarchical multi-receptive field architecture specially designed for MER, which incorporates a series of Local-Global Feature Integration stages. Within each stage, detailed intra-window motion patterns are captured using MERba Local Extractors, which integrate MambaVision Mixers with a tailored asymmetric multi-scanning strategy to enhance local spatial sensitivity. These localized features are then aggregated through lightweight self-attention layers that explicitly model inter-window relationships, enabling effective global context construction. Furthermore, to mitigate the challenge of high inter-class similarity among negative MEs, we introduce a Dual-Granularity Classification Module that decomposes the recognition task into a coarse-to-fine paradigm. Extensive experiments on three benchmark datasets demonstrate that MERba consistently outperforms existing methods, with ablation studies confirming the effectiveness of each proposed component.

[35] arXiv:2507.14494 (替换) [中文pdf, pdf, html, 其他]
标题: “它看起来很性感,但它是错误的。” 使用生成式人工智能进行生物医学可视化的创造力与准确性之间的紧张关系
标题: "It looks sexy but it's wrong." Tensions in creativity and accuracy using genAI for biomedical visualization
Roxanne Ziman, Shehryar Saharan, Gaël McGill, Laura Garrison
评论: 11页,3张图。被IEEE VIS 2025会议接收
主题: 人机交互 (cs.HC)

我们对生成式人工智能(genAI)在生物医学可视化(BioMedVis)中的使用所产生的工作流程和张力进行了深入分析。 尽管genAI为生物和医学内容提供了美观视觉效果的便捷生产,但这些工具的架构从根本上限制了所呈现信息的准确性和可信度,从想象(或幻想)的分子到外星解剖结构。 通过与多样化从业者和研究人员的17次访谈,我们定性分析了推动genAI(不)使用于空间导向生物医学数据视觉表示的关注点和价值观。 我们发现,BioMedVis专家,无论是作为开发人员还是设计师,在日常工作中会在不同阶段使用genAI工具,并对genAI持有从热情采用者到怀疑避免者的各种态度。 在将我们研究中观察到的当前使用情况和对genAI的观点与之前工作中对可视化流程中genAI的预测进行对比时,我们将genAI对可视化项目的影响讨论重新聚焦于当下,明确了其对未来可视化研究的机遇和陷阱。 在公众对科学的信任受到威胁的时候,我们被提醒首先要做到无害,不仅在生物医学可视化中,而且在更广泛的科学传播中也是如此。 我们的观察结果再次确认了人类干预在共情设计和准确科学视觉评估中的必要性。

We contribute an in-depth analysis of the workflows and tensions arising from generative AI (genAI) use in biomedical visualization (BioMedVis). Although genAI affords facile production of aesthetic visuals for biological and medical content, the architecture of these tools fundamentally limits the accuracy and trustworthiness of the depicted information, from imaginary (or fanciful) molecules to alien anatomy. Through 17 interviews with a diverse group of practitioners and researchers, we qualitatively analyze the concerns and values driving genAI (dis)use for the visual representation of spatially-oriented biomedical data. We find that BioMedVis experts, both in roles as developers and designers, use genAI tools at different stages of their daily workflows and hold attitudes ranging from enthusiastic adopters to skeptical avoiders of genAI. In contrasting the current use and perspectives on genAI observed in our study with predictions towards genAI in the visualization pipeline from prior work, we refocus the discussion of genAI's effects on projects in visualization in the here and now with its respective opportunities and pitfalls for future visualization research. At a time when public trust in science is in jeopardy, we are reminded to first do no harm, not just in biomedical visualization but in science communication more broadly. Our observations reaffirm the necessity of human intervention for empathetic design and assessment of accurate scientific visuals.

[36] arXiv:2507.21462 (替换) [中文pdf, pdf, html, 其他]
标题: 使用触觉图表支持盲人和低视力个体对复杂可视化的理解和学习
标题: Using Tactile Charts to Support Comprehension and Learning of Complex Visualizations for Blind and Low-Vision Individuals
Tingying He, Maggie McCracken, Daniel Hajas, Sarah Creem-Regehr, Alexander Lex
期刊参考: IEEE 可视化与计算机图形学汇刊,32,2026
主题: 人机交互 (cs.HC)

我们研究触觉图表是否有助于盲人和低视力(BLV)个体理解和学习复杂的可视化内容,并提出了四种触觉图表设计和一项访谈研究。 可视化是传达数据的强大工具,但BLV个体通常只能依靠辅助技术——主要是替代文本——来获取这些信息。 先前的研究表明,图表类型的思维模型对于解释这些描述非常重要,但BLV个体无法基于可视化图像建立这样的思维模型。 触觉图表在填补支持构建思维模型过程中的这一空白方面显示出潜力。 然而,关于触觉数据表示的研究主要集中在简单的图表类型上,尚不清楚它们是否也适用于科学出版物中更复杂的图表。 与两位BLV研究人员合作,我们设计了带有探索说明的3D打印触觉模板图表,用于四种高级图表类型:UpSet图、小提琴图、聚类热图和分面折线图。 然后,我们对12名BLV参与者进行了一项访谈研究,比较使用我们的触觉模板是否能改善对图表的思维模型和理解,以及这种理解是否能转化为通过替代文本体验的新数据集。 主题分析显示,触觉模型有助于理解图表类型,并且是BLV个体偏好的学习方法。 我们还报告了参与者对触觉图表设计的看法及其在BLV教育中的作用。

We investigate whether tactile charts support comprehension and learning of complex visualizations for blind and low-vision (BLV) individuals and contribute four tactile chart designs and an interview study. Visualizations are powerful tools for conveying data, yet BLV individuals typically can rely only on assistive technologies -- primarily alternative texts -- to access this information. Prior research shows the importance of mental models of chart types for interpreting these descriptions, yet BLV individuals have no means to build such a mental model based on images of visualizations. Tactile charts show promise to fill this gap in supporting the process of building mental models. Yet studies on tactile data representations mostly focus on simple chart types, and it is unclear whether they are also appropriate for more complex charts as would be found in scientific publications. Working with two BLV researchers, we designed 3D-printed tactile template charts with exploration instructions for four advanced chart types: UpSet plots, violin plots, clustered heatmaps, and faceted line charts. We then conducted an interview study with 12 BLV participants comparing whether using our tactile templates improves mental models and understanding of charts and whether this understanding translates to novel datasets experienced through alt texts. Thematic analysis shows that tactile models support chart type understanding and are the preferred learning method by BLV individuals. We also report participants' opinions on tactile chart design and their role in BLV education.

[37] arXiv:2507.21837 (替换) [中文pdf, pdf, html, 其他]
标题: VeasyGuide:演示视频中指导者动作的低视力学习者的个性化视觉指导
标题: VeasyGuide: Personalized Visual Guidance for Low-vision Learners on Instructor Actions in Presentation Videos
Yotam Sechayk, Ariel Shamir, Amy Pavel, Takeo Igarashi
评论: 资产 25,丹佛,科罗拉多州,美国
主题: 人机交互 (cs.HC)

教师常常依赖诸如指向、标记和草图等视觉动作,在教育演示视频中传达信息。 这些细微的视觉提示往往缺乏口头描述,迫使低视力(LV)学习者去寻找视觉指示或仅依赖音频,这可能导致信息遗漏和认知负荷增加。 为了解决这一挑战,我们与三位低视力参与者进行了共同设计研究,并开发了VeasyGuide,这是一种利用运动检测来识别教师动作并动态突出和放大它们的工具。 VeasyGuide生成熟悉的视觉高亮效果,传达空间上下文,并通过广泛的个性化和实时视觉反馈适应不同的学习者和内容。 VeasyGuide通过明确要查看什么和在哪里查看,减少了视觉搜索努力。 在8位低视力参与者的评估中,学习者在检测教师动作方面表现出显著的改进,反应时间更快,认知负荷显著降低。 另一项针对8位视力正常参与者的评估显示,VeasyGuide也增强了参与度和注意力,表明它作为一项普遍有益的工具的潜力。

Instructors often rely on visual actions such as pointing, marking, and sketching to convey information in educational presentation videos. These subtle visual cues often lack verbal descriptions, forcing low-vision (LV) learners to search for visual indicators or rely solely on audio, which can lead to missed information and increased cognitive load. To address this challenge, we conducted a co-design study with three LV participants and developed VeasyGuide, a tool that uses motion detection to identify instructor actions and dynamically highlight and magnify them. VeasyGuide produces familiar visual highlights that convey spatial context and adapt to diverse learners and content through extensive personalization and real-time visual feedback. VeasyGuide reduces visual search effort by clarifying what to look for and where to look. In an evaluation with 8 LV participants, learners demonstrated a significant improvement in detecting instructor actions, with faster response times and significantly reduced cognitive load. A separate evaluation with 8 sighted participants showed that VeasyGuide also enhanced engagement and attentiveness, suggesting its potential as a universally beneficial tool.

[38] arXiv:2508.02639 (替换) [中文pdf, pdf, html, 其他]
标题: 重构模式:对复合视觉变量的全面方法
标题: Reframing Pattern: A Comprehensive Approach to a Composite Visual Variable
Tingying He, Jason Dykes, Petra Isenberg, Tobias Isenberg
主题: 人机交互 (cs.HC)

我们提出了一种新的综合性理论,用于解释、探索和利用模式作为可视化中的视觉变量。 尽管模式长期以来一直用于数据编码,并且至今仍然具有价值,但其概念基础却很脆弱:研究文献和实践中使用的概念和术语不一致,这使得有效使用模式以及开展相关研究变得困难。 为了解决这个问题,我们进行了一项跨学科的全面文献综述,澄清了“模式”和“纹理”使用中的模糊之处。 结果,我们提出了一个全新的、一致的模式处理方法,将其视为由结构化的图形原始元素组组成的复合视觉变量,这些元素可以单独和共同用于数据编码。 这种新且广泛适用的表述为视觉变量模式开辟了一个较大的设计空间,我们将其形式化为一个新的系统,包括三组变量:原始元素的空间排列、原始元素之间的外观关系,以及描述单个原始元素的视网膜视觉变量。 我们展示了我们的模式系统如何与现有的可视化理论相关联,并突出了可视化设计的机会。 我们进一步探讨了基于复杂空间排列的模式,展示了其解释力,并将我们的概念化与地图和制图学的更广泛理论联系起来。 作者版本和附加材料可在OSF上获取:osf.io/z7ae2。

We present a new comprehensive theory for explaining, exploring, and using pattern as a visual variable in visualization. Although patterns have long been used for data encoding and continue to be valuable today, their conceptual foundations are precarious: the concepts and terminology used across the research literature and in practice are inconsistent, making it challenging to use patterns effectively and to conduct research to inform their use. To address this problem, we conduct a comprehensive cross-disciplinary literature review that clarifies ambiguities around the use of "pattern" and "texture". As a result, we offer a new consistent treatment of pattern as a composite visual variable composed of structured groups of graphic primitives that can serve as marks for encoding data individually and collectively. This new and widely applicable formulation opens a sizable design space for the visual variable pattern, which we formalize as a new system comprising three sets of variables: the spatial arrangement of primitives, the appearance relationships among primitives, and the retinal visual variables that characterize individual primitives. We show how our pattern system relates to existing visualization theory and highlight opportunities for visualization design. We further explore patterns based on complex spatial arrangements, demonstrating explanatory power and connecting our conceptualization to broader theory on maps and cartography. An author version and additional materials are available on OSF: osf.io/z7ae2.

[39] arXiv:2409.11535 (替换) [中文pdf, pdf, html, 其他]
标题: 平衡最优性与多样性:通过生成整理的人本决策制定
标题: Balancing Optimality and Diversity: Human-Centered Decision Making through Generative Curation
Michael Lingzhi Li, Shixiang Zhu
主题: 机器学习 (cs.LG) ; 人机交互 (cs.HC) ; 优化与控制 (math.OC)

在医疗保健、物流和公共政策中的操作性决策越来越多地涉及推荐候选解决方案的算法,例如治疗方案、配送路线或政策选项,而最终选择权则留给人类决策者。 例如,学区使用算法设计校车路线,但管理员会根据社区反馈做出最终决定。 在这些情况下,决策质量并不取决于单一的算法“最优”,而在于推荐组合中是否至少有一个选项是人类最终认为有吸引力的。 我们提出了生成式筛选,这是一种框架,在可观察目标和未观察到的定性考虑因素共同决定吸引力的情况下,该框架可以最优地生成推荐集。 与固定解决方案不同,生成式筛选学习一个解决方案的分布,旨在最大化在可管理组合中最佳选项的期望吸引力。 我们的分析确定了定量质量和定性多样性之间的权衡,这种权衡通过一个从重新表述的目标中得出的新颖多样性度量来形式化。 我们使用生成神经网络和顺序优化方法实现了该框架,并在合成和现实世界的研究中表明,与现有基准相比,它能持续减少预期遗憾。 我们的框架为决策者提供了一种系统的方法来设计补充而非替代人类判断的算法。 通过生成多样化且高质量的选项组合,决策支持工具可以更好地适应未建模的因素,如利益相关者偏好、政治可行性或社区接受度。 更广泛地说,该框架使组织能够在大规模上实现以人类为中心的决策,确保即使目标不完整或不断变化,算法推荐仍然有用。

Operational decisions in healthcare, logistics, and public policy increasingly involve algorithms that recommend candidate solutions, such as treatment plans, delivery routes, or policy options, while leaving the final choice to human decision-makers. For instance, school districts use algorithms to design bus routes, but administrators make the final call given community feedback. In these settings, decision quality depends not on a single algorithmic ``optimum'', but on whether the portfolio of recommendations contains at least one option the human ultimately deems desirable. We propose generative curation, a framework that optimally generates recommendation sets when desirability depends on both observable objectives and unobserved qualitative considerations. Instead of a fixed solution, generative curation learns a distribution over solutions designed to maximize the expected desirability of the best option within a manageable portfolio. Our analysis identifies a trade-off between quantitative quality and qualitative diversity, formalized through a novel diversity metric derived from the reformulated objective. We implement the framework using a generative neural network and a sequential optimization method, and show in synthetic and real-world studies that it consistently reduces expected regret compared to existing benchmarks. Our framework provides decision-makers with a principled way to design algorithms that complement, rather than replace, human judgment. By generating portfolios of diverse yet high-quality options, decision-support tools can better accommodate unmodeled factors such as stakeholder preferences, political feasibility, or community acceptance. More broadly, the framework enables organizations to operationalize human-centered decision-making at scale, ensuring that algorithmic recommendations remain useful even when objectives are incomplete or evolving.

[40] arXiv:2501.13836 (替换) [中文pdf, pdf, html, 其他]
标题: 超越数据:低资源语言自动化审核流程中的殖民偏见和系统性问题
标题: Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages
Farhana Shahid, Mona Elswah, Aditya Vashistha
评论: 被AIES 2025接受
主题: 计算与语言 (cs.CL) ; 人机交互 (cs.HC)

大多数社交媒体用户来自全球南方,有害内容通常以当地语言出现。 然而,人工智能驱动的审核系统在这些地区的低资源语言上表现不佳。 通过对22位在四种低资源语言(泰米尔语(南亚)、斯瓦希里语(东非)、马格里布阿拉伯语(北非)和克丘亚语(南美洲)中从事有害内容检测的AI专家进行半结构化访谈,我们研究了为这些语言构建自动化审核工具中的系统性问题。 我们的研究发现,除了数据稀缺之外,科技公司对用户数据的垄断以及对低利润全球南方市场的审核缺乏投资,加剧了历史上的不平等。 即使有更多数据可用,语言模型和预处理技术以英语为中心且数据密集的设计,忽视了对形态复杂、语言多样和混合语言的设计需求。 我们认为这些限制不仅仅是由于“数据稀缺”造成的技术缺口,而是根植于对非西方语言的殖民压制所导致的结构性不平等。 我们讨论了多利益相关方的方法,以加强本地研究能力,民主化数据访问,并支持语言敏感的解决方案,以改善对低资源语言的自动化审核。

Most social media users come from the Global South, where harmful content usually appears in local languages. Yet, AI-driven moderation systems struggle with low-resource languages spoken in these regions. Through semi-structured interviews with 22 AI experts working on harmful content detection in four low-resource languages: Tamil (South Asia), Swahili (East Africa), Maghrebi Arabic (North Africa), and Quechua (South America)--we examine systemic issues in building automated moderation tools for these languages. Our findings reveal that beyond data scarcity, socio-political factors such as tech companies' monopoly on user data and lack of investment in moderation for low-profit Global South markets exacerbate historic inequities. Even if more data were available, the English-centric and data-intensive design of language models and preprocessing techniques overlooks the need to design for morphologically complex, linguistically diverse, and code-mixed languages. We argue these limitations are not just technical gaps caused by "data scarcity" but reflect structural inequities, rooted in colonial suppression of non-Western languages. We discuss multi-stakeholder approaches to strengthen local research capacity, democratize data access, and support language-aware solutions to improve automated moderation for low-resource languages.

[41] arXiv:2504.16273 (替换) [中文pdf, pdf, html, 其他]
标题: 从有希望的能力到普遍的偏见:评估大型语言模型在急诊科分诊中的应用
标题: From Promising Capability to Pervasive Bias: Assessing Large Language Models for Emergency Department Triage
Joseph Lee, Tianqi Shang, Jae Young Baik, Duy Duong-Tran, Shu Yang, Lingyao Li, Li Shen
评论: 发表于GenAI4Health研讨会 @ AAAI 2025(非归档),一篇提交至2026年太平洋生物计算研讨会的论文预印本
主题: 人工智能 (cs.AI) ; 人机交互 (cs.HC)

大型语言模型(LLMs)在临床决策支持中显示出前景,但其在分诊中的应用仍缺乏深入研究。 我们通过两个关键维度系统地研究了LLMs在急诊科分诊中的能力:(1)对分布偏移和缺失数据的鲁棒性,以及(2)对性别和种族交叉偏见的反事实分析。 我们评估了多种基于LLM的方法,从持续预训练到上下文学习,以及机器学习方法。 我们的结果表明,LLMs表现出更强的鲁棒性,并我们研究了导致有希望的LLM方法的关键因素。 此外,在这种情况下,我们发现了在性别和种族特定交叉点上出现的LLM偏好差距。 LLMs通常表现出基于性别的差异,但在某些种族群体中最为明显。 这些发现表明,LLMs编码了可能在特定临床情境或特定特征组合中出现的人口统计偏好。

Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored. We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions: (1) robustness to distribution shifts and missing data, and (2) counterfactual analysis of intersectional biases across sex and race. We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches. Our results indicate that LLMs exhibit superior robustness, and we investigate the key factors contributing to the promising LLM-based approaches. Furthermore, in this setting, we identify gaps in LLM preferences that emerge in particular intersections of sex and race. LLMs generally exhibit sex-based differences, but they are most pronounced in certain racial groups. These findings suggest that LLMs encode demographic preferences that may emerge in specific clinical contexts or particular combinations of characteristics.

[42] arXiv:2507.18905 (替换) [中文pdf, pdf, 其他]
标题: 大型语言模型对患者提出的医学问题提供不安全的答案
标题: Large language models provide unsafe answers to patient-posed medical questions
Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany L. Brazile, Natasha Chase, Dimple Patel Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah
评论: 20页
主题: 计算与语言 (cs.CL) ; 人机交互 (cs.HC)

数百万患者已经在定期使用大型语言模型(LLM)聊天机器人获取医疗建议,这引发了患者安全方面的担忧。 这项由医生主导的红队测试研究比较了四种公开可用的聊天机器人——Anthropic的Claude、Google的Gemini、OpenAI的GPT-4o和Meta的Llama3-70B——在新的数据集HealthAdvice上的安全性,使用了一种能够进行定量和定性分析的评估框架。 总共对222个患者提出的初级护理主题的医疗建议问题的888条聊天机器人回复进行了评估,这些问题涵盖内科、女性健康和儿科领域。 我们发现聊天机器人之间存在统计学上的显著差异。 有问题的回复率从21.6%(Claude)到43.2%(Llama)不等,不安全的回复率从5%(Claude)到13%(GPT-4o、Llama)不等。 定性结果揭示了可能导致严重患者伤害的聊天机器人回复。 这项研究表明,数百万患者可能正在从公开可用的聊天机器人那里获得不安全的医疗建议,需要进一步的工作来提高这些强大工具的临床安全性。

Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots--Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta--on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women's health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.

总共 42 条目
显示最多 2000 每页条目: 较少 | 更多 | 所有
  • 关于
  • 帮助
  • contact arXivClick here to contact arXiv 联系
  • 订阅 arXiv 邮件列表点击这里订阅 订阅
  • 版权
  • 隐私政策
  • 网络无障碍帮助
  • arXiv 运营状态
    通过...获取状态通知 email 或者 slack

京ICP备2025123034号