电气工程与系统科学
查看 最近的 文章
显示 2025年08月08日, 星期五 新的列表
- [1] arXiv:2508.04728 [中文pdf, pdf, html, 其他]
-
标题: 基于神经场的扫描电子显微镜多探测器信号微结构三维表面重建标题: Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy主题: 图像与视频处理 (eess.IV) ; 仪器与探测器 (physics.ins-det)
扫描电子显微镜(SEM)是一种在科学研究和工业应用中广泛使用的成像设备。传统的二维(2D)SEM图像不能直接揭示微样品的三维(3D)形貌,这促使了SEM 3D表面重建方法的发展。然而,由于离散三维表示的限制、需要使用参考样品进行校准以及阴影引起的梯度误差,现有方法在复杂微结构的重建方面仍然具有挑战性。在此,我们引入了NFH-SEM,这是一种基于神经场的混合SEM 3D重建方法,它以多视角、多探测器的2D SEM图像作为输入,并将几何和光度信息融合到连续的神经场表示中。NFH-SEM通过端到端自校准消除了手动校准过程,并在训练过程中自动分离SEM图像中的阴影,从而实现了对复杂微结构的准确重建。我们在真实和模拟数据集上验证了NFH-SEM的有效性。我们的实验显示了对各种具有挑战性的样品的高保真重建,包括双光子光刻微结构、桃树花粉和碳化硅颗粒表面,展示了精确的细节和广泛的应用性。
The scanning electron microscope (SEM) is a widely used imaging device in scientific research and industrial applications. Conventional two-dimensional (2D) SEM images do not directly reveal the three-dimensional (3D) topography of micro samples, motivating the development of SEM 3D surface reconstruction methods. However, reconstruction of complex microstructures remains challenging for existing methods due to the limitations of discrete 3D representations, the need for calibration with reference samples, and shadow-induced gradient errors. Here, we introduce NFH-SEM, a neural field-based hybrid SEM 3D reconstruction method that takes multi-view, multi-detector 2D SEM images as input and fuses geometric and photometric information into a continuous neural field representation. NFH-SEM eliminates the manual calibration procedures through end-to-end self-calibration and automatically disentangles shadows from SEM images during training, enabling accurate reconstruction of intricate microstructures. We validate the effectiveness of NFH-SEM on real and simulated datasets. Our experiments show high-fidelity reconstructions of diverse, challenging samples, including two-photon lithography microstructures, peach pollen, and silicon carbide particle surfaces, demonstrating precise detail and broad applicability.
- [2] arXiv:2508.04729 [中文pdf, pdf, html, 其他]
-
标题: 基于自注意力的几何引导反投影网络的哨兵-2图像超分辨率标题: Super-Resolution of Sentinel-2 Images Using a Geometry-Guided Back-Projection Network with Self-AttentionIvan Pereira-Sánchez, Daniel Torres, Francesc Alcover, Bartomeu Garau, Julia Navarro, Catalina Sbert, Joan Duran主题: 图像与视频处理 (eess.IV)
哨兵-2任务提供13个波段的多光谱图像,分辨率为10米、20米和60米。 特别是,10米波段提供精细的结构细节,而20米波段捕获更丰富的光谱信息。 在本文中,我们提出了一种几何引导的超分辨率模型,用于融合10米和20米波段。 我们的方法引入了一种基于聚类的学习过程,从10米波段生成一个具有丰富几何信息的引导图像。 该图像被集成到一个展开的反投影架构中,该架构通过多头注意力机制利用图像的自相似性,从而在空间和光谱维度上建模非局部块间交互。 我们还生成了一个用于评估的数据集,包括包含城市、农村和沿海景观的三个测试集。 实验结果表明,我们的方法优于传统的和基于深度学习的超分辨率和融合技术。
The Sentinel-2 mission provides multispectral imagery with 13 bands at resolutions of 10m, 20m, and 60m. In particular, the 10m bands offer fine structural detail, while the 20m bands capture richer spectral information. In this paper, we propose a geometry-guided super-resolution model for fusing the 10m and 20m bands. Our approach introduces a cluster-based learning procedure to generate a geometry-rich guiding image from the 10m bands. This image is integrated into an unfolded back-projection architecture that leverages image self-similarities through a multi-head attention mechanism, which models nonlocal patch-based interactions across spatial and spectral dimensions. We also generate a dataset for evaluation, comprising three testing sets that include urban, rural, and coastal landscapes. Experimental results demonstrate that our method outperforms both classical and deep learning-based super-resolution and fusion techniques.
- [3] arXiv:2508.04790 [中文pdf, pdf, 其他]
-
标题: 基于BIRADS的乳腺X线图像检索的高级多架构深度学习框架:超集成优化的全面性能分析标题: Advanced Multi-Architecture Deep Learning Framework for BIRADS-Based Mammographic Image Retrieval: Comprehensive Performance Analysis with Super-Ensemble Optimization主题: 图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG)
基于内容的乳腺X线图像检索系统需要在五个不同的类别中进行精确的BIRADS分类匹配,这比文献中常解决的二分类任务具有显著更大的复杂性。 当前的医学图像检索研究存在方法论上的局限,包括样本量不足、数据划分不当和统计验证不足,这些限制阻碍了临床转化。 我们开发了一个全面的评估框架,系统地比较了CNN架构(DenseNet121、ResNet50、VGG16)与先进的训练策略,包括复杂的微调、度量学习和超级集成优化。 我们的评估采用了严格的分层数据划分(50%/20%/30% 训练/验证/测试)、602个测试查询,并使用1,000个样本的引导程序置信区间进行系统验证。 采用差异学习率的高级微调实现了显著改进:DenseNet121(34.79% precision@10,提高了19.64%)和ResNet50(34.54%,提高了19.58%)。 结合互补架构的超级集成优化实现了36.33% precision@10(95% CI: [34.78%, 37.88%]),相比基线提高了24.93%,每个查询提供了3.6个相关案例。 统计分析显示不同优化策略之间存在显著性能差异(p<0.001),效应量较大(Cohen's d>0.8),同时保持了实际的搜索效率(2.8毫秒)。 性能明显超过5类医学检索任务的实际预期,文献表明精确的BIRADS匹配在20-25% precision@10范围内是可实现的。 我们的框架建立了新的性能基准,并为诊断支持和质量保证应用中的临床部署提供了基于证据的架构选择指南。
Content-based mammographic image retrieval systems require exact BIRADS categorical matching across five distinct classes, presenting significantly greater complexity than binary classification tasks commonly addressed in literature. Current medical image retrieval studies suffer from methodological limitations including inadequate sample sizes, improper data splitting, and insufficient statistical validation that hinder clinical translation. We developed a comprehensive evaluation framework systematically comparing CNN architectures (DenseNet121, ResNet50, VGG16) with advanced training strategies including sophisticated fine-tuning, metric learning, and super-ensemble optimization. Our evaluation employed rigorous stratified data splitting (50%/20%/30% train/validation/test), 602 test queries, and systematic validation using bootstrap confidence intervals with 1,000 samples. Advanced fine-tuning with differential learning rates achieved substantial improvements: DenseNet121 (34.79% precision@10, 19.64% improvement) and ResNet50 (34.54%, 19.58% improvement). Super-ensemble optimization combining complementary architectures achieved 36.33% precision@10 (95% CI: [34.78%, 37.88%]), representing 24.93% improvement over baseline and providing 3.6 relevant cases per query. Statistical analysis revealed significant performance differences between optimization strategies (p<0.001) with large effect sizes (Cohen's d>0.8), while maintaining practical search efficiency (2.8milliseconds). Performance significantly exceeds realistic expectations for 5-class medical retrieval tasks, where literature suggests 20-25% precision@10 represents achievable performance for exact BIRADS matching. Our framework establishes new performance benchmarks while providing evidence-based architecture selection guidelines for clinical deployment in diagnostic support and quality assurance applications.
- [4] arXiv:2508.04832 [中文pdf, pdf, html, 其他]
-
标题: 深度蒸馏梯度预条件化用于反问题标题: Deep Distillation Gradient Preconditioning for Inverse Problems评论: 5页,4图主题: 图像与视频处理 (eess.IV)
成像反问题通常通过最小化测量一致性项和信号先验项来解决。 尽管已经投入了大量精力来开发高性能的先验,但即使是最先进的信号先验,在与病态感知矩阵配对时也可能失去其有效性,这会阻碍收敛并降低重建质量。 在优化理论中,预处理技术可以通过转换梯度更新来改善算法的收敛性。 传统的线性预处理技术可以提高收敛性,但由于它们依赖于感知矩阵的结构,因此性能仍然受到限制。 基于学习的线性预处理技术已被提出,但它们仅针对数据保真优化进行优化,这可能导致感知矩阵零空间中的解。 本文采用知识蒸馏来设计一个非线性预处理算子。 在我们的方法中,一个使用更好条件的(合成)感知矩阵的教师算法通过一个预处理神经网络进行梯度匹配,指导具有病态感知矩阵的学生算法。 我们在单像素、磁共振和超分辨率成像任务中验证了我们的非线性预处理器在插拔式FISTA中的性能,显示出一致的性能提升和更好的经验收敛性。
Imaging inverse problems are commonly addressed by minimizing measurement consistency and signal prior terms. While huge attention has been paid to developing high-performance priors, even the most advanced signal prior may lose its effectiveness when paired with an ill-conditioned sensing matrix that hinders convergence and degrades reconstruction quality. In optimization theory, preconditioners allow improving the algorithm's convergence by transforming the gradient update. Traditional linear preconditioning techniques enhance convergence, but their performance remains limited due to their dependence on the structure of the sensing matrix. Learning-based linear preconditioners have been proposed, but they are optimized only for data-fidelity optimization, which may lead to solutions in the null-space of the sensing matrix. This paper employs knowledge distillation to design a nonlinear preconditioning operator. In our method, a teacher algorithm using a better-conditioned (synthetic) sensing matrix guides the student algorithm with an ill-conditioned sensing matrix through gradient matching via a preconditioning neural network. We validate our nonlinear preconditioner for plug-and-play FISTA in single-pixel, magnetic resonance, and super-resolution imaging tasks, showing consistent performance improvements and better empirical convergence.
- [5] arXiv:2508.04857 [中文pdf, pdf, html, 其他]
-
标题: 关键词检测与超匹配滤波器在小尺寸设备中的应用标题: Keyword Spotting with Hyper-Matched Filters for Small Footprint Devices评论: 预印本主题: 音频与语音处理 (eess.AS) ; 机器学习 (cs.LG) ; 声音 (cs.SD)
开放词汇关键词检测(KWS)是指在语音记录中检测单词或术语的任务,无论这些单词或术语是否包含在训练数据中。 本文介绍了一种具有最先进检测准确率的开放词汇关键词检测模型,适用于小尺寸设备。 该模型由语音编码器、目标关键词编码器和检测网络组成。 语音编码器可以是小型Whisper或小型Conformer。 目标关键词编码器被实现为一个超网络,它将所需的关键词作为字符字符串,并为卷积层生成一组唯一的权重,这可以被视为特定于关键词的匹配滤波器。 检测网络使用匹配滤波器权重执行特定于关键词的卷积,这引导Perceiver模块的交叉注意力机制来确定目标术语是否出现在记录中。 结果表明,我们的系统实现了最先进的检测性能,并能有效推广到域外条件,包括第二语言(L2)语音。 值得注意的是,我们最小的模型,仅有420万个参数,能够与体积大几倍的模型相媲美或超越,展示了其高效性和鲁棒性。
Open-vocabulary keyword spotting (KWS) refers to the task of detecting words or terms within speech recordings, regardless of whether they were included in the training data. This paper introduces an open-vocabulary keyword spotting model with state-of-the-art detection accuracy for small-footprint devices. The model is composed of a speech encoder, a target keyword encoder, and a detection network. The speech encoder is either a tiny Whisper or a tiny Conformer. The target keyword encoder is implemented as a hyper-network that takes the desired keyword as a character string and generates a unique set of weights for a convolutional layer, which can be considered as a keyword-specific matched filter. The detection network uses the matched-filter weights to perform a keyword-specific convolution, which guides the cross-attention mechanism of a Perceiver module in determining whether the target term appears in the recording. The results indicate that our system achieves state-of-the-art detection performance and generalizes effectively to out-of-domain conditions, including second-language (L2) speech. Notably, our smallest model, with just 4.2 million parameters, matches or outperforms models that are several times larger, demonstrating both efficiency and robustness.
- [6] arXiv:2508.04871 [中文pdf, pdf, html, 其他]
-
标题: 基于线性规划的非线性自治系统的稳定性条件标题: Linear Program-Based Stability Conditions for Nonlinear Autonomous Systems评论: 6页。提交至NOLCOS'2025主题: 系统与控制 (eess.SY)
本文介绍了一种新的方法,用于评估连续时间(CT)和离散时间(DT)非线性自治系统中平衡点的渐近稳定性。 通过利用间接李雅普诺夫方法,并通过雅可比矩阵线性化系统动态,该方法将传统的半定规划(SDP)技术替换为计算效率更高的线性规划(LP)条件。 这种替换显著降低了计算负担,包括时间和内存使用,特别是对于高维系统。 稳定性准则通过矩阵变换和利用系统的结构特性进行开发,提高了可扩展性。 几个示例展示了所提出方法在计算效率上优于现有的基于SDP的准则,特别是对于高维系统。
This paper introduces a novel approach to evaluating the asymptotic stability of equilibrium points in both continuous-time (CT) and discrete-time (DT) nonlinear autonomous systems. By utilizing indirect Lyapunov methods and linearizing system dynamics through Jacobian matrices, the methodology replaces traditional semi-definite programming (SDP) techniques with computationally efficient linear programming (LP) conditions. This substitution substantially lowers the computational burden, including time and memory usage, particularly for high-dimensional systems. The stability criteria are developed using matrix transformations and leveraging the structural characteristics of the system, improving scalability. Several examples demonstrated the computational efficiency of the proposed approach compared to the existing SDP-based criteria, particularly for high-dimensional systems.
- [7] arXiv:2508.04874 [中文pdf, pdf, html, 其他]
-
标题: 面向电气化动力总成的发动机燃油消耗优化的序列感知SAC控制标题: Sequence Aware SAC Control for Engine Fuel Consumption Optimization in Electrified Powertrain主题: 系统与控制 (eess.SY) ; 人工智能 (cs.AI) ; 机器学习 (cs.LG)
随着混合动力电动汽车(HEVs)在重型卡车中的普及,自适应且高效的能量管理对于减少燃油消耗同时保持电池电量以实现长时间运行至关重要。 我们提出了一种基于软策略梯度(SAC)算法的新型强化学习(RL)框架,以优化串联HEVs中的发动机控制。 我们将控制任务重新表述为顺序决策问题,并通过将门控循环单元(GRUs)和决策变压器(DTs)纳入策略网络和评论家网络来增强SAC,以捕捉时间依赖性并提高长期规划能力。 为了评估鲁棒性和泛化能力,我们在多种初始电池状态、驾驶循环持续时间、功率需求和输入序列长度下训练模型。 实验表明,在高速公路上燃油经济性测试(HFET)循环中,基于DT的策略网络和基于GRU的评论家网络的SAC智能体在燃油节省方面与动态规划(DP)的差距在1.8%以内,而同时在策略网络和评论家网络中使用GRU的SAC智能体以及基于前馈网络(FFN)的策略网络和评论家网络的智能体分别在3.16%和3.43%以内。 在未见过的驾驶循环(US06和重型重载柴油卡车(HHDDT)巡航段)中,泛化的序列感知智能体始终优于基于前馈网络(FFN)的智能体,突显了它们在现实环境中的适应性和鲁棒性。
As hybrid electric vehicles (HEVs) gain traction in heavy-duty trucks, adaptive and efficient energy management is critical for reducing fuel consumption while maintaining battery charge for long operation times. We present a new reinforcement learning (RL) framework based on the Soft Actor-Critic (SAC) algorithm to optimize engine control in series HEVs. We reformulate the control task as a sequential decision-making problem and enhance SAC by incorporating Gated Recurrent Units (GRUs) and Decision Transformers (DTs) into both actor and critic networks to capture temporal dependencies and improve planning over time. To evaluate robustness and generalization, we train the models under diverse initial battery states, drive cycle durations, power demands, and input sequence lengths. Experiments show that the SAC agent with a DT-based actor and GRU-based critic was within 1.8% of Dynamic Programming (DP) in fuel savings on the Highway Fuel Economy Test (HFET) cycle, while the SAC agent with GRUs in both actor and critic networks, and FFN actor-critic agent were within 3.16% and 3.43%, respectively. On unseen drive cycles (US06 and Heavy Heavy-Duty Diesel Truck (HHDDT) cruise segment), generalized sequence-aware agents consistently outperformed feedforward network (FFN)-based agents, highlighting their adaptability and robustness in real-world settings.
- [8] arXiv:2508.04887 [中文pdf, pdf, html, 其他]
-
标题: 基于盲斜投影结合噪声白化的闭式连续相对传输函数向量估计标题: Closed-Form Successive Relative Transfer Function Vector Estimation based on Blind Oblique Projection Incorporating Noise Whitening主题: 音频与语音处理 (eess.AS)
相对传递函数(RTFs)在波束成形中起着关键作用,能够有效抑制噪声和干扰。 本文解决了在嘈杂和混响环境中在线估计多个声源的RTF向量的挑战,特别是在声源依次激活的特定场景下。 虽然第一个声源的RTF向量可以直接估计,但在多个声源同时激活的段落中估计后续声源的RTF向量是主要挑战。 盲斜投影(BOP)方法被提出用于通过最优阻塞新激活声源来估计其RTF向量。 然而,该方法面临几个限制:由于依赖于迭代梯度下降优化而导致的高计算复杂度,引入的随机附加向量可能对性能产生负面影响,以及假设信号噪声比(SNR)较高。 为克服这些限制,本文提出了BOP方法的三个扩展。 首先,我们推导了优化BOP代价函数的闭式解,显著降低了计算复杂度。 其次,我们引入正交附加向量而不是随机向量,提高了RTF向量估计的准确性。 第三,我们引入了受协方差减法和白化启发的噪声处理技术,在低SNR条件下提高了鲁棒性。 为了提供逐帧的源活动模式估计,这对传统BOP方法和所提方法都是必需的,我们提出了一种基于空间相干性的在线源计数方法。 使用包含3个依次激活说话人的实际混响噪声录音进行了仿真,有无先验的源活动模式知识。
Relative transfer functions (RTFs) of sound sources play a crucial role in beamforming, enabling effective noise and interference suppression. This paper addresses the challenge of online estimating the RTF vectors of multiple sound sources in noisy and reverberant environments, for the specific scenario where sources activate successively. While the RTF vector of the first source can be estimated straightforwardly, the main challenge arises in estimating the RTF vectors of subsequent sources during segments where multiple sources are simultaneously active. The blind oblique projection (BOP) method has been proposed to estimate the RTF vector of a newly activating source by optimally blocking this source. However, this method faces several limitations: high computational complexity due to its reliance on iterative gradient descent optimization, the introduction of random additional vectors, which can negatively impact performance, and the assumption of high signal-to-noise ratio (SNR). To overcome these limitations, in this paper we propose three extensions to the BOP method. First, we derive a closed-form solution for optimizing the BOP cost function, significantly reducing computational complexity. Second, we introduce orthogonal additional vectors instead of random vectors, enhancing RTF vector estimation accuracy. Third, we incorporate noise handling techniques inspired by covariance subtraction and whitening, increasing robustness in low SNR conditions. To provide a frame-by-frame estimate of the source activity pattern, required by both the conventional BOP method and the proposed method, we propose a spatial-coherence-based online source counting method. Simulations are performed with real-world reverberant noisy recordings featuring 3 successively activating speakers, with and without a-priori knowledge of the source activity pattern.
- [9] arXiv:2508.04929 [中文pdf, pdf, html, 其他]
-
标题: CryoGS:冷冻电镜同质重建的高斯点云方法标题: CryoGS: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction主题: 图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV)
作为结构生物学的关键模态,冷冻电子显微镜(cryo-EM)使得在接近原子分辨率下确定大分子结构成为可能。单粒子cryo-EM中的核心计算任务是从大量在未知方向下获取的噪声2D投影中重建分子的3D静电势。高斯混合模型(GMMs)为分子密度提供了连续、紧凑且物理可解释的表示,并最近在cryo-EM重建中受到关注。然而,现有方法依赖于外部共识图或原子模型进行初始化,限制了它们在自包含流程中的使用。为解决这一问题,我们引入了cryoGS,这是一种基于GMM的方法,将高斯点云投射与cryo-EM图像形成的物理机制相结合。特别是,我们开发了一种正交投影感知的高斯点云投射,包括针对cryo-EM成像量身定制的归一化项和FFT对齐坐标系。所有这些创新使得能够使用随机初始化直接从原始cryo-EM粒子图像进行稳定且高效的均匀重建。在真实数据集上的实验结果验证了cryoGS相对于代表性基线的有效性和鲁棒性。代码将在发表后发布。
As a critical modality for structural biology, cryogenic electron microscopy (cryo-EM) facilitates the determination of macromolecular structures at near-atomic resolution. The core computational task in single-particle cryo-EM is to reconstruct the 3D electrostatic potential of a molecule from a large collection of noisy 2D projections acquired at unknown orientations. Gaussian mixture models (GMMs) provide a continuous, compact, and physically interpretable representation for molecular density and have recently gained interest in cryo-EM reconstruction. However, existing methods rely on external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. Addressing this issue, we introduce cryoGS, a GMM-based method that integrates Gaussian splatting with the physics of cryo-EM image formation. In particular, we develop an orthogonal projection-aware Gaussian splatting, with adaptations such as a normalization term and FFT-aligned coordinate system tailored for cryo-EM imaging. All these innovations enable stable and efficient homogeneous reconstruction directly from raw cryo-EM particle images using random initialization. Experimental results on real datasets validate the effectiveness and robustness of cryoGS over representative baselines. The code will be released upon publication.
- [10] arXiv:2508.04951 [中文pdf, pdf, html, 其他]
-
标题: 用于利用GPU计算的任意波形实时多普勒和电离层色散校正技术标题: Real-Time Doppler and Ionospheric Dispersion Correction Techniques for Arbitrary Waveforms Utilizing GPU Compute评论: 这是作者的预印本版本,该作品将提交给IEEE Access主题: 信号处理 (eess.SP)
雷达数字信号处理的一般要求是电离层失真和多普勒色散校正,这在历史上需要雷达专用硬件来实时实现。 尽管模拟解决方案计算效率高,但它们通常会带来系统设计上的缺点,限制波形灵活性,并可能导致系统复杂性的整体增加。 随着现代通用计算系统的改进,使用非雷达专用的高性能计算实现实时数字信号处理变得越来越可行。 在本文中,我们对任意波形的雷达数字信号处理中的通用多普勒和电离层校正算法进行了分析。 我们还包括了这些算法在软件中高效实现的考虑因素,特别是使用GPU硬件。 该分析包括执行时间和误差校正精度等性能指标。 我们还提供了在雷达信号处理中应用的建议。 我们确定了两种色散校正算法:一种基于FFT的电离层色散方法和一种通过sinc插值的数值插值方法用于多普勒色散。 这两种算法能够以与波形特定的解析方法相当的精度补偿色散,并且能够在单个NVIDIA H100 GPU上实时执行。 这些方法与波形无关,并直接应用于样本,提高了系统灵活性,并使其易于集成到现有的软件定义无线电系统中。
General requirements for radar digital signal processing are ionospheric distortion and Doppler dispersion correction, which has historically required radar-specific hardware to implement in real time. Although analog solutions are computationally efficient, they often come with system design drawbacks which limit waveform flexibility and can result in an overall increase of system complexity. With improvements in modern general compute systems, real-time digital signal processing is becoming more realizable using non-radar-specific high-performance compute. In this paper, we present an analysis of general Doppler and ionospheric correction algorithms for arbitrary waveforms for radar digital signal processing. We also include considerations for efficient implementation of these algorithms in software, specifically using GPU hardware. This analysis includes metrics of performance such as execution time and error correction accuracy. We also provide recommendations for application in radar signal processing. We identify two algorithms for dispersion correction: an FFT-based method for ionospheric dispersion and a numerical interpolation method via sinc interpolation for Doppler dispersion. Both of these algorithms are able to compensate for dispersion equivalent in accuracy to waveform-specific analytical methods and were able to be performed in real-time on a single NVIDIA H100 GPU. These methods are waveform agnostic and applied directly to the samples, improving system flexibility and making them easy to incorporate into existing software-defined radio systems.
- [11] arXiv:2508.04964 [中文pdf, pdf, html, 其他]
-
标题: 具有分布式可重构智能超表面天线的抗干扰感知标题: Anti-Jamming Sensing with Distributed Reconfigurable Intelligent Metasurface Antennas主题: 信号处理 (eess.SP) ; 信息论 (cs.IT) ; 机器学习 (cs.LG)
利用射频(RF)信号进行无线传感引起了越来越多的关注。 然而,无线电环境是不可预测的,并且常常不利,传统RF传感方法的传感精度经常受到从发射器到接收器的不良传播信道的影响,例如衰落和噪声。 在本文中,我们提出采用分布式可重构智能超表面天线(RIMSA)来检测在不同位置部署多个RIMSA接收器(RIMSA Rxs)的物体的存在和位置。 通过编程其波束成形模式,RIMSA Rxs可以增强接收到的信号质量。 RF传感问题被建模为波束成形模式和接收到的信号到传感结果映射的联合优化问题。 为了解决这一挑战,我们引入了一种深度强化学习(DRL)算法,旨在计算最优的波束成形模式,并引入了一个神经网络,旨在将接收到的信号转换为传感结果。 此外,恶意攻击者可能会发起干扰攻击以破坏传感过程。 为了在干扰环境中实现有效的传感,我们设计了一个综合损失函数,考虑了接收到信号的信干噪比(SINR)。 仿真结果表明,所提出的分布式RIMSA系统相比集中式实现能够实现更高效的传感性能并更好地克服环境影响。 此外,所介绍的方法即使在干扰攻击下也能确保高精度的传感性能。
The utilization of radio frequency (RF) signals for wireless sensing has garnered increasing attention. However, the radio environment is unpredictable and often unfavorable, the sensing accuracy of traditional RF sensing methods is often affected by adverse propagation channels from the transmitter to the receiver, such as fading and noise. In this paper, we propose employing distributed Reconfigurable Intelligent Metasurface Antennas (RIMSA) to detect the presence and location of objects where multiple RIMSA receivers (RIMSA Rxs) are deployed on different places. By programming their beamforming patterns, RIMSA Rxs can enhance the quality of received signals. The RF sensing problem is modeled as a joint optimization problem of beamforming pattern and mapping of received signals to sensing outcomes. To address this challenge, we introduce a deep reinforcement learning (DRL) algorithm aimed at calculating the optimal beamforming patterns and a neural network aimed at converting received signals into sensing outcomes. In addition, the malicious attacker may potentially launch jamming attack to disrupt sensing process. To enable effective sensing in interferenceprone environment, we devise a combined loss function that takes into account the Signal to Interference plus Noise Ratio (SINR) of the received signals. The simulation results show that the proposed distributed RIMSA system can achieve more efficient sensing performance and better overcome environmental influences than centralized implementation. Furthermore, the introduced method ensures high-accuracy sensing performance even under jamming attack.
- [12] arXiv:2508.04977 [中文pdf, pdf, html, 其他]
-
标题: 揭示晶体管放大器的影响力流模型,其重建与应用标题: Uncovering the Influence Flow Model of Transistor Amplifiers, Its Reconstruction and Application主题: 系统与控制 (eess.SY)
多级晶体管放大器可以有效地建模为动态系统的网络,其中各个放大器级通过动态耦合相互作用。 使用电路分析技术,我们表明一大类晶体管放大器可以建模为线性动态影响模型(LDIM),其中不同放大器级之间的相互作用被建模为线性动态方程。 晶体管电路的LDIM建模导致了数据驱动的网络重构技术的应用,以高效地表征级间相互作用并识别故障和关键电路参数。 采用图形建模技术和维纳滤波,我们证明仅从电路中指定点采样的电压时间序列测量即可重建网络结构。 这些网络重构方法在多级放大器中的有效性通过在Cadence中涉及多个放大器电路的大量仿真以及物理硬件的实验结果得到了验证。 能够直接从测量数据推断网络结构为设计者和用户提供了高效的设计、分析和调试放大器电路的工具。 为了展示网络重构在多级放大器电路中的实用性,提出了一种利用这些技术的故障诊断方法。
Multistage transistor amplifiers can be effectively modeled as network of dynamic systems where individual amplifier stages interact through couplings that are dynamic in nature. Using circuit analysis techniques, we show that a large class of transistor amplifiers can be modeled as Linear Dynamic Influence Model (LDIM), where the interactions between different amplifier stages are modeled as linear dynamic equations. LDIM modeling of transistor circuits leads to application of data-driven network reconstruction techniques to characterize stage interactions and identify faults and critical circuit parameters efficiently. Employing graphical modeling techniques and Wiener filtering, we demonstrate that the network structure can be reconstructed solely from voltage time-series measurements sampled at specified points in the circuit. The efficacy of these network reconstruction methods in multistage amplifiers is demonstrated through extensive simulations involving multiple amplifier circuits in Cadence, as well as experimental results on physical hardware. The ability to infer network structure directly from measurement data offers designers and users efficient tools to design, analyze, and debug amplifier circuits. To demonstrate the utility of network reconstruction in multistage amplifier circuits, a fault diagnosis method leveraging these techniques is presented.
- [13] arXiv:2508.04978 [中文pdf, pdf, html, 其他]
-
标题: 局部核方法在信号处理中的应用标题: Localized Kernel Methods for Signal Processing评论: 博士论文主题: 信号处理 (eess.SP)
本论文提出了两种信号处理方法,利用专门设计的局部内核在噪声条件下进行参数恢复。 第一种方法解决了多维指数模型中频率和振幅的估计问题。 它使用局部三角多项式内核来检测多变量频率,随后进行更详细的参数估计。 我们将该方法与MUSIC和ESPRIT进行了比较,这两种是广泛用于估计指数信号参数的经典子空间算法。 在单变量情况下,该方法在低信噪比下优于MUSIC和ESPRIT。 对于多变量情况,我们开发了一种逐坐标投影和配准方法,使用显著较少的样本实现了高恢复精度。 第二种方法专注于从时间局部信号段中分离线性调频成分。 构建了信号分离算子(SSO)的一种变体,使用局部内核。 通过基于FFT的滤波获得瞬时频率估计,然后进行聚类并用分段线性回归拟合。 该方法无需事先知道成分数量,并且在信噪比低至-30 dB的情况下被证明能够恢复交叉和不连续的调频信号。 这两种方法都基于局部内核和高效的基于FFT的实现,且都不需要子空间分解或稀疏正则化。 实验结果证实了所提出方法在各种模拟数据条件下的鲁棒性和可行性。 潜在的扩展包括应用于非线性调频、自适应内核设计以及使用提取特征进行信号分类。
This dissertation presents two signal processing methods using specially designed localized kernels for parameter recovery under noisy condition. The first method addresses the estimation of frequencies and amplitudes in multidimensional exponential models. It utilizes localized trigonometric polynomial kernels to detect the multivariate frequencies, followed by a more detailed parameter estimation. We compare our method with MUSIC and ESPRIT, which are classical subspace-based algorithms widely used for estimating the parameters of exponential signals. In the univariate case, the method outperforms MUSIC and ESPRIT under low signal-to-noise ratios. For the multivariate case, we develop a coordinate-wise projection and registration approach that achieves high recovery accuracy using significantly fewer samples than other methods. The second method focuses on separating linear chirp components from time-localized signal segments. A variant of the Signal Separation Operator (SSO) is constructed using a localized kernel. Instantaneous frequency estimates are obtained via FFT-based filtering, then clustered and fitted with piecewise linear regression. The method operates without prior knowledge of the number of components and is shown to recover intersecting and discontinuous chirps at SNR levels as low as -30 dB. Both methods share an idea based on localized kernels and efficient FFT-based implementation, and neither requires subspace decomposition or sparsity regularization. Experimental results confirm the robustness and tractability of the proposed approaches across a range of simulated data conditions. Potential extensions include application to nonlinear chirps, adaptive kernel design, and signal classification using extracted features.
- [14] arXiv:2508.04996 [中文pdf, pdf, html, 其他]
-
标题: REF-VC:具有扩散变换器的鲁棒、表达性强且快速的零样本语音转换标题: REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion TransformersYuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Lei Xie, Zhonghua Fu主题: 音频与语音处理 (eess.AS)
在现实世界的声音转换应用中,源语音中的环境噪声和用户对富有表现力输出的需求构成了关键挑战。 基于传统自动语音识别(ASR)的方法确保了噪声鲁棒性,但抑制了语调,而基于自监督学习(SSL)的模型提高了表现力,但存在音色泄漏和噪声敏感的问题。 本文提出了REF-VC,一种具有噪声鲁棒性的富有表现力的声音转换系统。 主要创新包括:(1) 一种随机擦除策略,以减轻SSL特征中固有的信息冗余,增强噪声鲁棒性和表现力;(2) 受E2TTS启发的隐式对齐,以抑制非必要特征的重建;(3) 集成快捷模型以加速流匹配推理,显著减少到4步。 实验结果表明,我们的模型在噪声集的零样本场景中优于Seed-VC等基线模型,同时在干净集上与Seed-VC表现相当。 此外,REF-VC可以在一个模型中兼容歌唱声音转换。
In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody, while SSL-based models improve expressiveness but suffer from timbre leakage and noise sensitivity. This paper proposes REF-VC, a noise-robust expressive voice conversion system. Key innovations include: (1) A random erasing strategy to mitigate the information redundancy inherent in SSL feature, enhancing noise robustness and expressiveness; (2) Implicit alignment inspired by E2TTS to suppress non-essential feature reconstruction; (3) Integration of Shortcut Models to accelerate flow matching inference, significantly reducing to 4 steps. Experimental results demonstrate that our model outperforms baselines such as Seed-VC in zero-shot scenarios on the noisy set, while also performing comparably to Seed-VC on the clean set. In addition, REF-VC can be compatible with singing voice conversion within one model.
- [15] arXiv:2508.05049 [中文pdf, pdf, html, 其他]
-
标题: MedMambaLite:面向医疗图像分类的硬件感知Mamba标题: MedMambaLite: Hardware-Aware Mamba for Medical Image Classification评论: 第21届IEEE生物医学电路与系统会议(BioCAS)2025主题: 图像与视频处理 (eess.IV)
基于人工智能的医疗设备推动了实时、设备端推理的需求,例如生物医学图像分类。 在边缘部署深度学习模型现在用于异常检测和医学图像分类等应用。 然而,由于模型大小和计算能力的限制,在边缘设备上实现这种性能仍然具有挑战性。 为了解决这个问题,我们提出了MedMambaLite,这是一个通过知识蒸馏优化的基于Mamba的硬件感知模型,用于医学图像分类。 我们从一个强大的MedMamba模型开始,该模型集成了Mamba结构,以在医学成像中进行高效的特征提取。 我们通过修改和减少架构中的冗余来使模型更轻便,训练和推理更快。 然后通过减少嵌入维度将其知识蒸馏到一个较小的学生模型中。 优化后的模型在10个MedMNIST数据集上实现了94.5%的整体准确率。 与MedMamba相比,参数减少了22.8倍。 在NVIDIA Jetson Orin Nano上部署实现了每推理35.6 GOPS/J的能耗。 这在每推理能耗方面比MedMamba提高了63%。
AI-powered medical devices have driven the need for real-time, on-device inference such as biomedical image classification. Deployment of deep learning models at the edge is now used for applications such as anomaly detection and classification in medical images. However, achieving this level of performance on edge devices remains challenging due to limitations in model size and computational capacity. To address this, we present MedMambaLite, a hardware-aware Mamba-based model optimized through knowledge distillation for medical image classification. We start with a powerful MedMamba model, integrating a Mamba structure for efficient feature extraction in medical imaging. We make the model lighter and faster in training and inference by modifying and reducing the redundancies in the architecture. We then distill its knowledge into a smaller student model by reducing the embedding dimensions. The optimized model achieves 94.5% overall accuracy on 10 MedMNIST datasets. It also reduces parameters 22.8x compared to MedMamba. Deployment on an NVIDIA Jetson Orin Nano achieves 35.6 GOPS/J energy per inference. This outperforms MedMamba by 63% improvement in energy per inference.
- [16] arXiv:2508.05055 [中文pdf, pdf, html, 其他]
-
标题: MOVER:结合多个会议识别系统标题: MOVER: Combining Multiple Meeting Recognition Systems主题: 音频与语音处理 (eess.AS)
在本文中,我们提出了会议识别输出投票误差减少(MOVER),一种用于会议识别任务的新型系统组合方法。尽管有结合说话人分离(例如,DOVER)或自动语音识别(ASR)系统(例如,ROVER)输出的方法,MOVER 是第一个可以结合在说话人分离和 ASR 方面都不同的会议识别系统输出的方法。MOVER 通过一个五阶段过程结合具有不同时间间隔和说话人标签的假设,包括说话人对齐、段落分组、词语和时间组合等。在 CHiME-8 DASR 任务和 NOTSOFAR-1 任务的多通道轨道上的实验结果表明,MOVER 可以成功结合多个具有不同说话人分离和识别输出的会议识别系统,在两个任务上相对于最先进系统分别实现了 9.55% 和 8.51% 的相对 tcpWER 改进。
In this paper, we propose Meeting recognizer Output Voting Error Reduction (MOVER), a novel system combination method for meeting recognition tasks. Although there are methods to combine the output of diarization (e.g., DOVER) or automatic speech recognition (ASR) systems (e.g., ROVER), MOVER is the first approach that can combine the outputs of meeting recognition systems that differ in terms of both diarization and ASR. MOVER combines hypotheses with different time intervals and speaker labels through a five-stage process that includes speaker alignment, segment grouping, word and timing combination, etc. Experimental results on the CHiME-8 DASR task and the multi-channel track of the NOTSOFAR-1 task demonstrate that MOVER can successfully combine multiple meeting recognition systems with diverse diarization and recognition outputs, achieving relative tcpWER improvements of 9.55 % and 8.51 % over the state-of-the-art systems for both tasks.
- [17] arXiv:2508.05062 [中文pdf, pdf, html, 其他]
-
标题: 不确定随机动力系统中策略合成的概率交替模拟标题: Probabilistic Alternating Simulations for Policy Synthesis in Uncertain Stochastic Dynamical Systems评论: 发表于CDC 2025主题: 系统与控制 (eess.SY) ; 计算机科学中的逻辑 (cs.LO) ; 优化与控制 (math.OC)
一种在随机动态系统中进行形式策略综合的经典方法是构建一个有限状态抽象,通常表示为马尔可夫决策过程(MDP)。这些方法的正确性取决于动态系统与其抽象之间的行为关系,例如概率模拟关系。然而,当系统动态除了具有随机性外,还受到非确定性(即集合值)干扰时,概率模拟关系是不够的。在本工作中,我们将概率模拟关系扩展到同时具有随机和非确定性干扰的系统。我们的关系受到交替模拟概念的启发,它推广了在多篇工作中用于验证和策略综合的现有关系。直观地说,我们的关系允许对随机不确定性进行概率推理,同时对非确定性干扰进行稳健(即对抗性)推理。我们通过实验展示了我们的关系在4D状态杜宾斯车辆策略综合中的适用性。
A classical approach to formal policy synthesis in stochastic dynamical systems is to construct a finite-state abstraction, often represented as a Markov decision process (MDP). The correctness of these approaches hinges on a behavioural relation between the dynamical system and its abstraction, such as a probabilistic simulation relation. However, probabilistic simulation relations do not suffice when the system dynamics are, next to being stochastic, also subject to nondeterministic (i.e., set-valued) disturbances. In this work, we extend probabilistic simulation relations to systems with both stochastic and nondeterministic disturbances. Our relation, which is inspired by a notion of alternating simulation, generalises existing relations used for verification and policy synthesis used in several works. Intuitively, our relation allows reasoning probabilistically over stochastic uncertainty, while reasoning robustly (i.e., adversarially) over nondeterministic disturbances. We experimentally demonstrate the applicability of our relations for policy synthesis in a 4D-state Dubins vehicle.
- [18] arXiv:2508.05080 [中文pdf, pdf, html, 其他]
-
标题: 具有不完善信道状态信息的功率受限和量化MIMO-RSMA系统:联合预编码、天线选择和功率控制标题: Power-Constrained and Quantized MIMO-RSMA Systems with Imperfect CSIT: Joint Precoding, Antenna Selection, and Power Control评论: 13页,7图主题: 信号处理 (eess.SP)
为了充分利用基站(BS)的可用功率,我们提出了一种联合预编码、天线选择和发射功率控制算法,该算法适用于BS的总功率预算。 我们制定了一个下行链路多用户多输入多输出(MIMO)速率分割多址接入(RSMA)系统在任意分辨率数字到模拟转换器(DACs)下的总频谱效率(SE)最大化问题。 我们通过定义使用条件平均速率方法的遍历总SE来重新表述该问题,以处理发射端的不完美信道状态信息(CSIT),并通过使用近似技术使问题更易于处理。 然后,我们将问题分解为预编码方向和功率控制子问题。 我们通过识别一个优越的拉格朗日平稳点来解决预编码方向子问题,并使用梯度下降法解决功率控制子问题。 我们还提出了一种更适合大规模MIMO系统的复杂度降低方法。 仿真结果不仅验证了所提出的算法,还揭示了当充分利用BS的功率预算时,具有8-11位的中等分辨率DAC可能比低分辨率DAC更节能。
To utilize the full potential of the available power at a base station (BS), we propose a joint precoding, antenna selection, and transmit power control algorithm for a total power budget at the BS. We formulate a sum spectral efficiency (SE) maximization problem for downlink multi-user multiple-input multiple-output (MIMO) rate-splitting multiple access (RSMA) systems with arbitrary-resolution digital-to-analog converters (DACs). We reformulate the problem by defining the ergodic sum SE using the conditional average rate approach to handle imperfect channel state information at the transmitter (CSIT), and by using approximation techniques to make the problem more tractable. Then, we decompose the problem into precoding direction and power control subproblems. We solve the precoding direction subproblem by identifying a superior Lagrangian stationary point, and the power control subproblem using gradient descent. We also propose a complexity-reduction approach that is more suitable for massive MIMO systems. Simulation results not only validate the proposed algorithm but also reveal that when utilizing the full potential of the power budget at the BS, medium-resolution DACs with 8-11 bits may actually be more power-efficient than low-resolution DACs.
- [19] arXiv:2508.05102 [中文pdf, pdf, html, 其他]
-
标题: 失语语音合成中的公平性:使用F5-TTS理解失语语音克隆中的内在偏差标题: Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS评论: 被Interspeech 2025接收主题: 音频与语音处理 (eess.AS) ; 人工智能 (cs.AI)
失语症语音在开发辅助技术中带来了重大挑战,主要是由于数据的可用性有限。 神经语音合成的最新进展,特别是零样本语音克隆,促进了用于数据增强的合成语音生成;然而,它们可能会对失语症语音产生偏差。 在本文中,我们研究了最先进的F5-TTS在使用TORGO数据集克隆失语症语音的有效性,重点在于可理解性、说话人相似性和韵律保持。 我们还使用公平性指标如差异影响和均等差异来分析潜在偏差,以评估不同失语症严重程度水平之间的差异。 结果表明,F5-TTS在失语症语音合成中表现出对语音可理解性的强烈偏差,而相对于说话人和韵律保持而言。 这项研究的见解有助于整合具有公平意识的失语症语音合成,促进更包容的语音技术的发展。
Dysarthric speech poses significant challenges in developing assistive technologies, primarily due to the limited availability of data. Recent advances in neural speech synthesis, especially zero-shot voice cloning, facilitate synthetic speech generation for data augmentation; however, they may introduce biases towards dysarthric speech. In this paper, we investigate the effectiveness of state-of-the-art F5-TTS in cloning dysarthric speech using TORGO dataset, focusing on intelligibility, speaker similarity, and prosody preservation. We also analyze potential biases using fairness metrics like Disparate Impact and Parity Difference to assess disparities across dysarthric severity levels. Results show that F5-TTS exhibits a strong bias toward speech intelligibility over speaker and prosody preservation in dysarthric speech synthesis. Insights from this study can help integrate fairness-aware dysarthric speech synthesis, fostering the advancement of more inclusive speech technologies.
- [20] arXiv:2508.05142 [中文pdf, pdf, html, 其他]
-
标题: 基于数字孪生信道的CSI预测:一种环境感知的子空间提取方法以实现低开销和鲁棒性标题: Digital Twin Channel-Aided CSI Prediction: A Environment-based Subspace Extraction Approach for Achieving Low Overhead and Robustness主题: 信号处理 (eess.SP)
为了满足第六代(6G)移动通信系统在复杂场景下的鲁棒性和高速通信需求,基于感知和人工智能(AI)的数字孪生信道(DTC)技术成为一种有前景的方法,以减少系统开销。 在本文中,我们提出了一种环境特定的信道子空间基(EB)辅助的部分到整体信道状态信息(CSI)预测方法(EB-P2WCP),以实现DTC支持的低开销信道预测。 具体而言,EB被用于表征电磁环境的静态特性,该特性从数字孪生地图中提取,作为预测任务前的环境信息先验。 然后,我们将EB与实时估计的局部CSI融合,以预测当前和未来时间实例的整个空频域信道。 因此,设计了一个基于EB的部分到整体CSI预测网络(EB-P2WNet),以在各种复杂场景中实现鲁棒的信道预测方案。 仿真结果表明,在低信噪比和导频比例条件下,引入EB提供了显著的优势,导频开销最多可减少50%。 此外,所提出的方法对多用户干扰具有鲁棒性,在仅增加0.5 dB NMSE的情况下,能够容忍3米的定位误差,并在1.3毫秒内预测下一个信道相干时间的CSI。
To meet the robust and high-speed communication requirements of the sixth-generation (6G) mobile communication system in complex scenarios, sensing- and artificial intelligence (AI)-based digital twin channel (DTC) techniques become a promising approach to reduce system overhead. In this paper, we propose an environment-specific channel subspace basis (EB)-aided partial-to-whole channel state information (CSI) prediction method (EB-P2WCP) for realizing DTC-enabled low-overhead channel prediction. Specifically, EB is utilized to characterize the static properties of the electromagnetic environment, which is extracted from the digital twin map, serving as environmental information prior to the prediction task. Then, we fuse EB with real-time estimated local CSI to predict the entire spatial-frequency domain channel for both the present and future time instances. Hence, an EB-based partial-to-whole CSI prediction network (EB-P2WNet) is designed to achieve a robust channel prediction scheme in various complex scenarios. Simulation results indicate that incorporating EB provides significant benefits under low signal-to-noise ratio and pilot ratio conditions, achieving a reduction of up to 50% in pilot overhead. Additionally, the proposed method maintains robustness against multi-user interference, tolerating 3-meter localization errors with only a 0.5 dB NMSE increase, and predicts CSI for the next channel coherent time within 1.3 milliseconds.
- [21] arXiv:2508.05149 [中文pdf, pdf, html, 其他]
-
标题: 低资源场景中的语音大语言模型:数据量需求以及预训练对高资源语言的影响标题: Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages评论: 被Interspeech 2025接收。5页,2图,3表主题: 音频与语音处理 (eess.AS) ; 人工智能 (cs.AI) ; 计算与语言 (cs.CL)
大型语言模型(LLMs)在处理高资源语言的语音输入方面表现出潜力,在各种任务中达到了最先进的性能。 然而,它们在低资源环境中的适用性仍较少被探索。 这项工作研究了使用Speech LLMs进行低资源自动语音识别,采用SLAM-ASR框架,其中可训练的轻量级投影器连接语音编码器和LLM。 首先,我们评估了训练数据量的需求以匹配Whisper-only的性能,重新强调了数据有限的挑战。 其次,我们展示了利用在高资源语言上预训练的单语或多语种投影器可以减少数据稀缺的影响,尤其是在小训练集的情况下。 使用多语种LLMs(EuroLLM、Salamandra)与whisper-large-v3-turbo,我们在多个公共基准上评估了性能,为未来优化Speech LLMs在低资源语言和多语种方面的研究提供了见解。
Large language models (LLMs) have demonstrated potential in handling spoken inputs for high-resource languages, reaching state-of-the-art performance in various tasks. However, their applicability is still less explored in low-resource settings. This work investigates the use of Speech LLMs for low-resource Automatic Speech Recognition using the SLAM-ASR framework, where a trainable lightweight projector connects a speech encoder and a LLM. Firstly, we assess training data volume requirements to match Whisper-only performance, re-emphasizing the challenges of limited data. Secondly, we show that leveraging mono- or multilingual projectors pretrained on high-resource languages reduces the impact of data scarcity, especially with small training sets. Using multilingual LLMs (EuroLLM, Salamandra) with whisper-large-v3-turbo, we evaluate performance on several public benchmarks, providing insights for future research on optimizing Speech LLMs for low-resource languages and multilinguality.
- [22] arXiv:2508.05163 [中文pdf, pdf, html, 其他]
-
标题: 为最坏情况做准备:资源充足性评估中的长期和短期天气极端情况标题: Preparing for the worst: Long-term and short-term weather extremes in resource adequacy assessment主题: 系统与控制 (eess.SY) ; 优化与控制 (math.OC)
供应安全是在实现净零电力系统时整合可再生能源的常见且重要的关注点。极端天气会影响需求和供应,导致电力系统压力;在欧洲,这种压力会超越气象根源,蔓延整个大陆。我们采用基于影子价格的方法来识别称为系统定义事件的高压力时期,并分析它们对电力系统的影响。通过分类不同类型的系统定义事件,我们确定了电力系统运行和规划中的挑战。至关重要的是,我们发现需要足够的弹性备用(电力)容量,但由于天气变化,其财务可行性存在风险。此外,我们通过不同的指标和压力测试来区分短期和长期的弹性挑战,以将两者纳入未来的能源建模评估中。我们的方法和在开源模型PyPSA-Eur中的实现可以重新应用于其他系统,并帮助研究人员和政策制定者构建更具弹性和充分的能源系统。
Security of supply is a common and important concern when integrating renewables in net-zero power systems. Extreme weather affects both demand and supply leading to power system stress; in Europe this stress spreads continentally beyond the meteorological root cause. We use an approach based on shadow prices to identify periods of elevated stress called system-defining events and analyse their impact on the power system. By classifying different types of system-defining events, we identify challenges to power system operation and planning. Crucially, we find the need for sufficient resilience back-up (power) capacities whose financial viability is precarious due to weather variability. Furthermore, we disentangle short- and long-term resilience challenges with distinct metrics and stress tests to incorporate both into future energy modelling assessments. Our methodology and implementation in the open model PyPSA-Eur can be re-applied to other systems and help researchers and policymakers in building more resilient and adequate energy systems.
- [23] arXiv:2508.05168 [中文pdf, pdf, html, 其他]
-
标题: 超越像素:使用隐式神经表示的医学图像质量评估标题: Beyond Pixels: Medical Image Quality Assessment with Implicit Neural Representations评论: 被第16届医学影像机器学习(MLMI 2025)研讨会接受主题: 图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV)
伪影在医学成像中是一个重大挑战,影响诊断准确性和后续分析。 虽然基于图像的方法在检测伪影方面可能有效,但它们通常依赖于可能导致信息丢失和高内存需求的预处理方法,从而限制了分类模型的可扩展性。 在本工作中,我们提出使用隐式神经表示(INRs)进行图像质量评估。 INRs 提供了医学图像的紧凑且连续的表示,自然处理分辨率和图像大小的变化,同时减少内存开销。 我们开发了深度权重空间网络、图神经网络和关系注意变换器,它们在 INRs 上运行以实现图像质量评估。 我们的方法在 ACDC 数据集上进行了评估,具有合成生成的伪影模式,证明了其在评估图像质量方面的有效性,并且在参数更少的情况下实现了相似的性能。
Artifacts pose a significant challenge in medical imaging, impacting diagnostic accuracy and downstream analysis. While image-based approaches for detecting artifacts can be effective, they often rely on preprocessing methods that can lead to information loss and high-memory-demand medical images, thereby limiting the scalability of classification models. In this work, we propose the use of implicit neural representations (INRs) for image quality assessment. INRs provide a compact and continuous representation of medical images, naturally handling variations in resolution and image size while reducing memory overhead. We develop deep weight space networks, graph neural networks, and relational attention transformers that operate on INRs to achieve image quality assessment. Our method is evaluated on the ACDC dataset with synthetically generated artifact patterns, demonstrating its effectiveness in assessing image quality while achieving similar performance with fewer parameters.
- [24] arXiv:2508.05177 [中文pdf, pdf, html, 其他]
-
标题: 监督控制理论中可控性定义的概述标题: Overview of Controllability Definitions in Supervisory Control Theory评论: 18页主题: 系统与控制 (eess.SY) ; 形式语言与自动机理论 (cs.FL)
在监督控制理论领域,文献中经常为同一概念提出不同的定义,这使得难以理解这些定义之间的关系。 这对于监督器相对于被控对象的基本概念——可控制性来说肯定是这样的。 本文列出了文献中找到的可控制性定义,并研究了它们在确定性和非确定性自动机设置中的关系。 在一般情况下,当监督器和被控对象都被允许是非确定性的时,Flordal和Malik描述的可控制性概念,以及Kushi和Takai的不可控事件可接受性概念是等价的。 这些也是唯一能推出传统(语言)可控制性概念的定义。 从实践的角度来看,人们通常更关注被监督的被控对象相对于被控对象的可控制性。 在此背景下,除了上述两种可控制性概念外,Zhou等人提出的状态可控制性也意味着语言可控制性。
In the field of supervisory control theory, the literature often proposes different definitions for the same concept, making it difficult to understand how these definitions are related. This is definitely so for the fundamental notion of controllability of a supervisor w.r.t. a plant. This paper lists definitions of controllability found in the literature and studies their relationships in settings of both deterministic and nondeterministic automata. In the general context, where both the supervisor and the plant are allowed to be nondeterministic, the notions of controllability as described by Flordal and Malik, and uncontrollable event admissibility by Kushi and Takai are equivalent. These are also the only notions that imply the traditional notion of (language) controllability. From a practical perspective, one is often more interested in controllability of a supervised plant w.r.t. a plant. In this context, in addition to the previous two controllability notions, state controllability by Zhou et al. implies language controllability.
- [25] arXiv:2508.05204 [中文pdf, pdf, html, 其他]
-
标题: 基于液态透镜的MIMO VLC系统的成像接收器优化标题: Optimization of Liquid Lens-based Imaging Receiver for MIMO VLC Systems评论: 本文已被接受在IEEE全球通信会议(GLOBECOM 2025)上发表,该会议将于2025年12月8日至12日在台湾台北举行。arXiv管理员注:与arXiv:2503.10316有大量文本重叠主题: 信号处理 (eess.SP)
在本文中,提出了一种基于液态透镜的成像接收器,用于多输入多输出(MIMO)可见光通信(VLC)系统。 通过动态调整液态透镜的焦距和方向角度,减少了MIMO信道增益之间的空间相关性,从而提高了误码率(BER)性能。 与静态透镜不同,液态透镜在动态条件下具有适应性,包括用户移动和随机接收器方向。 建立了一个精确的数学框架来建模所提出系统的信道增益,并制定了一个优化问题以最小化其BER。 由于所得信道模型的复杂性,引入了两种透镜调整方案,即($i$)CLS方案和($ii$)VULO方案。 数值结果表明,在广泛的随机接收器方向条件下,所提出的基于液态透镜的系统相比传统静态透镜接收器提供了显著的BER改进。 具体而言,当随机接收器方向方差为$10^{\circ}$时,通过采用所提出的液态透镜,BER从$4\times 10^{-2}$改善到$5\times 10^{-4}$。
In this paper, a liquid lens-based imaging receiver is proposed for multiple-input multiple-output (MIMO) visible light communication (VLC) systems. By dynamically adjusting the focal length and orientation angles of the liquid lens, the spatial correlation between MIMO channel gains is reduced, leading to enhanced bit-error rate (BER) performance. Unlike static lenses, liquid lenses offer adaptability in dynamic conditions, including user mobility and random receiver orientation. An accurate mathematical framework is developed to model the channel gains of the proposed system, and an optimization problem is formulated to minimize its BER. Due to the complexity of the resulting channel model, two lens adjustment schemes, namely, ($i$) the CLS scheme, and ($ii$) the VULO scheme are introduced. Numerical results demonstrate that the proposed liquid lens-based system offers substantial BER improvements over conventional static lens-based receivers across a wide range of random receiver orientation conditions. Specifically, at a random receiver orientation variance of $10^{\circ}$, the BER is improved from $4\times 10^{-2}$ to $5\times 10^{-4}$ by employing the proposed liquid lens.
- [26] arXiv:2508.05226 [中文pdf, pdf, html, 其他]
-
标题: 基于深度学习的车辆ISAC场景动态环境重建标题: Deep Learning Based Dynamic Environment Reconstruction for Vehicular ISAC Scenarios主题: 信号处理 (eess.SP)
集成感知与通信(ISAC)技术在未来的智能交通系统中起着关键作用,通过重用无线信号使车辆能够感知和重建周围环境,从而减少甚至消除对额外传感器如LiDAR或雷达的需求。 然而,现有的基于ISAC的重建方法往往缺乏以足够的精度和时间一致性跟踪动态场景的能力,限制了其在现实世界中的适用性。 为解决这一限制,我们提出了一种基于深度学习的框架,通过使用ISAC信道进行车辆环境重建。 我们首先建立了一个基于真实城市道路场景多模态测量的联合信道环境数据集。 然后,开发了一个多阶段的深度学习网络来进行环境重建。 具体来说,场景解码器识别环境上下文,如建筑物、树木等;聚类中心解码器通过定位主要散射中心预测粗略的空间布局;点云解码器恢复周围环境的精细几何和结构。 实验结果表明,所提出的方法在具有0.29的Chamfer距离和0.87的F Score@1%的情况下实现了高质量的动态环境重建。 此外,复杂度分析证明了该方法在实时场景中的效率和实用性。 这项工作为未来智能交通系统基于ISAC的低成本环境重建提供了一条路径。
Integrated Sensing and Communication (ISAC) technology plays a critical role in future intelligent transportation systems, by enabling vehicles to perceive and reconstruct the surrounding environment through reuse of wireless signals, thereby reducing or even eliminating the need for additional sensors such as LiDAR or radar. However, existing ISAC based reconstruction methods often lack the ability to track dynamic scenes with sufficient accuracy and temporal consistency, limiting the real world applicability. To address this limitation, we propose a deep learning based framework for vehicular environment reconstruction by using ISAC channels. We first establish a joint channel environment dataset based on multi modal measurements from real world urban street scenarios. Then, a multistage deep learning network is developed to reconstruct the environment. Specifically, a scene decoder identifies the environmental context such as buildings, trees and so on; a cluster center decoder predicts coarse spatial layouts by localizing dominant scattering centers; a point cloud decoder recovers fine grained geometry and structure of surrounding environments. Experimental results demonstrate that the proposed method achieves high-quality dynamic environment reconstruction with a Chamfer Distance of 0.29 and F Score@1% of 0.87. In addition, complexity analysis demonstrates the efficiency and practical applicability of the method in real time scenarios. This work provides a pathway toward low cost environment reconstruction based on ISAC for future intelligent transportation.
- [27] arXiv:2508.05240 [中文pdf, pdf, html, 其他]
-
标题: 通过成像风格迁移的MR和超声图像的粗到细联合配准标题: Coarse-to-Fine Joint Registration of MR and Ultrasound Images via Imaging Style Transfer主题: 图像与视频处理 (eess.IV) ; 人工智能 (cs.AI) ; 计算机视觉与模式识别 (cs.CV)
我们开发了一个用于配准术前磁共振(MR)图像和术后超声(US)图像的流程。 我们的方法利用无配对风格迁移的3D CycleGAN生成合成T1图像,从而提高配准性能。 此外,我们的配准过程同时采用仿射和局部变形变换进行从粗到细的配准。 结果表明,我们的方法在大多数情况下提高了MR和US图像对之间的一致性。
We developed a pipeline for registering pre-surgery Magnetic Resonance (MR) images and post-resection Ultrasound (US) images. Our approach leverages unpaired style transfer using 3D CycleGAN to generate synthetic T1 images, thereby enhancing registration performance. Additionally, our registration process employs both affine and local deformable transformations for a coarse-to-fine registration. The results demonstrate that our approach improves the consistency between MR and US image pairs in most cases.
- [28] arXiv:2508.05250 [中文pdf, pdf, html, 其他]
-
标题: 语音和语言处理中的相似性隐私披露标题: Privacy Disclosure of Similarity in Speech and Language Processing主题: 音频与语音处理 (eess.AS)
语音识别者、作者和其他生物特征识别应用通常通过将样本的相似性与模板数据库进行比较来确定身份。 由于数据可能有噪声,相似性度量可能不准确,这种比较可能无法可靠地将真实身份识别为最相似的。 即使基于不准确的相似性度量的相似性排名,也可能泄露关于真实身份的隐私信息。 我们提出了一种方法,通过估计其概率分布来量化这种相似性排名的隐私泄露。 它基于确定真实说话者的相似性排名直方图,或者在数据稀缺时,用贝塔-二项分布对直方图进行建模。 我们将泄露表示为熵(比特),使得独立特征的泄露是可加的。 我们的实验表明,所有测试的说话者和作者特征都包含可用于识别的个人身份信息(PII),其中来自说话者识别算法的嵌入包含最多的信息,其次是电话嵌入、语言嵌入和基本频率。 我们的初步实验表明,PII的泄露随着测试样本长度的增加而增加,但受数据库模板长度的限制。 提供的度量标准,相似性排名泄露,提供了一种在生物特征之间比较PII泄露的方法,并可以合并它们以帮助识别。 因此,它可以有助于对语音和其他生物特征技术中的隐私威胁进行全面评估。
Speaker, author, and other biometric identification applications often compare a sample's similarity to a database of templates to determine the identity. Given that data may be noisy and similarity measures can be inaccurate, such a comparison may not reliably identify the true identity as the most similar. Still, even the similarity rank based on an inaccurate similarity measure can disclose private information about the true identity. We propose a methodology for quantifying the privacy disclosure of such a similarity rank by estimating its probability distribution. It is based on determining the histogram of the similarity rank of the true speaker, or when data is scarce, modeling the histogram with the beta-binomial distribution. We express the disclosure in terms of entropy (bits), such that the disclosure from independent features are additive. Our experiments demonstrate that all tested speaker and author characterizations contain personally identifying information (PII) that can aid in identification, with embeddings from speaker recognition algorithms containing the most information, followed by phone embeddings, linguistic embeddings, and fundamental frequency. Our initial experiments show that the disclosure of PII increases with the length of test samples, but it is bounded by the length of database templates. The provided metric, similarity rank disclosure, provides a way to compare the disclosure of PII between biometric features and merge them to aid identification. It can thus aid in the holistic evaluation of threats to privacy in speech and other biometric technologies.
- [29] arXiv:2508.05279 [中文pdf, pdf, html, 其他]
-
标题: 用于数据驱动控制的无源非线性FIR滤波器标题: Passive nonlinear FIR filters for data-driven control评论: 15页,12图主题: 系统与控制 (eess.SY)
我们提出了一类新的无源非线性有限脉冲响应算子。 此类算子是通过提升空间中有限脉冲响应滤波器的作用构建的。 这使得通过约束优化能够高效地进行控制综合。 通过基于虚拟参考反馈调优理论的最小二乘拟合来考虑闭环性能。 通过基于频域采样的有效线性约束来建立无源性。 由于无源性,这类算子特别适用于物理系统的控制,例如机电系统。
We propose a new class of passive nonlinear finite impulse response operators. This class is constructed by the action of finite impulse response filters in a lifted space. This allows for efficient control synthesis through constrained optimization. Closed-loop performance is taken into account through least-squares fitting, based on the theory of virtual reference feedback tuning. Passivity is established through efficient linear constraints, based on sampling in the frequency domain. Because of passivity, this class of operators is particularly suited for the control of physical systems, such as electromechanical systems.
- [30] arXiv:2508.05293 [中文pdf, pdf, html, 其他]
-
标题: 基于单通道VAE的语音增强中语音和噪声潜在表示的研究标题: Investigation of Speech and Noise Latent Representations in Single-channel VAE-based Speech Enhancement评论: 5页,5图主题: 音频与语音处理 (eess.AS)
最近,提出了一种基于变分自编码器(VAE)的单通道语音增强系统,该系统使用贝叶斯排列训练,利用两个预训练的VAE来获取语音和噪声的潜在表示。 基于这些预训练的VAE,一个噪声VAE学习从带有噪声的语音中生成语音和噪声的潜在表示以进行语音增强。 修改预训练VAE的损失项会影响预训练的语音和噪声潜在表示。 在本文中,我们研究了这些不同的表示如何影响语音增强性能。 在DNS3、WSJ0-QUT和VoiceBank-DEMAND数据集上的实验表明,一个语音和噪声表示明显分离的潜在空间显著提高了性能,而标准VAE会产生重叠的语音和噪声表示。
Recently, a variational autoencoder (VAE)-based single-channel speech enhancement system using Bayesian permutation training has been proposed, which uses two pretrained VAEs to obtain latent representations for speech and noise. Based on these pretrained VAEs, a noisy VAE learns to generate speech and noise latent representations from noisy speech for speech enhancement. Modifying the pretrained VAE loss terms affects the pretrained speech and noise latent representations. In this paper, we investigate how these different representations affect speech enhancement performance. Experiments on the DNS3, WSJ0-QUT, and VoiceBank-DEMAND datasets show that a latent space where speech and noise representations are clearly separated significantly improves performance over standard VAEs, which produce overlapping speech and noise representations.
- [31] arXiv:2508.05380 [中文pdf, pdf, html, 其他]
-
标题: 以瞬时时间-频率原子统一常见信号分析标题: Unifying Common Signal Analyses with Instantaneous Time-Frequency Atoms主题: 信号处理 (eess.SP)
在以前的工作中,我们提出了一个瞬时时频分析的通用框架,但没有提供如何计算特定瞬时谱(IS)的具体细节。 在本工作中,我们使用瞬时时频原子来获得与常见信号分析相关的IS:时域分析、频域分析、分数傅里叶变换、同步挤压短时傅里叶变换和同步挤压短时分数傅里叶变换。 通过这样做,我们展示了如何使用该通用框架来统一这些分析,并为相应的IS开发了闭式表达式。 这是通过将这些分析视为AM-FM分量的分解,并认识到每种分析在分析过程中都使用了二次啁啾波作为模板的专用(或极限)形式来实现的。 利用一个两参数的二次啁啾波,我们可以将这些IS组织成一个二维连续体,平面上的点对应于与一种信号分析相关的分解。 最后,使用几个示例信号,我们以闭式形式计算了各种分析的IS。
In previous work, we presented a general framework for instantaneous time-frequency analysis but did not provide any specific details of how to compute a particular instantaneous spectrum (IS). In this work, we use instantaneous time-frequency atoms to obtain an IS associated with common signal analyses: time domain analysis, frequency domain analysis, fractional Fourier transform, synchrosqueezed short-time Fourier transform, and synchrosqueezed short-time fractional Fourier transform. By doing so, we demonstrate how the general framework can be used to unify these analyses and we develop closed-form expressions for the corresponding ISs. This is accomplished by viewing these analyses as decompositions into AM--FM components and recognizing that each uses a specialized (or limiting) form of a quadratic chirplet as a template during analysis. With a two-parameter quadratic chirplet, we can organize these ISs into a 2D continuum with points in the plane corresponding to a decomposition related to one of the signal analyses. Finally, using several example signals, we compute in closed-form the ISs for the various analyses.
- [32] arXiv:2508.05391 [中文pdf, pdf, html, 其他]
-
标题: 基于人工智能的Spitz肿瘤分类标题: Artificial Intelligence-Based Classification of Spitz TumorsRuben T. Lucassen, Marjanna Romers, Chiel F. Ebbelaar, Aia N. Najem, Donal P. Hayes, Antien L. Mooyaart, Sara Roshani, Liliane C. D. Wynaendts, Nikolas Stathonikos, Gerben E. Breimer, Anne M. L. Jansen, Mitko Veta, Willeke A. M. Blokx评论: 19页,2图,6表,6个补充表主题: 图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV)
由于在非典型组织学特征上与常规黑色素瘤存在重叠,Spitz肿瘤在诊断上具有挑战性。 我们研究了使用组织学和/或临床特征的AI模型在多大程度上可以:(1)区分Spitz肿瘤与常规黑色素瘤;(2)预测Spitz肿瘤的潜在基因异常;以及(3)预测Spitz肿瘤的诊断类别。 AI模型是使用包含393个Spitz肿瘤和379个常规黑色素瘤的数据集开发和验证的。 预测性能通过AUROC和准确率进行测量。 在读者研究中,AI模型的性能与四位经验丰富的病理学家进行了比较。 此外,进行了一项模拟实验,以研究在辅助诊断测试中实施基于AI的建议对病理科工作流程的影响。 基于UNI特征的最佳AI模型在区分Spitz肿瘤与常规黑色素瘤时达到了0.95的AUROC和0.86的准确率。 与随机猜测的0.25相比,基因异常的预测准确率为0.55。 诊断类别的预测准确率为0.51,其中随机猜测的准确率等于0.33。 在所有三个任务中,AI模型的表现优于四位病理学家,尽管大多数个体比较的差异在统计学上不显著。 根据模拟实验,实施基于AI的辅助诊断测试建议可以减少材料成本、周转时间和检查次数。 总之,AI模型在区分Spitz肿瘤和常规黑色素瘤方面表现出强大的预测性能。 在预测Spitz肿瘤的基因异常和诊断类别的更具挑战性的任务中,AI模型的表现优于随机猜测。
Spitz tumors are diagnostically challenging due to overlap in atypical histological features with conventional melanomas. We investigated to what extent AI models, using histological and/or clinical features, can: (1) distinguish Spitz tumors from conventional melanomas; (2) predict the underlying genetic aberration of Spitz tumors; and (3) predict the diagnostic category of Spitz tumors. The AI models were developed and validated using a dataset of 393 Spitz tumors and 379 conventional melanomas. Predictive performance was measured using the AUROC and the accuracy. The performance of the AI models was compared with that of four experienced pathologists in a reader study. Moreover, a simulation experiment was conducted to investigate the impact of implementing AI-based recommendations for ancillary diagnostic testing on the workflow of the pathology department. The best AI model based on UNI features reached an AUROC of 0.95 and an accuracy of 0.86 in differentiating Spitz tumors from conventional melanomas. The genetic aberration was predicted with an accuracy of 0.55 compared to 0.25 for randomly guessing. The diagnostic category was predicted with an accuracy of 0.51, where random chance-level accuracy equaled 0.33. On all three tasks, the AI models performed better than the four pathologists, although differences were not statistically significant for most individual comparisons. Based on the simulation experiment, implementing AI-based recommendations for ancillary diagnostic testing could reduce material costs, turnaround times, and examinations. In conclusion, the AI models achieved a strong predictive performance in distinguishing between Spitz tumors and conventional melanomas. On the more challenging tasks of predicting the genetic aberration and the diagnostic category of Spitz tumors, the AI models performed better than random chance.
- [33] arXiv:2508.05476 [中文pdf, pdf, html, 其他]
-
标题: MM2CT:基于马尔可夫的多模态图像融合MRI到CT转换标题: MM2CT: MR-to-CT translation for multi-modal image fusion with mamba主题: 图像与视频处理 (eess.IV)
磁共振(MR)到计算机断层扫描(CT)的转换具有显著优势,包括消除与CT扫描相关的辐射暴露以及减轻由患者运动引起的成像伪影。 现有方法基于单模态MR到CT的转换,有限的研究探索了多模态融合。 为解决这一限制,我们通过利用多模态T1和T2加权MRI数据,引入了一种基于Mamba的多模态MR到CT(MM2CT)转换方法,这是一种创新的多模态医学图像合成框架。 Mamba有效克服了CNN中的有限局部感受野和Transformer中的高计算复杂性问题。 MM2CT利用这一优势,在保持长距离依赖关系建模能力的同时实现多模态MR特征集成。 此外,我们引入了一个动态局部卷积模块和一个动态增强模块以提高MRI到CT的合成效果。 在公开的骨盆数据集上的实验表明,MM2CT在结构相似性指数测量(SSIM)和峰值信噪比(PSNR)方面达到了最先进的性能。 我们的代码可在https://github.com/Gots-ch/MM2CT公开获取。
Magnetic resonance (MR)-to-computed tomography (CT) translation offers significant advantages, including the elimination of radiation exposure associated with CT scans and the mitigation of imaging artifacts caused by patient motion. The existing approaches are based on single-modality MR-to-CT translation, with limited research exploring multimodal fusion. To address this limitation, we introduce Multi-modal MR to CT (MM2CT) translation method by leveraging multimodal T1- and T2-weighted MRI data, an innovative Mamba-based framework for multi-modal medical image synthesis. Mamba effectively overcomes the limited local receptive field in CNNs and the high computational complexity issues in Transformers. MM2CT leverages this advantage to maintain long-range dependencies modeling capabilities while achieving multi-modal MR feature integration. Additionally, we incorporate a dynamic local convolution module and a dynamic enhancement module to improve MRI-to-CT synthesis. The experiments on a public pelvis dataset demonstrate that MM2CT achieves state-of-the-art performance in terms of Structural Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR). Our code is publicly available at https://github.com/Gots-ch/MM2CT.
- [34] arXiv:2508.05479 [中文pdf, pdf, 其他]
-
标题: 基于微瓦无电池无振荡器的Wi-Fi背向散射发射器,重新利用射频信号进行能量收集、通信和运动检测标题: Sub- μ W Battery-Less and Oscillator-Less Wi-Fi Backscattering Transmitter Reusing RF Signal for Harvesting, Communications, and Motion Detection主题: 信号处理 (eess.SP) ; 系统与控制 (eess.SY)
本文提出了一种亚微瓦特功率的802.11b背散射发射器,以实现同一入射波的三种用途:射频能量收集、背散射通信和位置/运动感知。 移除电池和任何外部运动传感器(例如MEMS)实现了前所未有的微型化和普及程度,不受限制的设备寿命,以及低制造和维护成本。 通过本地振荡器的消除,首次打破了WiFi发射器的微瓦特功率墙,这是通过利用双音入射波的二阶互调提取其频率实现的。 双音方案还使累计的能量收集/传输/感知灵敏度降低至Pmin -19 dBm。 通过将收集到的电压作为接收信号强度(RSS)的代理,实现了位置/运动感知,从而可以相对于共享在室内邻里标签中的音调发生器来感知芯片的位置。
In this paper, a sub-uW power 802.11b backscattering transmitter is presented to enable reuse of the same incident wave for three purposes: RF harvesting, backscattering communications and position/motion sensing. The removal of the battery and any off-chip motion sensor (e.g., MEMS) enables unprecedented level of miniaturization and ubiquity, unrestricted device lifespan, low fabrication and maintenance cost. The uW power wall for WiFi transmitters is broken for the first time via local oscillator elimination, as achieved by extracting its frequency through second-order intermodulation of a twotone incident wave. The two-tone scheme also enables a cumulative harvesting/transmission/sensing sensitivity down to Pmin -19 dBm. Position/motion sensing is enabled by using the harvested voltage as a proxy for the Received Signal Strength (RSS), allowing to sense the chip location with respect to the tone generator(s) shared across tags in indoor neighborhoods.
- [35] arXiv:2508.05495 [中文pdf, pdf, html, 其他]
-
标题: 对电源和热建模与管理的20年回顾标题: A 20-Year Retrospective on Power and Thermal Modeling and Management主题: 系统与控制 (eess.SY)
随着处理器性能的提升,日益增加的功率密度和复杂的热行为对能源效率和系统可靠性构成了威胁。 本综述涵盖了现代处理器中功率和热建模与管理超过二十年的研究。 我们首先比较了用于功率估计的解析方法、回归方法和神经网络方法,然后回顾了热建模方法,包括有限元法、有限差分法和数据驱动方法。 接下来,我们对动态运行时管理策略进行了分类,这些策略平衡了性能、功耗和可靠性。 最后,我们讨论了新兴挑战和有前景的研究方向。
As processor performance advances, increasing power densities and complex thermal behaviors threaten both energy efficiency and system reliability. This survey covers more than two decades of research on power and thermal modeling and management in modern processors. We start by comparing analytical, regression-based, and neural network-based techniques for power estimation, then review thermal modeling methods, including finite element, finite difference, and data-driven approaches. Next, we categorize dynamic runtime management strategies that balance performance, power consumption, and reliability. Finally, we conclude with a discussion of emerging challenges and promising research directions.
- [36] arXiv:2508.05499 [中文pdf, pdf, 其他]
-
标题: 0.6伏,微瓦功率四级OTA,组件最少且负载范围100倍标题: 0.6-V, uW-Power 4-Stage OTA with Minimal Components and 100X Load Range主题: 信号处理 (eess.SP)
一种用于超低功耗应用的四阶段运算跨导放大器(OTA)在本文中被引入。 所提出的电路包括频率补偿,所需的晶体管和无源元件数量最少,克服了传统四阶段OTAs的补偿难题,使其回归到三阶段OTAs的简单性。 同时,所提出的电路实现了高功率效率,这由大信号(小信号)功率效率指标FOML(FOMS)相比先前的四阶段OTAs(亚1 V多阶段OTAs)提高了>3.7X(>11.3X)得到证实。 由于相位裕度对负载电容的敏感度较低,所提出的OTA在广泛的负载范围内保持稳定(如任何三阶段或四阶段OTA一样为双侧),实现了负载电容的最大/最小比值>100X。
A four-stage operational transconductance amplifier (OTA) for ultra-low-power applications is introduced in this paper. The proposed circuit inclusive of frequency compensation requires minimal transistor count and passives, overcoming the traditionally difficult compensation of 4-stage OTAs and bringing it back to the simplicity of 3-stage OTAs. At the same time, the proposed circuit achieves high power efficiency, as evidenced by the >3.7X (>11.3X) improvement in the large-signal (small-signal) power efficiency figure of merit FOML (FOMS), compared to prior 4-stage OTAs (sub-1 V multi-stage OTAs). Thanks to the lower sensitivity of the phase margin to the load capacitance, the proposed OTA remains stable under a wide range of loads (double-sided as in any 3-4-stage OTA), achieving a max/min ratio of the load capacitance of >100X.
- [37] arXiv:2508.05583 [中文pdf, pdf, html, 其他]
-
标题: 基于大数据分析和机器学习的集成智能能源管理系统研究标题: Research on integrated intelligent energy management system based on big data analysis and machine learning评论: 6页,4图,会议主题: 系统与控制 (eess.SY)
大数据的应用是集成智能能源的重要特征。 将大数据应用于集成智能能源项目的文件管理,对于提高项目管理和控制的效率具有重要意义。 本文首先讨论了在集成智能能源项目文档管理与控制中实施大数据分析的益处和挑战。 此外,开发了一个用于集成智能能源项目文档管理的大数据分析实施框架,并提出了一种通过机器学习优化集成智能能源项目文档管理效率的方法。 利用项目文档管理过程中生成的各种类型的数据和信息,通过三种不同的机器学习方法优化了整个过程的项目文档控制效率。 拟合惩罚线性回归模型的结果显示,当有足够的数据作为训练集时,模型的准确率可以达到95%以上。 通过使用大数据分析和机器学习来分析综合智能能源项目文档管理的效率,可以跟踪综合智能能源项目文档的整个过程并优化业务流程,从而加强项目施工控制并提高项目施工效率。
The application of big data is one of the significant features of integrated smart energy. Applying it to the file management of integrated smart energy projects is of great significance for improving the efficiency of project management and control. This article first discussed the benefits and challenges of implementing big data analysis in document management and control of integrated smart energy projects. In addition, an implementation framework for big data analysis in integrated smart energy project document management was developed, and a method for optimizing the efficiency of integrated smart energy project document management through machine learning was proposed. Using various types of data and information generated during the project document management process, the efficiency of the entire process project document control through three different machine learning methods was optimized. The result of fitting a penalty linear regression model shows that when there is enough data as a training set, the accuracy of the model achieved can reach over 95\%. By using big data analysis and machine learning to analyze the efficiency of comprehensive smart energy project document management, it is possible to track the entire process of comprehensive smart energy project documents and optimize business processes, thereby strengthening project construction control and improving project construction efficiency.
- [38] arXiv:2508.05620 [中文pdf, pdf, html, 其他]
-
标题: 从量化测量中学习径向网络拓扑的误差界标题: Error Bounds for Radial Network Topology Learning from Quantized MeasurementsSamuel Talkington, Aditya Rangarajan, Pedro A. de Alcântara, Line Roald, Daniel K. Molzahn, Daniel R. Fuhrmann评论: 3页,2图主题: 系统与控制 (eess.SY)
我们对径向网络拓扑学习问题中解的误差进行概率性边界分析,其中连接性和线路参数都被估计。 在我们的模型中,数据误差由传感器的精度引入,即量化。 这产生了一个非线性测量模型,将传感器通信网络的操作嵌入到学习问题中,超越了电力系统估计算法中通常看到的加性噪声模型。 我们证明,学习得到的径向网络拓扑的误差与量化区间宽度成比例,并且在节点数量上呈次线性增长,前提是每个节点的样本数量是节点数量的对数。
We probabilistically bound the error of a solution to a radial network topology learning problem where both connectivity and line parameters are estimated. In our model, data errors are introduced by the precision of the sensors, i.e., quantization. This produces a nonlinear measurement model that embeds the operation of the sensor communication network into the learning problem, expanding beyond the additive noise models typically seen in power system estimation algorithms. We show that the error of a learned radial network topology is proportional to the quantization bin width and grows sublinearly in the number of nodes, provided that the number of samples per node is logarithmic in the number of nodes.
新提交 (展示 38 之 38 条目 )
- [39] arXiv:2412.14762 (交叉列表自 cs.RO) [中文pdf, pdf, html, 其他]
-
标题: 一种人机集成的通用控制方法标题: A General Control Method for Human-Robot Integration评论: 提交至《国际机器人研究杂志》(IJRR),自2024年10月起接受评审,16页,30图主题: 机器人技术 (cs.RO) ; 系统与控制 (eess.SY)
本文介绍了一种新的广义控制方法,专为多自由度设备设计,以帮助运动能力有限的人在日常活动中获得支持。 挑战在于找到最适合的控制界面策略,以有效将用户在低维空间中的运动映射到复杂的辅助机器人设备,如假肢、额外肢体,甚至远程机器人化身。 目标是构建一个系统,将人类和机器人部分整合成一个统一的系统,能够按照人类决定的目标移动,同时自主减少用户的努力和不适感。 我们提出了一种控制通用多自由度辅助系统的框架,该框架将用户执行的补偿性动作转换为达到目标所需的机器人指令,同时消除或减少补偿。 该框架适用于任何数量自由度的假肢,直至完整的机器人化身,这里将其视为一种全身假肢,使用者将机器人视为自己身体的人工延伸,没有物理连接,但具有感觉-运动整合。 我们通过涵盖模拟场景和涉及机器人部件(假肢和机器人)的虚拟双胞胎以及物理人形化身的实地测试,验证并应用了这种控制策略。
This paper introduces a new generalized control method designed for multi-degrees-of-freedom devices to help people with limited motion capabilities in their daily activities. The challenge lies in finding the most adapted strategy for the control interface to effectively map user's motions in a low-dimensional space to complex robotic assistive devices, such as prostheses, supernumerary limbs, up to remote robotic avatars. The goal is a system which integrates the human and the robotic parts into a unique system, moving so as to reach the targets decided by the human while autonomously reducing the user's effort and discomfort. We present a framework to control general multi DoFs assistive systems, which translates user-performed compensatory motions into the necessary robot commands for reaching targets while canceling or reducing compensation. The framework extends to prostheses of any number of DoF up to full robotic avatars, regarded here as a sort of whole-body prosthesis of the person who sees the robot as an artificial extension of their own body without a physical link but with a sensory-motor integration. We have validated and applied this control strategy through tests encompassing simulated scenarios and real-world trials involving a virtual twin of the robotic parts (prosthesis and robot) and a physical humanoid avatar.
- [40] arXiv:2508.02903 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: RDDPM:用于无监督异常分割的鲁棒去噪扩散概率模型标题: RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation评论: 10页,5张图。已被接受至ICCV 2025工业视觉检测研讨会(VISION)主题: 计算机视觉与模式识别 (cs.CV)
最近扩散模型的进展在无监督异常分割方面展示了显著的成功。 对于异常分割,这些模型首先在正常数据上进行训练;然后,异常图像被噪声干扰到中间步骤,并通过反向扩散重建正常图像。 与传统统计方法不同,扩散模型不依赖于数据或目标异常的具体假设,使其在不同领域中具有灵活性。 然而,扩散模型通常假设可以访问正常数据进行训练,这限制了它们在现实设置中的适用性。 在本文中,我们提出了新颖的鲁棒去噪扩散模型,适用于仅提供污染(即正常和异常的混合)未标记数据的情况。 通过将数据的最大似然估计转化为非线性回归问题,我们通过回归视角重新解释了去噪扩散概率模型。 使用鲁棒回归,我们推导出了去噪扩散概率模型的鲁棒版本。 我们的新框架在构建各种鲁棒扩散模型方面提供了灵活性。 我们的实验表明,当仅提供污染数据时,我们的方法优于当前最先进的扩散模型。 我们的方法优于现有的基于扩散的方法,在MVTec数据集上,AUROC提高了高达8.08%,AUPRC提高了高达10.37%。 实现代码可在以下地址获取:https://github.com/mehrdadmoradi124/RDDPM
Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation. For anomaly segmentation, these models are first trained on normal data; then, an anomalous image is noised to an intermediate step, and the normal image is reconstructed through backward diffusion. Unlike traditional statistical methods, diffusion models do not rely on specific assumptions about the data or target anomalies, making them versatile for use across different domains. However, diffusion models typically assume access to normal data for training, limiting their applicability in realistic settings. In this paper, we propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available. By casting maximum likelihood estimation of the data as a nonlinear regression problem, we reinterpret the denoising diffusion probabilistic model through a regression lens. Using robust regression, we derive a robust version of denoising diffusion probabilistic models. Our novel framework offers flexibility in constructing various robust diffusion models. Our experiments show that our approach outperforms current state of the art diffusion models, for unsupervised anomaly segmentation when only contaminated data is available. Our method outperforms existing diffusion-based approaches, achieving up to 8.08\% higher AUROC and 10.37\% higher AUPRC on MVTec datasets. The implementation code is available at: https://github.com/mehrdadmoradi124/RDDPM
- [41] arXiv:2508.04714 (交叉列表自 cs.AI) [中文pdf, pdf, html, 其他]
-
标题: 基于RAG的处方代理用于自动维护(PARAM)标题: Prescriptive Agents based on Rag for Automated Maintenance (PARAM)主题: 人工智能 (cs.AI) ; 计算与语言 (cs.CL) ; 机器学习 (cs.LG) ; 多智能体系统 (cs.MA) ; 信号处理 (eess.SP)
工业机械维护需要及时干预以防止灾难性故障并优化操作效率。 本文介绍了一种基于大型语言模型(LLM)的智能系统,用于处方性维护,其范围超越了传统的异常检测,提供了可操作的维护建议。 在我们之前用于数值数据分析的LAMP框架基础上,我们开发了一个全面的解决方案,将轴承振动频率分析与多代理生成相结合,以实现智能维护规划。 我们的方法将轴承振动数据(BPFO、BPFI、BSF、FTF频率)序列化为自然语言以供LLM处理,实现了高精度的少样本异常检测。 该系统对故障类型(内圈、外圈、滚珠/滚子、保持架故障)进行分类,并评估严重程度等级。 一个多代理组件使用向量嵌入和语义搜索处理维护手册,同时进行网络搜索以检索全面的程序知识并获取最新的维护实践,从而提供更准确和深入的建议。 Gemini模型随后生成结构化的维护建议,包括立即行动、检查清单、纠正措施、零件需求和时间表规范。 在轴承振动数据集上的实验验证表明,该系统能够有效检测异常并提供上下文相关的维护指导。 该系统成功弥合了状态监测与可操作维护计划之间的差距,为工业从业者提供智能决策支持。 这项工作推动了LLM在工业维护中的应用,为各种机械部件和工业部门的处方性维护提供了一个可扩展的框架。
Industrial machinery maintenance requires timely intervention to prevent catastrophic failures and optimize operational efficiency. This paper presents an integrated Large Language Model (LLM)-based intelligent system for prescriptive maintenance that extends beyond traditional anomaly detection to provide actionable maintenance recommendations. Building upon our prior LAMP framework for numerical data analysis, we develop a comprehensive solution that combines bearing vibration frequency analysis with multi agentic generation for intelligent maintenance planning. Our approach serializes bearing vibration data (BPFO, BPFI, BSF, FTF frequencies) into natural language for LLM processing, enabling few-shot anomaly detection with high accuracy. The system classifies fault types (inner race, outer race, ball/roller, cage faults) and assesses severity levels. A multi-agentic component processes maintenance manuals using vector embeddings and semantic search, while also conducting web searches to retrieve comprehensive procedural knowledge and access up-to-date maintenance practices for more accurate and in-depth recommendations. The Gemini model then generates structured maintenance recommendations includes immediate actions, inspection checklists, corrective measures, parts requirements, and timeline specifications. Experimental validation in bearing vibration datasets demonstrates effective anomaly detection and contextually relevant maintenance guidance. The system successfully bridges the gap between condition monitoring and actionable maintenance planning, providing industrial practitioners with intelligent decision support. This work advances the application of LLMs in industrial maintenance, offering a scalable framework for prescriptive maintenance across machinery components and industrial sectors.
- [42] arXiv:2508.04721 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
-
标题: 面向电信的低延迟端到端语音代理,使用流式ASR、量化LLMs和实时TTS标题: Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)
我们引入了一种低延迟的电信AI语音代理流程,用于实时、交互式的电信应用,使呼叫中心自动化、智能IVR(交互式语音应答)和AI驱动的客户服务成为可能。 该解决方案专为电信设计,结合了NetoAI的四个专业模型:TSLAM,一个4位量化电信专用大型语言模型(LLM);T-VEC,一个电信专用嵌入模型;TTE,一个电信专用自动语音识别(ASR)模型;以及T-Synth,一个电信专用文本到语音(TTS)模型。 这些模型实现了高度响应迅速、领域适应的语音AI代理,支持低延迟的知识基础口语交互。 该流程集成了流式ASR(TTE)、对话智能(TSLAM)、基于电信文档的检索增强生成(RAG)以及实时TTS(T-Synth),为电信语音助手设定了新的基准。 为了评估该系统,我们构建了一个包含500个从RFCs中录制的人类电信问题的数据集,模拟真实的电信代理查询。 该框架允许对整个系统中的延迟、领域相关性和实时性能进行分析。 结果表明,TSLAM、TTE和T-Synth提供了低于1.0的实时因子(RTF),支持企业级、低延迟的电信部署。 这些AI代理——由TSLAM、TTE和T-Synth提供动力——为下一代电信AI奠定了基础,实现了自动化客户服务、诊断等功能。
We introduce a low-latency telecom AI voice agent pipeline for real-time, interactive telecommunications use, enabling advanced voice AI for call center automation, intelligent IVR (Interactive Voice Response), and AI-driven customer support. The solution is built for telecom, combining four specialized models by NetoAI: TSLAM, a 4-bit quantized Telecom-Specific Large Language Model (LLM); T-VEC, a Telecom-Specific Embedding Model; TTE, a Telecom-Specific Automatic Speech Recognition (ASR) model; and T-Synth, a Telecom-Specific Text-to-Speech (TTS) model. These models enable highly responsive, domain-adapted voice AI agents supporting knowledge-grounded spoken interactions with low latency. The pipeline integrates streaming ASR (TTE), conversational intelligence (TSLAM), retrieval augmented generation (RAG) over telecom documents, and real-time TTS (T-Synth), setting a new benchmark for telecom voice assistants. To evaluate the system, we built a dataset of 500 human-recorded telecom questions from RFCs, simulating real telecom agent queries. This framework allows analysis of latency, domain relevance, and real-time performance across the stack. Results show that TSLAM, TTE, and T-Synth deliver real-time factors (RTF) below 1.0, supporting enterprise, low-latency telecom deployments. These AI agents -- powered by TSLAM, TTE, and T-Synth -- provide a foundation for next-generation telecom AI, enabling automated customer support, diagnostics, and more.
- [43] arXiv:2508.04723 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
-
标题: 可穿戴音乐到情感:通过便携式 EEG-fNIRS 融合评估人工智能生成音乐引起的情感标题: Wearable Music2Emotion : Assessing Emotions Induced by AI-Generated Music through Portable EEG-fNIRS FusionSha Zhao, Song Yi, Yangxuan Zhou, Jiadong Pan, Jiquan Wang, Jie Xia, Shijian Li, Shurong Dong, Gang Pan评论: 被ACM MM 2025接受主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)
情绪对心理健康有关键影响,推动了通过神经生理信号和脑机接口技术进行基于音乐的情感计算的研究。 尽管之前的研究利用音乐的可及性来进行情绪诱导,但仍有三个关键限制存在: \textbf{(1) 刺激 约束}: 由于版权和整理成本,音乐刺激局限于小规模语料库,且从启发式情绪-音乐映射中选择时存在偏差,忽略了个体情感特征。 \textbf{(2) 模态特异性}: 过度依赖单模态神经数据(例如,EEG)忽视了跨模态信号融合的补充见解。 \textbf{(3) 可移植性限制}: 繁琐的设置(例如,64+通道凝胶型EEG帽)由于程序复杂性和便携性障碍而阻碍了实际应用。 为了解决这些限制,我们提出了MEEtBrain,这是一种便携且多模态的情绪分析框架(愉悦度/唤醒度),将AI生成的音乐刺激与无线头带同步的EEG-fNIRS采集相结合。 通过MEEtBrain,音乐刺激可以由AI大规模自动生成,消除主观选择偏差的同时确保音乐多样性。 我们使用了自主研发的便携设备,该设备设计为轻量级头带式,并使用干电极,以同时收集EEG和fNIRS记录。 在首次招募中,从20名参与者那里收集了一个14小时的数据集,以验证框架的有效性,其中AI生成的音乐引发了目标情绪(愉悦度/唤醒度)。 我们正在积极扩展多模态数据集(最新数据集有44名参与者),并将其公开,以促进进一步的研究和实际应用。 \textbf \{该数据集可在 https://zju-bmi-lab.github.io/ZBra 获取。
Emotions critically influence mental health, driving interest in music-based affective computing via neurophysiological signals with Brain-computer Interface techniques. While prior studies leverage music's accessibility for emotion induction, three key limitations persist: \textbf{(1) Stimulus Constraints}: Music stimuli are confined to small corpora due to copyright and curation costs, with selection biases from heuristic emotion-music mappings that ignore individual affective profiles. \textbf{(2) Modality Specificity}: Overreliance on unimodal neural data (e.g., EEG) ignores complementary insights from cross-modal signal fusion.\textbf{ (3) Portability Limitation}: Cumbersome setups (e.g., 64+ channel gel-based EEG caps) hinder real-world applicability due to procedural complexity and portability barriers. To address these limitations, we propose MEEtBrain, a portable and multimodal framework for emotion analysis (valence/arousal), integrating AI-generated music stimuli with synchronized EEG-fNIRS acquisition via a wireless headband. By MEEtBrain, the music stimuli can be automatically generated by AI on a large scale, eliminating subjective selection biases while ensuring music diversity. We use our developed portable device that is designed in a lightweight headband-style and uses dry electrodes, to simultaneously collect EEG and fNIRS recordings. A 14-hour dataset from 20 participants was collected in the first recruitment to validate the framework's efficacy, with AI-generated music eliciting target emotions (valence/arousal). We are actively expanding our multimodal dataset (44 participants in the latest dataset) and make it publicly available to promote further research and practical applications. \textbf{The dataset is available at https://zju-bmi-lab.github.io/ZBra.
- [44] arXiv:2508.04727 (交叉列表自 q-bio.TO) [中文pdf, pdf, html, 其他]
-
标题: 基于强化学习的心脏MRI自适应k空间径向采样标题: Adaptive k-space Radial Sampling for Cardiac MRI with Reinforcement Learning评论: MICCAI 2025 STACOM 工作坊主题: 组织与器官 (q-bio.TO) ; 图像与视频处理 (eess.IV) ; 定量方法 (q-bio.QM)
加速磁共振成像(MRI)需要仔细优化k空间采样模式,以平衡采集速度和图像质量。 尽管深度学习的最新进展在优化笛卡尔采样方面显示出前景,但强化学习(RL)在非笛卡尔轨迹优化中的潜力仍 largely 未被探索。 在本工作中,我们提出了一种新的RL框架,用于优化心脏MRI中的径向采样轨迹。 我们的方法具有双分支架构,联合处理k空间和图像域信息,结合交叉注意力融合机制,以促进域间有效信息交换。 该框架采用解剖感知奖励设计和黄金比例采样策略,以确保均匀的k空间覆盖,同时保留心脏结构细节。 实验结果表明,我们的方法能够在多个加速因子下有效学习最优径向采样策略,与传统方法相比实现了改进的重建质量。 代码可用:https://github.com/Ruru-Xu/RL-kspace-Radial-Sampling
Accelerated Magnetic Resonance Imaging (MRI) requires careful optimization of k-space sampling patterns to balance acquisition speed and image quality. While recent advances in deep learning have shown promise in optimizing Cartesian sampling, the potential of reinforcement learning (RL) for non-Cartesian trajectory optimization remains largely unexplored. In this work, we present a novel RL framework for optimizing radial sampling trajectories in cardiac MRI. Our approach features a dual-branch architecture that jointly processes k-space and image-domain information, incorporating a cross-attention fusion mechanism to facilitate effective information exchange between domains. The framework employs an anatomically-aware reward design and a golden-ratio sampling strategy to ensure uniform k-space coverage while preserving cardiac structural details. Experimental results demonstrate that our method effectively learns optimal radial sampling strategies across multiple acceleration factors, achieving improved reconstruction quality compared to conventional approaches. Code available: https://github.com/Ruru-Xu/RL-kspace-Radial-Sampling
- [45] arXiv:2508.04734 (交叉列表自 q-bio.QM) [中文pdf, pdf, html, 其他]
-
标题: 跨域图像合成:从多重生物标志物成像生成H&E标题: Cross-Domain Image Synthesis: Generating H&E from Multiplex Biomarker Imaging主题: 定量方法 (q-bio.QM) ; 人工智能 (cs.AI) ; 图像与视频处理 (eess.IV)
虽然多重免疫荧光(mIF)成像提供了深度的、空间分辨的分子数据,将这些信息与苏木精和伊红(H&E)形态学标准相结合,对于获得关于潜在组织的互补信息可能非常重要。从mIF数据生成虚拟H&E染色提供了一种强大的解决方案,提供了即时的形态学背景。至关重要的是,这种方法使能够将大量的基于H&E的计算机辅助诊断(CAD)工具应用于丰富的分子数据,弥合了分子分析和形态学分析之间的差距。在本工作中,我们研究了多级向量量化生成对抗网络(VQGAN)用于从mIF图像生成高保真虚拟H&E染色的使用。我们对两个公开可用的结直肠癌数据集中的标准条件生成对抗网络(cGAN)基线进行了严格的评估,评估了图像相似性和下游分析的功能效用。我们的结果表明,尽管两种架构都能生成视觉上合理的图像,但由我们的VQGAN生成的虚拟染色为计算机辅助诊断提供了更有效的基础。具体而言,在VQGAN生成的图像上进行的下游核分割和组织分类任务中的语义保留表现出优于cGAN的结果,并且与真实值分析有更高的一致性。这项工作确立了多级VQGAN是一种稳健且优越的生成科学有用虚拟染色的架构,为将mIF的丰富分子数据整合到已建立且强大的基于H&E的分析工作流程中提供了一条可行的路径。
While multiplex immunofluorescence (mIF) imaging provides deep, spatially-resolved molecular data, integrating this information with the morphological standard of Hematoxylin & Eosin (H&E) can be very important for obtaining complementary information about the underlying tissue. Generating a virtual H&E stain from mIF data offers a powerful solution, providing immediate morphological context. Crucially, this approach enables the application of the vast ecosystem of H&E-based computer-aided diagnosis (CAD) tools to analyze rich molecular data, bridging the gap between molecular and morphological analysis. In this work, we investigate the use of a multi-level Vector-Quantized Generative Adversarial Network (VQGAN) to create high-fidelity virtual H&E stains from mIF images. We rigorously evaluated our VQGAN against a standard conditional GAN (cGAN) baseline on two publicly available colorectal cancer datasets, assessing performance on both image similarity and functional utility for downstream analysis. Our results show that while both architectures produce visually plausible images, the virtual stains generated by our VQGAN provide a more effective substrate for computer-aided diagnosis. Specifically, downstream nuclei segmentation and semantic preservation in tissue classification tasks performed on VQGAN-generated images demonstrate superior performance and agreement with ground-truth analysis compared to those from the cGAN. This work establishes that a multi-level VQGAN is a robust and superior architecture for generating scientifically useful virtual stains, offering a viable pathway to integrate the rich molecular data of mIF into established and powerful H&E-based analytical workflows.
- [46] arXiv:2508.04795 (交叉列表自 cs.CL) [中文pdf, pdf, html, 其他]
-
标题: 利用冻结大语言模型增强对话注释的说话人特征标题: Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM评论: 被2025年IEEE自动语音识别与理解研讨会接受主题: 计算与语言 (cs.CL) ; 人工智能 (cs.AI) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)
在对话转录流程中,大型语言模型(LLMs)经常用于后处理以改进语法、标点和可读性。 我们探索了一个补充的后处理步骤:通过添加元数据标签来丰富转录的对话,这些标签包括年龄、性别和情绪等说话人特征。 其中一些标签是整个对话全局的,而另一些则是时间变化的。 我们的方法将冻结的音频基础模型(如Whisper或WavLM)与冻结的LLAMA语言模型结合,以推断这些说话人属性,而无需对任一模型进行任务特定的微调。 使用轻量且高效的连接器来连接音频和语言表示,我们在说话人分析任务上实现了具有竞争力的性能,同时保持了模块化和速度。 此外,我们证明了冻结的LLAMA模型可以直接比较x-vectors,在某些场景中实现了8.8%的等错误率。
In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.
- [47] arXiv:2508.04799 (交叉列表自 cs.NE) [中文pdf, pdf, html, 其他]
-
标题: 最优性原理与基于神经常微分方程的过程建模用于分布式控制标题: Optimality Principles and Neural Ordinary Differential Equations-based Process Modeling for Distributed Control评论: 27页,7图主题: 神经与进化计算 (cs.NE) ; 人工智能 (cs.AI) ; 机器学习 (cs.LG) ; 系统与控制 (eess.SY)
机器学习和过程控制分析的最新进展提出了如何自然地将新的数据驱动方法与经典过程模型和控制相结合的问题。 我们提出了一种过程建模框架,通过一致的拓扑特性和广延量的守恒,实现数据驱动算法的集成。 过程网络单元之间的相互连接通过连通矩阵和网络图表示。 我们推导出系统的自然目标函数,等价于稳态系统中的非平衡熵产生,作为过程动态的驱动力。 我们说明了如何将分布式控制和优化实施到过程网络结构中,以及控制定律和算法如何改变系统的自然平衡,以达到工程目标。 基本要求是流量条件可以用锥形扇区(无源性)条件来表示。 我们的形式化方法允许通过稀疏深度神经网络,将来自拓扑的基本守恒性质与从数据中学习的动态关系结合起来。 我们在一个简单的库存控制系统实际例子中演示了如何将过程的基本拓扑与神经网络常微分方程模型相结合。 系统特定的本构方程未被描述,而是由神经常微分方程算法通过伴随方法结合自适应常微分方程求解器,从合成时间序列数据中学习。 由此产生的神经网络形成一个状态空间模型,可用于例如模型预测控制算法。
Most recent advances in machine learning and analytics for process control pose the question of how to naturally integrate new data-driven methods with classical process models and control. We propose a process modeling framework enabling integration of data-driven algorithms through consistent topological properties and conservation of extensive quantities. Interconnections among process network units are represented through connectivity matrices and network graphs. We derive the system's natural objective function equivalent to the non-equilibrium entropy production in a steady state system as a driving force for the process dynamics. We illustrate how distributed control and optimization can be implemented into process network structures and how control laws and algorithms alter the system's natural equilibrium towards engineered objectives. The basic requirement is that the flow conditions can be expressed in terms of conic sector (passivity) conditions. Our formalism allows integration of fundamental conservation properties from topology with learned dynamic relations from data through sparse deep neural networks. We demonstrate in a practical example of a simple inventory control system how to integrate the basic topology of a process with a neural network ordinary differential equation model. The system specific constitutive equations are left undescribed and learned by the neural ordinary differential equation algorithm using the adjoint method in combination with an adaptive ODE solver from synthetic time-series data. The resulting neural network forms a state space model for use in e.g. a model predictive control algorithm.
- [48] arXiv:2508.04814 (交叉列表自 cs.CL) [中文pdf, pdf, html, 其他]
-
标题: 语气重音检测提高预训练自动语音识别标题: Pitch Accent Detection improves Pretrained Automatic Speech Recognition主题: 计算与语言 (cs.CL) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)
我们展示了使用半监督语音表示的自动语音识别(ASR)系统的性能可以通过引入联合ASR和音调重音检测模型来提升,该模型包含一个互补的音调重音检测模块。 我们模型中的音调重音检测组件在该任务上达到了显著的改进,将F1分数的差距缩小了41%。 此外,在有限资源微调下,联合训练中的ASR性能在LibriSpeech上将WER降低了28.3%。 通过这些结果,我们展示了扩展预训练语音模型以保留或重新学习重要的韵律线索(如音调重音)的重要性。
We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.
- [49] arXiv:2508.04818 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: 通过扩散模型的单步重建无关异常检测与分割标题: Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models评论: 9页,8图,2表。提交至IEEE会议主题: 计算机视觉与模式识别 (cs.CV) ; 图像与视频处理 (eess.IV) ; 机器学习 (stat.ML)
生成模型在过去十年中在异常检测和分割方面表现出显著的成功。 最近,扩散模型作为一种强大的替代方法出现,其性能超过了之前的生成对抗网络(GANs)和变分自编码器(VAEs)方法。 在典型的基于扩散的异常检测中,模型在正常数据上进行训练,在推理过程中,异常图像会被扰动到前向扩散过程中的一个预定义中间步骤。 然后通过迭代反向采样重建对应的正常图像。 然而,基于重建的方法存在三个主要挑战:(1) 由于多次采样步骤,重建过程计算成本高,使得实时应用不切实际;(2) 对于复杂或细微的模式,重建的图像可能对应于不同的正常模式,而不是原始输入;(3) 选择适当的中间噪声水平具有挑战性,因为这取决于应用场景,并且通常假设对异常有先验知识,这一假设在无监督设置中并不成立。 我们引入了实时注意力扩散模型的无重建异常检测(RADAR),克服了基于重建的异常检测的局限性。 与当前最先进的方法不同,RADAR直接从扩散模型生成异常图,提高了检测准确性和计算效率。 我们在真实世界的3D打印材料和MVTec-AD数据集上评估了RADAR。 我们的方法在所有关键指标上都超过了最先进的扩散模型和统计机器学习模型,包括准确率、精确率、召回率和F1分数。 具体而言,与次优模型相比,RADAR在MVTec-AD上的F1分数提高了7%,在3D打印材料数据集上提高了13%。 代码可在以下链接获取:https://github.com/mehrdadmoradi124/RADAR
Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: https://github.com/mehrdadmoradi124/RADAR
- [50] arXiv:2508.04946 (交叉列表自 cs.LG) [中文pdf, pdf, html, 其他]
-
标题: REINA:基于正则化熵信息的损失函数用于高效的同步语音翻译标题: REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation主题: 机器学习 (cs.LG) ; 计算与语言 (cs.CL) ; 音频与语音处理 (eess.AS)
同时语音翻译(SimulST)系统在流式传输音频的同时同时发出翻译后的文本或语音。 此类系统面临着平衡翻译质量和延迟的重大挑战。 我们引入了一种优化这种权衡的策略:只有在这样做能获得信息时才等待更多输入。 基于这一策略,我们提出了正则化熵信息适应(REINA),一种新的损失函数,用于使用现有的非流式翻译模型训练自适应策略。 我们从信息论原理推导出REINA,并表明REINA有助于将报告的延迟/质量权衡的帕累托前沿推向先前工作之上。 利用REINA,我们在法语、西班牙语和德语上训练了一个SimulST模型,包括从这些语言到英语的翻译。 仅使用开源或合成生成的数据进行训练,我们实现了与现有模型大小相当的最先进的流式传输结果。 我们还引入了一种流式效率指标,定量显示与之前的方法相比,REINA在延迟/质量权衡上提高了多达21%,并以非流式基线BLEU分数为基准进行归一化。
Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.
- [51] arXiv:2508.04985 (交叉列表自 cs.LG) [中文pdf, pdf, html, 其他]
-
标题: RCUKF:数据驱动建模与贝叶斯估计标题: RCUKF: Data-Driven Modeling Meets Bayesian Estimation评论: 6页,6张图表。被IFAC MECC 2025(建模、估计与控制会议)接收主题: 机器学习 (cs.LG) ; 系统与控制 (eess.SY) ; 机器学习 (stat.ML)
准确的建模在许多工程和科学应用中至关重要,但获得复杂系统的可靠过程模型通常具有挑战性。 为解决这一挑战,我们提出了一种新的框架,即带有无迹卡尔曼滤波的水库计算(RCUKF),该框架将通过水库计算(RC)进行的数据驱动建模与通过无迹卡尔曼滤波(UKF)进行的贝叶斯估计相结合。 RC部分直接从数据中学习非线性系统动力学,作为UKF预测步骤中的代理过程模型,在名义数学模型可能失效的高维或混沌情况下生成状态估计。 同时,UKF测量更新整合实时传感器数据以校正数据驱动模型中的潜在漂移。 我们在著名的基准问题和高保真仿真环境中的实时车辆轨迹估计任务中展示了RCUKF的有效性。
Accurate modeling is crucial in many engineering and scientific applications, yet obtaining a reliable process model for complex systems is often challenging. To address this challenge, we propose a novel framework, reservoir computing with unscented Kalman filtering (RCUKF), which integrates data-driven modeling via reservoir computing (RC) with Bayesian estimation through the unscented Kalman filter (UKF). The RC component learns the nonlinear system dynamics directly from data, serving as a surrogate process model in the UKF prediction step to generate state estimates in high-dimensional or chaotic regimes where nominal mathematical models may fail. Meanwhile, the UKF measurement update integrates real-time sensor data to correct potential drift in the data-driven model. We demonstrate RCUKF effectiveness on well-known benchmark problems and a real-time vehicle trajectory estimation task in a high-fidelity simulation environment.
- [52] arXiv:2508.05011 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
-
标题: 面向无幻觉音乐:一种用于可靠歌曲生成的强化学习偏好优化框架标题: Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song GenerationHuaicheng Zhang, Wei Tan, Guangzheng Li, Yixuan Zhang, Hangting Chen, Shun Lei, Chenyu Yang, Zhiyong Wu, Shuai Wang, Qijun Huang, Dong Yu主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)
最近在基于音频的生成语言模型方面的进展加速了AI驱动的歌词到歌曲的生成。 然而,这些模型经常出现内容幻觉,生成的输出与输入歌词不一致,破坏了音乐的连贯性。 当前的监督微调(SFT)方法由于被动的标签拟合而受到限制,表现出有限的自我改进和较差的幻觉缓解效果。 为了解决这一核心挑战,我们提出了一种新颖的强化学习(RL)框架,利用偏好优化进行幻觉控制。 我们的主要贡献包括:(1) 开发了一个通过音素错误率(PER)计算和基于规则的过滤构建的鲁棒幻觉偏好数据集,以捕捉与人类期望的一致性;(2) 在RL框架内实施并评估三种不同的偏好优化策略:直接偏好优化(DPO)、近端策略优化(PPO)和组相对策略优化(GRPO)。 DPO采用非策略方法来增强正向标记的可能性,实现了显著的7.4% PER降低。 PPO和GRPO采用策略方法,训练一个基于PER的奖励模型,通过奖励最大化和KL正则化迭代优化序列,分别实现了4.9%和4.7%的PER降低。 全面的客观和主观评估证实,我们的方法有效抑制了幻觉,同时保持了音乐质量。 至关重要的是,这项工作提出了一个系统性的、基于RL的解决方案,用于歌词到歌曲生成中的幻觉控制。 该框架的可迁移性也解锁了音乐风格遵循和音乐性提升的潜力,为未来生成歌曲研究开辟了新途径。
Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework's transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research.
- [53] arXiv:2508.05016 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: AU-IQA:AI增强用户生成内容的感知质量评估基准数据集标题: AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content主题: 计算机视觉与模式识别 (cs.CV) ; 图像与视频处理 (eess.IV)
基于人工智能的图像增强技术已被广泛应用于各种视觉应用中,显著提高了用户生成内容(UGC)的感知质量。 然而,缺乏专门的质量评估模型已成为该领域的一个重要限制因素,限制了用户体验并阻碍了增强方法的进步。 尽管感知质量评估方法在单独的UGC和AIGC上表现出色,但它们在融合两者特征的AI增强UGC(AI-UGC)上的有效性仍基本未被探索。 为解决这一差距,我们构建了AU-IQA,这是一个基准数据集,包含由三种代表性增强类型生成的4800张AI-UGC图像,这些类型包括超分辨率、低光增强和去噪。 在此数据集上,我们进一步评估了一系列现有的质量评估模型,包括传统的IQA方法和大型多模态模型。 最后,我们对当前方法在评估AI-UGC的感知质量方面的表现进行了全面分析。 AU-IQA的访问链接是https://github.com/WNNGGU/AU-IQA-Dataset。
AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.
- [54] arXiv:2508.05026 (交叉列表自 physics.optics) [中文pdf, pdf, 其他]
-
标题: 超越64-QAM的低导频开销OFDM太赫兹链路的相位噪声容忍度标题: Phase Noise Tolerance for Low-Pilot-Overhead OFDM Terahertz Links Beyond 64-QAM主题: 光学 (physics.optics) ; 信号处理 (eess.SP)
太赫兹无线通信由于其通过丰富未开发频谱实现的前所未有的数据速率而受到广泛关注。 然而,除了64-QAM之外的高级调制格式仍大多未被探索,因为在上变频/下变频过程中引入的相位误差严重限制了系统性能。 特别是,OFDM传输极易受到由相位噪声引起的ICI加剧的影响,这破坏了子载波的正交性。 虽然锁相环(PLL)和导频辅助补偿可以减轻相位误差,但过多的导频开销会损害频谱效率和能耗,且白相位噪声仍然无法恢复。 因此,量化相位噪声容限对于实际物理层协议至关重要。 在此,我们揭示了在64-QAM、2048个子载波的OFDM太赫兹传输系统中相位噪声的影响。 提出了3{\sigma }-误差估计来量化相位噪声容限,表明一个大约为5%的直观EVM阈值。 该阈值进一步界定了相位噪声水平、SNR要求和导频开销之间的权衡。 此外,通过与具有不同相位噪声谱的代表性振荡器进行基准测试,发现微环谐振器(MRRs)是超越64-QAM运行的低导频开销OFDM太赫兹链路不可或缺的使能器。
THz wireless communications have garnered significant attention due to their unprecedented data rates enabled by the abundant untapped spectrum. However, advanced modulation formats beyond 64-QAM remain largely unexplored, as phase errors introduced during up/down-conversion severely limit system performance. Particularly, OFDM transmission is highly susceptible to aggravated ICI induced by phase noise, undermining the orthogonality of subcarriers. While PLLs and pilot-assisted compensation can mitigate phase errors, excessive pilot overhead compromises spectral efficiency and energy consumption, and white phase noise remains unrecoverable. Therefore, quantifying phase noise tolerance is essential for practical physical layer protocols. Here, we reveal the impact of phase noise in a 64-QAM, 2048-subcarrier OFDM THz transmission system. 3{\sigma}-error estimation is proposed to quantify phase noise tolerance, indicating an intuitive EVM threshold of approximately 5%. This threshold further delineates the trade-offs among phase noise levels, SNR requirements, and pilot overhead. Moreover, by benchmarking representative oscillators with distinct phase noise spectra, microring resonators (MRRs) are identified as indispensable enablers for low-pilot-overhead OFDM THz links operating beyond 64-QAM.
- [55] arXiv:2508.05033 (交叉列表自 cs.IT) [中文pdf, pdf, html, 其他]
-
标题: 移动天线辅助通信系统的能效优化标题: Energy Efficiency Optimization for Movable Antenna-Aided Communication Systems评论: 本文已被IEEE iWRF&AT 2025接受主题: 信息论 (cs.IT) ; 信号处理 (eess.SP)
本文研究了考虑移动天线(MA)移动带来的时延和能耗的移动天线系统能量效率优化问题。 我们首先推导了单用户下行通信系统中的能量效率上界,其中用户配备了一个单MA。 然后,提出了一个以优化MA位置的能量效率最大化问题,并基于连续凸逼近提出了一种高效算法来解决这个非凸优化问题。 仿真结果表明,尽管MA移动带来了开销,与传统的固定位置天线(FPA)系统相比,MA系统仍能提高能量效率。
This paper investigates the energy efficiency optimization for movable antenna (MA) systems by considering the time delay and energy consumption introduced by MA movement. We first derive the upper bound on energy efficiency for a single-user downlink communication system, where the user is equipped with a single MA. Then, the energy efficiency maximization problem is formulated to optimize the MA position, and an efficient algorithm based on successive convex approximation is proposed to solve this non-convex optimization problem. Simulation results show that, despite the overhead caused by MA movement, the MA system can still improve the energy efficiency compared to the conventional fixed-position antenna (FPA) system.
- [56] arXiv:2508.05036 (交叉列表自 quant-ph) [中文pdf, pdf, 其他]
-
标题: Q-DPTS:通过变分量子电路的量子差分隐私时间序列预测标题: Q-DPTS: Quantum Differentially Private Time Series Forecasting via Variational Quantum Circuits主题: 量子物理 (quant-ph) ; 密码学与安全 (cs.CR) ; 机器学习 (cs.LG) ; 信号处理 (eess.SP)
时间序列预测在数据敏感性至关重要的领域中至关重要,例如金融和能源系统。 虽然差分隐私(DP)提供了理论保证以保护个体数据贡献,但通过DP-SGD进行集成时,由于注入的噪声,通常会损害模型性能。 在本文中,我们提出了Q-DPTS,这是一种用于量子差分隐私时间序列预测的混合量子-经典框架。 Q-DPTS结合了变分量子电路(VQCs)与逐样本梯度裁剪和高斯噪声注入,确保严格的$(\epsilon, \delta)$-差分隐私。 量子模型的表现力使得其对DP机制引起的效用损失具有更强的鲁棒性。 我们在ETT(电力变压器温度)数据集上评估了Q-DPTS,该数据集是长期时间序列预测的标准基准。 我们的方法与经典的和量子的基线进行了比较,包括LSTM、QASA、QRWKV和QLSTM。 结果表明,在相同的隐私预算下,Q-DPTS始终能实现更低的预测误差,表明其具有有利的隐私-效用权衡。 这项工作是对量子增强的差分隐私预测的最早探索之一,为在隐私关键场景中安全且准确的时间序列建模提供了有前景的方向。
Time series forecasting is vital in domains where data sensitivity is paramount, such as finance and energy systems. While Differential Privacy (DP) provides theoretical guarantees to protect individual data contributions, its integration especially via DP-SGD often impairs model performance due to injected noise. In this paper, we propose Q-DPTS, a hybrid quantum-classical framework for Quantum Differentially Private Time Series Forecasting. Q-DPTS combines Variational Quantum Circuits (VQCs) with per-sample gradient clipping and Gaussian noise injection, ensuring rigorous $(\epsilon, \delta)$-differential privacy. The expressiveness of quantum models enables improved robustness against the utility loss induced by DP mechanisms. We evaluate Q-DPTS on the ETT (Electricity Transformer Temperature) dataset, a standard benchmark for long-term time series forecasting. Our approach is compared against both classical and quantum baselines, including LSTM, QASA, QRWKV, and QLSTM. Results demonstrate that Q-DPTS consistently achieves lower prediction error under the same privacy budget, indicating a favorable privacy-utility trade-off. This work presents one of the first explorations into quantum-enhanced differentially private forecasting, offering promising directions for secure and accurate time series modeling in privacy-critical scenarios.
- [57] arXiv:2508.05068 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: 基于卷积神经网络和生成对抗网络的自动图像着色标题: Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks评论: 5页,4图主题: 计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI) ; 机器学习 (cs.LG) ; 图像与视频处理 (eess.IV)
图像着色,即为灰度图像添加颜色的任务,近年来在计算机视觉领域受到了大量研究工作的关注,因为它在诸如颜色修复和自动动画着色等各个应用领域都有广泛的应用[15, 1]。 颜色化问题具有挑战性,因为它高度不适定,其中三个图像维度中有两个丢失,导致自由度很高。 然而,场景的语义以及表面纹理可以为颜色提供重要的线索:天空通常是蓝色的,云通常是白色的,草地通常是绿色的,并且由于任何彩色图像都可以作为训练数据点,因此有大量的训练数据可用于学习这些先验知识[20]。 颜色化最初被表述为一个回归任务[5],这忽略了颜色预测的多模态特性。 在这个项目中,我们通过分类和对抗学习探索自动图像着色。 我们将基于先前的工作构建模型,针对我们特定的场景进行修改并进行比较。
Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.
- [58] arXiv:2508.05115 (交叉列表自 cs.GR) [中文pdf, pdf, html, 其他]
-
标题: RAP:基于视频扩散变压器的实时音频驱动肖像动画标题: RAP: Real-time Audio-driven Portrait Animation with Video Diffusion TransformerFangyu Du, Taiqing Li, Ziwei Zhang, Qian Qiao, Tan Yu, Dingcheng Zhen, Xu Jia, Yang Yang, Shunshun Yin, Siyuan Liu评论: 11页,9图主题: 图形学 (cs.GR) ; 计算机视觉与模式识别 (cs.CV) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)
从输入音频信号和单张参考图像中合成逼真自然的说话人视频是音频驱动肖像动画的目标。 尽管现有方法通过利用高维中间表示并显式建模运动动力学实现了高质量的结果,但其计算复杂度使其不适合实时部署。 实时推理对延迟和内存有严格的要求,通常需要使用高度压缩的潜在表示。 然而,在这种紧凑空间中操作会妨碍细粒度时空细节的保留,从而使得音频-视觉同步变得复杂。RAP(实时音频驱动肖像动画)是在实时约束下生成高质量说话人肖像的统一框架。 具体来说,RAP引入了一种用于细粒度音频控制的混合注意力机制,以及一种静态-动态的训练-推理范式,避免了显式的运动监督。 通过这些技术,RAP实现了精确的音频驱动控制,减轻了长期时间漂移,并保持了高视觉保真度。 大量实验表明,RAP在实时约束下实现了最先进的性能。
Audio-driven portrait animation aims to synthesize realistic and natural talking head videos from an input audio signal and a single reference image. While existing methods achieve high-quality results by leveraging high-dimensional intermediate representations and explicitly modeling motion dynamics, their computational complexity renders them unsuitable for real-time deployment. Real-time inference imposes stringent latency and memory constraints, often necessitating the use of highly compressed latent representations. However, operating in such compact spaces hinders the preservation of fine-grained spatiotemporal details, thereby complicating audio-visual synchronization RAP (Real-time Audio-driven Portrait animation), a unified framework for generating high-quality talking portraits under real-time constraints. Specifically, RAP introduces a hybrid attention mechanism for fine-grained audio control, and a static-dynamic training-inference paradigm that avoids explicit motion supervision. Through these techniques, RAP achieves precise audio-driven control, mitigates long-term temporal drift, and maintains high visual fidelity. Extensive experiments demonstrate that RAP achieves state-of-the-art performance while operating under real-time constraints.
- [59] arXiv:2508.05130 (交叉列表自 cs.NI) [中文pdf, pdf, html, 其他]
-
标题: TerarIS NOMA-MIMO 通信用于 6G 及未来工业网络标题: TeraRIS NOMA-MIMO Communications for 6G and Beyond Industrial NetworksAli Raza, Muhammad Farhan Khan, Zeeshan Alam, Muhammad Saad, Ilyas Saleem, Muhammad Ahmed Mohsin, Muhammad Ali Jamshed评论: 已被PIMRC接受主题: 网络与互联网架构 (cs.NI) ; 信号处理 (eess.SP)
本文提出了一种联合框架,将可重构智能表面(RISs)与太赫兹(THz)通信和非正交多址接入(NOMA)相结合,以增强智能工业通信。 所提出的系统利用RIS和THz频段的优势,提高频谱效率、覆盖范围和可靠性,这是未来6G网络及以后工业自动化和实时通信的关键要求。 在此框架内,研究了两种功率分配策略:第一种最优地在近端和远端工业节点之间分配功率,第二种优先考虑网络需求以进一步提高系统性能。 进行性能评估以比较总速率和中断概率与固定功率分配方案。 我们的方案在30 dBm时比固定PA的总速率高出高达23%。 仿真结果验证了理论分析,证明了RIS辅助的NOMA MIMO框架在THz启用的工业通信中的有效性和鲁棒性。
This paper presents a joint framework that integrates reconfigurable intelligent surfaces (RISs) with Terahertz (THz) communications and non-orthogonal multiple access (NOMA) to enhance smart industrial communications. The proposed system leverages the advantages of RIS and THz bands to improve spectral efficiency, coverage, and reliability key requirements for industrial automation and real-time communications in future 6G networks and beyond. Within this framework, two power allocation strategies are investigated: the first optimally distributes power between near and far industrial nodes, and the second prioritizes network demands to enhance system performance further. A performance evaluation is conducted to compare the sum rate and outage probability against a fixed power allocation scheme. Our scheme achieves up to a 23% sum rate gain over fixed PA at 30 dBm. Simulation results validate the theoretical analysis, demonstrating the effectiveness and robustness of the RIS-assisted NOMA MIMO framework for THz enabled industrial communications.
- [60] arXiv:2508.05207 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
-
标题: 光谱流:一种通用音频的神经编解码器标题: SpectroStream: A Versatile Neural Codec for General Audio主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)
我们提出SpectroStream,一种全频段多通道神经音频编解码器。 作为广受认可的SoundStream的后续版本,SpectroStream扩展了其能力,超越24 kHz单声道音频,并在4–16 kbps的比特率下实现了48 kHz立体声音乐的高质量重建。 这是通过一种新的神经架构实现的,该架构利用了时频域中的音频表示,从而在较高采样率下显著提升了音频质量。 该模型还使用了延迟融合策略来处理多通道音频,这对于平衡每通道的声学质量和跨通道相位一致性至关重要。
We propose SpectroStream, a full-band multi-channel neural audio codec. Successor to the well-established SoundStream, SpectroStream extends its capability beyond 24 kHz monophonic audio and enables high-quality reconstruction of 48 kHz stereo music at bit rates of 4--16 kbps. This is accomplished with a new neural architecture that leverages audio representation in the time-frequency domain, which leads to better audio quality especially at higher sample rate. The model also uses a delayed-fusion strategy to handle multi-channel audio, which is crucial in balancing per-channel acoustic quality and cross-channel phase consistency.
- [61] arXiv:2508.05210 (交叉列表自 cs.LG) [中文pdf, pdf, 其他]
-
标题: 用于钻进速度预测的先进混合Transformer LSTM技术与注意力和TS混合器标题: Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration PredictionSaddam Hussain Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan)评论: 37页,19图,9表主题: 机器学习 (cs.LG) ; 人工智能 (cs.AI) ; 系统与控制 (eess.SY)
穿透率(ROP)对于优化钻井操作至关重要;然而,由于钻井数据的复杂性、动态性和高维性,准确预测它受到阻碍。 传统的经验模型、基于物理的模型和基本机器学习模型往往无法捕捉复杂的时序和上下文关系,导致预测效果不佳,实时应用有限。 为解决这一问题,我们提出了一种新的混合深度学习架构,结合长短期记忆(LSTM)网络、Transformer编码器、时间序列混合(TS-Mixer)块和注意力机制,以协同建模时序依赖性、静态特征交互、全局上下文和动态特征重要性。 在真实世界的钻井数据集上进行评估,我们的模型在R平方分数为0.9988,平均绝对百分比误差为1.447%的情况下,优于基准模型(单独的LSTM、TS-Mixer和更简单的混合模型),如标准回归指标所示(R平方、MAE、RMSE、MAPE)。 使用SHAP和LIME确保了模型的可解释性,实际与预测曲线以及偏差检查确认了在各种场景下的准确性和公平性。 这种先进的混合方法实现了可靠的实时ROP预测,为具有显著运营影响的智能、低成本钻井优化系统铺平了道路。
The Rate of Penetration (ROP) is crucial for optimizing drilling operations; however, accurately predicting it is hindered by the complex, dynamic, and high-dimensional nature of drilling data. Traditional empirical, physics-based, and basic machine learning models often fail to capture intricate temporal and contextual relationships, resulting in suboptimal predictions and limited real-time utility. To address this gap, we propose a novel hybrid deep learning architecture integrating Long Short-Term Memory (LSTM) networks, Transformer encoders, Time-Series Mixer (TS-Mixer) blocks, and attention mechanisms to synergistically model temporal dependencies, static feature interactions, global context, and dynamic feature importance. Evaluated on a real-world drilling dataset, our model outperformed benchmarks (standalone LSTM, TS-Mixer, and simpler hybrids) with an R-squared score of 0.9988 and a Mean Absolute Percentage Error of 1.447%, as measured by standard regression metrics (R-squared, MAE, RMSE, MAPE). Model interpretability was ensured using SHAP and LIME, while actual vs. predicted curves and bias checks confirmed accuracy and fairness across scenarios. This advanced hybrid approach enables reliable real-time ROP prediction, paving the way for intelligent, cost-effective drilling optimization systems with significant operational impact.
- [62] arXiv:2508.05306 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
-
标题: 从自回归扩散模型噪声空间中的音频估计音乐意外性标题: Estimating Musical Surprisal from Audio in Autoregressive Diffusion Model Noise Spaces评论: 9页,1图,5表。已被第25届国际音乐信息检索学会会议(ISMIR)接受,会议在韩国大田举行,2025年2025年主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 音频与语音处理 (eess.AS)
最近,生成无限词汇变换器(GIVT)预测的信息内容(IC)已被用于建模音频中的音乐预期和意外程度。我们使用自回归扩散模型(ADMs)计算的IC来研究这种建模的有效性。我们通过实验表明,基于两种不同扩散常微分方程(ODEs)的模型的IC估计,在负对数似然方面比GIVT更好地描述了多样化数据。我们通过检查两个任务来评估扩散模型IC在捕捉意外方面的有效性:(1)捕捉单音调音高意外,(2)检测多轨音频中的段落边界。在这两个任务中,扩散模型的表现与GIVT相当或超过。我们假设在不同扩散过程噪声水平下估计的意外程度对应于不同音频粒度下的音乐和音频特征的意外程度。测试我们的假设,我们发现对于适当的噪声水平,所研究的音乐意外任务的结果有所改善。代码提供在github.com/SonyCSLParis/audioic。
Recently, the information content (IC) of predictions from a Generative Infinite-Vocabulary Transformer (GIVT) has been used to model musical expectancy and surprisal in audio. We investigate the effectiveness of such modelling using IC calculated with autoregressive diffusion models (ADMs). We empirically show that IC estimates of models based on two different diffusion ordinary differential equations (ODEs) describe diverse data better, in terms of negative log-likelihood, than a GIVT. We evaluate diffusion model IC's effectiveness in capturing surprisal aspects by examining two tasks: (1) capturing monophonic pitch surprisal, and (2) detecting segment boundaries in multi-track audio. In both tasks, the diffusion models match or exceed the performance of a GIVT. We hypothesize that the surprisal estimated at different diffusion process noise levels corresponds to the surprisal of music and audio features present at different audio granularities. Testing our hypothesis, we find that, for appropriate noise levels, the studied musical surprisal tasks' results improve. Code is provided on github.com/SonyCSLParis/audioic.
- [63] arXiv:2508.05368 (交叉列表自 cs.RO) [中文pdf, pdf, html, 其他]
-
标题: 一种多视角特征点表示方法及其在GNSS-视觉-惯性里程计中的应用标题: A Multi-view Landmark Representation Approach with Application to GNSS-Visual-Inertial Odometry主题: 机器人技术 (cs.RO) ; 系统与控制 (eess.SY)
不变扩展卡尔曼滤波(IEKF)已成为视觉辅助传感器融合中的重要技术。 然而,当同时优化相机位姿和地标时,它通常会遭受较高的计算负担。 为了提高其在多传感器融合中的效率和适用性,本文提出了一种多视角仅位姿估计方法,并将其应用于GNSS-Visual-Inertial里程计(GVIO)。 我们的主要贡献是推导出一个视觉测量模型,该模型直接将地标表示与多个相机位姿和观测联系起来。 这种仅位姿的测量被证明在地标和位姿之间是紧密耦合的,并且保持了一个与估计位姿无关的完美零空间。 最后,我们将所提出的方法应用于基于滤波的GVIO,并采用了一种新的特征管理策略。 进行了仿真测试和真实实验,以展示所提出方法在效率和精度方面的优越性。
Invariant Extended Kalman Filter (IEKF) has been a significant technique in vision-aided sensor fusion. However, it usually suffers from high computational burden when jointly optimizing camera poses and the landmarks. To improve its efficiency and applicability for multi-sensor fusion, we present a multi-view pose-only estimation approach with its application to GNSS-Visual-Inertial Odometry (GVIO) in this paper. Our main contribution is deriving a visual measurement model which directly associates landmark representation with multiple camera poses and observations. Such a pose-only measurement is proven to be tightly-coupled between landmarks and poses, and maintain a perfect null space that is independent of estimated poses. Finally, we apply the proposed approach to a filter based GVIO with a novel feature management strategy. Both simulation tests and real-world experiments are conducted to demonstrate the superiority of the proposed method in terms of efficiency and accuracy.
- [64] arXiv:2508.05385 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
-
标题: 一种可扩展的管道,用于实现非言语语音生成和理解标题: A Scalable Pipeline for Enabling Non-Verbal Speech Generation and UnderstandingRunchuan Ye, Yixuan Zhou, Renjie Yu, Zijian Lin, Kehan Li, Xiang Li, Xin Liu, Guoyang Zeng, Zhiyong Wu主题: 声音 (cs.SD) ; 音频与语音处理 (eess.AS)
人类口语交流不仅包括词汇内容,还包括非语言声音(NVs),如笑声、叹气和咳嗽,这些声音传达情绪、意图和社会信号。 然而,大多数现有的语音系统仅关注语言内容,缺乏理解和生成此类非语言提示的能力,从而降低了口语接口的情感智能和交际丰富性。 在本工作中,我们引入了$\textbf{NonVerbalSpeech-38K}$,一个用于非语言语音生成和理解的大而多样的数据集,该数据集从现实世界的媒体中收集,并使用自动流程进行注释。 该数据集包含 38,718 个样本(约 131 小时),包含 10 类非语言提示,如笑声、抽鼻涕和清喉咙。 我们进一步通过微调最先进的模型,包括 F5-TTS 和 Qwen2-Audio 来验证该数据集,证明其在非语言语音生成和理解任务中的有效性。 我们的贡献有三个方面:(1) 我们提出了一种构建自然且多样化的非语言语音数据集的实用流程;(2) 我们发布了一个大规模数据集,以推进非语言语音生成和理解的研究; (3) 我们通过展示非语言语音合成和描述方面的改进来验证数据集的有效性,从而促进更丰富的人机交互。
Human spoken communication involves not only lexical content but also non-verbal vocalizations (NVs) such as laughter, sighs, and coughs, which convey emotions, intentions, and social signals. However, most existing speech systems focus solely on verbal content and lack the ability to understand and generate such non-verbal cues, reducing the emotional intelligence and communicative richness of spoken interfaces. In this work, we introduce $\textbf{NonVerbalSpeech-38K}$, a large and diverse dataset for non-verbal speech generation and understanding, collected from real-world media and annotated using an automatic pipeline. The dataset contains 38,718 samples (about 131 hours) with 10 categories of non-verbal cues, such as laughter, sniff, and throat clearing. We further validate the dataset by fine-tuning state-of-the-art models, including F5-TTS and Qwen2-Audio, demonstrating its effectiveness in non-verbal speech generation and understanding tasks. Our contributions are threefold: (1) We propose a practical pipeline for building natural and diverse non-verbal speech datasets; (2) We release a large-scale dataset to advance research on non-verbal speech generation and understanding; (3) We validate the dataset's effectiveness by demonstrating improvements in both non-verbal speech synthesis and captioning, thereby facilitating richer human-computer interaction.
- [65] arXiv:2508.05409 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: 从检测到纠正:通过视觉语言触发检测和基于噪声的中和实现后门鲁棒的人脸识别标题: From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based NeutralizationFarah Wahida, M.A.P. Chamikara, Yashothara Shanmugarasa, Mohan Baruwal Chhetri, Thilina Ranbaduge, Ibrahim Khalil评论: 19页,24图主题: 计算机视觉与模式识别 (cs.CV) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)
生物识别系统,如由深度神经网络(DNN)驱动的人脸识别系统,依赖于大型且高度敏感的数据集。 后门攻击可以通过操纵训练过程来颠覆这些系统。 通过在少量训练图像中插入一个小型触发器,例如贴纸、化妆品或图案化面具,攻击者可以在认证期间展示相同的触发器,从而被错误地识别为另一个人,从而获得未经授权的访问权限。 现有的针对后门攻击的防御机制仍然面临挑战,在精确识别和缓解受污染图像的同时不损害数据效用,这会削弱系统的整体可靠性。 我们提出了一种新颖且可推广的方法,TrueBiometric:可信生物识别,该方法利用多个最先进的大型视觉语言模型,通过多数投票机制准确检测受污染的图像。 一旦识别出受污染的样本,将使用有针对性且校准的纠正噪声进行修正。 我们的大量实证结果表明,TrueBiometric以100%的准确率检测并修正受污染的图像,而不会影响干净图像的准确性。 与现有最先进的方法相比,TrueBiometric为减轻人脸识别系统中的后门攻击提供了一个更实用、准确和有效的解决方案。
Biometric systems, such as face recognition systems powered by deep neural networks (DNNs), rely on large and highly sensitive datasets. Backdoor attacks can subvert these systems by manipulating the training process. By inserting a small trigger, such as a sticker, make-up, or patterned mask, into a few training images, an adversary can later present the same trigger during authentication to be falsely recognized as another individual, thereby gaining unauthorized access. Existing defense mechanisms against backdoor attacks still face challenges in precisely identifying and mitigating poisoned images without compromising data utility, which undermines the overall reliability of the system. We propose a novel and generalizable approach, TrueBiometric: Trustworthy Biometrics, which accurately detects poisoned images using a majority voting mechanism leveraging multiple state-of-the-art large vision language models. Once identified, poisoned samples are corrected using targeted and calibrated corrective noise. Our extensive empirical results demonstrate that TrueBiometric detects and corrects poisoned images with 100\% accuracy without compromising accuracy on clean images. Compared to existing state-of-the-art approaches, TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.
- [66] arXiv:2508.05465 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: F2PASeg:内窥镜手术中垂体解剖分割的特征融合标题: F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic SurgeryLumin Chen, Zhiying Wu, Tianye Lei, Xuexue Bai, Ming Feng, Yuxi Wang, Gaofeng Meng, Zhen Lei, Hongbin Liu主题: 计算机视觉与模式识别 (cs.CV) ; 图像与视频处理 (eess.IV) ; 系统与控制 (eess.SY)
垂体肿瘤通常会导致相邻重要结构的变形或包裹。 解剖结构分割可以为外科医生提供对存在手术风险区域的早期警告,从而提高垂体手术的安全性。 然而,用于垂体手术的像素级标注视频流数据集极为罕见。 为解决这一挑战,我们引入了一个新的垂体解剖分割数据集(PAS)。 PAS包含从120个视频中提取的7,845张时间一致的图像。 为了缓解类别不平衡问题,我们应用了数据增强技术,以在训练数据中模拟手术器械的存在。 垂体解剖分割的一个主要挑战是由于遮挡、摄像机运动和手术出血导致的特征表示不一致。 通过引入特征融合模块, F2PASeg被提出以通过利用高分辨率图像特征和深度语义嵌入来改进解剖结构分割,从而增强对术中变化的鲁棒性。 实验结果表明,F2PASeg能够实时一致地分割关键解剖结构,为术中垂体手术规划提供可靠的解决方案。 代码:https://github.com/paulili08/F2PASeg.
Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: https://github.com/paulili08/F2PASeg.
- [67] arXiv:2508.05466 (交叉列表自 math.OC) [中文pdf, pdf, html, 其他]
-
标题: 分布鲁棒的带输出反馈的仿射控制策略系统层级综合标题: Distributionally Robust System Level Synthesis With Output Feedback Affine Control Policy主题: 优化与控制 (math.OC) ; 系统与控制 (eess.SY)
本文研究了受模型不匹配和加性随机扰动影响的线性系统的有限时域鲁棒最优控制。利用系统层级综合(SLS)参数化方法,我们提出了一种新颖的基于输出反馈的仿射控制策略的SLS设计,并将其扩展到分布鲁棒设置中,通过最小化成本函数并在最坏情况不确定性分布下确保约束满足来提高系统弹性。模型不匹配和随机扰动的范围分别通过1-范数和基于Wasserstein度量的模糊集进行量化。对于闭环动力学,我们分析了预测输出-输入响应(使用名义参数和经验扰动样本计算)与实际闭环分布之间的分布变化,突出了其对模型不匹配和SLS参数化依赖性。假设成本函数和约束是凸的且Lipschitz连续的,我们通过利用鲁棒控制和分布鲁棒优化(DRO)的工具,推导出分布鲁棒SLS(DR-SLS)问题的可处理重写形式。数值实验验证了所提方法的性能和鲁棒性。
This paper studies the finite-horizon robust optimal control of linear systems subject to model mismatch and additive stochastic disturbances. Utilizing the system level synthesis (SLS) parameterization, we propose a novel SLS design using output-feedback affine control policy and extend it to a distributionally robust setting to improve system resilience by minimizing the cost function while ensuring constraint satisfaction against the worst-case uncertainty distribution. The scopes of model mismatch and stochastic disturbances are quantified using the 1-norm and a Wasserstein metric-based ambiguity set, respectively. For the closed-loop dynamics, we analyze the distributional shift between the predicted output-input response -- computed using nominal parameters and empirical disturbance samples -- and the actual closed-loop distribution, highlighting its dependence on model mismatch and SLS parameterization. Assuming convex and Lipschitz continuous cost functions and constraints, we derive a tractable reformulation of the distributionally robust SLS (DR-SLS) problem by leveraging tools from robust control and distributionally robust optimization (DRO). Numerical experiments validate the performance and robustness of the proposed approach.
- [68] arXiv:2508.05473 (交叉列表自 cs.MM) [中文pdf, pdf, html, 其他]
-
标题: 音频代码生成中的嵌入对齐标题: Embedding Alignment in Code Generation for Audio主题: 多媒体 (cs.MM) ; 人工智能 (cs.AI) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)
基于大语言模型的代码生成有潜力彻底改变创意编程工作,例如实时编程,因为它使用户能够关注结构主题而非语法细节。 在这些领域中,当提示大语言模型时,用户可能受益于考虑多种不同的代码候选,以更好地实现他们的音乐意图。 然而,代码生成模型难以提供独特且多样的代码候选,并且无法直接了解代码的音频输出。 为了更好地建立代码候选与生成音频之间的关系,我们研究了代码和音频嵌入空间之间映射的拓扑结构。 我们发现代码和音频嵌入之间没有简单的线性关系,但通过一个构建的预测模型补充了这一点,该模型表明可以学习到一个嵌入对齐图。 为了实现音乐上多样化的输出,我们提出了一种模型,该模型给定代码预测输出音频嵌入,构建一个代码-音频嵌入对齐图。
LLM-powered code generation has the potential to revolutionize creative coding endeavors, such as live-coding, by enabling users to focus on structural motifs over syntactic details. In such domains, when prompting an LLM, users may benefit from considering multiple varied code candidates to better realize their musical intentions. Code generation models, however, struggle to present unique and diverse code candidates, with no direct insight into the code's audio output. To better establish a relationship between code candidates and produced audio, we investigate the topology of the mapping between code and audio embedding spaces. We find that code and audio embeddings do not exhibit a simple linear relationship, but supplement this with a constructed predictive model that shows an embedding alignment map could be learned. Supplementing the aim for musically diverse output, we present a model that given code predicts output audio embedding, constructing a code-audio embedding alignment map.
- [69] arXiv:2508.05489 (交叉列表自 cs.CV) [中文pdf, pdf, html, 其他]
-
标题: 保持真实:针对基于压缩的对抗净化的攻击挑战标题: Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification主题: 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG) ; 图像与视频处理 (eess.IV)
先前的研究表明,通过有损压缩预处理图像可以防御对抗扰动,但缺乏全面的攻击评估。 在本文中,我们针对各种压缩模型构建了强大的白盒和自适应攻击,并确定了攻击者面临的一个关键挑战:重建图像的高真实感显著增加了攻击难度。 通过在多个攻击场景中的严格评估,我们证明了能够生成真实、高保真重建的压缩模型对我们的攻击具有显著的抵抗力。 相反,低真实感的压缩模型则容易被攻破。 我们的分析表明,这并非由于梯度遮蔽。而是由于保持与自然图像分布一致的真实重建似乎提供了固有的鲁棒性。 这项工作突显了未来对抗攻击的一个重要障碍,并表明开发更有效的克服真实感的技术是进行全面安全评估的关键挑战。
Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.
- [70] arXiv:2508.05554 (交叉列表自 cs.SD) [中文pdf, pdf, html, 其他]
-
标题: SPGISpeech 2.0:用于说话人标记转录的多说话人金融音频转录标题: SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcriptionRaymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg评论: 将要在2025年国际语音会议上演示主题: 声音 (cs.SD) ; 计算与语言 (cs.CL) ; 音频与语音处理 (eess.AS)
我们引入了SPGISpeech 2.0,这是一个适用于金融领域说话人标记转录的数据集。 SPGISpeech 2.0在保持原始SPGISpeech数据集的核心特性的同时,提高了适用建模任务的多样性:音频片段及其对应的完全格式化文本转录,可用于端到端自动语音识别(ASR)。 SPGISpeech 2.0包含了3780小时的专业转录的收益电话。 此外,该数据集包含每个音频片段的通话和说话人信息,有助于多说话人ASR。 我们通过在SPGISpeech 2.0上微调后,主流语音识别模型的说话人标记ASR性能的提升来验证SPGISpeech 2.0的实用性。 SPGISpeech 2.0免费用于非商业用途,我们期望它能促进语音识别技术的进步并激发广泛的研究应用。
We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.
- [71] arXiv:2508.05558 (交叉列表自 quant-ph) [中文pdf, pdf, html, 其他]
-
标题: 联合参数估计与多维校正用于连续变量量子密钥分发标题: Joint parameter estimation and multidimensional reconciliation for CV-QKD评论: 18页,5图主题: 量子物理 (quant-ph) ; 信号处理 (eess.SP)
精确的量子信道参数估计对于连续变量量子密钥分发(CV-QKD)中的有效信息重合至关重要。然而,传统的最大似然(ML)估计器依赖于大量被丢弃的数据(或导频符号),导致符号效率显著下降。此外,估计和重合阶段之间的分离可能会引入误差传播。在本文中,我们提出了一种新颖的联合消息传递方案,在贝叶斯框架内统一了信道参数估计和信息重合。通过利用期望最大化(EM)算法,所提出的方法在解码过程中同时估计未知参数,消除了对单独ML估计的需求。此外,我们引入了一种混合多维旋转方案,消除了对归一化反馈的要求,显著降低了经典信道开销。据我们所知,这是第一篇将多维重合和信道参数估计统一在CV-QKD中的工作,为使用最少导频的高效率重合提供了实用解决方案。
Accurate quantum channel parameter estimation is essential for effective information reconciliation in continuous-variable quantum key distribution (CV-QKD). However, conventional maximum likelihood (ML) estimators rely on a large amount of discarded data (or pilot symbols), leading to a significant loss in symbol efficiency. Moreover, the separation between the estimation and reconciliation phases can introduce error propagation. In this paper, we propose a novel joint message-passing scheme that unifies channel parameter estimation and information reconciliation within a Bayesian framework. By leveraging the expectation-maximization (EM) algorithm, the proposed method simultaneously estimates unknown parameters during decoding, eliminating the need for separate ML estimation. Furthermore, we introduce a hybrid multidimensional rotation scheme that removes the requirement for norm feedback, significantly reducing classical channel overhead. To the best of our knowledge, this is the first work to unify multidimensional reconciliation and channel parameter estimation in CV-QKD, providing a practical solution for high-efficiency reconciliation with minimal pilots.
- [72] arXiv:2508.05574 (交叉列表自 cs.IT) [中文pdf, pdf, html, 其他]
-
标题: 多AAV启用的ISCC系统中延迟最小化与可移动天线标题: Latency Minimization for Multi-AAV-Enabled ISCC Systems with Movable Antenna评论: 6页,6图,该文稿已提交给IEEE主题: 信息论 (cs.IT) ; 信号处理 (eess.SP)
本文研究了一个自主空中车辆(AAV)启用的集成感知、通信和计算系统,特别关注将可移动天线(MAs)整合到系统中以提高整体系统性能。 具体而言,多个配备MA的AVVs执行感知任务,并同时将生成的计算任务传输到基站进行处理。 为了在感知和资源约束下最小化最大时延,我们制定一个优化问题,联合协调MA的位置、计算资源分配和发射波束成形。 由于目标函数的非凸性和变量之间的强耦合性,我们提出一种两层迭代算法,利用粒子群优化和凸优化来解决该问题。 仿真结果表明,所提出的方案相比基准方案实现了显著的时延改进。
This paper investigates an autonomous aerial vehicle (AAV)-enabled integrated sensing, communication, and computation system, with a particular focus on integrating movable antennas (MAs) into the system for enhancing overall system performance. Specifically, multiple MA-enabled AVVs perform sensing tasks and simultaneously transmit the generated computational tasks to the base station for processing. To minimize the maximum latency under the sensing and resource constraints, we formulate an optimization problem that jointly coordinates the position of the MAs, the computation resource allocation, and the transmit beamforming. Due to the non-convexity of the objective function and strong coupling among variables, we propose a two-layer iterative algorithm leveraging particle swarm optimization and convex optimization to address it. The simulation results demonstrate that the proposed scheme achieves significant latency improvements compared to the baseline schemes.
- [73] arXiv:2508.05590 (交叉列表自 physics.optics) [中文pdf, pdf, html, 其他]
-
标题: 基于二氧化钒的超宽带太赫兹超材料吸收器的设计与分析标题: Design and Analysis of a Vanadium Dioxide-Based Ultra-Broadband Terahertz Metamaterial Absorber评论: 6页,3图,2表主题: 光学 (physics.optics) ; 系统与控制 (eess.SY) ; 应用物理 (physics.app-ph)
本文提出了一种基于VO2的超宽带、偏振不敏感的太赫兹(THz)频率范围的超材料吸收器。 该吸收器由图案化的VO2超表面、低损耗的MF2介质间隔层和金地平面组成。 利用VO2的相变,该设计实现了电磁吸收的动态控制。 全波仿真显示,在5.38THz带宽(5.72-11.11THz)内平均吸收率为98.15%,并在3.35THz范围内保持超过99%的吸收率。 在倾斜入射下,该吸收器对不同的偏振角度以及TE和TM模式均表现出稳定的性能。 阻抗分析证实了与自由空间的良好匹配,减少了反射并消除了透射。 参数分析研究了VO2电导率、MF2厚度和单元周期性对性能的影响。 与最近的太赫兹超材料吸收器相比,所提出的结构实现了更宽的带宽、更高的效率和更简单的实现方式。 这些特性使其适用于太赫兹传感、成像、无线通信和可调光子系统,并为其作为可调和可重构太赫兹模块的有前途的平台奠定了基础。
This paper presents a VO2-based metamaterial absorber optimized for ultra-broadband, polarization-insensitive performance in the terahertz (THz) frequency range. The absorber consists of a patterned VO2 metasurface, a low-loss MF2 dielectric spacer, and a gold ground plane. Exploiting the phase transition of VO2, the design enables dynamic control of electromagnetic absorption. Full-wave simulations show an average absorptance of 98.15% across a 5.38THz bandwidth (5.72-11.11THz) and over 99% absorption sustained across 3.35THz. The absorber maintains stable performance for varying polarization angles and both TE and TM modes under oblique incidence. Impedance analysis confirms strong matching to free space, reducing reflection and eliminating transmission. Parametric analysis investigates the influence of VO2 conductivity, MF2 thickness, and unit cell periodicity on performance. Compared to recent THz metamaterial absorbers, the proposed design achieves broader bandwidth, higher efficiency, and simpler implementation. These characteristics make it suitable for THz sensing, imaging, wireless communication, and adaptive photonic systems, and position it as a promising platform for tunable and reconfigurable THz modules.
- [74] arXiv:2508.05634 (交叉列表自 cs.RO) [中文pdf, pdf, html, 其他]
-
标题: 通过共形不确定性处理实现人群导航中的可推广安全性标题: Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling评论: 第九届机器人学习会议(CoRL 2025);项目网站:https://gen-safe-nav.github.io/。arXiv管理员备注:与arXiv:2407.17460文本重叠主题: 机器人技术 (cs.RO) ; 人工智能 (cs.AI) ; 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG) ; 系统与控制 (eess.SY)
在使用强化学习训练的移动机器人在人群环境中导航时,当遇到分布外场景时会出现性能下降。我们提出,通过正确考虑行人的不确定性,机器人可以学习到对分布变化具有鲁棒性的安全导航策略。我们的方法通过自适应保真推断生成的预测不确定性估计来增强智能体的观测,并利用这些估计通过约束强化学习来指导智能体的行为。该系统有助于调节智能体的动作,并使其能够适应分布变化。在分布内设置中,我们的方法实现了96.93%的成功率,比之前的最先进基线高出超过8.80%,碰撞次数减少了3.72倍,进入真实人类未来轨迹的次数减少了2.43倍。在三个分布外场景中,我们的方法在面对速度变化、策略变化以及个体到群体动力学的转换时表现出更强的鲁棒性。我们将我们的方法部署在真实机器人上,实验表明,机器人在与稀疏和密集人群互动时能够做出安全且稳健的决策。我们的代码和视频可在https://gen-safe-nav.github.io/上获得。
Mobile robots navigating in crowds trained using reinforcement learning are known to suffer performance degradation when faced with out-of-distribution scenarios. We propose that by properly accounting for the uncertainties of pedestrians, a robot can learn safe navigation policies that are robust to distribution shifts. Our method augments agent observations with prediction uncertainty estimates generated by adaptive conformal inference, and it uses these estimates to guide the agent's behavior through constrained reinforcement learning. The system helps regulate the agent's actions and enables it to adapt to distribution shifts. In the in-distribution setting, our approach achieves a 96.93% success rate, which is over 8.80% higher than the previous state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times fewer intrusions into ground-truth human future trajectories. In three out-of-distribution scenarios, our method shows much stronger robustness when facing distribution shifts in velocity variations, policy changes, and transitions from individual to group dynamics. We deploy our method on a real robot, and experiments show that the robot makes safe and robust decisions when interacting with both sparse and dense crowds. Our code and videos are available on https://gen-safe-nav.github.io/.
交叉提交 (展示 36 之 36 条目 )
- [75] arXiv:2309.02265 (替换) [中文pdf, pdf, html, 其他]
-
标题: PESTO:具有自监督转置等变目标的音高估计标题: PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective评论: 最佳论文奖,第24届国际音乐信息检索学会会议,ISMIR 2023主题: 音频与语音处理 (eess.AS) ; 声音 (cs.SD)
在本文中,我们解决了使用自监督学习(SSL)进行音高估计的问题。 我们使用的SSL范式对音高转调具有等变性,这使得我们的模型在仅使用一个小的未标记数据集进行训练后,能够准确地对单音音频进行音高估计。 我们使用了一个轻量级($<$30k参数)的Siamese神经网络,该网络以同一音频的不同音高移位版本作为输入,这些音频通过其恒定Q变换表示。 为了防止在仅编码器设置下模型崩溃,我们提出了一种新颖的基于类别的转调等变目标,该目标捕捉音高信息。 此外,我们通过引入可学习的托普利茨矩阵,设计了保持转调特性的网络架构。 我们在歌唱声音和乐器音高估计两个任务上评估了我们的模型,并表明我们的模型能够在任务和数据集之间进行泛化,同时保持轻量级,因此与低资源设备兼容,并适用于实时应用。 特别是,我们的结果超越了自监督基线,并缩小了自监督和监督方法在音高估计方面的性能差距。
In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ($<$ 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices. We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.
- [76] arXiv:2401.00740 (替换) [中文pdf, pdf, html, 其他]
-
标题: 超越子空间隔离:用于光场图像超分辨率的多对多Transformer标题: Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-resolution评论: 被IEEE多媒体汇刊接受主题: 图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV)
空间角度特征的有效提取在光场图像超分辨率(LFSR)任务中起着关键作用,卷积和Transformer的引入在此领域取得了显著进步。 然而,由于光场图像的4D数据量较大,许多现有方法选择将数据分解为多个低维子空间,并在每个子空间中单独进行Transformer操作。 作为副作用,这些方法无意中将自注意力机制限制为一对一方案,仅访问光场数据的一个有限子集,明确阻止了对所有空间和角度线索的全面优化。 在本文中,我们将这一限制识别为子空间隔离,并引入一种新颖的多对多Transformer(M2MT)来解决它。 M2MT在执行自注意力机制之前,在空间子空间中聚合角度信息。 它能够完全访问光场图像中所有子孔径图像(SAIs)的所有信息。 因此,M2MT能够全面捕捉长距离相关依赖关系。 以M2MT作为基础组件,我们开发了一个简单而有效的M2MT网络用于LFSR。 我们的实验结果表明,M2MT在各种公共数据集上实现了最先进的性能,并在模型性能和效率之间提供了良好的平衡,以显著较低的内存和计算需求获得了更高品质的LFSR结果。 我们进一步使用局部归因图(LAM)进行深入分析,以获得视觉可解释性,结果验证了M2MT在空间和角度子空间中真正具备非局部上下文,以缓解子空间隔离并获取有效空间角度表示。
The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.
- [77] arXiv:2404.03253 (替换) [中文pdf, pdf, html, 其他]
-
标题: 带有多模态分割的原发性鼻咽癌MRI数据集标题: A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentationYin Li, Qi Chen, Kai Wang, Meige Li, Liping Si, Yingwei Guo, Yu Xiong, Qixing Wang, Yang Qin, Ling Xu, Patrick van der Smagt, Jun Tang, Nutan Chen评论: 这篇预印本已提交并原则上被接受发表在《Scientific Data》上,无需重大修改。主题: 图像与视频处理 (eess.IV) ; 人工智能 (cs.AI) ; 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG)
多模态磁共振成像(MRI)数据有助于鼻咽癌(NPC)管理中的早期诊断、肿瘤分割和疾病分期。公开可用的全面数据集的缺乏限制了NPC的诊断、治疗计划和机器学习算法的发展。为解决这一关键需求,我们引入了第一个全面的NPC MRI数据集,涵盖了277例原发性NPC患者的MR轴位成像。该数据集包括T1加权、T2加权和对比增强的T1加权序列,共计831次扫描。除了相应的临床数据外,由经验丰富的放射科医生手动标注和标记的分割结果提供了未经治疗的原发性NPC的高质量数据资源。
Multi-modality magnetic resonance imaging(MRI) data facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we introduce the first comprehensive NPC MRI dataset, encompassing MR axial imaging of 277 primary NPC patients. This dataset includes T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences, totaling 831 scans. In addition to the corresponding clinical data, manually annotated and labeled segmentations by experienced radiologists offer high-quality data resources from untreated primary NPC.
- [78] arXiv:2407.21122 (替换) [中文pdf, pdf, html, 其他]
-
标题: 自由空间通信的阴影区域和自由度标题: Shadow Area and Degrees of Freedom for Free-Space Communication主题: 信号处理 (eess.SP) ; 应用物理 (physics.app-ph) ; 经典物理 (physics.class-ph)
通信信道中的自由度数量(NDoF)从根本上限制了可用于发送和接收信息的独立空间模式的数量。 尽管可以通过对特定配置的信道算子进行奇异值分解(SVD)来数值计算NDoF,但这种方法提供的物理洞察力有限。 在本文中,我们引入了一种简单的解析估计方法,用于计算自由空间中任意形状的发射器和接收器区域之间的NDoF。 在电大极限下,当NDoF较高时,它可以用互阴影面积近似,单位为波长平方。 该面积对应于所有视线方向上区域的投影重叠积分,并捕捉了它们的有效空间耦合。 所提出的估计方法推广并统一了若干先前建立的结果,包括基于Weyl定律、阴影面积和傍轴近似的结果。 我们分析了几个示例配置,以说明该估计的准确性,并通过与传播信道的数值SVD计算进行比较来验证它。 结果为高容量通信和传感系统的设计与分析提供了实用工具和物理洞察。
The number of degrees of freedom (NDoF) in a communication channel fundamentally limits the number of independent spatial modes available for transmitting and receiving information. Although the NDoF can be computed numerically for specific configurations using singular value decomposition (SVD) of the channel operator, this approach provides limited physical insight. In this paper, we introduce a simple analytical estimate for the NDoF between arbitrarily shaped transmitter and receiver regions in free space. In the electrically large limit, where the NDoF is high, it is well approximated by the mutual shadow area, measured in units of wavelength squared. This area corresponds to the projected overlap of the regions, integrated over all lines of sight, and captures their effective spatial coupling. The proposed estimate generalizes and unifies several previously established results, including those based on Weyl's law, shadow area, and the paraxial approximation. We analyze several example configurations to illustrate the accuracy of the estimate and validate it through comparisons with numerical SVD computations of the propagation channel. The results provide both practical tools and physical insight for the design and analysis of high-capacity communication and sensing systems.
- [79] arXiv:2411.01567 (替换) [中文pdf, pdf, html, 其他]
-
标题: 在线时间-顶点自适应滤波器图拓扑学习:从理论到心室颤动标题: Online Graph Topology Learning via Time-Vertex Adaptive Filters: From Theory to Cardiac Fibrillation主题: 信号处理 (eess.SP) ; 机器学习 (cs.LG) ; 机器学习 (stat.ML)
图信号处理(GSP)通过将数据建模为图上的信号,提供了一个强大的框架来分析复杂、相互连接的系统。 尽管最近的进展使得从观测信号中学习图拓扑成为可能,但现有方法在时变系统和实时应用中往往表现不佳。 为解决这一问题,我们引入了AdaCGP,这是一种针对多变量时间序列动态图拓扑估计的稀疏感知自适应算法。 AdaCGP通过设计用于解决稀疏性、平移不变性和偏差的递归更新公式来估计图移位算子(GSO)。 通过全面的仿真,我们证明AdaCGP在多种图拓扑下始终优于多个基线方法,在GSO估计方面相比最先进方法提高了超过83%,同时保持了有利的计算扩展特性。 我们的变量分裂方法能够以接近零的误报率可靠地识别因果连接,并最小化遗漏边。 应用于心室颤动记录中,AdaCGP比Granger因果等现有方法更有效地跟踪传播模式的动态变化,捕捉静态方法所忽略的图拓扑时间变化。 该算法成功识别了可能维持心律失常的传导模式的稳定性特征,展示了在复杂生物医学系统诊断和治疗中的临床应用潜力。
Graph Signal Processing (GSP) provides a powerful framework for analysing complex, interconnected systems by modelling data as signals on graphs. While recent advances have enabled graph topology learning from observed signals, existing methods often struggle with time-varying systems and real-time applications. To address this gap, we introduce AdaCGP, a sparsity-aware adaptive algorithm for dynamic graph topology estimation from multivariate time series. AdaCGP estimates the Graph Shift Operator (GSO) through recursive update formulae designed to address sparsity, shift-invariance, and bias. Through comprehensive simulations, we demonstrate that AdaCGP consistently outperforms multiple baselines across diverse graph topologies, achieving improvements exceeding 83% in GSO estimation compared to state-of-the-art methods while maintaining favourable computational scaling properties. Our variable splitting approach enables reliable identification of causal connections with near-zero false alarm rates and minimal missed edges. Applied to cardiac fibrillation recordings, AdaCGP tracks dynamic changes in propagation patterns more effectively than established methods like Granger causality, capturing temporal variations in graph topology that static approaches miss. The algorithm successfully identifies stability characteristics in conduction patterns that may maintain arrhythmias, demonstrating potential for clinical applications in diagnosis and treatment of complex biomedical systems.
- [80] arXiv:2501.00378 (替换) [中文pdf, pdf, html, 其他]
-
标题: STARFormer:一种用于脑部疾病诊断的FMRI时空聚合重组变压器标题: STARFormer: A Novel Spatio-Temporal Aggregation Reorganization Transformer of FMRI for Brain Disorder Diagnosis期刊参考: 神经网络,2025主题: 图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG)
许多现有的使用功能性磁共振成像(fMRI)的方法在分类脑部疾病,如自闭症谱系障碍(ASD)和注意力缺陷多动障碍(ADHD)时,往往忽略了血氧水平依赖(BOLD)信号的空间和时间依赖性的整合,这可能导致分类结果不准确或不精确。 为了解决这个问题,我们提出了一种时空聚合重组变压器(STARFormer),通过结合三个关键模块,有效捕捉BOLD信号的空间和时间特征。 感兴趣区域(ROI)空间结构分析模块使用特征向量中心性(EC)根据有效连接重新组织脑区,突出与脑部疾病相关的关键空间关系。 时间特征重组模块系统地将时间序列分割为等维窗口标记,并通过可变窗口和跨窗口注意力捕捉多尺度特征。 时空特征融合模块采用具有专用时间和空间分支的并行变压器架构来提取集成特征。 所提出的STARFormer已在两个公开可用的数据集上进行了严格的评估,用于ASD和ADHD的分类。 实验结果证实,STARFormer在多个评估指标上实现了最先进的性能,为脑部疾病的诊断和生物医学研究提供了一个更准确和可靠的工具。 代码可在以下地址获取:https://github.com/NZWANG/STARFormer.
Many existing methods that use functional magnetic resonance imaging (fMRI) classify brain disorders, such as autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD), often overlook the integration of spatial and temporal dependencies of the blood oxygen level-dependent (BOLD) signals, which may lead to inaccurate or imprecise classification results. To solve this problem, we propose a Spatio-Temporal Aggregation eorganization ransformer (STARFormer) that effectively captures both spatial and temporal features of BOLD signals by incorporating three key modules. The region of interest (ROI) spatial structure analysis module uses eigenvector centrality (EC) to reorganize brain regions based on effective connectivity, highlighting critical spatial relationships relevant to the brain disorder. The temporal feature reorganization module systematically segments the time series into equal-dimensional window tokens and captures multiscale features through variable window and cross-window attention. The spatio-temporal feature fusion module employs a parallel transformer architecture with dedicated temporal and spatial branches to extract integrated features. The proposed STARFormer has been rigorously evaluated on two publicly available datasets for the classification of ASD and ADHD. The experimental results confirm that the STARFormer achieves state-of-the-art performance across multiple evaluation metrics, providing a more accurate and reliable tool for the diagnosis of brain disorders and biomedical research. The codes are available at: https://github.com/NZWANG/STARFormer.
- [81] arXiv:2501.03536 (替换) [中文pdf, pdf, html, 其他]
-
标题: 神经退行性疾病的自动语音分析与技术概述:诊断与辅助应用标题: Overview of Automatic Speech Analysis and Technologies for Neurodegenerative Disorders: Diagnosis and Assistive Applications评论: 发表于IEEE《信号处理重点领域期刊》期刊参考: https://ieeexplore.ieee.org/abstract/document/11086511/主题: 音频与语音处理 (eess.AS)
语音语言技术在神经退行性言语障碍中的进展对于满足临床和技术需求至关重要。 这篇综述论文对于推动该领域的发展至关重要,因为它对病理语音检测、自动语音识别、病理语音可理解性增强、可理解性和严重程度评估以及病理语音的数据增强方法进行了全面的回顾。 它还强调了关键挑战,例如确保鲁棒性、隐私和可解释性。 本文最后探讨了有前景的未来方向,包括采用多模态方法以及整合大型语言模型,以进一步推进神经退行性言语障碍的语音技术。
Advancements in spoken language technologies for neurodegenerative speech disorders are crucial for meeting both clinical and technological needs. This overview paper is vital for advancing the field, as it presents a comprehensive review of state-of-the-art methods in pathological speech detection, automatic speech recognition, pathological speech intelligibility enhancement, intelligibility and severity assessment, and data augmentation approaches for pathological speech. It also highlights key challenges, such as ensuring robustness, privacy, and interpretability. The paper concludes by exploring promising future directions, including the adoption of multimodal approaches and the integration of large language models to further advance speech technologies for neurodegenerative speech disorders.
- [82] arXiv:2502.03681 (替换) [中文pdf, pdf, html, 其他]
-
标题: 使用惯性测量单元进行方向估计中角加速度的影响标题: On the effects of angular acceleration in orientation estimation using inertial measurement units主题: 系统与控制 (eess.SY)
在本文中,我们分析使用惯性测量单元的姿态估计问题。 当加速度计受到重力以外的加速度影响时,许多估计算法的性能会下降。 我们表明,由旋转加速度引起的线性加速度不能被当作需要抑制的外部扰动,而是会改变滤波器本身的动态特性。 特别是,这会导致线性化传递函数中引入额外的零点。 这些零点导致非最小相位行为,这已知对控制具有挑战性。 我们通过实验验证了这些发现。 此外,我们证明Mahony和Madgwick滤波器可以通过降低带宽来抑制加速度。 此外,我们还表明基于预先收集数据的验证方案无法准确捕捉这些闭环效应。
In this paper, we analyze the orientation estimation problem using inertial measurement units. Many estimation algorithms suffer degraded performance when accelerations other than gravity affect the accelerometer. We show that linear accelerations resulting from rotational accelerations cannot be treated as external disturbance to be attenuated, rather, they change the dynamic behavior of the filter itself. In particular, this results in the introduction of additional zeros in the linearized transfer functions. These zeros lead to nonminimum phase behavior, which is known to be challenging for control. We validate these findings experimentally. Further, we demonstrate that Mahony and Madgwick filters can attenuate the acceleration at the expense of reduced bandwidth. In addition, we show that validation schemes based on precollected data fail to capture these closed-loop effects accurately.
- [83] arXiv:2503.05797 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于GNN的电网并行网络物理攻击故障诊断方法标题: GNN-Enhanced Fault Diagnosis Method for Parallel Cyber-physical Attacks in Power Grids评论: 10页,3图,5表,期刊主题: 系统与控制 (eess.SY) ; 人工智能 (cs.AI)
并行网络物理攻击(PCPA)同时损坏电力传输线路并阻断测量数据传输,损害或延迟系统保护和恢复。 本文研究了在PCPA下线性化(DC)功率流模型的故障诊断问题。 物理攻击机制不仅包括线路断开,还包括导纳修改,例如通过被破坏的分布式柔性交流输电系统(D-FACTS)设备。 为了解决这个问题,我们提出了一种基于元混合整数规划(MMIP)的故障诊断框架,结合基于图注意力网络的故障定位(GAT-FL)。 首先,我们推导了测量重建条件,允许从可用测量和系统拓扑中重建受攻击区域的未知测量。 基于这些条件,我们将诊断任务建模为MMIP模型。 GAT-FL预测潜在物理攻击的概率分布,并将其作为MMIP中的目标系数进行整合。 求解MMIP可得到最优攻击位置和幅度估计,从而重建系统状态。 在IEEE 30/118节点标准测试案例上进行了实验仿真,以证明所提出的故障诊断算法的有效性。
Parallel cyber-physical attacks (PCPA) simultaneously damage physical transmission lines and block measurement data transmission in power grids, impairing or delaying system protection and recovery. This paper investigates the fault diagnosis problem for a linearized (DC) power flow model under PCPA. The physical attack mechanism includes not only line disconnection but also admittance modification, for example via compromised distributed flexible AC transmission system (D-FACTS) devices. To address this problem, we propose a fault diagnosis framework based on meta-mixed-integer programming (MMIP), integrating graph attention network-based fault localization (GAT-FL). First, we derive measurement reconstruction conditions that allow reconstructing unknown measurements in attacked areas from available measurements and the system topology. Based on these conditions, we formulate the diagnosis task as an MMIP model. The GAT-FL predicts a probability distribution over potential physical attacks, which is then incorporated as objective coefficients in the MMIP. Solving the MMIP yields optimal attack location and magnitude estimates, from which the system states are also reconstructed. Experimental simulations are conducted on IEEE 30/118 bus standard test cases to demonstrate the effectiveness of the proposed fault diagnosis algorithms.
- [84] arXiv:2503.18625 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于最大似然估计的复数鲁棒中国剩余定理及其快速算法标题: Maximum Likelihood Estimation Based Complex-Valued Robust Chinese Remainder Theorem and Its Fast Algorithm评论: 22页,18图主题: 信号处理 (eess.SP)
最近,提出了一种具有复数模数的多通道自复位模拟-数字转换器(ADC)系统。 该系统通过中国剩余定理(CRT)在低采样率下实现高动态范围复数带限信号的恢复。 在本文中,我们研究了存在余数误差的复数中国剩余定理(C-CRT),其中误差服从环绕复高斯分布。 基于现有的利用最大似然估计(MLE)的实数中国剩余定理,我们提出了一种快速基于MLE的C-CRT(MLE C-CRT)。 所提出的算法仅需要$2L$次搜索即可获得公共余数的最佳估计,其中$L$是模数的数量。 一旦估计出公共余数,就可以使用C-CRT确定复数。 此外,我们得到了快速MLE C-CRT实现鲁棒估计的必要且充分条件。 最后,我们将所提出的算法应用于ADCs。 结果表明,所提出的算法优于现有方法。
Recently, a multi-channel self-reset analog-to-digital converter (ADC) system with complex-valued moduli has been proposed. This system enables the recovery of high dynamic range complex-valued bandlimited signals at low sampling rates via the Chinese remainder theorem (CRT). In this paper, we investigate complex-valued CRT (C-CRT) with erroneous remainders, where the errors follow wrapped complex Gaussian distributions. Based on the existing real-valued CRT utilizing maximum likelihood estimation (MLE), we propose a fast MLE-based C-CRT (MLE C-CRT). The proposed algorithm requires only $2L$ searches to obtain the optimal estimate of the common remainder, where $L$ is the number of moduli. Once the common remainder is estimated, the complex number can be determined using the C-CRT. Furthermore, we obtain a necessary and sufficient condition for the fast MLE C-CRT to achieve robust estimation. Finally, we apply the proposed algorithm to ADCs. The results demonstrate that the proposed algorithm outperforms the existing methods.
- [85] arXiv:2503.23324 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于时间分裂的非线性MHE优化方法标题: A Time Splitting Based Optimization Method for Nonlinear MHE主题: 系统与控制 (eess.SY)
移动时域估计(MHE)本质上是一种基于优化的方法,旨在动态系统在移动时间窗口内估计状态。 传统的MHE解决方案由于随着问题复杂度增加和时间窗口长度增长而产生的\textit{维度灾难}变得计算上不可行。 为解决这个问题,我们提出了用于求解非线性MHE问题的新型计算高效的算法。 具体来说,我们首先引入一种利用时间分割技术的分布式重新表述。 借助这种重新表述,我们开发了高效高斯-牛顿增广拉格朗日交替方向不精确牛顿(ALADIN)方法,以实现计算效率。 此外,为了适应某些子问题求解器固有的有限计算能力,我们提出了高效灵敏度辅助ALADIN,该方法使子问题可以不精确求解而不影响计算效率。 此外,考虑到某些子问题求解器没有任何计算能力的情况,我们提出了一种仅依赖局部目标函数一阶和二阶信息的分布式顺序二次规划(SQP)。 我们通过在差分驱动机器人案例上的数值实验展示了所提出方法的性能和优势,这是一个实际的非线性MHE问题。 我们的结果表明,所提出的三种算法在保持高精度的同时实现了计算效率,从而满足了MHE的实时要求。
Moving Horizon Estimation~(MHE) is essentially an optimization-based approach designed to estimate the states of dynamic systems within a moving time horizon. Traditional MHE solutions become computationally prohibitive due to the \textit{curse of dimensionality} arising from increasing problem complexity and growing length of time horizon. To address this issue, we propose novel computationally efficient algorithms for solving nonlinear MHE problems. Specifically, we first introduce a distributed reformulation utilizing a time-splitting technique. Leveraging this reformulation, we develop the Efficient Gauss-Newton Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) to achieve computational efficiency. Additionally, to accommodate limited computational capabilities inherent in some sub-problem solvers, we propose the Efficient Sensitivity Assisted ALADIN, which enables sub-problems to be solved inexactly without hindering computational efficiency. Furthermore, recognizing scenarios where sub-problem solvers possess no computational power, we propose a Distributed Sequential Quadratic Programming (SQP) that relies solely on first- and second-order information of local objective functions. We demonstrate the performance and advantages of our proposed methods through numerical experiments on differential drive robots case, a practical nonlinear MHE problem. Our results demonstrate that the three proposed algorithms achieve computational efficiency while preserving high accuracy, thereby satisfying the real-time requirements of MHE.
- [86] arXiv:2503.24105 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于数据的异构离散时间多智能体系统的分布式输出同步标题: Data-Driven Distributed Output Synchronization of Heterogeneous Discrete-Time Multi-Agent Systems评论: 会议论文的扩展版本,已被接受在第64届IEEE决策与控制会议上发表。与之前的版本相比,一些拼写错误已更正,附录中引理13的证明已扩展主题: 系统与控制 (eess.SY)
在本文中,我们假设一个自主外系统生成参考输出,并考虑为通过有向图连接的一族离散时间异构线性时不变代理设计分布式数据驱动控制律,以使代理的输出同步到参考输出。 网络中的代理分为两类:领导者,可以直接访问外系统的输出,以及跟随者,仅从其邻居接收信息。 所有代理都希望通过状态反馈实现输出同步,该反馈利用自身的状态以及由内部状态观测器提供的外部系统状态估计。 这种观测器对于领导者和跟随者具有不同的结构。 首先在基于模型的设置中推导了存在解的必要充分条件,然后在数据驱动的背景下进行推导。 一个示例说明了所提出方法的实施过程和性能。
In this paper, we assume that an autonomous exosystem generates a reference output, and we consider the problem of designing a distributed data-driven control law for a family of discrete-time heterogeneous LTI agents, connected through a directed graph, in order to synchronize the agents' outputs to the reference one. The agents of the network are split into two categories: leaders, with direct access to the exosystem output, and followers, that only receive information from their neighbors. All agents aim to achieve output synchronization by means of a state feedback that makes use of their own states as well as of an estimate of the exogenous system state, provided by an internal state observer. Such observer has a different structure for leaders and followers. Necessary and sufficient conditions for the existence of a solution are first derived in the model-based set-up and then in a data-driven context. An example illustrates both the implementation procedure and the performance of the proposed approach.
- [87] arXiv:2504.07675 (替换) [中文pdf, pdf, html, 其他]
-
标题: 动态信道测量的天线切换方案的低复杂度优化标题: Low-Complexity Optimization of Antenna Switching Schemes for Dynamic Channel Sounding评论: 本文已提交至《IEEE无线通信汇刊》。14页,6图,3表。主题: 信号处理 (eess.SP)
理解无线信道对于无线系统的设计至关重要。 对于移动通信,需要使用具有短测量时间的探测器和天线阵列,以同时捕捉动态和空间信道特性。 切换天线阵列是一种有吸引力的选择,可以克服真实阵列的高成本和虚拟阵列的长测量时间。 因此,优化切换序列是必要的,以避免混叠并提高信道参数估计的准确性。 本文提供了对切换序列设计的新颖且全面的分析。 我们首先回顾了传统的时空模糊函数,将其扩展到双极化天线阵列,并分析了在超大规模天线阵列中应用时的禁止性复杂度。 因此,我们提出了一种新方法,利用费舍尔信息矩阵来处理估计精度。 我们还提出通过选择在傅里叶谱中最小化旁瓣的切换序列来最小化模糊性。 从这个意义上说,我们将序列设计问题分为基于傅里叶的模糊减少和基于费舍尔的精度提升,并将由此产生的设计方法称为傅里叶-费舍尔。 仿真和测量结果表明,傅里叶-费舍尔方法在性能上与传统基于模糊的方法相同,但计算复杂度显著降低。
Understanding wireless channels is crucial for the design of wireless systems. For mobile communication, sounders and antenna arrays with short measurement times are required to simultaneously capture the dynamic and spatial channel characteristics. Switched antenna arrays are an attractive option that can overcome the high cost of real arrays and the long measurement times of virtual arrays. Optimization of the switching sequences is then essential to avoid aliasing and increase the accuracy of channel parameter estimates. This paper provides a novel and comprehensive analysis of the design of switching sequences. We first review the conventional spatio-temporal ambiguity function, extend it to dual-polarized antenna arrays, and analyze its prohibitive complexity when applied to ultra-massive antenna arrays. We thus propose a new method that uses the Fisher information matrix to tackle the estimation accuracy. We also propose to minimize the ambiguity by choosing a switching sequence that minimizes side lobes in its Fourier spectrum. In this sense, we divide the sequence design problem into Fourier-based ambiguity reduction and Fisher-based accuracy improvement, and coin the resulting design approach as Fourier-Fisher. Simulations and measurements show that the Fourier-Fisher approach achieves identical performance and significantly lower computational complexity than that of the conventional ambiguity-based approach.
- [88] arXiv:2505.02611 (替换) [中文pdf, pdf, html, 其他]
-
标题: 多维参数估计在RIS辅助的MU-MIMO信道中标题: Multi-dimensional Parameter Estimation in RIS-aided MU-MIMO Channels评论: 论文已提交至IEEE无线通信快报。版权可能在未通知的情况下发生变化主题: 信号处理 (eess.SP)
我们通过提出一种双结构和多维变换(DS-MDT)算法来解决可重构智能表面(RIS)辅助的宽带系统中的信道估计问题。 所提出的方法利用信道参数的双结构特性,以帮助经历较弱信道条件的用户,从而提高估计性能。 此外,考虑到信道参数分布在接收张量的多个维度上,所提出的算法采用多维变换来有效隔离和提取不同的参数。 数值结果表明,与现有最先进方法相比,所提出的算法在保持较低复杂度的同时,将归一化均方误差(NMSE)降低了高达10 dB。
We address the channel estimation problem in reconfigurable intelligent surface (RIS) aided broadband systems by proposing a dual-structure and multi-dimensional transformations (DS-MDT) algorithm. The proposed approach leverages the dual-structure features of the channel parameters to assist users experiencing weaker channel conditions, thereby enhancing estimation performance. Moreover, given that the channel parameters are distributed across multiple dimensions of the received tensor, the proposed algorithm employs multi-dimensional transformations to effectively isolate and extract distinct parameters. The numerical results demonstrate the proposed algorithm reduces the normalized mean square error (NMSE) by up to 10 dB while maintaining lower complexity compared to state-of-the-art methods.
- [89] arXiv:2505.08642 (替换) [中文pdf, pdf, html, 其他]
-
标题: 具有硬件损伤的STAR-RIS辅助RSMA网络的鲁棒波束成形设计标题: Robust Beamforming Design for STAR-RIS Aided RSMA Network with Hardware Impairments主题: 信号处理 (eess.SP)
本文研究了同时发射和反射可重构智能表面(STAR-RIS)辅助的下行速率分割多址(RSMA)通信系统的鲁棒波束成形设计,其中收发器和STAR-RIS均受到硬件缺陷(HWI)的影响。部署一个基站(BS)以同时向多个用户发送消息,利用STAR-RIS提高通信质量和扩展用户覆盖范围。我们的目标是在确保发射功率、STAR-RIS系数以及所有用户公共流的实际速率约束的前提下,最大化用户的可实现总速率。为了解决这一具有高耦合性和非凸性的挑战性问题,我们采用基于分数规划(FP)的交替优化(AO)方法,其中每个子问题通过连续凸逼近(SCA)和惩罚函数(PF)方法进行解决。数值结果表明,所提出的方案在可实现总速率方面优于其他多址方案和传统被动RIS。此外,考虑到收发器和STAR-RIS的HWI使我们的算法比不考虑这些因素时更具鲁棒性。
In this article, we investigate the robust beamforming design for a simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) aided downlink rate-splitting multiple access (RSMA) communication system, where both transceivers and STAR-RIS suffer from the impact of hardware impairments (HWI).A base station (BS) is deployed to transmit messages concurrently to multiple users, utilizing a STAR-RIS to improve communication quality and expand user coverage. We aim to maximize the achievable sum rate of the users while ensuring the constraints of transmit power, STAR-RIS coefficients, and the actual rate of the common stream for all users. To solve this challenging high-coupling and non-convexity problem, we adopt a fractional programming (FP)-based alternating optimization (AO) approach, where each sub-problem is addressed via successive convex approximation (SCA) and penalty function (PF) methods. Numerical results demonstrate that the proposed scheme outperforms other multiple access schemes and conventional passive RIS in terms of the achievable sum rate. Additionally, considering the HWI of the transceiver and STAR-RIS makes our algorithm more robust than when such considerations are not included.
- [90] arXiv:2506.11671 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于微调自监督模型的脑网络分析用于脑疾病诊断标题: Brain Network Analysis Based on Fine-tuned Self-supervised Model for Brain Disease Diagnosis评论: 13页,3图,神经计算在高级应用国际会议主题: 图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV)
功能脑网络分析已成为脑疾病分析不可或缺的工具。 它深受深度学习方法的影响,这些方法可以表征ROI之间的复杂连接。 然而,关于脑网络基础模型的研究有限,且局限于单一维度,这限制了它们在神经科学中的广泛应用。 在本研究中,我们提出了一种微调的脑网络模型用于脑疾病诊断。 它基于原始脑网络模型,在多个维度上扩展了脑区表示,从而增强了其泛化能力。 我们的模型包含两个关键模块:(1)一个适配器模块,能够在不同维度上扩展脑区特征。 (2)一个基于自监督学习并在数千名参与者的fMRI数据上预训练的微调基础脑网络模型。 具体而言,其Transformer块能够有效提取脑区特征并计算区域间的关联性。 此外,我们推导出一种紧凑的脑网络潜在表示用于脑疾病诊断。 本研究中的下游实验表明,所提出的模型在脑疾病诊断中表现出优越的性能,这可能为脑网络分析研究提供一种有前景的方法。
Functional brain network analysis has become an indispensable tool for brain disease analysis. It is profoundly impacted by deep learning methods, which can characterize complex connections between ROIs. However, the research on foundation models of brain network is limited and constrained to a single dimension, which restricts their extensive application in neuroscience. In this study, we propose a fine-tuned brain network model for brain disease diagnosis. It expands brain region representations across multiple dimensions based on the original brain network model, thereby enhancing its generalizability. Our model consists of two key modules: (1)an adapter module that expands brain region features across different dimensions. (2)a fine-tuned foundation brain network model, based on self-supervised learning and pre-trained on fMRI data from thousands of participants. Specifically, its transformer block is able to effectively extract brain region features and compute the inter-region associations. Moreover, we derive a compact latent representation of the brain network for brain disease diagnosis. Our downstream experiments in this study demonstrate that the proposed model achieves superior performance in brain disease diagnosis, which potentially offers a promising approach in brain network analysis research.
- [91] arXiv:2506.13053 (替换) [中文pdf, pdf, html, 其他]
-
标题: ZipVoice:通过流匹配实现快速且高质量的零样本文本到语音标题: ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow MatchingHan Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, Daniel Povey评论: 被ASRU 2025接收主题: 音频与语音处理 (eess.AS) ; 声音 (cs.SD)
现有的大规模零样本文本到语音(TTS)模型能够提供高质量的语音,但由于参数量巨大,推理速度较慢。 为了解决这个问题,本文介绍了ZipVoice,这是一个基于流匹配的高质量零样本TTS模型,具有紧凑的模型规模和快速的推理速度。 关键设计包括:1)基于Zipformer的向量场估计器,在受限规模下保持足够的建模能力;2)基于平均上采样的初始语音-文本对齐和基于Zipformer的文本编码器,以提高语音可懂度;3)一种流蒸馏方法,以减少采样步骤并消除与无分类器指导相关的推理开销。 在10万小时多语言数据集上的实验表明,ZipVoice在语音质量上与最先进模型相当,同时比基于DiT的流匹配基线模型小3倍,且快至30倍。 代码、模型检查点和演示样本可在https://github.com/k2-fsa/ZipVoice公开获取。
Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-based zero-shot TTS model with a compact model size and fast inference speed. Key designs include: 1) a Zipformer-based vector field estimator to maintain adequate modeling capabilities under constrained size; 2) Average upsampling-based initial speech-text alignment and Zipformer-based text encoder to improve speech intelligibility; 3) A flow distillation method to reduce sampling steps and eliminate the inference overhead associated with classifier-free guidance. Experiments on 100k hours multilingual datasets show that ZipVoice matches state-of-the-art models in speech quality, while being 3 times smaller and up to 30 times faster than a DiT-based flow-matching baseline. Codes, model checkpoints and demo samples are publicly available at https://github.com/k2-fsa/ZipVoice.
- [92] arXiv:2506.15105 (替换) [中文pdf, pdf, html, 其他]
-
标题: 斜率引起的插入损耗偏差(SILD)和FOM_SILD:用于量化高速通道中P/N斜率效应的指标标题: Skew-Induced Insertion Loss Deviation (SILD) and FOM_SILD: Metrics for Quantifying P/N Skew Effects in High-Speed Channels主题: 系统与控制 (eess.SY) ; 信号处理 (eess.SP)
人工智能工作负载的增加和数据中心需求的增长推动了需要超过200 Gb/s的超高速互连的需求。 随着单位间隔(UI)的缩小,即使几皮秒的P/N偏移也会降低串行器-解串器(SerDes)的性能。 传统的量化偏移的方法在捕捉其影响方面存在不足。 我们引入了两个新的度量标准:1)由偏移引起的插入损耗偏差(SILD)和2)其互补的优值(FOM_SILD),通过分析方法开发以评估P/N偏移的影响。 测量的S参数确认了FOM_SILD的互易性,而224G PAM4 SerDes的仿真显示与误码率(BER)趋势有很强的相关性。 这种方法为分析下一代超高速互连中的偏移提供了一个稳健的框架。
The rise of AI workloads and growing data center demands have driven the need for ultra-high-speed interconnects exceeding 200 Gb/s. As unit intervals (UI) shrink, even a few picoseconds of P/N skew can degrade serializer-deserializer (SerDes) performance. Traditional methods for quantifying skew fall short in capturing its impact. We introduce two new metrics: 1) Skew-Induced Insertion Loss Deviation (SILD) and 2) its complementary Figure of Merit (FOM_SILD), analytically developed to assess P/N skew effects. Measured S-parameters confirm FOM_SILD reciprocity, while simulations of 224G PAM4 SerDes show strong correlation with bit error rate (BER) trends. This approach offers a robust framework for analyzing skew in next-generation ultra-high-speed interconnects.
- [93] arXiv:2507.06417 (替换) [中文pdf, pdf, html, 其他]
-
标题: 胶囊-ConvKAN:一种医学图像分类的混合神经方法标题: Capsule-ConvKAN: A Hybrid Neural Approach to Medical Image Classification评论: 预印本版本。已被IEEE SMC 2025接收主题: 图像与视频处理 (eess.IV) ; 计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG)
这项研究对四种神经网络架构进行了全面比较:卷积神经网络、胶囊网络、卷积Kolmogorov-Arnold网络以及新提出的胶囊-卷积Kolmogorov-Arnold网络。 所提出的胶囊-ConvKAN架构结合了胶囊网络的动态路由和空间层次结构能力,以及卷积Kolmogorov-Arnold网络的灵活和可解释的功能逼近能力。 这种新型混合模型旨在提高特征表示和分类准确性,特别是在具有挑战性的现实世界生物医学图像数据中。 这些架构在组织病理学图像数据集上进行了评估,其中胶囊-ConvKAN达到了最高的分类性能,准确率为91.21%。 结果表明,新引入的胶囊-ConvKAN在捕捉空间模式、管理复杂特征以及解决传统卷积模型在医学图像分类中的局限性方面具有潜力。
This study conducts a comprehensive comparison of four neural network architectures: Convolutional Neural Network, Capsule Network, Convolutional Kolmogorov-Arnold Network, and the newly proposed Capsule-Convolutional Kolmogorov-Arnold Network. The proposed Capsule-ConvKAN architecture combines the dynamic routing and spatial hierarchy capabilities of Capsule Network with the flexible and interpretable function approximation of Convolutional Kolmogorov-Arnold Networks. This novel hybrid model was developed to improve feature representation and classification accuracy, particularly in challenging real-world biomedical image data. The architectures were evaluated on a histopathological image dataset, where Capsule-ConvKAN achieved the highest classification performance with an accuracy of 91.21%. The results demonstrate the potential of the newly introduced Capsule-ConvKAN in capturing spatial patterns, managing complex features, and addressing the limitations of traditional convolutional models in medical image classification.
- [94] arXiv:2507.12237 (替换) [中文pdf, pdf, 其他]
-
标题: 构建的现实? 高知名度图像中的技术与情境异常标题: Constructed Realities? Technical and Contextual Anomalies in a High-Profile Image评论: 41页,9图,39参考文献主题: 图像与视频处理 (eess.IV)
这项研究对一张广泛传播的照片进行了法医学评估,照片中包括安德鲁王子、弗吉尼亚·吉弗雷和吉斯莱恩·麦克斯韦尔——这张照片在公众讨论和法律叙述中发挥了关键作用。 通过分析多个发布的版本,发现了若干不一致之处,包括光线、姿势和身体互动方面的异常,这些更符合数字合成而非未经修改的快照。 尽管缺乏原始底片和可验证的审计记录,无法得出确定性结论,但技术性和背景性的异常表明,这张照片可能是有意构建的。 然而,在没有额外证据的情况下,这张照片仍然是一个复杂的故事中未解决但具有象征意义的片段,涉及虐待、记忆和被争议的真相。
This study offers a forensic assessment of a widely circulated photograph featuring Prince Andrew, Virginia Giuffre, and Ghislaine Maxwell - an image that has played a pivotal role in public discourse and legal narratives. Through analysis of multiple published versions, several inconsistencies are identified, including irregularities in lighting, posture, and physical interaction, which are more consistent with digital compositing than with an unaltered snapshot. While the absence of the original negative and a verifiable audit trail precludes definitive conclusions, the technical and contextual anomalies suggest that the image may have been deliberately constructed. Nevertheless, without additional evidence, the photograph remains an unresolved but symbolically charged fragment within a complex story of abuse, memory, and contested truth.
- [95] arXiv:2508.01660 (替换) [中文pdf, pdf, html, 其他]
-
标题: GPS卫星的姿态确定与控制:稳定化、轨道插入和运行控制机制标题: Attitude Determination and Control of GPS Satellites: Stabilization, Orbital Insertion, and Operational Control Mechanisms评论: 8页,3图主题: 系统与控制 (eess.SY) ; 地球与行星天体物理学 (astro-ph.EP) ; 天体物理学的仪器与方法 (astro-ph.IM)
全球定位系统(GPS)卫星对于提供全球准确的导航和定时信息至关重要。 在中地球轨道(MEO)运行,这些卫星必须保持精确的朝向地球的姿态以有效传输信号。 本文全面回顾了GPS卫星的操作动力学、姿态确定和控制系统(ADCS)以及轨道插入技术。 我们探讨了传感器和执行机构的集成、控制算法、稳定策略以及部署这些卫星所需的发射程序。 讨论了与轨道力学和姿态控制相关的关键方程,并包含了对近期技术文献的引用。
Global Positioning System (GPS) satellites are essential for providing accurate navigation and timing information worldwide. Operating in medium Earth orbit (MEO), these satellites must maintain precise Earth-pointing attitudes to transmit signals effectively. This paper presents a comprehensive review of the operational dynamics, attitude determination and control systems (ADCS), and orbital insertion techniques for GPS satellites. We explore the integration of sensors and actuators, control algorithms, stabilization strategies, and the launch procedures required to deploy these satellites. Key equations related to orbital mechanics and attitude control are discussed, and references to recent technical literature are included.
- [96] arXiv:2508.02881 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于不确定传感器信号的预防性和响应性防御资源分配优化标题: Optimizing Preventive and Reactive Defense Resource Allocation with Uncertain Sensor Signals评论: 6页,6图。被接受在第61届Allerton通信、控制与计算会议进行展示主题: 系统与控制 (eess.SY) ; 密码学与安全 (cs.CR) ; 计算机科学与博弈论 (cs.GT)
网络攻击持续引起关注,尽管网络防御技术有所进步。 尽管无法完全防止网络攻击,标准的决策框架通常侧重于如何防止攻击成功,而没有考虑成功攻击造成的损害清理成本。 这促使我们研究本文中提出的新的资源分配问题:防御者必须决定如何在预防性防御和反应性防御之间分配其投资;预防性防御旨在使节点免受攻击,而反应性防御旨在快速清理被入侵的节点。 这面临着与观察或传感器信号相关的不确定性挑战,即是否一个节点真正被入侵;这种不确定性是真实的,因为攻击检测器并不完美。 我们研究了传感器信号的质量如何影响防御者在两种防御类型上的战略投资,以及最终能够实现的安全水平。 特别是,我们表明,随着传感器质量的提高,预防性资源的最佳投资增加,因此反应性资源投资减少。 我们还表明,相对于不使用传感器的基准情况,防御者的性能提升在攻击者只能实现低攻击成功率时达到最大。
Cyber attacks continue to be a cause of concern despite advances in cyber defense techniques. Although cyber attacks cannot be fully prevented, standard decision-making frameworks typically focus on how to prevent them from succeeding, without considering the cost of cleaning up the damages incurred by successful attacks. This motivates us to investigate a new resource allocation problem formulated in this paper: The defender must decide how to split its investment between preventive defenses, which aim to harden nodes from attacks, and reactive defenses, which aim to quickly clean up the compromised nodes. This encounters a challenge imposed by the uncertainty associated with the observation, or sensor signal, whether a node is truly compromised or not; this uncertainty is real because attack detectors are not perfect. We investigate how the quality of sensor signals impacts the defender's strategic investment in the two types of defense, and ultimately the level of security that can be achieved. In particular, we show that the optimal investment in preventive resources increases, and thus reactive resource investment decreases, with higher sensor quality. We also show that the defender's performance improvement, relative to a baseline of no sensors employed, is maximal when the attacker can only achieve low attack success probabilities.
- [97] arXiv:2508.04068 (替换) [中文pdf, pdf, html, 其他]
-
标题: WiFo-CF:无线基础模型用于信道状态信息反馈标题: WiFo-CF: Wireless Foundation Model for CSI Feedback主题: 信号处理 (eess.SP)
基于深度学习的信道状态信息(CSI)反馈方案表现出强大的压缩能力,但通常受限于固定的系统配置,限制了其泛化能力和灵活性。 为解决这一挑战,提出了一种针对CSI反馈的新型无线基础模型WiFo-CF,通过其关键创新:(1) 多用户、多速率的自监督预训练策略;以及(2) 一种共享与路由专家混合(S-R MoE)架构,在统一框架中独特地适应异构配置,如不同的信道维度、反馈速率和数据分布。 支持WiFo-CF的大规模预训练的是第一个异构信道反馈数据集,其多样化的模式使模型在模拟和现实场景中的分布内和分布外数据上都能实现卓越的性能。 此外,学习到的表示有效促进了对下游任务如基于CSI的室内定位的适应,验证了WiFo-CF的可扩展性和部署潜力。
Deep learning-based channel state information (CSI) feedback schemes demonstrate strong compression capabilities but are typically constrained to fixed system configurations, limiting their generalization and flexibility. To address this challenge, WiFo-CF, a novel wireless foundation model tailored for CSI feedback, is proposed, uniquely accommodating heterogeneous configurations such as varying channel dimensions, feedback rates, and data distributions within a unified framework through its key innovations: (1) a multi-user, multi-rate self-supervised pre-training strategy; and (2) a Mixture of Shared and Routed Expert (S-R MoE) architecture. Supporting the large-scale pre-training of WiFo-CF is the first heterogeneous channel feedback dataset, whose diverse patterns enable the model to achieve superior performance on both in-distribution and out-of-distribution data across simulated and real-world scenarios. Furthermore, the learned representations effectively facilitate adaptation to downstream tasks such as CSI-based indoor localization, validating WiFo-CF's scalability and deployment potential.
- [98] arXiv:2508.04585 (替换) [中文pdf, pdf, html, 其他]
-
标题: UniTalker:对话式语音-视觉合成标题: UniTalker: Conversational Speech-Visual Synthesis评论: 15页,8图,已被ACM MM 2025接收主题: 音频与语音处理 (eess.AS)
对话式语音合成(CSS)是用户代理交互领域的一个关键任务,旨在为用户提供更具表现力和同理心的语音。 然而,众所周知,在现实世界的人际交流中,“倾听”和“眼神交流”在传达情感方面起着至关重要的作用。 现有的CSS研究仅限于感知对话上下文中的文本和语音,这限制了其效果。 此外,仅语音的回应进一步限制了交互体验。 为了解决这些局限性,我们引入了一个对话式语音-视觉合成(CSVS)任务,作为传统CSS的扩展。 通过利用多模态对话上下文,它为用户提供连贯的音视频回应。 为此,我们开发了一个名为UniTalker的CSVS系统,这是一个统一的模型,能够无缝集成多模态感知和多模态渲染能力。 具体而言,它利用大规模语言模型来全面理解对话上下文中的多模态线索,包括说话者、文本、语音和说话面部动画。 之后,它采用多任务序列预测来首先推断目标话语的情感,然后生成富有同理心的语音和自然的说话面部动画。 为了确保生成的语音-视觉内容在情感、内容和时长方面保持一致,我们引入了三个关键优化:1)设计一种专门的神经地标编码器,以对表情序列进行标记化和重建。2)提出一种双模态语音-视觉硬对齐解码策略。3)在生成阶段应用情感引导的渲染。 综合的客观和主观实验表明,我们的模型合成出更具同理心的语音,并为用户提供更自然且情感一致的说话面部动画。
Conversational Speech Synthesis (CSS) is a key task in the user-agent interaction area, aiming to generate more expressive and empathetic speech for users. However, it is well-known that "listening" and "eye contact" play crucial roles in conveying emotions during real-world interpersonal communication. Existing CSS research is limited to perceiving only text and speech within the dialogue context, which restricts its effectiveness. Moreover, speech-only responses further constrain the interactive experience. To address these limitations, we introduce a Conversational Speech-Visual Synthesis (CSVS) task as an extension of traditional CSS. By leveraging multimodal dialogue context, it provides users with coherent audiovisual responses. To this end, we develop a CSVS system named UniTalker, which is a unified model that seamlessly integrates multimodal perception and multimodal rendering capabilities. Specifically, it leverages a large-scale language model to comprehensively understand multimodal cues in the dialogue context, including speaker, text, speech, and the talking-face animations. After that, it employs multi-task sequence prediction to first infer the target utterance's emotion and then generate empathetic speech and natural talking-face animations. To ensure that the generated speech-visual content remains consistent in terms of emotion, content, and duration, we introduce three key optimizations: 1) Designing a specialized neural landmark codec to tokenize and reconstruct facial expression sequences. 2) Proposing a bimodal speech-visual hard alignment decoding strategy. 3) Applying emotion-guided rendering during the generation stage. Comprehensive objective and subjective experiments demonstrate that our model synthesizes more empathetic speech and provides users with more natural and emotionally consistent talking-face animations.
- [99] arXiv:2508.04588 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于体素监督模型的IVIM MRI不确定性量化综合框架标题: A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRINicola Casali, Alessandro Brusaferri, Giuseppe Baselli, Stefano Fumagalli, Edoardo Micotti, Gianluigi Forloni, Riaz Hussein, Giovanna Rizzo, Alfonso Mastropietro主题: 图像与视频处理 (eess.IV) ; 人工智能 (cs.AI) ; 机器学习 (cs.LG)
准确估计扩散加权MRI中的体素内不相干运动(IVIM)参数仍然具有挑战性,这是由于逆问题的不适定性以及对噪声的高度敏感性,尤其是在灌注部分。 在本工作中,我们提出了一种基于混合密度网络(MDN)深度集成(DE)的概率深度学习框架,能够估计总体预测不确定性并将其分解为随机性(AU)和知识性(EU)成分。 该方法与非概率神经网络、贝叶斯拟合方法以及具有单高斯参数化的概率网络进行了对比基准测试。 监督训练是在合成数据上进行的,评估是在模拟数据和一个体内数据集上进行的。 通过校准曲线、输出分布锐度和连续排名概率评分(CRPS)评估了量化不确定性的可靠性。 MDN对于扩散系数D和分数f参数产生了更校准和更锐利的预测分布,尽管在伪扩散系数D*中观察到了轻微的自信过度。 稳健变异系数(RCV)表明与高斯模型相比,MDN在D*的体内估计更加平滑。 尽管训练数据涵盖了预期的生理范围,但体内EU的升高表明与实际采集条件存在不匹配,突显了纳入EU的重要性,这由DE允许实现。 总体而言,我们提出了一个全面的IVIM拟合框架,具有不确定性量化,能够识别和解释不可靠的估计。 所提出的方法也可以通过适当的架构和仿真调整用于拟合其他物理模型。
Accurate estimation of intravoxel incoherent motion (IVIM) parameters from diffusion-weighted MRI remains challenging due to the ill-posed nature of the inverse problem and high sensitivity to noise, particularly in the perfusion compartment. In this work, we propose a probabilistic deep learning framework based on Deep Ensembles (DE) of Mixture Density Networks (MDNs), enabling estimation of total predictive uncertainty and decomposition into aleatoric (AU) and epistemic (EU) components. The method was benchmarked against non probabilistic neural networks, a Bayesian fitting approach and a probabilistic network with single Gaussian parametrization. Supervised training was performed on synthetic data, and evaluation was conducted on both simulated and an in vivo dataset. The reliability of the quantified uncertainties was assessed using calibration curves, output distribution sharpness, and the Continuous Ranked Probability Score (CRPS). MDNs produced more calibrated and sharper predictive distributions for the diffusion coefficient D and fraction f parameters, although slight overconfidence was observed in pseudo-diffusion coefficient D*. The Robust Coefficient of Variation (RCV) indicated smoother in vivo estimates for D* with MDNs compared to Gaussian model. Despite the training data covering the expected physiological range, elevated EU in vivo suggests a mismatch with real acquisition conditions, highlighting the importance of incorporating EU, which was allowed by DE. Overall, we present a comprehensive framework for IVIM fitting with uncertainty quantification, which enables the identification and interpretation of unreliable estimates. The proposed approach can also be adopted for fitting other physical models through appropriate architectural and simulation adjustments.
- [100] arXiv:2307.14297 (替换) [中文pdf, pdf, html, 其他]
-
标题: 鲁棒遗憾最优控制标题: Robust Regret Optimal Control期刊参考: 《国际鲁棒与非线性控制杂志》34.7 (2024): 4532-4553主题: 优化与控制 (math.OC) ; 系统与控制 (eess.SY)
本文提出了一种鲁棒、后悔最优控制的综合方法。被控对象以离散时间形式由一个不确定的线性时不变(LTI)系统进行建模。利用名义被控对象模型和对扰动的完全已知信息,构造了一个最优非因果控制器。鲁棒后悔定义为相对于该最优非因果控制性能的性能。证明了当且仅当控制器满足鲁棒$H_\infty$性能条件时,才能实现鲁棒后悔。可以使用DK-迭代来综合满足该条件的控制器,从而实现给定水平的鲁棒后悔。该方法通过三个示例进行了演示:(i) 一个简单的单输入单输出经典设计,(ii) 一个简化波音747模型的纵向控制,以及(iii) 一个四分之一汽车模型的主动悬架。所有示例都将鲁棒后悔最优控制器与未考虑不确定性的后悔最优控制器进行了比较。
This paper presents a synthesis method for robust, regret optimal control. The plant is modeled in discrete-time by an uncertain linear time-invariant (LTI) system. An optimal non-causal controller is constructed using the nominal plant model and given full knowledge of the disturbance. Robust regret is defined relative to the performance of this optimal non-causal control. It is shown that a controller achieves robust regret if and only if it satisfies a robust $H_\infty$ performance condition. DK-iteration can be used to synthesize a controller that satisfies this condition and hence achieve a given level of robust regret. The approach is demonstrated three examples: (i) a simple single-input, single-output classical design, (ii) a longitudinal control for a simplified model for a Boeing 747 model, and (iii) an active suspension for a quarter car model. All examples compare the robust regret optimal against regret optimal controllers designed without uncertainty.
- [101] arXiv:2403.10934 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于四元数的六自由度无人机飞行控制滑模控制标题: Quaternion-Based Sliding Mode Control for Six Degrees of Freedom Flight Control of Quadrotors主题: 机器人技术 (cs.RO) ; 系统与控制 (eess.SY)
尽管对四旋翼飞行器滑模控制(SMC)设计进行了广泛的研究,但现有的方法存在一定的局限性。 基于欧拉角的SMC公式在高俯仰或滚转机动中表现不佳。 基于四元数的SMC方法存在缠绕问题和复杂的结构。 无坐标方法速度较慢且仅几乎全局稳定。 本文提出了一种新的六自由度SMC飞行控制器,以解决上述限制。 我们采用级联结构,在外环使用位置控制器,在内环使用基于四元数的姿态控制器。 位置控制器使用无坐标方法为姿态控制器生成期望轨迹。 基于四元数的姿态控制器利用了四元数超球面的自然特性,具有简单的结构,同时提供全局稳定性并避免缠绕问题。 我们与三种其他常见控制方法进行了比较,在存在模型不确定性和干扰的情况下进行翻转和高速轨迹跟踪等挑战性机动。 我们的控制器始终优于基准方法,所需控制努力和执行器饱和度更低,提供了高效且有效的飞行控制。
Despite extensive research on sliding mode control (SMC) design for quadrotors, the existing approaches suffer from certain limitations. Euler angle-based SMC formulations suffer from poor performance in high-pitch or -roll maneuvers. Quaternion-based SMC approaches have unwinding issues and complex architecture. Coordinate-free methods are slow and only almost globally stable. This paper presents a new six degrees of freedom SMC flight controller to address the above limitations. We use a cascaded architecture with a position controller in the outer loop and a quaternion-based attitude controller in the inner loop. The position controller generates the desired trajectory for the attitude controller using a coordinate-free approach. The quaternion-based attitude controller uses the natural characteristics of the quaternion hypersphere, featuring a simple structure while providing global stability and avoiding unwinding issues. We compare our controller with three other common control methods conducting challenging maneuvers like flip-over and high-speed trajectory tracking in the presence of model uncertainties and disturbances. Our controller consistently outperforms the benchmark approaches with less control effort and actuator saturation, offering highly effective and efficient flight control.
- [102] arXiv:2406.17537 (替换) [中文pdf, pdf, html, 其他]
-
标题: SincVAE:一种新的半监督方法,使用SincNet和变分自编码器来提高EEG数据的异常检测性能标题: SincVAE: A new semi-supervised approach to improve anomaly detection on EEG data using SincNet and variational autoencoder期刊参考: 计算机方法和程序在医学中的应用更新,5:100213,2025主题: 机器学习 (cs.LG) ; 人工智能 (cs.AI) ; 信号处理 (eess.SP)
在过去几十年中,脑电图(EEG)监测已成为诊断神经系统疾病的重要工具,特别是在检测癫痫发作方面。 癫痫是全球最普遍的神经系统疾病之一,影响大约1%的人口。 这些患者面临重大风险,强调了在日常生活中进行可靠、持续的癫痫发作监测的必要性。 文献中讨论的大多数技术依赖于监督机器学习(ML)方法。 然而,准确标记癫痫EEG波形变化的挑战使得这些方法的使用变得复杂。 此外,发作事件的罕见性导致数据中存在高度不平衡,这可能导致监督学习方法的预测性能不佳。 相反,半监督方法仅使用不含癫痫发作的数据来训练模型,从而避免与数据不平衡相关的问题。 本文提出了一种用于从EEG数据中检测癫痫发作的半监督方法,采用了一种称为SincVAE的新型基于深度学习的方法。 该方法将专用带通滤波器阵列的学习作为变分自编码器(VAE)的第一层,可能消除在预处理阶段识别和隔离信息频段的步骤。 结果表明,SincVAE能够提高EEG数据中的癫痫发作检测能力,并且能够在发作前阶段识别早期癫痫发作,同时在整个发作后阶段监测患者。
Over the past few decades, electroencephalography (EEG) monitoring has become a pivotal tool for diagnosing neurological disorders, particularly for detecting seizures. Epilepsy, one of the most prevalent neurological diseases worldwide, affects approximately the 1 \% of the population. These patients face significant risks, underscoring the need for reliable, continuous seizure monitoring in daily life. Most of the techniques discussed in the literature rely on supervised Machine Learning (ML) methods. However, the challenge of accurately labeling variations in epileptic EEG waveforms complicates the use of these approaches. Additionally, the rarity of ictal events introduces an high imbalancing within the data, which could lead to poor prediction performance in supervised learning approaches. Instead, a semi-supervised approach allows to train the model only on data not containing seizures, thus avoiding the issues related to the data imbalancing. This work proposes a semi-supervised approach for detecting epileptic seizures from EEG data, utilizing a novel Deep Learning-based method called SincVAE. This proposal incorporates the learning of an ad-hoc array of bandpass filter as a first layer of a Variational Autoencoder (VAE), potentially eliminating the preprocessing stage where informative band frequencies are identified and isolated. Results indicate that SincVAE improves seizure detection in EEG data and is capable of identifying early seizures during the preictal stage as well as monitoring patients throughout the postictal stage.
- [103] arXiv:2408.07836 (替换) [中文pdf, pdf, html, 其他]
-
标题: 用于沉浸式显示的可学习单通道多任务感知图形标题: Learned Single-Pass Multitasking Perceptual Graphics for Immersive Displays主题: 计算机视觉与模式识别 (cs.CV) ; 图形学 (cs.GR) ; 图像与视频处理 (eess.IV)
新兴的沉浸式显示技术通过感知图形方法如视网膜渲染和去噪高效利用资源。 运行多个感知图形方法会挑战有限电源和计算资源的设备。 我们提出了一种计算轻量级的学习多任务感知图形模型。 给定RGB图像和文本提示,我们的模型在一个推理步骤中执行文本描述的感知任务。 简单地串联多个模型或训练专用模型可能导致模型管理问题并耗尽计算资源。 相反,我们的灵活方法在合理的计算下解锁一致的高质量感知效果,使用文本提示中的形容词(例如,轻微地,轻度地)支持各种不同强度的排列。 文本指导为动态需求(如创作过程)提供了易用性。 为了训练我们的模型,我们提出一个包含源图像和感知增强图像以及相应文本提示的数据集。 我们在桌面和嵌入式平台评估我们的模型,并通过用户研究验证感知质量。
Emerging immersive display technologies efficiently utilize resources with perceptual graphics methods such as foveated rendering and denoising. Running multiple perceptual graphics methods challenges devices with limited power and computational resources. We propose a computationally-lightweight learned multitasking perceptual graphics model. Given RGB images and text-prompts, our model performs text-described perceptual tasks in a single inference step. Simply daisy-chaining multiple models or training dedicated models can lead to model management issues and exhaust computational resources. In contrast, our flexible method unlocks consistent high quality perceptual effects with reasonable compute, supporting various permutations at varied intensities using adjectives in text prompts (e.g. mildly, lightly). Text-guidance provides ease of use for dynamic requirements such as creative processes. To train our model, we propose a dataset containing source and perceptually enhanced images with corresponding text prompts. We evaluate our model on desktop and embedded platforms and validate perceptual quality through a user study.
- [104] arXiv:2410.03751 (替换) [中文pdf, pdf, html, 其他]
-
标题: 语音语言模型的最新进展:综述标题: Recent Advances in Speech Language Models: A SurveyWenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King评论: 这篇论文的精简版已被接受至ACL 2025主题: 计算与语言 (cs.CL) ; 声音 (cs.SD) ; 音频与语音处理 (eess.AS)
大型语言模型(LLMs)最近引起了广泛关注,主要是因为它们在基于文本的交互中的能力。 然而,自然的人类交互通常依赖于语音,这需要向基于语音的模型转变。 实现这一目标的一种直接方法是采用“自动语音识别(ASR)+ LLM + 文本到语音(TTS)”的流程,其中输入的语音被转录为文本,由LLM处理,然后重新转换为语音。 尽管这种方法简单,但存在固有的局限性,例如在模态转换过程中信息丢失、由于复杂流程导致的显著延迟以及三个阶段中的错误累积。 为了解决这些问题,端到端模型——无需从文本转换即可生成语音的模型——作为有前途的替代方案出现了。 这篇综述论文提供了对构建SpeechLMs的最新方法的首次全面概述,详细介绍了其架构的关键组件以及对其开发至关重要的各种训练方案。 此外,我们系统地调查了SpeechLMs的各种能力,对其评估指标进行了分类,并讨论了这一快速发展的领域中的挑战和未来研究方向。 GitHub仓库地址为https://github.com/dreamtheater123/Awesome-SpeechLM-Survey
Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize their evaluation metrics, and discuss the challenges and future research directions in this rapidly evolving field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey
- [105] arXiv:2411.18148 (替换) [中文pdf, pdf, html, 其他]
-
标题: 一种运行时自适应的Transformer神经网络加速器在FPGAs上标题: A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs评论: arXiv管理员注释:与arXiv:2409.14023存在文本重叠主题: 硬件架构 (cs.AR) ; 机器学习 (cs.LG) ; 系统与控制 (eess.SY)
Transformer神经网络(TNN)在自然语言处理(NLP)、机器翻译和计算机视觉(CV)中表现出色,而无需依赖循环或卷积层。 然而,它们的计算和内存需求较高,特别是在资源受限的设备如FPGA上。 此外,Transformer模型在不同应用中的处理时间各不相同,需要具有特定参数的定制模型。 为每个模型设计定制加速器既复杂又耗时。 一些定制加速器缺乏运行时适应性,并且通常依赖于稀疏矩阵来减少延迟。 然而,由于需要特定应用的稀疏模式,硬件设计变得更加困难。 本文介绍了ADAPTOR,一种用于FPGA上的Transformer编码器和解码器中密集矩阵计算的运行时自适应加速器。 ADAPTOR提高了处理单元和片上内存的利用率,增强了并行性并减少了延迟。 它结合了高效的矩阵分块技术,以在FPGA平台上分布资源,并且完全量化以提高计算效率和可移植性。 在Xilinx Alveo U55C数据中心卡和嵌入式平台如VC707和ZCU102上的评估表明,我们的设计分别比NVIDIA K80 GPU和i7-8700K CPU更高效1.2$\times$和2.87$\times$。 此外,与一些最先进的基于FPGA的加速器相比,它实现了1.7到2.25$\times$的加速。
Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2$\times$ and 2.87$\times$ more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25$\times$ compared to some state-of-the-art FPGA-based accelerators.
- [106] arXiv:2412.00082 (替换) [中文pdf, pdf, html, 其他]
-
标题: PL-DCP:一种成对学习框架,具有领域和类别原型,用于未见目标条件下的EEG情绪识别标题: PL-DCP: A Pairwise Learning framework with Domain and Class Prototypes for EEG emotion recognition under unseen target conditions主题: 机器学习 (cs.LG) ; 人工智能 (cs.AI) ; 人机交互 (cs.HC) ; 信号处理 (eess.SP)
脑电图(EEG)信号在情感脑机接口(aBCIs)中是一种强大的工具,并在情感计算中起着关键作用。近年来,深度学习技术的引入显著推动了aBCIs的发展。然而,基于深度迁移学习的当前情感识别方法面临模型对源域和目标域的双重依赖问题,以及受到标签噪声的影响,这严重影响了模型的性能和泛化能力。为了克服这一限制,我们提出了一种在未见目标条件下用于EEG情感识别的成对学习框架,结合领域和类别原型(PL-DCP),并整合了特征解耦和原型推理的概念。在这里,特征解耦模块提取并解耦情感EEG特征以形成领域特征和类别特征,并进一步计算双原型表示。领域原型捕捉个体之间的变化,而类别原型捕捉情感类别的跨个体共性。此外,成对学习策略有效减少了由错误标签引起的声音影响。PL-DCP框架在发布的数据集SEED、SEED-IV和SEED-V上进行了系统实验评估,准确率分别为82.88%、65.15%和61.29%。结果表明,与其他最先进的(SOTA)方法相比,PL-DCP模型在训练期间目标域完全未见的情况下,仍然比需要源数据和目标数据的深度迁移学习方法表现出稍好的性能。这项工作为情感识别提供了一种有效且稳健的潜在解决方案。源代码可在https://github.com/WuCB-BCI/PL_DCP获取。
Electroencephalogram (EEG) signals serve as a powerful tool in affective Brain-Computer Interfaces (aBCIs) and play a crucial role in affective computing. In recent years, the introduction of deep learning techniques has significantly advanced the development of aBCIs. However, the current emotion recognition methods based on deep transfer learning face the challenge of the dual dependence of the model on source domain and target domain, As well as being affected by label noise, which seriously affects the performance and generalization ability of the model. To overcome this limitation, we proposes a Pairwise Learning framework with Domain and Category Prototypes for EEG emotion recognition under unseen target conditions (PL-DCP), and integrating concepts of feature disentanglement and prototype inference. Here, the feature disentanglement module extracts and decouples the emotional EEG features to form domain features and class features, and further calculates the dual prototype representation. The Domain-pprototype captures the individual variations across subjects, while the class-prototype captures the cross-individual commonality of emotion categories. In addition, the pairwise learning strategy effectively reduces the noise effect caused by wrong labels. The PL-DCP framework conducts a systematic experimental evaluation on the published datasets SEED, SEED-IV and SEED-V, and the accuracy are 82.88\%, 65.15\% and 61.29\%, respectively. The results show that compared with other State-of-the-Art(SOTA) Methods, the PL-DCP model still achieves slightly better performance than the deep transfer learning method that requires both source and target data, although the target domain is completely unseen during the training. This work provides an effective and robust potential solution for emotion recognition. The source code is available at https://github.com/WuCB-BCI/PL_DCP.
- [107] arXiv:2502.10154 (替换) [中文pdf, pdf, html, 其他]
-
标题: 视频配乐生成通过情感和时间边界对齐标题: Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries主题: 声音 (cs.SD) ; 人工智能 (cs.AI) ; 机器学习 (cs.LG) ; 多媒体 (cs.MM) ; 音频与语音处理 (eess.AS) ; 图像与视频处理 (eess.IV)
我们引入了EMSYNC,这是一个基于视频的符号音乐生成模型,能够将音乐与视频的情感内容和时间边界对齐。 它采用了一个两阶段框架,其中预训练的视频情感分类器提取情感特征,而条件音乐生成器则在情感和时间提示的引导下生成MIDI序列。 我们引入了边界偏移量,这是一种新颖的时间条件机制,使模型能够预测并使音乐和弦与场景切换对齐。 与现有模型不同,我们的方法保留了基于事件的编码,确保细粒度的时间控制和富有表现力的音乐细节。 我们还提出了一种映射方案,以连接产生离散情感类别的视频情感分类器与在连续值效价-唤醒输入上运行的情感条件MIDI生成器。 在主观听觉测试中,EMSYNC在所有主观指标上都优于最先进的模型,无论是对音乐理论敏感的参与者还是普通听众。
We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.
- [108] arXiv:2502.10452 (替换) [中文pdf, pdf, 其他]
-
标题: 四元数-哈达玛网络:一种新的对抗攻击防御方法及一个新的数据集标题: Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset主题: 机器学习 (cs.LG) ; 图像与视频处理 (eess.IV)
本文解决了针对雨、雪和雾霾去除设计的深度学习模型的脆弱性问题。 尽管这些模型可以提高恶劣天气下的图像质量,但它们容易受到破坏其有效性的对抗性攻击。 传统的防御方法,如对抗训练和模型蒸馏,通常需要大量的重新训练,这使得它们在实际部署中成本高且不切实际。 虽然去噪和超分辨率技术可以帮助图像分类模型,但它们会带来高的计算需求,并引入阻碍图像处理任务的视觉伪影。 我们提出了一种模型无关的防御方法,用于应对一阶白盒对抗性攻击,使用四元数-哈达玛网络(QHNet)来解决这些挑战。 白盒攻击尤其难以防御,因为攻击者可以完全访问模型的架构、权重和训练过程。 我们的防御方法引入了四元数哈达玛去噪卷积块(QHDCB)和四元数去噪残差块(QDRB),利用多项式阈值。 QHNet在编码器-解码器架构中结合了这些块,并通过特征细化进行增强,以有效中和对抗性噪声。 此外,我们引入了对抗性天气条件视觉数据集(AWCVD),该数据集是通过对涉及雾霾、雨痕和雪的最先进的天气去除技术应用一阶梯度攻击而创建的。 使用PSNR和SSIM指标,我们证明与最先进的去噪和超分辨率技术相比,QHNet显著提高了低级计算机视觉模型对对抗性攻击的鲁棒性。 源代码和数据集将在本文的最终版本中一同发布。
This paper addresses the vulnerability of deep-learning models designed for rain, snow, and haze removal. Despite enhancing image quality in adverse weather, these models are susceptible to adversarial attacks that compromise their effectiveness. Traditional defenses such as adversarial training and model distillation often require extensive retraining, making them costly and impractical for real-world deployment. While denoising and super-resolution techniques can aid image classification models, they impose high computational demands and introduce visual artifacts that hinder image processing tasks. We propose a model-agnostic defense against first-order white-box adversarial attacks using the Quaternion-Hadamard Network (QHNet) to tackle these challenges. White-box attacks are particularly difficult to defend against since attackers have full access to the model's architecture, weights, and training procedures. Our defense introduces the Quaternion Hadamard Denoising Convolutional Block (QHDCB) and the Quaternion Denoising Residual Block (QDRB), leveraging polynomial thresholding. QHNet incorporates these blocks within an encoder-decoder architecture, enhanced by feature refinement, to effectively neutralize adversarial noise. Additionally, we introduce the Adversarial Weather Conditions Vision Dataset (AWCVD), created by applying first-order gradient attacks on state-of-the-art weather removal techniques in scenarios involving haze, rain streaks, and snow. Using PSNR and SSIM metrics, we demonstrate that QHNet significantly enhances the robustness of low-level computer vision models against adversarial attacks compared with state-of-the-art denoising and super-resolution techniques. The source code and dataset will be released alongside the final version of this paper.
- [109] arXiv:2503.07940 (替换) [中文pdf, pdf, html, 其他]
-
标题: BUFFER-X:面向多样场景的零样本点云配准标题: BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes评论: 20页,14张图。被接受为ICCV 2025的亮点论文主题: 计算机视觉与模式识别 (cs.CV) ; 机器人技术 (cs.RO) ; 图像与视频处理 (eess.IV)
近年来,基于深度学习的点云配准方法提高了泛化能力,但大多数方法在每个新环境中仍需要重新训练或手动调整参数。 在本文中,我们确定了限制泛化的三个关键因素:(a) 依赖于环境特定的体素大小和搜索半径,(b) 基于学习的关键点检测器的域外鲁棒性差,以及 (c) 使用原始坐标,这加剧了尺度差异。 为了解决这些问题,我们提出了一种零样本配准流程 BUFFER-X,通过 (a) 自适应确定体素大小/搜索半径,(b) 使用最远点采样来绕过学习到的检测器,以及 (c) 利用逐块尺度归一化以获得一致的坐标范围。 特别是,我们提出了一种基于多尺度块的描述符生成和跨尺度的层次内联搜索,以提高在不同场景中的鲁棒性。 我们还提出了一种新的泛化性基准,使用 11 个数据集,涵盖各种室内/室外场景和传感器模态,证明了 BUFFER-X 在无需先验信息或手动参数调整的情况下实现了显著的泛化能力。 我们的代码可在 https://github.com/MIT-SPARK/BUFFER-X 获取。
Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at https://github.com/MIT-SPARK/BUFFER-X.
- [110] arXiv:2504.03038 (替换) [中文pdf, pdf, html, 其他]
-
标题: 如何适应控制障碍函数? 一种基于学习的方法及其在垂直起降四旋翼飞机中的应用标题: How to Adapt Control Barrier Functions? A Learning-Based Approach with Applications to a VTOL Quadplane评论: 2025年IEEE决策与控制会议(CDC)。项目页面:https://www.taekyung.me/how-to-adapt-cbf主题: 机器人技术 (cs.RO) ; 系统与控制 (eess.SY)
在本文中,我们提出了一种新的理论框架,用于在输入约束下对控制屏障函数(CBF)参数进行在线适应,即对包含在CBF条件中的K类函数进行在线适应。 我们引入了局部验证的CBF参数概念,这些参数在线适应以保证有限时间范围内的安全性,其依据是来自Nagumo定理和切锥分析的条件。 为了在线识别这些参数,我们将基于学习的方法与一种考虑神经网络预测中固有认知不确定性和随机不确定性的不确定性感知验证过程相结合。 我们的方法在VTOL四旋翼飞机模型上进行了演示,在具有挑战性的过渡和着陆操作中展示了增强的性能同时保持了安全性。
In this paper, we present a novel theoretical framework for online adaptation of Control Barrier Function (CBF) parameters, i.e., of the class K functions included in the CBF condition, under input constraints. We introduce the concept of locally validated CBF parameters, which are adapted online to guarantee finite-horizon safety, based on conditions derived from Nagumo's theorem and tangent cone analysis. To identify these parameters online, we integrate a learning-based approach with an uncertainty-aware verification process that account for both epistemic and aleatoric uncertainties inherent in neural network predictions. Our method is demonstrated on a VTOL quadplane model during challenging transition and landing maneuvers, showcasing enhanced performance while maintaining safety.
- [111] arXiv:2504.12169 (替换) [中文pdf, pdf, html, 其他]
-
标题: 面向通用的零样本合成低光照图像和视频流水线标题: Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline主题: 计算机视觉与模式识别 (cs.CV) ; 图像与视频处理 (eess.IV)
低光条件对人类和机器注释都构成了重大挑战。 这进而导致了对低光图像和(特别是)视频的机器理解研究不足。 一种常见方法是将从高质量数据集获得的注释应用于合成创建的低光版本。 此外,这些方法通常由于使用不现实的噪声模型而受到限制。 在本文中,我们提出了一种新的退化估计网络(DEN),它可以在不需要相机元数据的情况下合成生成真实的标准RGB(sRGB)噪声。 这是通过估计物理信息噪声分布的参数来实现的,这些参数以自监督的方式进行训练。 这种零样本方法使我们的方法能够生成具有多种真实噪声特性的合成噪声内容,与其他专注于重现训练数据噪声特性的方法不同。 我们使用各种在合成数据上训练的方法来评估我们提出的合成管道,用于典型的低光任务,包括合成噪声复制、视频增强和目标检测,分别显示出高达24% KLD、21% LPIPS和62% AP$_{50-95}$的改进。
Low-light conditions pose significant challenges for both human and machine annotation. This in turn has led to a lack of research into machine understanding for low-light images and (in particular) videos. A common approach is to apply annotations obtained from high quality datasets to synthetically created low light versions. In addition, these approaches are often limited through the use of unrealistic noise models. In this paper, we propose a new Degradation Estimation Network (DEN), which synthetically generates realistic standard RGB (sRGB) noise without the requirement for camera metadata. This is achieved by estimating the parameters of physics-informed noise distributions, trained in a self-supervised manner. This zero-shot approach allows our method to generate synthetic noisy content with a diverse range of realistic noise characteristics, unlike other methods which focus on recreating the noise characteristics of the training data. We evaluate our proposed synthetic pipeline using various methods trained on its synthetic data for typical low-light tasks including synthetic noise replication, video enhancement, and object detection, showing improvements of up to 24\% KLD, 21\% LPIPS, and 62\% AP$_{50-95}$, respectively.
- [112] arXiv:2504.15774 (替换) [中文pdf, pdf, html, 其他]
-
标题: 非主信道接入的建模与性能分析标题: Modelling and Performance Analysis of Non-Primary Channel Access in Wi-Fi Networks主题: 网络与互联网架构 (cs.NI) ; 信息论 (cs.IT) ; 信号处理 (eess.SP)
本文旨在提高我们对非主信道访问(NPCA)机制性能的理解,这是一种在IEEE 802.11bn中引入的新功能,旨在提高Wi-Fi网络中的频谱利用率。NPCA使设备能够在主信道被重叠基本服务集(OBSS)的传输占用时,争夺并使用次信道进行传输。我们开发了一个连续时间马尔可夫链(CTMC)模型,在启用NPCA时,该模型捕捉了密集无线局域网(WLAN)环境中OBSS之间的交互,包括新的NPCA特定状态和转移。除了模型提供的分析见解外,我们还进行了数值评估和仿真,以量化NPCA在各种场景下对吞吐量和信道接入延迟的影响。我们的结果表明,在支持该机制的BSS有利条件下,NPCA可以显著提高吞吐量并减少接入延迟。此外,NPCA有助于缓解OBSS性能异常问题,其中低速率OBSS传输会降低所有附近设备的网络性能。然而,我们也观察到权衡:NPCA可能会增加次信道上的竞争,从而可能减少在这些信道上运行的BSS的传输机会。总体而言,所提出的建模方法为分析、优化和指导下一代Wi-Fi网络中NPCA的发展提供了基础。
This paper aims to improve our understanding of the performance of the Non-Primary Channel Access (NPCA) mechanism, a new feature introduced in IEEE 802.11bn to enhance spectrum utilization in Wi-Fi networks. NPCA enables devices to contend for and transmit on the secondary channel when the primary channel is occupied by transmissions from an Overlapping Basic Service Set (OBSS). We develop a Continuous-Time Markov Chain (CTMC) model that captures the interactions among OBSSs in dense Wireless Local Area Network (WLAN) environments when NPCA is enabled, incorporating new NPCA-specific states and transitions. In addition to the analytical insights offered by the model, we conduct numerical evaluations and simulations to quantify NPCA's impact on throughput and channel access delay across various scenarios. Our results show that NPCA can significantly improve throughput and reduce access delays in favorable conditions for BSSs that support the mechanism. Moreover, NPCA helps mitigate the OBSS performance anomaly, where low-rate OBSS transmissions degrade network performance for all nearby devices. However, we also observe trade-offs: NPCA may increase contention on secondary channels, potentially reducing transmission opportunities for BSSs operating there. Overall, the proposed modeling approach offers a foundation for analyzing, optimizing, and guiding the development of NPCA in next-generation Wi-Fi networks.
- [113] arXiv:2504.17103 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于子框架的多机器人网络轴承刚度保持控制标题: Subframework-based Bearing Rigidity Maintenance Control in Multirobot Networks评论: 6页期刊参考: IEEE 控制系统快报,第9卷,第1249-1254页,2025主题: 机器人技术 (cs.RO) ; 系统与控制 (eess.SY)
这项工作提出了一种新的方法,用于在具有感知约束和动态拓扑的多机器人网络中的\textit{轴承刚度}分析和控制。 通过将系统的框架分解为\textit{子框架},我们将方向刚性——一种全局属性——表示为一组\textit{本地}属性,其中刚性特征值作为自然\textit{局部刚性度量}。 我们提出了一种分布式梯度控制器,仅使用方向测量来执行特定任务命令。 该控制器通过保持刚性特征值高于阈值来保持方向刚性,仅使用子框架内交换的信息。 仿真评估了该方案的有效性,突显了其可扩展性和实用性。
This work presents a novel approach for \textit{bearing rigidity} analysis and control in multi-robot networks with sensing constraints and dynamic topology. By decomposing the system's framework into \textit{subframeworks}, we express bearing rigidity -- a global property -- as a set of \textit{local} properties, with rigidity eigenvalues serving as natural \textit{local rigidity measures}. We propose a decentralized gradient-based controller to execute mission-specific commands using only bearing measurements. The controller preserves bearing rigidity by keeping the rigidity eigenvalues above a threshold, using only information exchanged within subframeworks. Simulations evaluate the scheme's effectiveness, underscoring its scalability and practicality.
- [114] arXiv:2506.11105 (替换) [中文pdf, pdf, html, 其他]
-
标题: 通过输入驱动的显著性适应实现设备端医疗AI助手标题: Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation评论: 已接受发表于IEEE BioCAS 2025会议论文集主题: 计算与语言 (cs.CL) ; 人工智能 (cs.AI) ; 硬件架构 (cs.AR) ; 系统与控制 (eess.SY)
大型语言模型(LLMs)对医疗场景有重大影响,但在实时、资源受限的环境(如边缘设备)中部署仍显得过于庞大。 在本工作中,我们引入了一个通过我们的通用压缩框架优化的新型医疗助手系统,该系统将大型语言模型(LLMs)适配于特定领域部署。 通过在特定领域数据上测量神经元显著性,我们的方法可以大幅剪枝无关神经元,减少模型大小同时保持性能。 剪枝后,我们应用训练后量化以进一步减少内存占用,并在包括MedMCQA、MedQA和PubMedQA在内的医学基准上评估压缩后的模型。 我们还将压缩了50%的Gemma模型和压缩了67%的LLaMA3模型部署在Jetson Orin Nano(18.7W峰值)和Raspberry Pi 5(6.3W峰值)上,在硬件限制下实现了实时、节能的推理。
Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50\% compressed Gemma and the 67\% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.
- [115] arXiv:2507.12479 (替换) [中文pdf, pdf, html, 其他]
-
标题: 数据驱动的磁流体动力学流动控制标题: Data-driven control of a magnetohydrodynamic flow评论: 21页,7图;姓名更改,增加参考文献,润色语言,扩展附录主题: 流体动力学 (physics.flu-dyn) ; 系统与控制 (eess.SY)
我们通过外部施加的电场和磁场产生的洛伦兹力,展示了对弱导电磁流体动力学(MHD)流的反馈控制。 具体而言,我们使用围绕和位于流体储层周围和下方的电极阵列和电磁铁,引导电解质的流动以达到预定的速度或涡度模式,反馈由平面粒子图像测速技术(PIV)提供。 控制是通过模型预测控制(MPC)框架实现的,在该框架中,通过最小化流体演化的预测中的成本函数来计算控制信号。 预测器完全基于数据构建,使用科姆普顿算子理论,这使得底层非线性流体动力学能够被线性表示。 这种线性允许通过在两个小型且可高效求解的凸二次规划(QPs)之间交替来解决MPC问题:一个用于电极,另一个用于电磁铁。 结果控制器在一个标准笔记本电脑上闭环运行,实现了对流体的实时控制。 我们通过实验展示了该方法的功能,其中流动被塑造成与一系列参考速度场和随时间变化的涡度场相匹配。
We demonstrate the feedback control of a weakly conducting magnetohydrodynamic (MHD) flow via Lorentz forces generated by externally applied electric and magnetic fields. Specifically, we steer the flow of an electrolyte toward prescribed velocity or vorticity patterns using arrays of electrodes and electromagnets positioned around and beneath a fluid reservoir, with feedback provided by planar particle image velocimetry (PIV). Control is implemented using a model predictive control (MPC) framework, in which control signals are computed by minimizing a cost function over the predicted evolution of the flow. The predictor is constructed entirely from data using Koopman operator theory, which enables a linear representation of the underlying nonlinear fluid dynamics. This linearity allows the MPC problem to be solved by alternating between two small and efficiently solvable convex quadratic programs (QPs): one for the electrodes and one for the electromagnets. The resulting controller runs in a closed loop on a standard laptop, enabling real-time control of the flow. We demonstrate the functionality of the approach through experiments in which the flow is shaped to match a range of reference velocity fields and a time-varying vorticity field.
- [116] arXiv:2507.21394 (替换) [中文pdf, pdf, html, 其他]
-
标题: 基于脉动阵列的结构化状态空间模型加速器标题: Systolic Array-based Accelerator for Structured State-Space Models主题: 机器学习 (cs.LG) ; 系统与控制 (eess.SY)
序列建模对于人工智能理解时间数据和检测复杂的时间依赖模式至关重要。 虽然循环神经网络(RNNs)、卷积神经网络(CNNs)和Transformer在捕捉长距离依赖关系方面取得了进展,但由于有限的记忆保持能力(固定上下文窗口),它们在处理非常长的序列时难以实现高精度。 状态空间模型(SSMs)利用指数衰减的记忆,使得可以拥有很长的上下文窗口,因此它们比循环和基于Transformer的模型更高效地处理非常长的数据序列。 与传统的神经模型如CNNs和RNNs不同,基于SSM的模型需要通过连续积分来求解微分方程,这使得在传统CPU和GPU上进行训练和推理既计算密集又内存密集。 在本文中,我们介绍了一种专用的硬件加速器EpochCore,用于加速SSMs。 EpochCore基于脉动阵列(SAs),旨在提高基于SSM的模型在长距离序列任务中的能效和吞吐量。 在脉动阵列中,我们提出了一种称为LIMA-PE的多功能处理单元(PE),用于执行传统和专门的MAC操作,以支持传统DNNs和SSMs。 为了补充EpochCore的微架构,我们提出了一种新的数据流ProDF,它能够高效执行基于SSM的模型。 通过利用LIMA-PE微架构和ProDF,EpochCore在LRA数据集上的性能平均比GPU提高了2000倍,并且相比传统的基于SA的加速器(TPU)在性能上提高了250倍,在能效方面提高了45倍。
Sequence modeling is crucial for AI to understand temporal data and detect complex time-dependent patterns. While recurrent neural networks (RNNs), convolutional neural networks (CNNs), and Transformers have advanced in capturing long-range dependencies, they struggle with achieving high accuracy with very long sequences due to limited memory retention (fixed context window). State-Space Models (SSMs) leverage exponentially decaying memory enabling lengthy context window and so they process very long data sequences more efficiently than recurrent and Transformer-based models. Unlike traditional neural models like CNNs and RNNs, SSM-based models require solving differential equations through continuous integration, making training and inference both compute- and memory-intensive on conventional CPUs and GPUs. In this paper we introduce a specialized hardware accelerator, EpochCore, for accelerating SSMs. EpochCore is based on systolic arrays (SAs) and is designed to enhance the energy efficiency and throughput of inference of SSM-based models for long-range sequence tasks. Within the SA, we propose a versatile processing element (PE) called LIMA-PE to perform traditional and specialized MAC operations to support traditional DNNs and SSMs. To complement the EpochCore microarchitecture, we propose a novel dataflow, ProDF, which enables highly efficient execution of SSM-based models. By leveraging the LIMA-PE microarchitecture and ProDF, EpochCore achieves on average 2000x improvement in performance on LRA datasets compared to a GPU and 250x gains in performance and 45x improvement in energy efficiency, over traditional SA-based accelerators (TPU).
- [117] arXiv:2507.21886 (替换) [中文pdf, pdf, html, 其他]
-
标题: 通过呼吸信号的高效疼痛识别:一种单交叉注意力变换器多窗口融合管道标题: Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline评论: arXiv管理员备注:与arXiv:2507.21881、arXiv:2507.21875文本重叠主题: 人工智能 (cs.AI) ; 机器学习 (cs.LG) ; 信号处理 (eess.SP)
疼痛是一种影响大量人群的复杂状况。 准确且一致的评估对于经历疼痛的个体至关重要,并有助于开发有效和先进的管理策略。 自动疼痛评估系统提供持续监测并支持临床决策,旨在减少痛苦并预防功能退化。 本研究已提交至\textit{第二部分 下一代疼痛评估多模态传感挑战赛(AI4PAIN)}。 所提出的方法引入了一个利用呼吸作为输入信号的流程,并结合了一种高效交叉注意力变换器以及多窗口策略。 大量实验表明,呼吸是疼痛评估的一种有价值的生理模态。 此外,实验表明,当适当优化时,紧凑且高效的模型可以实现强大的性能,通常超过较大的模型。 所提出的多窗口方法有效地捕捉了短期和长期特征以及全局特性,从而增强了模型的表示能力。
Pain is a complex condition affecting a large portion of the population. Accurate and consistent evaluation is essential for individuals experiencing pain, and it supports the development of effective and advanced management strategies. Automatic pain assessment systems provide continuous monitoring and support clinical decision-making, aiming to reduce distress and prevent functional decline. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages respiration as the input signal and incorporates a highly efficient cross-attention transformer alongside a multi-windowing strategy. Extensive experiments demonstrate that respiration is a valuable physiological modality for pain assessment. Moreover, experiments revealed that compact and efficient models, when properly optimized, can achieve strong performance, often surpassing larger counterparts. The proposed multi-window approach effectively captures both short-term and long-term features, as well as global characteristics, thereby enhancing the model's representational capacity.
- [118] arXiv:2507.22769 (替换) [中文pdf, pdf, html, 其他]
-
标题: 用于自动驾驶功能加速虚拟验证的贝叶斯优化标题: Bayesian Optimization applied for accelerated Virtual Validation of the Autonomous Driving FunctionSatyesh Shanker Awasthi, Mohammed Irshadh Ismaaeel Sathyamangalam Imran, Stefano Arrigoni, Francesco Braghin评论: 12页,更正了参考文献27和38的作者列表,删除了参考文献6的重复引用主题: 机器人技术 (cs.RO) ; 系统与控制 (eess.SY)
自主驾驶功能(ADFs)的严格验证和确认(V&V)对于确保自动驾驶车辆(AVs)的安全性和公众接受度至关重要。 当前的验证主要依赖于仿真,以在车辆的操作设计域(ODD)内实现足够的测试覆盖率,但全面探索可能场景的庞大参数空间在计算上成本高昂且耗时。 本研究引入了一个基于贝叶斯优化(BO)的框架,以加速关键场景的发现。 我们在一个基于模型预测控制器(MPC)的运动规划器上展示了该框架的有效性,表明它能够使用比暴力实验设计(DoE)方法少多个数量级的仿真来识别危险情况,例如偏离道路事件。 此外,本研究还探讨了该框架在更高维参数空间中的可扩展性,以及其在用作案例研究的运动规划器的ODD内识别多个不同关键区域的能力。
Rigorous Verification and Validation (V&V) of Autonomous Driving Functions (ADFs) is paramount for ensuring the safety and public acceptance of Autonomous Vehicles (AVs). Current validation relies heavily on simulation to achieve sufficient test coverage within the Operational Design Domain (ODD) of a vehicle, but exhaustively exploring the vast parameter space of possible scenarios is computationally expensive and time-consuming. This work introduces a framework based on Bayesian Optimization (BO) to accelerate the discovery of critical scenarios. We demonstrate the effectiveness of the framework on an Model Predictive Controller (MPC)-based motion planner, showing that it identifies hazardous situations, such as off-road events, using orders of magnitude fewer simulations than brute-force Design of Experiments (DoE) methods. Furthermore, this study investigates the scalability of the framework in higher-dimensional parameter spaces and its ability to identify multiple, distinct critical regions within the ODD of the motion planner used as the case study .
- [119] arXiv:2508.00733 (替换) [中文pdf, pdf, html, 其他]
-
标题: AudioGen-Omni:一种用于视频同步音频、语音和歌曲生成的统一多模态扩散Transformer标题: AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation评论: 12页,2图主题: 声音 (cs.SD) ; 计算机视觉与模式识别 (cs.CV) ; 多媒体 (cs.MM) ; 音频与语音处理 (eess.AS)
我们提出AudioGen-Omni——一种基于多模态扩散变换器(MMDit)的统一方法,能够生成与输入视频连贯同步的高质量音频、语音和歌曲。 AudioGen-Omni引入了一种新颖的联合训练范式,无缝整合大规模视频-文本-音频语料库,使模型能够根据多模态输入生成语义丰富的、声学多样的音频,并适应广泛的音频生成任务。 AudioGen-Omni采用了一个统一的歌词转录编码器,将歌曲和口语输入中的字素和音素编码为密集的帧级表示。 使用基于AdaLN的联合注意力机制增强的相位对齐各向异性位置注入(PAAPI),其中 RoPE被选择性地应用于时间结构化的模态,以确保精确且稳健的跨模态对齐。 通过解冻所有模态并遮蔽缺失输入,AudioGen-Omni减轻了文本冻结范式的语义限制,实现了有效的跨模态条件生成。 这种联合训练方法提高了音频质量、语义对齐和唇形同步准确性,同时在 文本到音频/语音/歌曲任务上实现了最先进的结果。 对于8秒音频的推理时间为1.91秒,它在效率和通用性方面都有显著提升。
We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and song coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both song and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
- [120] arXiv:2508.02521 (替换) [中文pdf, pdf, html, 其他]
-
标题: 面向可靠音频深度伪造归属和模型识别:一种多级自编码器框架标题: Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based FrameworkAndrea Di Pierno (1), Luca Guarnera (2), Dario Allegra (2), Sebastiano Battiato (2) ((1) IMT School of Advanced Studies, (2) University of Catania)主题: 声音 (cs.SD) ; 计算机视觉与模式识别 (cs.CV) ; 音频与语音处理 (eess.AS)
音频深度伪造的泛滥对数字通信中的信任构成了日益增长的威胁。 尽管检测方法已经取得进展,但将音频深度伪造归因于其源模型仍然是一个研究不足但至关重要的挑战。 在本文中,我们引入了LAVA(语音归因分层架构),这是一种用于音频深度伪造检测和模型识别的分层框架,该框架利用仅在虚假音频上训练的卷积自编码器提取的注意力增强的潜在表示。 两个专门的分类器在此特征上运行:音频深度伪造归因(ADA),用于识别生成技术;音频深度伪造模型识别(ADMR),用于识别特定的生成模型实例。 为了在开放集条件下提高鲁棒性,我们结合了基于置信度的拒绝阈值。 在ASVspoof2021、FakeOrReal和CodecFake上的实验表现出强劲的性能:ADA分类器在所有数据集上的F1分数均超过95%,而ADMR模块在六个类别上的宏F1达到96.31%。 对ASVpoof2019 LA中未见过的攻击进行的额外测试和错误传播分析证实了LAVA的鲁棒性和可靠性。 该框架通过在开放集条件下引入一种监督方法来进行深度伪造归因和模型识别,已在公开基准上得到验证,并配有公开发布的模型和代码。 模型和代码可在https://www.github.com/adipiz99/lava-framework获取。
The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open-set conditions, we incorporate confidence-based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1-scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA's robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at https://www.github.com/adipiz99/lava-framework.
- [121] arXiv:2508.02873 (替换) [中文pdf, pdf, html, 其他]
-
标题: 可调腿部刚度在单足跳跃器中用于跨不同地面轮廓的节能垂直跳跃标题: Tunable Leg Stiffness in a Monopedal Hopper for Energy-Efficient Vertical Hopping Across Varying Ground Profiles评论: 2025年IEEE国际机器人与自动化会议(ICRA)主题: 机器人技术 (cs.RO) ; 系统与控制 (eess.SY)
我们介绍了HASTA(用于地形适应的可调节刚度跳动机器人)的设计和实现,这是一种具有实时可调腿部刚度的垂直跳跃机器人,旨在优化各种地面轮廓(一对地面刚度和阻尼条件)下的能量效率。通过调整腿部刚度,我们的目标是最大化跳跃顶点高度,这是能量高效垂直跳跃的关键指标。我们假设在柔软、阻尼的地面上,较软的腿部表现更好,通过最小化渗透和能量损失;而在坚硬、阻尼较小的地面上,较硬的腿部表现更佳,通过减少肢体变形和能量耗散。通过实验测试和仿真,我们发现对于每种地面刚度和阻尼的组合,我们选择的最佳腿部刚度,使机器人能够在恒定的能量输入下达到最大稳态跳跃高度。这些结果支持了我们的假设,即可调刚度在受控实验条件下提高了能量高效的运动。此外,仿真提供了有助于未来开发选择腿部刚度的控制器的见解。
We present the design and implementation of HASTA (Hopper with Adjustable Stiffness for Terrain Adaptation), a vertical hopping robot with real-time tunable leg stiffness, aimed at optimizing energy efficiency across various ground profiles (a pair of ground stiffness and damping conditions). By adjusting leg stiffness, we aim to maximize apex hopping height, a key metric for energy-efficient vertical hopping. We hypothesize that softer legs perform better on soft, damped ground by minimizing penetration and energy loss, while stiffer legs excel on hard, less damped ground by reducing limb deformation and energy dissipation. Through experimental tests and simulations, we find the best leg stiffness within our selection for each combination of ground stiffness and damping, enabling the robot to achieve maximum steady-state hopping height with a constant energy input. These results support our hypothesis that tunable stiffness improves energy-efficient locomotion in controlled experimental conditions. In addition, the simulation provides insights that could aid in the future development of controllers for selecting leg stiffness.