定量方法
查看 最近的 文章
显示 2025年08月08日, 星期五 新的列表
- [1] arXiv:2508.04724 [中文pdf, pdf, html, 其他]
-
标题: 使用多模态检索增强的基础模型理解蛋白质功能标题: Understanding protein function with a multimodal retrieval-augmented foundation model主题: 定量方法 (q-bio.QM) ; 机器学习 (cs.LG)
蛋白质语言模型(PLMs)学习自然蛋白质序列上的概率分布。 通过从数亿个自然蛋白质序列中学习,蛋白质理解和设计能力得以出现。 近期的研究表明,扩大这些模型可以改进结构预测,但似乎不会改善蛋白质功能预测中的突变理解和表示质量。 我们引入了PoET-2,这是一种多模态、检索增强的蛋白质基础模型,它结合了家族特异性进化约束的上下文学习,并带有可选的结构条件,以在蛋白质序列上学习生成分布。 PoET-2使用了一个对序列上下文顺序等变的分层变压器编码器,以及一个具有因果和掩码语言建模目标的双解码器架构,使PoET-2能够在完全生成和双向表示学习模式下运行。 PoET-2在零样本变异效应预测任务中达到了最先进的性能,特别擅长对具有多个突变和具有挑战性的插入缺失突变进行评分。 在监督设置中,PoET-2嵌入在学习序列-功能关系方面优于之前的方法,尤其是在小数据集上表现更优。 这项工作突显了将检索增强与多模态、家族中心建模相结合对于推进蛋白质基础模型的优势。
Protein language models (PLMs) learn probability distributions over natural protein sequences. By learning from hundreds of millions of natural protein sequences, protein understanding and design capabilities emerge. Recent works have shown that scaling these models improves structure prediction, but does not seem to improve mutation understanding and representation quality for protein function prediction. We introduce PoET-2, a multimodal, retrieval-augmented protein foundation model that incorporates in-context learning of family-specific evolutionary constraints with optional structure conditioning to learn generative distributions over protein sequences. PoET-2 uses a hierarchical transformer encoder that is equivariant to sequence context ordering and a dual decoder architecture with both causal and masked language modeling objectives, allowing PoET-2 to operate in both fully generative and bidirectional representation learning modes. PoET-2 achieves state-of-the-art performance on zero-shot variant effect prediction, excelling at scoring variants with multiple mutations and challenging indel mutations. In supervised settings, PoET-2 embeddings outperform previous methods for learning sequence-function relationships, especially with small datasets. This work highlights the benefits of combining retrieval augmentation with multimodal, family-centric modeling for advancing protein foundation models.
- [2] arXiv:2508.04734 [中文pdf, pdf, html, 其他]
-
标题: 跨域图像合成:从多重生物标志物成像生成H&E标题: Cross-Domain Image Synthesis: Generating H&E from Multiplex Biomarker Imaging主题: 定量方法 (q-bio.QM) ; 人工智能 (cs.AI) ; 图像与视频处理 (eess.IV)
虽然多重免疫荧光(mIF)成像提供了深度的、空间分辨的分子数据,将这些信息与苏木精和伊红(H&E)形态学标准相结合,对于获得关于潜在组织的互补信息可能非常重要。从mIF数据生成虚拟H&E染色提供了一种强大的解决方案,提供了即时的形态学背景。至关重要的是,这种方法使能够将大量的基于H&E的计算机辅助诊断(CAD)工具应用于丰富的分子数据,弥合了分子分析和形态学分析之间的差距。在本工作中,我们研究了多级向量量化生成对抗网络(VQGAN)用于从mIF图像生成高保真虚拟H&E染色的使用。我们对两个公开可用的结直肠癌数据集中的标准条件生成对抗网络(cGAN)基线进行了严格的评估,评估了图像相似性和下游分析的功能效用。我们的结果表明,尽管两种架构都能生成视觉上合理的图像,但由我们的VQGAN生成的虚拟染色为计算机辅助诊断提供了更有效的基础。具体而言,在VQGAN生成的图像上进行的下游核分割和组织分类任务中的语义保留表现出优于cGAN的结果,并且与真实值分析有更高的一致性。这项工作确立了多级VQGAN是一种稳健且优越的生成科学有用虚拟染色的架构,为将mIF的丰富分子数据整合到已建立且强大的基于H&E的分析工作流程中提供了一条可行的路径。
While multiplex immunofluorescence (mIF) imaging provides deep, spatially-resolved molecular data, integrating this information with the morphological standard of Hematoxylin & Eosin (H&E) can be very important for obtaining complementary information about the underlying tissue. Generating a virtual H&E stain from mIF data offers a powerful solution, providing immediate morphological context. Crucially, this approach enables the application of the vast ecosystem of H&E-based computer-aided diagnosis (CAD) tools to analyze rich molecular data, bridging the gap between molecular and morphological analysis. In this work, we investigate the use of a multi-level Vector-Quantized Generative Adversarial Network (VQGAN) to create high-fidelity virtual H&E stains from mIF images. We rigorously evaluated our VQGAN against a standard conditional GAN (cGAN) baseline on two publicly available colorectal cancer datasets, assessing performance on both image similarity and functional utility for downstream analysis. Our results show that while both architectures produce visually plausible images, the virtual stains generated by our VQGAN provide a more effective substrate for computer-aided diagnosis. Specifically, downstream nuclei segmentation and semantic preservation in tissue classification tasks performed on VQGAN-generated images demonstrate superior performance and agreement with ground-truth analysis compared to those from the cGAN. This work establishes that a multi-level VQGAN is a robust and superior architecture for generating scientifically useful virtual stains, offering a viable pathway to integrate the rich molecular data of mIF into established and powerful H&E-based analytical workflows.
- [3] arXiv:2508.04735 [中文pdf, pdf, html, 其他]
-
标题: ERDES:用于眼超声视网膜脱离和黄斑状态分类的基准视频数据集标题: ERDES: A Benchmark Video Dataset for Retinal Detachment and Macular Status Classification in Ocular UltrasoundPouyan Navard, Yasemin Ozkut, Srikar Adhikari, Elaine Situ-LaCasse, Josie Acu√±a, Adrienne Yarnish, Alper Yilmaz评论: 正在审核中,https://github.com/OSUPCVLab/ERDES主题: 定量方法 (q-bio.QM) ; 人工智能 (cs.AI)
视网膜脱离(RD)是一种威胁视力的状况,需要及时干预以保护视力。黄斑受累——黄斑是否仍然完整(黄斑完整)或脱离(黄斑脱离)——是视觉结果和治疗紧迫性的关键决定因素。床旁超声(POCUS)提供了一种快速、无创、成本效益高且易于获取的成像方式,在各种临床环境中广泛用于检测RD。然而,超声图像的解读受限于医疗保健提供者缺乏专业知识,尤其是在资源有限的环境中。深度学习有望实现基于超声的RD评估自动化。然而,目前尚无可用于临床的ML超声算法来检测RD,也没有先前的研究探讨使用超声在RD病例中评估黄斑状态——这对于手术优先级至关重要。此外,目前没有公开的数据集支持基于黄斑的RD分类使用超声视频片段。我们介绍了Eye Retinal DEtachment ultraSound,ERDES,第一个针对(i)视网膜脱离的存在和(ii)黄斑完整与黄斑脱离状态进行标记的眼部超声片段的开源数据集。该数据集旨在促进检测视网膜脱离的机器学习模型的开发和评估。我们还使用多种时空卷积神经网络(CNN)架构提供了基线基准。所有片段、标签和训练代码均可在https://osupcvlab.github.io/ERDES/上公开获取。
Retinal detachment (RD) is a vision-threatening condition that requires timely intervention to preserve vision. Macular involvement -- whether the macula is still intact (macula-intact) or detached (macula-detached) -- is the key determinant of visual outcomes and treatment urgency. Point-of-care ultrasound (POCUS) offers a fast, non-invasive, cost-effective, and accessible imaging modality widely used in diverse clinical settings to detect RD. However, ultrasound image interpretation is limited by a lack of expertise among healthcare providers, especially in resource-limited settings. Deep learning offers the potential to automate ultrasound-based assessment of RD. However, there are no ML ultrasound algorithms currently available for clinical use to detect RD and no prior research has been done on assessing macular status using ultrasound in RD cases -- an essential distinction for surgical prioritization. Moreover, no public dataset currently supports macular-based RD classification using ultrasound video clips. We introduce Eye Retinal DEtachment ultraSound, ERDES, the first open-access dataset of ocular ultrasound clips labeled for (i) presence of retinal detachment and (ii) macula-intact versus macula-detached status. The dataset is intended to facilitate the development and evaluation of machine learning models for detecting retinal detachment. We also provide baseline benchmarks using multiple spatiotemporal convolutional neural network (CNN) architectures. All clips, labels, and training code are publicly available at https://osupcvlab.github.io/ERDES/.
- [4] arXiv:2508.05550 [中文pdf, pdf, html, 其他]
-
标题: PhysiBoSS-Models:多尺度模型数据库标题: PhysiBoSS-Models: A database for multiscale modelsVincent Noel, Marco Ruscone, Randy Heiland, Arnau Montagud, Alfonso Valencia, Emmanuel Barillot, Paul Macklin, Laurence Calzone主题: 定量方法 (q-bio.QM)
PhysiBoSS是一个开源平台,将细胞群体的基于代理的建模与细胞内随机布尔网络相结合,能够对复杂的生物行为进行多尺度模拟。 为了促进模型的共享和版本控制,我们介绍了PhysiBoSS-Models数据库:一个由PhysiBoSS构建的多尺度模型的精选存储库。 通过提供简单的Python API,PhysiBoSS-Models提供了一种通过PhysiCell Studio等工具下载和模拟现有模型的简便方法。 通过提供对经过验证的模型的标准访问,PhysiBoSS-Models促进了模型的重用、验证和基准测试,支持生物学研究。
PhysiBoSS is an open-source platform that integrates agent-based modeling of cell populations with intracellular stochastic Boolean networks, enabling multiscale simulations of complex biological behaviors. To promote model sharing and versioning, we present the PhysiBoSS-Models database: a curated repository for multiscale models built with PhysiBoSS. By providing a simple Python API, PhysiBoSS-Models provides an easy way to download and simulate preexisting models through tools such as PhysiCell Studio. By providing standardized access to validated models, PhysiBoSS-Models facilitates reuse, validation, and benchmarking, supporting research in biology.
- [5] arXiv:2508.05594 [中文pdf, pdf, 其他]
-
标题: 用于检测污水中SARS-CoV-2的实验室方法之间的过渡数据分析与建模标题: Data Analysis and Modeling for Transitioning Between Laboratory Methods for Detecting SARS-CoV-2 in WastewaterMaria M. Warns, Leah Mrowiec, Christopher Owen, Adam Horton, Chi-Yu Lin, Modou Lamin Jarju, Niall M. Mangan, Aaron Packman, Katelyn Plaisier Leisman, Abhilasha Shrestha, Rachel Poretsky主题: 定量方法 (q-bio.QM)
废水监测已被证明是监测病原体如SARS-CoV-2的有用工具,因为它是一种非侵入性的方式,可以调查贡献到污水区的人口潜在的疾病负担。 自新冠疫情开始以来,该领域不断发展,处理废水和量化病原体核酸水平的实验室方法随着技术的变化、努力的规模和范围的扩大以及供应链问题的解决而得到了改进。 对于正在经历方法转换的实验室来说,保持数据连续性对于准确评估随时间变化的传染病水平以及将测得的RNA浓度与公共卫生数据进行比较至关重要。 尽管实验室方法具有动态性,并且需要确保数据不间断,但据我们所知,还没有研究将来自不同实验室方法的病原体定量环境样本数据集结合起来。 在此,我们描述了一个实验室方法的转变,即从使用低通量、手动过滤浓缩废水和RNA提取后进行qPCR的SARS-CoV-2 RNA定量,转变为使用高通量、自动磁珠浓缩和提取后进行dPCR的方法。 在两个月的过渡期间,来自芝加哥大都会区的废水样品同时用两种方法进行处理。 我们评估了多种回归模型来关联两种方法的RNA测量结果,并在去除异常值和不一致点以提高模型性能后,发现对数-对数模型最为合适。 我们还评估了对低于检测限的样品分配数值的后果。 我们的研究表明,如果在方法之间有足够的重叠期,以便构建适当的模型来关联数据集,则可以在实验室方法的转换过程中保持数据连续性。
Wastewater surveillance has proven to be a useful tool to monitor pathogens such as SARS-CoV-2 as it is a nonintrusive way to survey the potential disease burden of the population contributing to a sewershed. With the expansion of this field since the beginning of the COVID-19 pandemic, laboratory methods to process wastewater and quantify pathogen nucleic acid levels have improved as technologies changed, efforts expanded in size and scope, and supply chain issues were resolved. Maintaining data continuity is crucial for labs undergoing method transitions to accurately assess infectious disease levels over time and compare measured RNA concentrations to public health data. Despite the dynamic nature of laboratory methods and the necessity to ensure uninterrupted data, to our knowledge there has not been a study that unites two datasets from different lab methods for pathogen quantification from environmental samples. Here, we describe a lab transition from SARS-CoV-2 RNA quantification using a low-throughput, manual filtration-based wastewater concentration and RNA extraction followed by qPCR to a high-throughput, automated magnetic bead-based concentration and extraction followed by dPCR. During the two-month transition period, wastewater samples from across the Chicago metropolitan area were processed with both methods in parallel. We evaluated a variety of regression models to relate the RNA measurements from both methods and found a log-log model was most appropriate after removing outliers and discrepancy points to improve model performance. We also evaluated the consequences of assigning values to samples that were below the detection limit. Our study demonstrates that data continuity can be maintained throughout a transition of laboratory methods if there is a sufficient period of overlap between the methods for an appropriate model to be constructed to relate the datasets.
新提交 (展示 5 之 5 条目 )
- [6] arXiv:2508.04727 (交叉列表自 q-bio.TO) [中文pdf, pdf, html, 其他]
-
标题: 基于强化学习的心脏MRI自适应k空间径向采样标题: Adaptive k-space Radial Sampling for Cardiac MRI with Reinforcement Learning评论: MICCAI 2025 STACOM 工作坊主题: 组织与器官 (q-bio.TO) ; 图像与视频处理 (eess.IV) ; 定量方法 (q-bio.QM)
加速磁共振成像(MRI)需要仔细优化k空间采样模式,以平衡采集速度和图像质量。 尽管深度学习的最新进展在优化笛卡尔采样方面显示出前景,但强化学习(RL)在非笛卡尔轨迹优化中的潜力仍 largely 未被探索。 在本工作中,我们提出了一种新的RL框架,用于优化心脏MRI中的径向采样轨迹。 我们的方法具有双分支架构,联合处理k空间和图像域信息,结合交叉注意力融合机制,以促进域间有效信息交换。 该框架采用解剖感知奖励设计和黄金比例采样策略,以确保均匀的k空间覆盖,同时保留心脏结构细节。 实验结果表明,我们的方法能够在多个加速因子下有效学习最优径向采样策略,与传统方法相比实现了改进的重建质量。 代码可用:https://github.com/Ruru-Xu/RL-kspace-Radial-Sampling
Accelerated Magnetic Resonance Imaging (MRI) requires careful optimization of k-space sampling patterns to balance acquisition speed and image quality. While recent advances in deep learning have shown promise in optimizing Cartesian sampling, the potential of reinforcement learning (RL) for non-Cartesian trajectory optimization remains largely unexplored. In this work, we present a novel RL framework for optimizing radial sampling trajectories in cardiac MRI. Our approach features a dual-branch architecture that jointly processes k-space and image-domain information, incorporating a cross-attention fusion mechanism to facilitate effective information exchange between domains. The framework employs an anatomically-aware reward design and a golden-ratio sampling strategy to ensure uniform k-space coverage while preserving cardiac structural details. Experimental results demonstrate that our method effectively learns optimal radial sampling strategies across multiple acceleration factors, achieving improved reconstruction quality compared to conventional approaches. Code available: https://github.com/Ruru-Xu/RL-kspace-Radial-Sampling
交叉提交 (展示 1 之 1 条目 )
- [7] arXiv:2507.21404 (替换) [中文pdf, pdf, html, 其他]
-
标题: 数据泄露和冗余在LIT-PCBA基准中标题: Data Leakage and Redundancy in the LIT-PCBA Benchmark主题: 机器学习 (cs.LG) ; 定量方法 (q-bio.QM)
LIT-PCBA被广泛用于基准虚拟筛选模型,但我们的审计显示它从根本上存在问题。 我们发现其分割中存在广泛的数据泄露和分子冗余,包括不同分区内外的二维相同的配体、普遍的类似物重叠以及低多样性查询集。 例如,在ALDH1中,323个活性训练-验证类似物对出现在ECFP4 Tanimoto相似度$\geq 0.6$;在所有靶点中,2,491个二维相同的非活性物质同时出现在训练和验证集中,而对应的活性物质非常少。 这些重叠使得模型能够通过支架记忆取得成功,而不是泛化能力,从而夸大了富集因子和AUROC分数。 这些缺陷并非偶然——它们如此严重,以至于一个没有任何可学习参数的简单基于记忆的基线模型可以利用它们来达到或超过最新深度学习和3D相似性模型的报告性能。 因此,几乎所有在LIT-PCBA上发表的结果都被削弱了。 即使是在“零样本”模式下评估的模型也受到类似物泄漏到查询集的影响,削弱了泛化能力的声明。 以目前的形式,该基准并不衡量模型恢复新型化学类型的能力,也不应被视为方法进步的证据。 所有代码、数据和基线实现均可在以下网址获取:https://github.com/sievestack/LIT-PCBA-audit
LIT-PCBA is widely used to benchmark virtual screening models, but our audit reveals that it is fundamentally compromised. We find extensive data leakage and molecular redundancy across its splits, including 2D-identical ligands within and across partitions, pervasive analog overlap, and low-diversity query sets. In ALDH1 alone, for instance, 323 active training -- validation analog pairs occur at ECFP4 Tanimoto similarity $\geq 0.6$; across all targets, 2,491 2D-identical inactives appear in both training and validation, with very few corresponding actives. These overlaps allow models to succeed through scaffold memorization rather than generalization, inflating enrichment factors and AUROC scores. These flaws are not incidental -- they are so severe that a trivial memorization-based baseline with no learnable parameters can exploit them to match or exceed the reported performance of state-of-the-art deep learning and 3D-similarity models. As a result, nearly all published results on LIT-PCBA are undermined. Even models evaluated in "zero-shot" mode are affected by analog leakage into the query set, weakening claims of generalization. In its current form, the benchmark does not measure a model's ability to recover novel chemotypes and should not be taken as evidence of methodological progress. All code, data, and baseline implementations are available at: https://github.com/sievestack/LIT-PCBA-audit