Skip to main content
CenXiv.org
此网站处于试运行阶段,支持我们!
我们衷心感谢所有贡献者的支持。
贡献
赞助
cenxiv logo > q-bio.GN

帮助 | 高级搜索

基因组学

  • 新提交
  • 交叉列表
  • 替换

查看 最近的 文章

显示 2025年08月08日, 星期五 新的列表

总共 7 条目
显示最多 2000 每页条目: 较少 | 更多 | 所有

新提交 (展示 4 之 4 条目 )

[1] arXiv:2508.04739 [中文pdf, pdf, html, 其他]
标题: CodonMoE:用于mRNA分析的DNA语言模型
标题: CodonMoE: DNA Language Models for mRNA Analyses
Shiyi Du, Litian Liang, Jiayi Li, Carl Kingsford
主题: 基因组学 (q-bio.GN) ; 机器学习 (cs.LG)

基因组语言模型(gLMs)面临一个基本的效率挑战:要么为每种生物模态(DNA和RNA)维护单独的专业模型,要么开发大型多模态架构。 这两种方法都会带来显著的计算负担——模态特定的模型即使在本质上存在生物联系,也需要冗余的基础架构,而多模态架构则需要大量的参数数量和广泛的跨模态预训练。 为了解决这一限制,我们引入了CodonMoE(密码子重构专家自适应混合),这是一种轻量级适配器,可在不进行RNA特定预训练的情况下将DNA语言模型转化为有效的RNA分析器。 我们的理论分析表明,CodonMoE在密码子级别上是一个通用逼近器,只要专家能力足够,就能将任意函数从密码子序列映射到RNA特性。 在四个涵盖稳定性、表达和调控的RNA预测任务中,添加了CodonMoE的DNA模型显著优于未修改的模型,其中HyenaDNA+CodonMoE系列在使用比专业RNA模型少80%的参数的情况下取得了最先进的结果。 通过保持次二次复杂度并实现卓越的性能,我们的方法提供了一条统一基因组语言建模的原则性路径,利用更丰富的DNA数据,减少计算开销,同时保持模态特定的性能优势。

Genomic language models (gLMs) face a fundamental efficiency challenge: either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens - modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand massive parameter counts and extensive cross-modality pretraining. To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages.

[2] arXiv:2508.04742 [中文pdf, pdf, html, 其他]
标题: 通过代理人工智能驱动的转录组特征分析发现疾病关系
标题: Discovery of Disease Relationships via Transcriptomic Signature Analysis Powered by Agentic AI
Ke Chen, Haohan Wang
主题: 基因组学 (q-bio.GN) ; 机器学习 (cs.LG)

现代疾病分类常常忽视了在不同临床表现下隐藏的分子共性。 本研究引入了一种基于转录组学的框架,通过使用GenoMAS——一种完全自动化的代理AI系统——分析超过1300对疾病状况,以发现疾病之间的关系。 除了识别出稳健的基因层面重叠外,我们还开发了一种基于通路的相似性框架,该框架整合了多数据库富集分析,以量化不同疾病之间的功能收敛性。 由此产生的疾病相似性网络揭示了已知的共病关系以及之前未记录的跨类别联系。 通过检查共享的生物通路,我们探讨了这些联系背后的潜在分子机制,提出了超越基于症状的分类法的功能假设。 我们进一步展示了像肥胖和高血压这样的基础状况如何调节转录组相似性,并根据它们与更广泛了解状况的分子接近性,确定了如自闭症谱系障碍等罕见疾病的治疗再利用机会。 此外,这项工作展示了如何让生物基础的代理AI扩展转录组分析,同时在复杂的疾病景观中实现机制解释。 所有结果均可在github.com/KeeeeChen/Pathway_Similarity_Network公开访问。

Modern disease classification often overlooks molecular commonalities hidden beneath divergent clinical presentations. This study introduces a transcriptomics-driven framework for discovering disease relationships by analyzing over 1300 disease-condition pairs using GenoMAS, a fully automated agentic AI system. Beyond identifying robust gene-level overlaps, we develop a novel pathway-based similarity framework that integrates multi-database enrichment analysis to quantify functional convergence across diseases. The resulting disease similarity network reveals both known comorbidities and previously undocumented cross-category links. By examining shared biological pathways, we explore potential molecular mechanisms underlying these connections-offering functional hypotheses that go beyond symptom-based taxonomies. We further show how background conditions such as obesity and hypertension modulate transcriptomic similarity, and identify therapeutic repurposing opportunities for rare diseases like autism spectrum disorder based on their molecular proximity to better-characterized conditions. In addition, this work demonstrates how biologically grounded agentic AI can scale transcriptomic analysis while enabling mechanistic interpretation across complex disease landscapes. All results are publicly accessible at github.com/KeeeeChen/Pathway_Similarity_Network.

[3] arXiv:2508.04747 [中文pdf, pdf, html, 其他]
标题: GRIT:用于零样本细胞类型注释的图正则化逻辑修正
标题: GRIT: Graph-Regularized Logit Refinement for Zero-shot Cell Type Annotation
Tianxiang Hu, Chenyi Zhou, Jiaxiang Liu, Jiongxin Wang, Ruizhe Chen, Haoxiang Xia, Gaoang Wang, Jian Wu, Zuozhu Liu
主题: 基因组学 (q-bio.GN) ; 机器学习 (cs.LG)

细胞类型注释是单细胞RNA测序(scRNA-seq)数据分析中的一个基本步骤。 在实践中,人类专家通常依靠主成分分析(PCA)揭示的结构,随后通过$k$-最近邻($k$-NN)图构建来指导注释。 虽然有效,但这一过程耗时且无法扩展到大型数据集。 CLIP风格模型的最新进展为自动化细胞类型注释提供了一条有希望的路径。 通过将scRNA-seq特征与自然语言描述对齐,像LangCell这样的模型能够实现零样本注释。 尽管LangCell展示了良好的零样本性能,但其预测结果仍然不够理想,特别是在所有细胞类型中实现一致的准确性方面。 在本文中,我们提出通过图正则化优化框架来改进LangCell产生的零样本logits。 通过在任务特定的基于PCA的k-NN图上强制局部一致性,我们的方法结合了预训练模型的可扩展性与专家注释所依赖的结构鲁棒性。 我们在来自4项不同研究的14个注释的人类scRNA-seq数据集上评估了我们的方法,这些数据集覆盖了11个器官和超过200,000个单细胞。 我们的方法始终提高了零样本注释的准确性,最高实现了10%的准确率提升。 进一步的分析展示了GRIT如何通过图有效地传播正确的信号,将错误标记的细胞拉回到更准确的预测中。 该方法无需训练,与模型无关,并作为简单而有效的插件,在实际中增强自动细胞类型注释。

Cell type annotation is a fundamental step in the analysis of single-cell RNA sequencing (scRNA-seq) data. In practice, human experts often rely on the structure revealed by principal component analysis (PCA) followed by $k$-nearest neighbor ($k$-NN) graph construction to guide annotation. While effective, this process is labor-intensive and does not scale to large datasets. Recent advances in CLIP-style models offer a promising path toward automating cell type annotation. By aligning scRNA-seq profiles with natural language descriptions, models like LangCell enable zero-shot annotation. While LangCell demonstrates decent zero-shot performance, its predictions remain suboptimal, particularly in achieving consistent accuracy across all cell types. In this paper, we propose to refine the zero-shot logits produced by LangCell through a graph-regularized optimization framework. By enforcing local consistency over the task-specific PCA-based k-NN graph, our method combines the scalability of the pre-trained models with the structural robustness relied upon in expert annotation. We evaluate our approach on 14 annotated human scRNA-seq datasets from 4 distinct studies, spanning 11 organs and over 200,000 single cells. Our method consistently improves zero-shot annotation accuracy, achieving accuracy gains of up to 10%. Further analysis showcase the mechanism by which GRIT effectively propagates correct signals through the graph, pulling back mislabeled cells toward more accurate predictions. The method is training-free, model-agnostic, and serves as a simple yet effective plug-in for enhancing automated cell type annotation in practice.

[4] arXiv:2508.04757 [中文pdf, pdf, html, 其他]
标题: 嵌入几乎就是你需要的一切:用于可泛化基因组预测任务的检索增强推理
标题: Embedding Is (Almost) All You Need: Retrieval-Augmented Inference for Generalizable Genomic Prediction Tasks
Nirjhor Datta, Swakkhar Shatabda, M Sohel Rahman
主题: 基因组学 (q-bio.GN) ; 机器学习 (cs.LG)

大型预训练DNA语言模型,如DNABERT-2、Nucleotide Transformer和HyenaDNA,在各种基因组基准测试中表现出色。然而,大多数应用依赖于昂贵的微调,这在训练数据和测试数据具有相似分布时效果最好。在本工作中,我们研究任务特定的微调是否总是必要的。我们表明,简单的基于嵌入的流程,从这些模型中提取固定表示并将其输入轻量级分类器,可以实现有竞争力的性能。在不同数据分布的评估设置中,基于嵌入的方法通常优于微调,同时将推理时间减少10倍到20倍。我们的结果表明,嵌入提取不仅是一个强大的基线,而且是微调更通用和高效的替代方案,尤其是在多样化或未见过的基因组环境中部署时。例如,在增强子分类中,HyenaDNA嵌入与zCurve结合可达到0.68的准确率(微调为0.58),推理时间减少了88%,碳排放量降低了8倍以上(0.02公斤 vs. 0.17公斤CO2)。在非TATA启动子分类中,DNABERT-2嵌入与zCurve或GC含量达到0.85的准确率(微调为0.89),碳足迹降低了22倍(0.02公斤 vs. 0.44公斤CO2)。这些结果表明,基于嵌入的流程在保持强大预测性能的同时,碳效率提高了10倍以上。代码可在以下位置获取:https://github.com/NIRJHOR-DATTA/EMBEDDING-IS-ALMOST-ALL-YOU-NEED.

Large pre-trained DNA language models such as DNABERT-2, Nucleotide Transformer, and HyenaDNA have demonstrated strong performance on various genomic benchmarks. However, most applications rely on expensive fine-tuning, which works best when the training and test data share a similar distribution. In this work, we investigate whether task-specific fine-tuning is always necessary. We show that simple embedding-based pipelines that extract fixed representations from these models and feed them into lightweight classifiers can achieve competitive performance. In evaluation settings with different data distributions, embedding-based methods often outperform fine-tuning while reducing inference time by 10x to 20x. Our results suggest that embedding extraction is not only a strong baseline but also a more generalizable and efficient alternative to fine-tuning, especially for deployment in diverse or unseen genomic contexts. For example, in enhancer classification, HyenaDNA embeddings combined with zCurve achieve 0.68 accuracy (vs. 0.58 for fine-tuning), with an 88% reduction in inference time and over 8x lower carbon emissions (0.02 kg vs. 0.17 kg CO2). In non-TATA promoter classification, DNABERT-2 embeddings with zCurve or GC content reach 0.85 accuracy (vs. 0.89 with fine-tuning) with a 22x lower carbon footprint (0.02 kg vs. 0.44 kg CO2). These results show that embedding-based pipelines offer over 10x better carbon efficiency while maintaining strong predictive performance. The code is available here: https://github.com/NIRJHOR-DATTA/EMBEDDING-IS-ALMOST-ALL-YOU-NEED.

交叉提交 (展示 1 之 1 条目 )

[5] arXiv:2508.04743 (交叉列表自 q-bio.MN) [中文pdf, pdf, html, 其他]
标题: Alz-QNet:用于研究阿尔茨海默病基因相互作用的量子回归网络
标题: Alz-QNet: A Quantum Regression Network for Studying Alzheimer's Gene Interactions
Debanjan Konar, Neerav Sreekumar, Richard Jiang, Vaneet Aggarwal
期刊参考: 生物学与医学中的计算机,第196卷,第C部分,2025年9月,110837
主题: 分子网络 (q-bio.MN) ; 机器学习 (cs.LG) ; 基因组学 (q-bio.GN) ; 量子物理 (quant-ph)

理解阿尔茨海默病(AD)的分子机制,通过研究与该疾病相关的关键基因仍然是一个挑战。阿尔茨海默病是一种多因素疾病,需要了解其背后的基因-基因相互作用,以促进诊断和治疗的发展。在本文中,首次尝试使用量子回归来解码一些关键基因,如阿尔茨海默病淀粉样前体蛋白($APP$)、固醇调节元件结合转录因子14($FGF14$)、Yin Yang 1($YY1$)和磷脂酶D家族成员3($PLD3$)等,在疾病进展过程中如何受到其他显著开关基因的影响,这可能有助于基于基因表达的阿尔茨海默病治疗。我们提出的量子回归网络(Alz-QNet)引入了一种创新方法,结合最先进的量子基因调控网络(QGRN)的见解,以揭示阿尔茨海默病病理中涉及的基因相互作用,特别是在早期病理变化发生的内侧颞叶皮层(EC)中。 使用提出的Alz-QNet框架,我们探索了阿尔茨海默病患者CE微环境中关键基因($APP$,$FGF14$,$YY1$,$EGR1$,$GAS7$,$AKT3$,$SREBF2$,以及$PLD3$)之间的相互作用,研究来自数据库$GSE138852$的遗传样本,所有这些都被认为在AD的发展中起着关键作用。 我们的研究揭示了复杂的基因-基因相互作用,阐明了潜在的调控机制,这些机制涉及AD的发病过程,有助于我们找到潜在的基因抑制剂或调节因子用于诊断和治疗。

Understanding the molecular-level mechanisms underpinning Alzheimer's disease (AD) by studying crucial genes associated with the disease remains a challenge. Alzheimer's, being a multifactorial disease, requires understanding the gene-gene interactions underlying it for theranostics and progress. In this article, a novel attempt has been made using a quantum regression to decode how some crucial genes in the AD Amyloid Beta Precursor Protein ($APP$), Sterol regulatory element binding transcription factor 14 ($FGF14$), Yin Yang 1 ($YY1$), and Phospholipase D Family Member 3 ($PLD3$) etc. become influenced by other prominent switching genes during disease progression, which may help in gene expression-based therapy for AD. Our proposed Quantum Regression Network (Alz-QNet) introduces a pioneering approach with insights from the state-of-the-art Quantum Gene Regulatory Networks (QGRN) to unravel the gene interactions involved in AD pathology, particularly within the Entorhinal Cortex (EC), where early pathological changes occur. Using the proposed Alz-QNet framework, we explore the interactions between key genes ($APP$, $FGF14$, $YY1$, $EGR1$, $GAS7$, $AKT3$, $SREBF2$, and $PLD3$) within the CE microenvironment of AD patients, studying genetic samples from the database $GSE138852$, all of which are believed to play a crucial role in the progression of AD. Our investigation uncovers intricate gene-gene interactions, shedding light on the potential regulatory mechanisms that underlie the pathogenesis of AD, which help us to find potential gene inhibitors or regulators for theranostics.

替换提交 (展示 2 之 2 条目 )

[6] arXiv:2506.09076 (替换) [中文pdf, pdf, html, 其他]
标题: 一种用于时空病原体模型中填补遗传距离的概率框架
标题: A Probabilistic Framework for Imputing Genetic Distances in Spatiotemporal Pathogen Models
Haley Stone, Jing Du, Hao Xue, Matthew Scotch, David Heslop, Andreas Züfle, Chandini Raina MacIntyre, Flora Salim
评论: 9页,3图 | 作为完整论文被SIGSPATIAL 2025接收
主题: 基因组学 (q-bio.GN) ; 机器学习 (cs.LG) ; 种群与进化 (q-bio.PE)

病原体基因组数据为空间模型提供了有价值的结构,但其效用受到测序覆盖不完整的限制。 我们提出了一种概率框架,用于推断未测序病例与已知序列在定义的传播链中的遗传距离,使用时间感知的进化距离建模。 该方法从收集日期和观察到的遗传距离估计成对分化,能够在没有序列比对或已知传播链的情况下,基于观察到的分化模式进行生物上合理的插补。 应用于美国野生鸟类中高致病性禽流感A/H5病例,这种方法支持基因组数据的可扩展、不确定性感知增强,并增强了进化信息在时空建模工作流中的整合。

Pathogen genome data offers valuable structure for spatial models, but its utility is limited by incomplete sequencing coverage. We propose a probabilistic framework for inferring genetic distances between unsequenced cases and known sequences within defined transmission chains, using time-aware evolutionary distance modeling. The method estimates pairwise divergence from collection dates and observed genetic distances, enabling biologically plausible imputation grounded in observed divergence patterns, without requiring sequence alignment or known transmission chains. Applied to highly pathogenic avian influenza A/H5 cases in wild birds in the United States, this approach supports scalable, uncertainty-aware augmentation of genomic datasets and enhances the integration of evolutionary information into spatiotemporal modeling workflows.

[7] arXiv:2506.15671 (替换) [中文pdf, pdf, html, 其他]
标题: 基于量子的算法用于模拟病毒反应
标题: Quantum-inspired algorithm for simulating viral response
Daria O. Konina, Dmitry I. Korbashov, Ilya V. Kovalchuk, Aygul A. Nizamieva, Dmitry A. Chermoshentsev, Aleksey K. Fedorov
评论: 35页,6图
主题: 基因组学 (q-bio.GN) ; 分子网络 (q-bio.MN) ; 量子物理 (quant-ph)

理解生物系统的特性是应用先进方法解决相应计算任务的一个令人兴奋的途径。 一类在解决生物挑战中出现的问题是优化问题。 在这项工作中,我们展示了将量子启发优化算法应用于模拟病毒反应的概念验证研究结果。 我们将一个伊辛型模型用于描述宿主反应中的基因活动模式。 将问题简化为伊辛形式使得可以使用现有的量子和量子启发优化工具。 我们展示了量子启发优化算法在该问题上的应用。 我们的研究为探索量子和量子启发优化工具在生物应用中的全部潜力铺平了道路。

Understanding the properties of biological systems is an exciting avenue for applying advanced approaches to solving corresponding computational tasks. A specific class of problems that arises in the resolution of biological challenges is optimization. In this work, we present the results of a proof-of-concept study that applies a quantum-inspired optimization algorithm to simulate a viral response. We formulate an Ising-type model to describe the patterns of gene activity in host responses. Reducing the problem to the Ising form allows the use of available quantum and quantum-inspired optimization tools. We demonstrate the application of a quantum-inspired optimization algorithm to this problem. Our study paves the way for exploring the full potential of quantum and quantum-inspired optimization tools in biological applications.

总共 7 条目
显示最多 2000 每页条目: 较少 | 更多 | 所有
  • 关于
  • 帮助
  • contact arXivClick here to contact arXiv 联系
  • 订阅 arXiv 邮件列表点击这里订阅 订阅
  • 版权
  • 隐私政策
  • 网络无障碍帮助
  • arXiv 运营状态
    通过...获取状态通知 email 或者 slack

京ICP备2025123034号