Your browser doesn't support javascript.
loading
Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model.
Zhai, Jingjing; Gokaslan, Aaron; Schiff, Yair; Berthel, Ana; Liu, Zong-Yan; Miller, Zachary R; Scheben, Armin; Stitzer, Michelle C; Romay, M Cinta; Buckler, Edward S; Kuleshov, Volodymyr.
Afiliação
  • Zhai J; Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853.
  • Gokaslan A; Department of Computer Science, Cornell University, Ithaca, NY, USA 14853.
  • Schiff Y; Department of Computer Science, Cornell University, Ithaca, NY, USA 14853.
  • Berthel A; Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853.
  • Liu ZY; Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853.
  • Miller ZR; Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853.
  • Scheben A; Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY USA 11724.
  • Stitzer MC; Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853.
  • Romay MC; Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853.
  • Buckler ES; Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853.
  • Kuleshov V; Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853.
bioRxiv ; 2024 Jun 10.
Article em En | MEDLINE | ID: mdl-38895432
ABSTRACT
Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article