Search | VHL Regional Portal

PyHMMER: a Python library binding to HMMER for efficient sequence analysis.

Larralde, Martin; Zeller, Georg.

Bioinformatics ; 39(5)2023 05 04.

Article in English | MEDLINE | ID: mdl-37074928

ABSTRACT

SUMMARY: PyHMMER provides Python integration of the popular profile Hidden Markov Model software HMMER via Cython bindings. This allows the annotation of protein sequences with profile HMMs and building new ones directly with Python. PyHMMER increases flexibility of use, allowing creating queries directly from Python code, launching searches, and obtaining results without I/O, or accessing previously unavailable statistics like uncorrected P-values. A new parallelization model greatly improves performance when running multithreaded searches, while producing the exact same results as HMMER. AVAILABILITY AND IMPLEMENTATION: PyHMMER supports all modern Python versions (Python 3.6+) and similar platforms as HMMER (x86 or PowerPC UNIX systems). Pre-compiled packages are released via PyPI (https://pypi.org/project/pyhmmer/) and Bioconda (https://anaconda.org/bioconda/pyhmmer). The PyHMMER source code is available under the terms of the open-source MIT licence and hosted on GitHub (https://github.com/althonos/pyhmmer); its documentation is available on ReadTheDocs (https://pyhmmer.readthedocs.io).

Subject(s)

Documentation , Software , Amino Acid Sequence

proGenomes3: approaching one million accurately and consistently annotated high-quality prokaryotic genomes.

Fullam, Anthony; Letunic, Ivica; Schmidt, Thomas S B; Ducarmon, Quinten R; Karcher, Nicolai; Khedkar, Supriya; Kuhn, Michael; Larralde, Martin; Maistrenko, Oleksandr M; Malfertheiner, Lukas; Milanese, Alessio; Rodrigues, Joao Frederico Matias; Sanchis-López, Claudia; Schudoma, Christian; Szklarczyk, Damian; Sunagawa, Shinichi; Zeller, Georg; Huerta-Cepas, Jaime; von Mering, Christian; Bork, Peer; Mende, Daniel R.

Nucleic Acids Res ; 51(D1): D760-D766, 2023 01 06.

Article in English | MEDLINE | ID: mdl-36408900

ABSTRACT

The interpretation of genomic, transcriptomic and other microbial 'omics data is highly dependent on the availability of well-annotated genomes. As the number of publicly available microbial genomes continues to increase exponentially, the need for quality control and consistent annotation is becoming critical. We present proGenomes3, a database of 907 388 high-quality genomes containing 4 billion genes that passed stringent criteria and have been consistently annotated using multiple functional and taxonomic databases including mobile genetic elements and biosynthetic gene clusters. proGenomes3 encompasses 41 171 species-level clusters, defined based on universal single copy marker genes, for which pan-genomes and contextual habitat annotations are provided. The database is available at http://progenomes.embl.de/.

Subject(s)

Genome , Prokaryotic Cells , Databases, Genetic , Genomics , Molecular Sequence Annotation , Bacteria/classification , Bacteria/genetics

Predicting stop codon reassignment improves functional annotation of bacteriophages.

Cook, Ryan; Telatin, Andrea; Bouras, George; Camargo, Antonio Pedro; Larralde, Martin; Edwards, Robert A; Adriaenssens, Evelien M.

bioRxiv ; 2023 Dec 19.

Article in English | MEDLINE | ID: mdl-38187747

ABSTRACT

The majority of bacteriophage diversity remains uncharacterised, and new intriguing mechanisms of their biology are being continually described. Members of some phage lineages, such as the Crassvirales, repurpose stop codons to encode an amino acid by using alternate genetic codes. Here, we investigated the prevalence of stop codon reassignment in phage genomes and subsequent impacts on functional annotation. We predicted 76 genomes within INPHARED and 712 vOTUs from the Unified Human Gut Virome catalogue (UHGV) that repurpose a stop codon to encode an amino acid. We re-annotated these sequences with modified versions of Pharokka and Prokka, called Pharokka-gv and Prokka-gv, to automatically predict stop codon reassignment prior to annotation. Both tools significantly improved the quality of annotations, with Pharokka-gv performing best. For sequences predicted to repurpose TAG to glutamine (translation table 15), Pharokka-gv increased the median gene length (median of per genome medians) from 287 to 481 bp for UHGV sequences (67.8% increase) and from 318 to 550 bp for INPHARED sequences (72.9% increase). The re-annotation increased mean coding density from 66.8% to 90.0%, and from 69.0% to 89.8% for UHGV and INPHARED sequences. Furthermore, the proportion of genes that could be assigned functional annotation increased, including an increase in the number of major capsid proteins that could be identified. We propose that automatic prediction of stop codon reassignment before annotation is beneficial to downstream viral genomic and metagenomic analyses.

Ontology Development Kit: a toolkit for building, maintaining and standardizing biomedical ontologies.

Matentzoglu, Nicolas; Goutte-Gattat, Damien; Tan, Shawn Zheng Kai; Balhoff, James P; Carbon, Seth; Caron, Anita R; Duncan, William D; Flack, Joe E; Haendel, Melissa; Harris, Nomi L; Hogan, William R; Hoyt, Charles Tapley; Jackson, Rebecca C; Kim, HyeongSik; Kir, Huseyin; Larralde, Martin; McMurry, Julie A; Overton, James A; Peters, Bjoern; Pilgrim, Clare; Stefancsik, Ray; Robb, Sofia Mc; Toro, Sabrina; Vasilevsky, Nicole A; Walls, Ramona; Mungall, Christopher J; Osumi-Sutherland, David.

Database (Oxford) ; 20222022 10 08.

Article in English | MEDLINE | ID: mdl-36208225

ABSTRACT

Similar to managing software packages, managing the ontology life cycle involves multiple complex workflows such as preparing releases, continuous quality control checking and dependency management. To manage these processes, a diverse set of tools is required, from command-line utilities to powerful ontology-engineering environmentsr. Particularly in the biomedical domain, which has developed a set of highly diverse yet inter-dependent ontologies, standardizing release practices and metadata and establishing shared quality standards are crucial to enable interoperability. The Ontology Development Kit (ODK) provides a set of standardized, customizable and automatically executable workflows, and packages all required tooling in a single Docker image. In this paper, we provide an overview of how the ODK works, show how it is used in practice and describe how we envision it driving standardization efforts in our community. Database URL: https://github.com/INCATools/ontology-development-kit.

Subject(s)

Biological Ontologies , Databases, Factual , Metadata , Quality Control , Software , Workflow

Biosynthetic potential of the global ocean microbiome.

Paoli, Lucas; Ruscheweyh, Hans-Joachim; Forneris, Clarissa C; Hubrich, Florian; Kautsar, Satria; Bhushan, Agneya; Lotti, Alessandro; Clayssen, Quentin; Salazar, Guillem; Milanese, Alessio; Carlström, Charlotte I; Papadopoulou, Chrysa; Gehrig, Daniel; Karasikov, Mikhail; Mustafa, Harun; Larralde, Martin; Carroll, Laura M; Sánchez, Pablo; Zayed, Ahmed A; Cronin, Dylan R; Acinas, Silvia G; Bork, Peer; Bowler, Chris; Delmont, Tom O; Gasol, Josep M; Gossert, Alvar D; Kahles, André; Sullivan, Matthew B; Wincker, Patrick; Zeller, Georg; Robinson, Serina L; Piel, Jörn; Sunagawa, Shinichi.

Nature ; 607(7917): 111-118, 2022 07.

Article in English | MEDLINE | ID: mdl-35732736

ABSTRACT

Natural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups1, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds2,3. However, studying this diversity to identify genomic pathways for the synthesis of such compounds4 and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters ('Candidatus Eudoremicrobiaceae') that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environments.

Subject(s)

Biosynthetic Pathways , Microbiota , Oceans and Seas , Bacteria/classification , Bacteria/genetics , Biosynthetic Pathways/genetics , Genomics , Microbiota/genetics , Multigene Family/genetics , Phylogeny

ISA API: An open platform for interoperable life science experimental metadata.

Johnson, David; Batista, Dominique; Cochrane, Keeva; Davey, Robert P; Etuk, Anthony; Gonzalez-Beltran, Alejandra; Haug, Kenneth; Izzo, Massimiliano; Larralde, Martin; Lawson, Thomas N; Minotto, Alice; Moreno, Pablo; Nainala, Venkata Chandrasekhar; O'Donovan, Claire; Pireddu, Luca; Roger, Pierrick; Shaw, Felix; Steinbeck, Christoph; Weber, Ralf J M; Sansone, Susanna-Assunta; Rocca-Serra, Philippe.

Gigascience ; 10(9)2021 09 16.

Article in English | MEDLINE | ID: mdl-34528664

ABSTRACT

BACKGROUND: The Investigation/Study/Assay (ISA) Metadata Framework is an established and widely used set of open source community specifications and software tools for enabling discovery, exchange, and publication of metadata from experiments in the life sciences. The original ISA software suite provided a set of user-facing Java tools for creating and manipulating the information structured in ISA-Tab-a now widely used tabular format. To make the ISA framework more accessible to machines and enable programmatic manipulation of experiment metadata, the JSON serialization ISA-JSON was developed. RESULTS: In this work, we present the ISA API, a Python library for the creation, editing, parsing, and validating of ISA-Tab and ISA-JSON formats by using a common data model engineered as Python object classes. We describe the ISA API feature set, early adopters, and its growing user community. CONCLUSIONS: The ISA API provides users with rich programmatic metadata-handling functionality to support automation, a common interface, and an interoperable medium between the 2 ISA formats, as well as with other life science data formats required for depositing data in public databases.

Subject(s)

Biological Science Disciplines , Metadata , Databases, Factual , Software

nmrML: A Community Supported Open Data Standard for the Description, Storage, and Exchange of NMR Data.

Schober, Daniel; Jacob, Daniel; Wilson, Michael; Cruz, Joseph A; Marcu, Ana; Grant, Jason R; Moing, Annick; Deborde, Catherine; de Figueiredo, Luis F; Haug, Kenneth; Rocca-Serra, Philippe; Easton, John; Ebbels, Timothy M D; Hao, Jie; Ludwig, Christian; Günther, Ulrich L; Rosato, Antonio; Klein, Matthias S; Lewis, Ian A; Luchinat, Claudio; Jones, Andrew R; Grauslys, Arturas; Larralde, Martin; Yokochi, Masashi; Kobayashi, Naohiro; Porzel, Andrea; Griffin, Julian L; Viant, Mark R; Wishart, David S; Steinbeck, Christoph; Salek, Reza M; Neumann, Steffen.

Anal Chem ; 90(1): 649-656, 2018 01 02.

Article in English | MEDLINE | ID: mdl-29035042

ABSTRACT

NMR is a widely used analytical technique with a growing number of repositories available. As a result, demands for a vendor-agnostic, open data format for long-term archiving of NMR data have emerged with the aim to ease and encourage sharing, comparison, and reuse of NMR data. Here we present nmrML, an open XML-based exchange and storage format for NMR spectral data. The nmrML format is intended to be fully compatible with existing NMR data for chemical, biochemical, and metabolomics experiments. nmrML can capture raw NMR data, spectral data acquisition parameters, and where available spectral metadata, such as chemical structures associated with spectral assignments. The nmrML format is compatible with pure-compound NMR data for reference spectral libraries as well as NMR data from complex biomixtures, i.e., metabolomics experiments. To facilitate format conversions, we provide nmrML converters for Bruker, JEOL and Agilent/Varian vendor formats. In addition, easy-to-use Web-based spectral viewing, processing, and spectral assignment tools that read and write nmrML have been developed. Software libraries and Web services for data validation are available for tool developers and end-users. The nmrML format has already been adopted for capturing and disseminating NMR data for small molecules by several open source data processing tools and metabolomics reference spectral libraries, e.g., serving as storage format for the MetaboLights data repository. The nmrML open access data standard has been endorsed by the Metabolomics Standards Initiative (MSI), and we here encourage user participation and feedback to increase usability and make it a successful standard.

Subject(s)

Databases, Chemical/standards , Magnetic Resonance Spectroscopy/statistics & numerical data , Metabolomics/methods , Software

mzML2ISA & nmrML2ISA: generating enriched ISA-Tab metadata files from metabolomics XML data.

Larralde, Martin; Lawson, Thomas N; Weber, Ralf J M; Moreno, Pablo; Haug, Kenneth; Rocca-Serra, Philippe; Viant, Mark R; Steinbeck, Christoph; Salek, Reza M.

Bioinformatics ; 33(16): 2598-2600, 2017 Aug 15.

Article in English | MEDLINE | ID: mdl-28402395

ABSTRACT

SUMMARY: Submission to the MetaboLights repository for metabolomics data currently places the burden of reporting instrument and acquisition parameters in ISA-Tab format on users, who have to do it manually, a process that is time consuming and prone to user input error. Since the large majority of these parameters are embedded in instrument raw data files, an opportunity exists to capture this metadata more accurately. Here we report a set of Python packages that can automatically generate ISA-Tab metadata file stubs from raw XML metabolomics data files. The parsing packages are separated into mzML2ISA (encompassing mzML and imzML formats) and nmrML2ISA (nmrML format only). Overall, the use of mzML2ISA & nmrML2ISA reduces the time needed to capture metadata substantially (capturing 90% of metadata on assay and sample levels), is much less prone to user input errors, improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets. AVAILABILITY AND IMPLEMENTATION: mzML2ISA & nmrML2ISA are available under version 3 of the GNU General Public Licence at https://github.com/ISA-tools. Documentation is available from http://2isa.readthedocs.io/en/latest/. CONTACT: reza.salek@ebi.ac.uk or isatools@googlegroups.com. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Information Storage and Retrieval , Metabolomics/methods , Metadata , Software , Data Mining/methods

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL