Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
Sci Data ; 11(1): 568, 2024 Jun 01.
Article in English | MEDLINE | ID: mdl-38824125

ABSTRACT

Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.


Subject(s)
Proteome , Humans , Gastrointestinal Tract/metabolism , Cluster Analysis , Molecular Sequence Annotation , Metagenomics , Databases, Protein
3.
PLoS Comput Biol ; 18(10): e1010610, 2022 10.
Article in English | MEDLINE | ID: mdl-36260616

ABSTRACT

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.


Subject(s)
Proteins , Databases, Protein , Proteins/genetics , Cluster Analysis , Amino Acid Sequence , Protein Domains
4.
Sci Data ; 6(1): 3, 2019 02 05.
Article in English | MEDLINE | ID: mdl-30723195

ABSTRACT

Following further analysis of the Majority Dataset (Data Citation 3, originally https://doi.org/10.23728/b2share.e344a8afef08463a855ada08aadbf352 ) and 100% Dataset (Data Citation 4, originally https://doi.org/10.23728/b2share.f1aa0f5ad38c456eaf7b04d47a65af53 ) presented in the original version of this Data Descriptor it was revealed that a large number of duplicate images were included in both datasets. Both datasets have been corrected in updated versions, removing all replicates. The new version of the Majority Dataset (Data Citation 3) can be accessed via https://doi.org/10.23728/b2share.72758204db9044ab8b3e6b6c4d2eb576 and the 100% Dataset (Data Citation 4) via https://doi.org/10.23728/b2share.80df8606fcdb4b2bae1656f0dc6db8ba . The HTML and PDF versions of the Data Descriptor have been corrected accordingly.

5.
Sci Data ; 5: 180172, 2018 08 28.
Article in English | MEDLINE | ID: mdl-30152811

ABSTRACT

In this paper, we present the first publicly available human-annotated dataset of images obtained by the Scanning Electron Microscopy (SEM). A total of roughly 26,000 SEM images at the nanoscale are classified into 10 categories to form 4 labeled training sets, suited for image recognition tasks. The selected categories span the range of 0D objects such as particles, 1D nanowires and fibres, 2D films and coated surfaces as well as patterned surfaces, and 3D structures such as microelectromechanical system (MEMS) devices and pillars. Additional categories such as tips and biological are also included to expand the spectrum of possible images. A preliminary degree of hierarchy is introduced, by creating a subtree structure for the categories and populating them with the available images, wherever possible.

6.
Sci Rep ; 7(1): 13282, 2017 10 16.
Article in English | MEDLINE | ID: mdl-29038550

ABSTRACT

In this paper we applied transfer learning techniques for image recognition, automatic categorization, and labeling of nanoscience images obtained by scanning electron microscope (SEM). Roughly 20,000 SEM images were manually classified into 10 categories to form a labeled training set, which can be used as a reference set for future applications of deep learning enhanced algorithms in the nanoscience domain. The categories chosen spanned the range of 0-Dimensional (0D) objects such as particles, 1D nanowires and fibres, 2D films and coated surfaces, and 3D patterned surfaces such as pillars. The training set was used to retrain on the SEM dataset and to compare many convolutional neural network models (Inception-v3, Inception-v4, ResNet). We obtained compatible results by performing a feature extraction of the different models on the same dataset. We performed additional analysis of the classifier on a second test set to further investigate the results both on particular cases and from a statistical point of view. Our algorithm was able to successfully classify around 90% of a test dataset consisting of SEM images, while reduced accuracy was found in the case of images at the boundary between two categories or containing elements of multiple categories. In these cases, the image classification did not identify a predominant category with a high score. We used the statistical outcomes from testing to deploy a semi-automatic workflow able to classify and label images generated by the SEM. Finally, a separate training was performed to determine the volume fraction of coherently aligned nanowires in SEM images. The results were compared with what was obtained using the Local Gradient Orientation method. This example demonstrates the versatility and the potential of transfer learning to address specific tasks of interest in nanoscience applications.

SELECTION OF CITATIONS
SEARCH DETAIL
...