Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Sci Rep ; 13(1): 9748, 2023 06 16.
Artigo em Inglês | MEDLINE | ID: mdl-37328502

RESUMO

Increased global production of sorghum has the potential to meet many of the demands of a growing human population. Developing automation technologies for field scouting is crucial for long-term and low-cost production. Since 2013, sugarcane aphid (SCA) Melanaphis sacchari (Zehntner) has become an important economic pest causing significant yield loss across the sorghum production region in the United States. Adequate management of SCA depends on costly field scouting to determine pest presence and economic threshold levels to spray insecticides. However, with the impact of insecticides on natural enemies, there is an urgent need to develop automated-detection technologies for their conservation. Natural enemies play a crucial role in the management of SCA populations. These insects, primary coccinellids, prey on SCA and help to reduce unnecessary insecticide applications. Although these insects help regulate SCA populations, the detection and classification of these insects is time-consuming and inefficient in lower value crops like sorghum during field scouting. Advanced deep learning software provides a means to perform laborious automatic agricultural tasks, including detection and classification of insects. However, deep learning models for coccinellids in sorghum have not been developed. Therefore, our objective was to develop and train machine learning models to detect coccinellids commonly found in sorghum and classify them according to their genera, species, and subfamily level. We trained a two-stage object detection model, specifically, Faster Region-based Convolutional Neural Network (Faster R-CNN) with the Feature Pyramid Network (FPN) and also one-stage detection models in the YOLO (You Only Look Once) family (YOLOv5 and YOLOv7) to detect and classify seven coccinellids commonly found in sorghum (i.e., Coccinella septempunctata, Coleomegilla maculata, Cycloneda sanguinea, Harmonia axyridis, Hippodamia convergens, Olla v-nigrum, Scymninae). We used images extracted from the iNaturalist project to perform training and evaluation of the Faster R-CNN-FPN and YOLOv5 and YOLOv7 models. iNaturalist is an imagery web server used to publish citizen's observations of images pertaining to living organisms. Experimental evaluation using standard object detection metrics, such as average precision (AP), AP@0.50, etc., has shown that the YOLOv7 model performs the best on the coccinellid images with an AP@0.50 as high as 97.3, and AP as high as 74.6. Our research contributes automated deep learning software to the area of integrated pest management, making it easier to detect natural enemies in sorghum.


Assuntos
Afídeos , Besouros , Aprendizado Profundo , Inseticidas , Saccharum , Sorghum , Animais , Humanos , Grão Comestível , Produtos Agrícolas
2.
Glycobiology ; 33(5): 411-422, 2023 06 03.
Artigo em Inglês | MEDLINE | ID: mdl-37067908

RESUMO

Protein N-linked glycosylation is an important post-translational mechanism in Homo sapiens, playing essential roles in many vital biological processes. It occurs at the N-X-[S/T] sequon in amino acid sequences, where X can be any amino acid except proline. However, not all N-X-[S/T] sequons are glycosylated; thus, the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In this regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem that has not been extensively addressed by the existing methods, especially in regard to the creation of negative sets and leveraging the distilled information from protein language models (pLMs). Here, we developed LMNglyPred, a deep learning-based approach, to predict N-linked glycosylated sites in human proteins using embeddings from a pre-trained pLM. LMNglyPred produces sensitivity, specificity, Matthews Correlation Coefficient, precision, and accuracy of 76.50, 75.36, 0.49, 60.99, and 75.74 percent, respectively, on a benchmark-independent test set. These results demonstrate that LMNglyPred is a robust computational tool to predict N-linked glycosylation sites confined to the N-X-[S/T] sequon.


Assuntos
Aminoácidos , Glicoproteínas , Humanos , Glicosilação , Glicoproteínas/metabolismo , Aminoácidos/química , Processamento de Proteína Pós-Traducional , Sequência de Aminoácidos
3.
Front Plant Sci ; 14: 1133115, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36968399

RESUMO

Chalk, an undesirable grain quality trait in rice, is primarily formed due to high temperatures during the grain-filling process. Owing to the disordered starch granule structure, air spaces and low amylose content, chalky grains are easily breakable during milling thereby lowering head rice recovery and its market price. Availability of multiple QTLs associated with grain chalkiness and associated attributes, provided us an opportunity to perform a meta-analysis and identify candidate genes and their alleles contributing to enhanced grain quality. From the 403 previously reported QTLs, 64 Meta-QTLs encompassing 5262 non-redundant genes were identified. MQTL analysis reduced the genetic and physical intervals and nearly 73% meta-QTLs were narrower than 5cM and 2Mb, revealing the hotspot genomic regions. By investigating expression profiles of 5262 genes in previously published datasets, 49 candidate genes were shortlisted on the basis of their differential regulation in at least two of the datasets. We identified non-synonymous allelic variations and haplotypes in 39 candidate genes across the 3K rice genome panel. Further, we phenotyped a subset panel of 60 rice accessions by exposing them to high temperature stress under natural field conditions over two Rabi cropping seasons. Haplo-pheno analysis uncovered haplotype combinations of two starch synthesis genes, GBSSI and SSIIa, significantly contributing towards the formation of grain chalk in rice. We, therefore, report not only markers and pre-breeding material, but also propose superior haplotype combinations which can be introduced using either marker-assisted breeding or CRISPR-Cas based prime editing to generate elite rice varieties with low grain chalkiness and high HRY traits.

4.
Plant Methods ; 18(1): 9, 2022 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-35065667

RESUMO

BACKGROUND: Rice is a major staple food crop for more than half the world's population. As the global population is expected to reach 9.7 billion by 2050, increasing the production of high-quality rice is needed to meet the anticipated increased demand. However, global environmental changes, especially increasing temperatures, can affect grain yield and quality. Heat stress is one of the major causes of an increased proportion of chalkiness in rice, which compromises quality and reduces the market value. Researchers have identified 140 quantitative trait loci linked to chalkiness mapped across 12 chromosomes of the rice genome. However, the available genetic information acquired by employing advances in genetics has not been adequately exploited due to a lack of a reliable, rapid and high-throughput phenotyping tool to capture chalkiness. To derive extensive benefit from the genetic progress achieved, tools that facilitate high-throughput phenotyping of rice chalkiness are needed. RESULTS: We use a fully automated approach based on convolutional neural networks (CNNs) and Gradient-weighted Class Activation Mapping (Grad-CAM) to detect chalkiness in rice grain images. Specifically, we train a CNN model to distinguish between chalky and non-chalky grains and subsequently use Grad-CAM to identify the area of a grain that is indicative of the chalky class. The area identified by the Grad-CAM approach takes the form of a smooth heatmap that can be used to quantify the degree of chalkiness. Experimental results on both polished and unpolished rice grains using standard instance classification and segmentation metrics have shown that Grad-CAM can accurately identify chalky grains and detect the chalkiness area. CONCLUSIONS: We have successfully demonstrated the application of a Grad-CAM based tool to accurately capture high night temperature induced chalkiness in rice. The models trained will be made publicly available. They are easy-to-use, scalable and can be readily incorporated into ongoing rice breeding programs, without rice researchers requiring computer science or machine learning expertise.

5.
Molecules ; 26(23)2021 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-34885895

RESUMO

Protein N-linked glycosylation is a post-translational modification that plays an important role in a myriad of biological processes. Computational prediction approaches serve as complementary methods for the characterization of glycosylation sites. Most of the existing predictors for N-linked glycosylation utilize the information that the glycosylation site occurs at the N-X-[S/T] sequon, where X is any amino acid except proline. Not all N-X-[S/T] sequons are glycosylated, thus the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In that regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem. Here, we report DeepNGlyPred a deep learning-based approach that encodes the positive and negative sequences in the human proteome dataset (extracted from N-GlycositeAtlas) using sequence-based features (gapped-dipeptide), predicted structural features, and evolutionary information. DeepNGlyPred produces SN, SP, MCC, and ACC of 88.62%, 73.92%, 0.60, and 79.41%, respectively on N-GlyDE independent test set, which is better than the compared approaches. These results demonstrate that DeepNGlyPred is a robust computational technique to predict N-Linked glycosylation sites confined to N-X-[S/T] sequon. DeepNGlyPred will be a useful resource for the glycobiology community.


Assuntos
Proteoma/química , Aprendizado Profundo , Glicosilação , Humanos , Modelos Biológicos , Redes Neurais de Computação , Polissacarídeos/análise , Processamento de Proteína Pós-Traducional
6.
Front Cell Dev Biol ; 9: 662983, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34249915

RESUMO

Phosphorylation, which is mediated by protein kinases and opposed by protein phosphatases, is an important post-translational modification that regulates many cellular processes, including cellular metabolism, cell migration, and cell division. Due to its essential role in cellular physiology, a great deal of attention has been devoted to identifying sites of phosphorylation on cellular proteins and understanding how modification of these sites affects their cellular functions. This has led to the development of several computational methods designed to predict sites of phosphorylation based on a protein's primary amino acid sequence. In contrast, much less attention has been paid to dephosphorylation and its role in regulating the phosphorylation status of proteins inside cells. Indeed, to date, dephosphorylation site prediction tools have been restricted to a few tyrosine phosphatases. To fill this knowledge gap, we have employed a transfer learning strategy to develop a deep learning-based model to predict sites that are likely to be dephosphorylated. Based on independent test results, our model, which we termed DTL-DephosSite, achieved efficiency scores for phosphoserine/phosphothreonine residues of 84%, 84% and 0.68 with respect to sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC). Similarly, DTL-DephosSite exhibited efficiency scores of 75%, 88% and 0.64 for phosphotyrosine residues with respect to SN, SP, and MCC.

7.
Plant Physiol ; 186(3): 1562-1579, 2021 07 06.
Artigo em Inglês | MEDLINE | ID: mdl-33856488

RESUMO

Stomatal density (SD) and stomatal complex area (SCA) are important traits that regulate gas exchange and abiotic stress response in plants. Despite sorghum (Sorghum bicolor) adaptation to arid conditions, the genetic potential of stomata-related traits remains unexplored due to challenges in available phenotyping methods. Hence, identifying loci that control stomatal traits is fundamental to designing strategies to breed sorghum with optimized stomatal regulation. We implemented both classical and deep learning methods to characterize genetic diversity in 311 grain sorghum accessions for stomatal traits at two different field environments. Nearly 12,000 images collected from abaxial (Ab) and adaxial (Ad) leaf surfaces revealed substantial variation in stomatal traits. Our study demonstrated significant accuracy between manual and deep learning methods in predicting SD and SCA. In sorghum, SD was 32%-39% greater on the Ab versus the Ad surface, while SCA on the Ab surface was 2%-5% smaller than on the Ad surface. Genome-Wide Association Study identified 71 genetic loci (38 were environment-specific) with significant genotype to phenotype associations for stomatal traits. Putative causal genes underlying the phenotypic variation were identified. Accessions with similar SCA but carrying contrasting haplotypes for SD were tested for stomatal conductance and carbon assimilation under field conditions. Our findings provide a foundation for further studies on the genetic and molecular mechanisms controlling stomata patterning and regulation in sorghum. An integrated physiological, deep learning, and genomic approach allowed us to unravel the genetic control of natural variation in stomata traits in sorghum, which can be applied to other plants.


Assuntos
Estudo de Associação Genômica Ampla , Genótipo , Fenótipo , Estômatos de Plantas/crescimento & desenvolvimento , Estômatos de Plantas/genética , Sorghum/crescimento & desenvolvimento , Sorghum/genética , Aprendizado Profundo , Grão Comestível/genética , Grão Comestível/crescimento & desenvolvimento , Regulação da Expressão Gênica no Desenvolvimento , Regulação da Expressão Gênica de Plantas , Genes de Plantas , Variação Genética , Folhas de Planta
8.
JCO Clin Cancer Inform ; 5: 239-251, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33656914

RESUMO

PURPOSE: Children with acute lymphoblastic leukemia (ALL) are treated according to risk-based protocols defined by the Children's Oncology Group (COG). Alignment between real-world clinical practice and protocol milestones is not widely understood. Aggregate deidentified electronic health record (EHR) data offer a useful resource to evaluate real-world clinical practice. METHODS: A cohort of children with ALL was identified in the Cerner Health Facts deidentified aggregate EHR data. Manual review identified candidate procedural milestones. Automated methods were developed to classify likely standard-risk precursor B-cell ALL patients. Milestone procedures were adjusted relative to initiation of therapy and then aligned to the COG protocols for standard induction therapy. RESULTS: We identified 7,728 patients with pediatric ALL with 188,187 encounters. Records for lumbar punctures (LP) and bone marrow biopsies were frequently present in the data and were appropriate targets to evaluate guideline performance. Alluvial graph analysis of 14 health systems indicated that none of the systems have data from all three COG-recommended lumbar procedures for all patients but alignment demonstrated that most systems test at the recommended times. CONCLUSION: Source-system variation introduces inconsistency and incompleteness into aggregate EHR data. Data visualization was helpful in characterizing and interpreting the data. Health systems with patients meeting the inclusion criteria demonstrated strong alignment with the recommended milestones for LP. Large-scale aggregate EHR data are useful to evaluate alignment of recommended versus actual clinical milestones in support of treating children with ALL. This work can inform other guideline and protocol driven care.


Assuntos
Registros Eletrônicos de Saúde , Leucemia , Criança , Estudos de Coortes , Humanos , Padrão de Cuidado
9.
PLoS One ; 13(2): e0191362, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29389941

RESUMO

Escherichia coli O103, harbored in the hindgut and shed in the feces of cattle, can be enterohemorrhagic (EHEC), enteropathogenic (EPEC), or putative non-pathotype. The genetic diversity particularly that of virulence gene profiles within O103 serogroup is likely to be broad, considering the wide range in severity of illness. However, virulence descriptions of the E. coli O103 strains isolated from cattle feces have been primarily limited to major genes, such as Shiga toxin and intimin genes. Less is known about the frequency at which other virulence genes exist or about genes associated with the mobile genetic elements of E. coli O103 pathotypes. Our objective was to utilize whole genome sequencing (WGS) to identify and compare major and putative virulence genes of EHEC O103 (positive for Shiga toxin gene, stx1, and intimin gene, eae; n = 43), EPEC O103 (negative for stx1 and positive for eae; n = 13) and putative non-pathotype O103 strains (negative for stx and eae; n = 13) isolated from cattle feces. Six strains of EHEC O103 from human clinical cases were also included. All bovine EHEC strains (43/43) and a majority of EPEC (12/13) and putative non-pathotype strains (12/13) were O103:H2 serotype. Both bovine and human EHEC strains had significantly larger average genome sizes (P < 0.0001) and were positive for a higher number of adherence and toxin-based virulence genes and genes on mobile elements (prophages, transposable elements, and plasmids) than EPEC or putative non-pathotype strains. The genome size of the three pathotypes positively correlated (R2 = 0.7) with the number of genes carried on mobile genetic elements. Bovine strains clustered phylogenetically by pathotypes, which differed in several key virulence genes. The diversity of E. coli O103 pathotypes shed in cattle feces is likely reflective of the acquisition or loss of virulence genes carried on mobile genetic elements.


Assuntos
Infecções por Escherichia coli/microbiologia , Proteínas de Escherichia coli/genética , Escherichia coli/genética , Fezes/microbiologia , Genômica/métodos , Sequências Repetitivas Dispersas , Fatores de Virulência/genética , Animais , Bovinos , Escherichia coli/classificação , Escherichia coli/isolamento & purificação , Escherichia coli/patogenicidade , Variação Genética , Humanos , Filogenia
10.
Genome Announc ; 5(21)2017 May 25.
Artigo em Inglês | MEDLINE | ID: mdl-28546486

RESUMO

Enteropathogenic Escherichia coli (EPEC) pathotype represents a minor proportion of E. coli O103 strains shed in the feces of feedlot cattle. The draft genome sequences of 13 strains of EPEC O103 are reported here. The availability of the genome sequences will help in the assessment of genetic diversity and virulence potential of bovine EPEC O103.

11.
Genome Announc ; 5(19)2017 May 11.
Artigo em Inglês | MEDLINE | ID: mdl-28495758

RESUMO

The enterohemorrhagic pathotype represents a minor proportion of the Escherichia coli O103 strains shed in the feces of cattle. We report here the genome sequences of 43 strains of enterohemorrhagic E. coli (EHEC) O103:H2 isolated from feedlot cattle feces. The genomic analysis will provide information on the genetic diversity and virulence potential of bovine EHEC O103.

12.
IEEE Trans Nanobioscience ; 15(2): 84-92, 2016 03.
Artigo em Inglês | MEDLINE | ID: mdl-26863669

RESUMO

Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach used the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently used community detection to identify groups of k -mers that appear frequently in a set of sequences. Whereas this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extended our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.


Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas/química , Proteínas/classificação , Aprendizado de Máquina Supervisionado
13.
IEEE Trans Nanobioscience ; 15(2): 75-83, 2016 03.
Artigo em Inglês | MEDLINE | ID: mdl-26849871

RESUMO

Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction-a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.


Assuntos
Algoritmos , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Modelos Logísticos , Splicing de RNA/genética , Análise de Sequência de DNA/métodos , Animais , Área Sob a Curva , Modelos Estatísticos , Curva ROC
14.
BMC Syst Biol ; 9 Suppl 5: S1, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26356316

RESUMO

BACKGROUND: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. RESULTS: Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. CONCLUSIONS: In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.


Assuntos
Genômica/métodos , Anotação de Sequência Molecular/métodos , Aprendizado de Máquina Supervisionado , Algoritmos , Bases de Dados Genéticas , Splicing de RNA/genética
15.
Curr Biol ; 25(5): 613-20, 2015 Mar 02.
Artigo em Inglês | MEDLINE | ID: mdl-25660540

RESUMO

Gall-forming arthropods are highly specialized herbivores that, in combination with their hosts, produce extended phenotypes with unique morphologies [1]. Many are economically important, and others have improved our understanding of ecology and adaptive radiation [2]. However, the mechanisms that these arthropods use to induce plant galls are poorly understood. We sequenced the genome of the Hessian fly (Mayetiola destructor; Diptera: Cecidomyiidae), a plant parasitic gall midge and a pest of wheat (Triticum spp.), with the aim of identifying genic modifications that contribute to its plant-parasitic lifestyle. Among several adaptive modifications, we discovered an expansive reservoir of potential effector proteins. Nearly 5% of the 20,163 predicted gene models matched putative effector gene transcripts present in the M. destructor larval salivary gland. Another 466 putative effectors were discovered among the genes that have no sequence similarities in other organisms. The largest known arthropod gene family (family SSGP-71) was also discovered within the effector reservoir. SSGP-71 proteins lack sequence homologies to other proteins, but their structures resemble both ubiquitin E3 ligases in plants and E3-ligase-mimicking effectors in plant pathogenic bacteria. SSGP-71 proteins and wheat Skp proteins interact in vivo. Mutations in different SSGP-71 genes avoid the effector-triggered immunity that is directed by the wheat resistance genes H6 and H9. Results point to effectors as the agents responsible for arthropod-induced plant gall formation.


Assuntos
Cromossomos/genética , Dípteros/genética , Família Multigênica/genética , Filogenia , Tumores de Planta/genética , Triticum/parasitologia , Adaptação Biológica/genética , Sequência de Aminoácidos , Animais , Sequência de Bases , Dípteros/metabolismo , Larva/metabolismo , Modelos Genéticos , Dados de Sequência Molecular , Análise de Sequência de DNA , Homologia de Sequência , Comportamento Sexual Animal/fisiologia , Técnicas do Sistema de Duplo-Híbrido , Ubiquitina-Proteína Ligases/genética
16.
J Proteome Res ; 10(4): 1505-18, 2011 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-21226539

RESUMO

The relationship between aphids and their host plants is thought to be functionally analogous to plant-pathogen interactions. Although virulence effector proteins that mediate plant defenses are well-characterized for pathogens such as bacteria, oomycetes, and nematodes, equivalent molecules in aphids and other phloem-feeders are poorly understood. A dual transcriptomic-proteomic approach was adopted to generate a catalog of candidate effector proteins from the salivary glands of the pea aphid, Acyrthosiphon pisum. Of the 1557 transcript supported and 925 mass spectrometry identified proteins, over 300 proteins were identified with secretion signals, including proteins that had previously been identified directly from the secreted saliva. Almost half of the identified proteins have no homologue outside aphids and are of unknown function. Many of the genes encoding the putative effector proteins appear to be evolving at a faster rate than homologues in other insects, and there is strong evidence that genes with multiple copies in the genome are under positive selection. Many of the candidate aphid effector proteins were previously characterized in typical phytopathogenic organisms (e.g., nematodes and fungi) and our results highlight remarkable similarities in the saliva from plant-feeding nematodes and aphids that may indicate the evolution of common solutions to the plant-parasitic lifestyle.


Assuntos
Afídeos/química , Perfilação da Expressão Gênica , Proteínas de Insetos/análise , Proteoma/análise , Proteômica/métodos , Saliva/química , Sequência de Aminoácidos , Animais , Afídeos/metabolismo , Eletroforese em Gel Bidimensional , Etiquetas de Sequências Expressas , Proteínas de Insetos/classificação , Proteínas de Insetos/genética , Espectrometria de Massas/métodos , Dados de Sequência Molecular , Filogenia , Sinais Direcionadores de Proteínas/genética , Alinhamento de Sequência
17.
BMC Bioinformatics ; 11 Suppl 8: S6, 2010 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-21034431

RESUMO

BACKGROUND: Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing semi-supervised methods for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data. RESULTS: In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting unlabeled data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data). CONCLUSIONS: The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.


Assuntos
Biologia Computacional/métodos , Cadeias de Markov , Proteínas/classificação , Proteínas/metabolismo , Frações Subcelulares/metabolismo , Algoritmos , Inteligência Artificial , Análise por Conglomerados , Bases de Dados de Proteínas , Modelos Biológicos , Proteínas de Plantas/química , Proteínas de Plantas/metabolismo , Proteínas/química , Reprodutibilidade dos Testes , Frações Subcelulares/química
18.
Int J Data Min Bioinform ; 4(4): 411-30, 2010.
Artigo em Inglês | MEDLINE | ID: mdl-20815140

RESUMO

Alternative splicing is a mechanism for generating different gene transcripts (called isoforms) from the same genomic sequence. In this paper, we explore the predictive power of a large set of diverse gene features that have been experimentally shown to have effect on alternative splicing. We use such features to build support vector machine classifiers for predicting alternatively spliced exons. Experimental results show that classifiers built from the diverse set of features give better results than those that consider only basic sequence features. Furthermore, we use feature selection methods to identify the most informative features for the prediction problem at hand.


Assuntos
Processamento Alternativo , Inteligência Artificial , Biologia Computacional/métodos , Éxons , Isoformas de Proteínas/genética
19.
BMC Genomics ; 11: 463, 2010 Aug 06.
Artigo em Inglês | MEDLINE | ID: mdl-20691076

RESUMO

BACKGROUND: Termites (Isoptera) are eusocial insects whose colonies consist of morphologically and behaviorally specialized castes of sterile workers and soldiers, and reproductive alates. Previous studies on eusocial insects have indicated that caste differentiation and behavior are underlain by differential gene expression. Although much is known about gene expression in the honey bee, Apis mellifera, termites remain relatively understudied in this regard. Therefore, our objective was to assemble an expressed sequence tag (EST) data base for the eastern subterranean termite, Reticulitermes flavipes, for future gene expression studies. RESULTS: Soldier, worker, and alate caste and two larval cDNA libraries were constructed, and approximately 15,000 randomly chosen clones were sequenced to compile an EST data base. Putative gene functions were assigned based on a BLASTX Swissprot search. Categorical in silico expression patterns for each library were compared using the R-statistic. A significant proportion of the ESTs of each caste and life stages had no significant similarity to those in existing data bases. All cDNA libraries, including those of non-reproductive worker and soldier castes, contained sequences with putative reproductive functions. Genes that showed a potential expression bias among castes included a putative antibacterial humoral response and translation elongation protein in soldiers and a chemosensory protein in alates. CONCLUSIONS: We have expanded upon the available sequences for R. flavipes and utilized an in silico method to compare gene expression in different castes of an eusocial insect. The in silico analysis allowed us to identify several genes which may be differentially expressed and involved in caste differences. These include a gene overrepresented in the alate cDNA library with a predicted function of neurotransmitter secretion or cholesterol absorption and a gene predicted to be involved in protein biosynthesis and ligase activity that was overrepresented in the late larval stage cDNA library. The EST data base and analyses reported here will be a valuable resource for future studies on the genomics of R. flavipes and other termites.


Assuntos
Etiquetas de Sequências Expressas , Isópteros/genética , Estágios do Ciclo de Vida , Animais , Biologia Computacional , Bases de Dados de Ácidos Nucleicos , Proteínas de Insetos/genética , Isópteros/crescimento & desenvolvimento , Larva/genética , Análise de Sequência de DNA , Transcrição Gênica
20.
Nucleic Acids Res ; 38(Database issue): D437-42, 2010 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19820115

RESUMO

BeetleBase (http://www.beetlebase.org) has been updated to provide more comprehensive genomic information for the red flour beetle Tribolium castaneum. The database contains genomic sequence scaffolds mapped to 10 linkage groups (genome assembly release Tcas_3.0), genetic linkage maps, the official gene set, Reference Sequences from NCBI (RefSeq), predicted gene models, ESTs and whole-genome tiling array data representing several developmental stages. The database was reconstructed using the upgraded Generic Model Organism Database (GMOD) modules. The genomic data is stored in a PostgreSQL relatational database using the Chado schema and visualized as tracks in GBrowse. The updated genetic map is visualized using the comparative genetic map viewer CMAP. To enhance the database search capabilities, the BLAST and BLAT search tools have been integrated with the GMOD tools. BeetleBase serves as a long-term repository for Tribolium genomic data, and is compatible with other model organism databases.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Bases de Dados de Ácidos Nucleicos , Tribolium/genética , Animais , Biologia Computacional/tendências , Bases de Dados de Proteínas , Etiquetas de Sequências Expressas , Genoma , Genômica , Armazenamento e Recuperação da Informação/métodos , Internet , Modelos Genéticos , Estrutura Terciária de Proteína , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...