Pesquisa | BVS IEC

1.

Propagation, detection and correction of errors using the sequence database network.

Goudey, Benjamin; Geard, Nicholas; Verspoor, Karin; Zobel, Justin.

Brief Bioinform ; 23(6)2022 11 19.

Artigo em Inglês | MEDLINE | ID: mdl-36266246

RESUMO

Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.

Assuntos

Biologia Computacional , Bases de Dados de Ácidos Nucleicos , Sequência de Aminoácidos

2.

Exploring automatic inconsistency detection for literature-based gene ontology annotation.

Chen, Jiyu; Goudey, Benjamin; Zobel, Justin; Geard, Nicholas; Verspoor, Karin.

Bioinformatics ; 38(Suppl 1): i273-i281, 2022 06 24.

Artigo em Inglês | MEDLINE | ID: mdl-35758780

RESUMO

MOTIVATION: Literature-based gene ontology annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in the primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This article presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. RESULTS: We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows. The data underlying this article are available in Github at https://github.com/jiyuc/AutoGOAConsistency.

Assuntos

Publicações , Ontologia Genética , Anotação de Sequência Molecular

3.

Automatic consistency assurance for literature-based gene ontology annotation.

Chen, Jiyu; Geard, Nicholas; Zobel, Justin; Verspoor, Karin.

BMC Bioinformatics ; 22(1): 565, 2021 Nov 25.

Artigo em Inglês | MEDLINE | ID: mdl-34823464

RESUMO

BACKGROUND: Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. RESULTS: In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. CONCLUSIONS: Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.

Assuntos

Biologia Computacional , Mineração de Dados , Bases de Dados de Proteínas , Ontologia Genética , Humanos , Anotação de Sequência Molecular

4.

GeneMates: an R package for detecting horizontal gene co-transfer between bacteria using gene-gene associations controlled for population structure.

Wan, Yu; Wick, Ryan R; Zobel, Justin; Ingle, Danielle J; Inouye, Michael; Holt, Kathryn E.

BMC Genomics ; 21(1): 658, 2020 Sep 24.

Artigo em Inglês | MEDLINE | ID: mdl-32972363

RESUMO

BACKGROUND: Horizontal gene transfer contributes to bacterial evolution through mobilising genes across various taxonomical boundaries. It is frequently mediated by mobile genetic elements (MGEs), which may capture, maintain, and rearrange mobile genes and co-mobilise them between bacteria, causing horizontal gene co-transfer (HGcoT). This physical linkage between mobile genes poses a great threat to public health as it facilitates dissemination and co-selection of clinically important genes amongst bacteria. Although rapid accumulation of bacterial whole-genome sequencing data since the 2000s enables study of HGcoT at the population level, results based on genetic co-occurrence counts and simple association tests are usually confounded by bacterial population structure when sampled bacteria belong to the same species, leading to spurious conclusions. RESULTS: We have developed a network approach to explore WGS data for evidence of intraspecies HGcoT and have implemented it in R package GeneMates ( github.com/wanyuac/GeneMates ). The package takes as input an allelic presence-absence matrix of interested genes and a matrix of core-genome single-nucleotide polymorphisms, performs association tests with linear mixed models controlled for population structure, produces a network of significantly associated alleles, and identifies clusters within the network as plausible co-transferred alleles. GeneMates users may choose to score consistency of allelic physical distances measured in genome assemblies using a novel approach we have developed and overlay scores to the network for further evidence of HGcoT. Validation studies of GeneMates on known acquired antimicrobial resistance genes in Escherichia coli and Salmonella Typhimurium show advantages of our network approach over simple association analysis: (1) distinguishing between allelic co-occurrence driven by HGcoT and that driven by clonal reproduction, (2) evaluating effects of population structure on allelic co-occurrence, and (3) direct links between allele clusters in the network and MGEs when physical distances are incorporated. CONCLUSION: GeneMates offers an effective approach to detection of intraspecies HGcoT using WGS data.

Assuntos

Transferência Genética Horizontal , Genes Bacterianos , Software , Escherichia coli/genética , Salmonella typhimurium/genética , Sequenciamento Completo do Genoma/métodos

5.

Automated assessment of biological database assertions using the scientific literature.

Bouadjenek, Mohamed Reda; Zobel, Justin; Verspoor, Karin.

BMC Bioinformatics ; 20(1): 216, 2019 Apr 29.

Artigo em Inglês | MEDLINE | ID: mdl-31035936

RESUMO

BACKGROUND: The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS: Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS: BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.

Assuntos

Algoritmos , Mineração de Dados/métodos , Bases de Dados Factuais , Bases de Dados de Ácidos Nucleicos , Humanos , Mapas de Interação de Proteínas , Editoração

6.

Exploring effective approaches for haplotype block phasing.

Al Bkhetan, Ziad; Zobel, Justin; Kowalczyk, Adam; Verspoor, Karin; Goudey, Benjamin.

BMC Bioinformatics ; 20(1): 540, 2019 Oct 30.

Artigo em Inglês | MEDLINE | ID: mdl-31666002

RESUMO

BACKGROUND: Knowledge of phase, the specific allele sequence on each copy of homologous chromosomes, is increasingly recognized as critical for detecting certain classes of disease-associated mutations. One approach for detecting such mutations is through phased haplotype association analysis. While the accuracy of methods for phasing genotype data has been widely explored, there has been little attention given to phasing accuracy at haplotype block scale. Understanding the combined impact of the accuracy of phasing tool and the method used to determine haplotype blocks on the error rate within the determined blocks is essential to conduct accurate haplotype analyses. RESULTS: We present a systematic study exploring the relationship between seven widely used phasing methods and two common methods for determining haplotype blocks. The evaluation focuses on the number of haplotype blocks that are incorrectly phased. Insights from these results are used to develop a haplotype estimator based on a consensus of three tools. The consensus estimator achieved the most accurate phasing in all applied tests. Individually, EAGLE2, BEAGLE and SHAPEIT2 alternate in being the best performing tool in different scenarios. Determining haplotype blocks based on linkage disequilibrium leads to more correctly phased blocks compared to a sliding window approach. We find that there is little difference between phasing sections of a genome (e.g. a gene) compared to phasing entire chromosomes. Finally, we show that the location of phasing error varies when the tools are applied to the same data several times, a finding that could be important for downstream analyses. CONCLUSIONS: The choice of phasing and block determination algorithms and their interaction impacts the accuracy of phased haplotype blocks. This work provides guidance and evidence for the different design choices needed for analyses using haplotype blocks. The study highlights a number of issues that may have limited the replicability of previous haplotype analysis.

Assuntos

Haplótipos , Algoritmos , Desequilíbrio de Ligação

7.

Automated detection of records in biological sequence databases that are inconsistent with the literature.

Bouadjenek, Mohamed Reda; Verspoor, Karin; Zobel, Justin.

J Biomed Inform ; 71: 229-240, 2017 07.

Artigo em Inglês | MEDLINE | ID: mdl-28624643

RESUMO

We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

Assuntos

Algoritmos , Curadoria de Dados , Bases de Dados Bibliográficas , Bases de Dados de Ácidos Nucleicos , Armazenamento e Recuperação da Informação , Humanos , PubMed , Publicações , Controle de Qualidade

8.

Accurate and robust genomic prediction of celiac disease using statistical learning.

Abraham, Gad; Tye-Din, Jason A; Bhalala, Oneil G; Kowalczyk, Adam; Zobel, Justin; Inouye, Michael.

PLoS Genet ; 10(2): e1004137, 2014 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-24550740

RESUMO

Practical application of genomic-based risk stratification to clinical diagnosis is appealing yet performance varies widely depending on the disease and genomic risk score (GRS) method. Celiac disease (CD), a common immune-mediated illness, is strongly genetically determined and requires specific HLA haplotypes. HLA testing can exclude diagnosis but has low specificity, providing little information suitable for clinical risk stratification. Using six European cohorts, we provide a proof-of-concept that statistical learning approaches which simultaneously model all SNPs can generate robust and highly accurate predictive models of CD based on genome-wide SNP profiles. The high predictive capacity replicated both in cross-validation within each cohort (AUC of 0.87-0.89) and in independent replication across cohorts (AUC of 0.86-0.9), despite differences in ethnicity. The models explained 30-35% of disease variance and up to â¼43% of heritability. The GRS's utility was assessed in different clinically relevant settings. Comparable to HLA typing, the GRS can be used to identify individuals without CD with ≥99.6% negative predictive value however, unlike HLA typing, fine-scale stratification of individuals into categories of higher-risk for CD can identify those that would benefit from more invasive and costly definitive testing. The GRS is flexible and its performance can be adapted to the clinical situation by adjusting the threshold cut-off. Despite explaining a minority of disease heritability, our findings indicate a genomic risk score provides clinically relevant information to improve upon current diagnostic pathways for CD and support further studies evaluating the clinical utility of this approach in CD and other complex diseases.

Assuntos

Doença Celíaca/genética , Predisposição Genética para Doença , Genômica , Antígenos HLA/genética , Alelos , Biometria , Doença Celíaca/patologia , Feminino , Genoma Humano , Haplótipos , Humanos , Polimorfismo de Nucleotídeo Único , Risco

9.

Bandage: interactive visualization of de novo genome assemblies.

Wick, Ryan R; Schultz, Mark B; Zobel, Justin; Holt, Kathryn E.

Bioinformatics ; 31(20): 3350-2, 2015 Oct 15.

Artigo em Inglês | MEDLINE | ID: mdl-26099265

RESUMO

UNLABELLED: Although de novo assembly graphs contain assembled contigs (nodes), the connections between those contigs (edges) are difficult for users to access. Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections. Users can zoom in to specific areas of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences. BLAST searches can be performed within the Bandage graphical user interface and the hits are displayed as highlights in the graph. By displaying connections between contigs, Bandage presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone. AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available at https://github.com/rrwick/Bandage. Bandage is implemented in C++ and supported on Linux, OS X and Windows. A full feature list and screenshots are available at http://rrwick.github.io/Bandage. CONTACT: rrwick@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Biologia Computacional/métodos , Gráficos por Computador , Genoma Bacteriano , Genoma Humano , Análise de Sequência de DNA/métodos , Software , Mapeamento Cromossômico , Genômica/métodos , Humanos

10.

A categorical analysis of coreference resolution errors in biomedical texts.

Choi, Miji; Zobel, Justin; Verspoor, Karin.

J Biomed Inform ; 60: 309-18, 2016 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-26925515

RESUMO

BACKGROUND: Coreference resolution is an essential task in information extraction from the published biomedical literature. It supports the discovery of complex information by linking referring expressions such as pronouns and appositives to their referents, which are typically entities that play a central role in biomedical events. Correctly establishing these links allows detailed understanding of all the participants in events, and connecting events together through their shared participants. RESULTS: As an initial step towards the development of a novel coreference resolution system for the biomedical domain, we have categorised the characteristics of coreference relations by type of anaphor as well as broader syntactic and semantic characteristics, and have compared the performance of a domain adaptation of a state-of-the-art general system to published results from domain-specific systems in terms of this categorisation. We also develop a rule-based system for anaphoric coreference resolution in the biomedical domain with simple modules derived from available systems. Our results show that the domain-specific systems outperform the general system overall. Whilst this result is unsurprising, our proposed categorisation enables a detailed quantitative analysis of the system performance. We identify limitations of each system and find that there remain important gaps in the state-of-the-art systems, which are clearly identifiable with respect to the categorisation. CONCLUSION: We have analysed in detail the performance of existing coreference resolution systems for the biomedical literature and have demonstrated that there clear gaps in their coverage. The approach developed in the general domain needs to be tailored for portability to the biomedical domain. The specific framework for class-based error analysis of existing systems that we propose has benefits for identifying specific limitations of those systems. This in turn provides insights for further system development.

Assuntos

Mineração de Dados/métodos , Registros Eletrônicos de Saúde , Idioma , Processamento de Linguagem Natural , Algoritmos , Reações Falso-Negativas , Humanos , Informática Médica , Reconhecimento Automatizado de Padrão , Resolução de Problemas , Publicações , Reprodutibilidade dos Testes , Semântica

11.

Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease.

Abraham, Gad; Kowalczyk, Adam; Zobel, Justin; Inouye, Michael.

Genet Epidemiol ; 37(2): 184-95, 2013 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-23203348

RESUMO

A central goal of medical genetics is to accurately predict complex disease from genotypes. Here, we present a comprehensive analysis of simulated and real data using lasso and elastic-net penalized support-vector machine models, a mixed-effects linear model, a polygenic score, and unpenalized logistic regression. In simulation, the sparse penalized models achieved lower false-positive rates and higher precision than the other methods for detecting causal SNPs. The common practice of prefiltering SNP lists for subsequent penalized modeling was examined and shown to substantially reduce the ability to recover the causal SNPs. Using genome-wide SNP profiles across eight complex diseases within cross-validation, lasso and elastic-net models achieved substantially better predictive ability in celiac disease, type 1 diabetes, and Crohn's disease, and had equivalent predictive ability in the rest, with the results in celiac disease strongly replicating between independent datasets. We investigated the effect of linkage disequilibrium on the predictive models, showing that the penalized methods leverage this information to their advantage, compared with methods that assume SNP independence. Our findings show that sparse penalized approaches are robust across different disease architectures, producing as good as or better phenotype predictions and variance explained. This has fundamental ramifications for the selection and future development of methods to genetically predict human disease.

Assuntos

Doença/genética , Modelos Genéticos , Herança Multifatorial , Polimorfismo de Nucleotídeo Único , Artrite Reumatoide/genética , Estudos de Casos e Controles , Doença Celíaca/genética , Doença da Artéria Coronariana/genética , Doença de Crohn/genética , Diabetes Mellitus Tipo 1/genética , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de Ligação , Modelos Logísticos , Reprodutibilidade dos Testes

12.

Gossamer--a resource-efficient de novo assembler.

Conway, Thomas; Wazny, Jeremy; Bromage, Andrew; Zobel, Justin; Beresford-Smith, Bryan.

Bioinformatics ; 28(14): 1937-8, 2012 Jul 15.

Artigo em Inglês | MEDLINE | ID: mdl-22611131

RESUMO

MOTIVATION: The de novo assembly of short read high-throughput sequencing data poses significant computational challenges. The volume of data is huge; the reads are tiny compared to the underlying sequence, and there are significant numbers of sequencing errors. There are numerous software packages that allow users to assemble short reads, but most are either limited to relatively small genomes (e.g. bacteria) or require large computing infrastructure or employ greedy algorithms and thus often do not yield high-quality results. RESULTS: We have developed Gossamer, an implementation of the de Bruijn approach to assembly that requires close to the theoretical minimum of memory, but still allows efficient processing. Our results show that it is space efficient and produces high-quality assemblies. AVAILABILITY: Gossamer is available for non-commercial use from http://www.genomics.csse.unimelb.edu.au/product-gossamer.php.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Biologia Computacional/métodos

13.

SparSNP: fast and memory-efficient analysis of all SNPs for phenotype prediction.

Abraham, Gad; Kowalczyk, Adam; Zobel, Justin; Inouye, Michael.

BMC Bioinformatics ; 13: 88, 2012 May 10.

Artigo em Inglês | MEDLINE | ID: mdl-22574887

RESUMO

BACKGROUND: A central goal of genomics is to predict phenotypic variation from genetic variation. Fitting predictive models to genome-wide and whole genome single nucleotide polymorphism (SNP) profiles allows us to estimate the predictive power of the SNPs and potentially develop diagnostic models for disease. However, many current datasets cannot be analysed with standard tools due to their large size. RESULTS: We introduce SparSNP, a tool for fitting lasso linear models for massive SNP datasets quickly and with very low memory requirements. In analysis on a large celiac disease case/control dataset, we show that SparSNP runs substantially faster than four other state-of-the-art tools for fitting large scale penalised models. SparSNP was one of only two tools that could successfully fit models to the entire celiac disease dataset, and it did so with superior performance. Compared with the other tools, the models generated by SparSNP had better than or equal to predictive performance in cross-validation. CONCLUSIONS: Genomic datasets are rapidly increasing in size, rendering existing approaches to model fitting impractical due to their prohibitive time or memory requirements. This study shows that SparSNP is an essential addition to the genomic analysis toolkit.SparSNP is available at http://www.genomics.csse.unimelb.edu.au/SparSNP.

Assuntos

Genômica/métodos , Fenótipo , Polimorfismo de Nucleotídeo Único , Software , Doença Celíaca/genética , Biologia Computacional/métodos , Humanos , Modelos Lineares

14.

Short read sequence typing (SRST): multi-locus sequence types from short reads.

Inouye, Michael; Conway, Thomas C; Zobel, Justin; Holt, Kathryn E.

BMC Genomics ; 13: 338, 2012 Jul 24.

Artigo em Inglês | MEDLINE | ID: mdl-22827703

RESUMO

BACKGROUND: Multi-locus sequence typing (MLST) has become the gold standard for population analyses of bacterial pathogens. This method focuses on the sequences of a small number of loci (usually seven) to divide the population and is simple, robust and facilitates comparison of results between laboratories and over time. Over the last decade, researchers and population health specialists have invested substantial effort in building up public MLST databases for nearly 100 different bacterial species, and these databases contain a wealth of important information linked to MLST sequence types such as time and place of isolation, host or niche, serotype and even clinical or drug resistance profiles. Recent advances in sequencing technology mean it is increasingly feasible to perform bacterial population analysis at the whole genome level. This offers massive gains in resolving power and genetic profiling compared to MLST, and will eventually replace MLST for bacterial typing and population analysis. However given the wealth of data currently available in MLST databases, it is crucial to maintain backwards compatibility with MLST schemes so that new genome analyses can be understood in their proper historical context. RESULTS: We present a software tool, SRST, for quick and accurate retrieval of sequence types from short read sets, using inputs easily downloaded from public databases. SRST uses read mapping and an allele assignment score incorporating sequence coverage and variability, to determine the most likely allele at each MLST locus. Analysis of over 3,500 loci in more than 500 publicly accessible Illumina read sets showed SRST to be highly accurate at allele assignment. SRST output is compatible with common analysis tools such as eBURST, Clonal Frame or PhyloViz, allowing easy comparison between novel genome data and MLST data. Alignment, fastq and pileup files can also be generated for novel alleles. CONCLUSIONS: SRST is a novel software tool for accurate assignment of sequence types using short read data. Several uses for the tool are demonstrated, including quality control for high-throughput sequencing projects, plasmid MLST and analysis of genomic data during outbreak investigation. SRST is open-source, requires Python, BWA and SamTools, and is available from http://srst.sourceforge.net.

Assuntos

Tipagem de Sequências Multilocus/métodos , Análise de Sequência de DNA/métodos , Software

15.

When proxy-driven learning is no better than random: The consequences of representational incompleteness.

Zobel, Justin; Vázquez-Abad, Felisa J; Lin, Pauline.

PLoS One ; 17(7): e0271268, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35830451

RESUMO

Machine learning is widely used for personalisation, that is, to tune systems with the aim of adapting their behaviour to the responses of humans. This tuning relies on quantified features that capture the human actions, and also on objective functions-that is, proxies - that are intended to represent desirable outcomes. However, a learning system's representation of the world can be incomplete or insufficiently rich, for example if users' decisions are based on properties of which the system is unaware. Moreover, the incompleteness of proxies can be argued to be an intrinsic property of computational systems, as they are based on literal representations of human actions rather than on the human actions themselves; this problem is distinct from the usual aspects of bias that are examined in machine learning literature. We use mathematical analysis and simulations of a reinforcement-learning case study to demonstrate that incompleteness of representation can, first, lead to learning that is no better than random; and second, means that the learning system can be inherently unaware that it is failing. This result has implications for the limits and applications of machine learning systems in human domains.

Assuntos

Aprendizado de Máquina , Procurador , Diretivas Antecipadas , Humanos , Reforço Psicológico

16.

Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context.

Abraham, Gad; Kowalczyk, Adam; Loi, Sherene; Haviv, Izhak; Zobel, Justin.

BMC Bioinformatics ; 11: 277, 2010 May 25.

Artigo em Inglês | MEDLINE | ID: mdl-20500821

RESUMO

BACKGROUND: Different microarray studies have compiled gene lists for predicting outcomes of a range of treatments and diseases. These have produced gene lists that have little overlap, indicating that the results from any one study are unstable. It has been suggested that the underlying pathways are essentially identical, and that the expression of gene sets, rather than that of individual genes, may be more informative with respect to prognosis and understanding of the underlying biological process. RESULTS: We sought to examine the stability of prognostic signatures based on gene sets rather than individual genes. We classified breast cancer cases from five microarray studies according to the risk of metastasis, using features derived from predefined gene sets. The expression levels of genes in the sets are aggregated, using what we call a set statistic. The resulting prognostic gene sets were as predictive as the lists of individual genes, but displayed more consistent rankings via bootstrap replications within datasets, produced more stable classifiers across different datasets, and are potentially more interpretable in the biological context since they examine gene expression in the context of their neighbouring genes in the pathway. In addition, we performed this analysis in each breast cancer molecular subtype, based on ER/HER2 status. The prognostic gene sets found in each subtype were consistent with the biology based on previous analysis of individual genes. CONCLUSIONS: To date, most analyses of gene expression data have focused at the level of the individual genes. We show that a complementary approach of examining the data using predefined gene sets can reduce the noise and could provide increased insight into the underlying biological pathways.

Assuntos

Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , Perfilação da Expressão Gênica/métodos , Biomarcadores Tumorais/genética , Feminino , Humanos , Prognóstico

17.

Boolean versus ranked querying for biomedical systematic reviews.

Karimi, Sarvnaz; Pohl, Stefan; Scholer, Falk; Cavedon, Lawrence; Zobel, Justin.

BMC Med Inform Decis Mak ; 10: 58, 2010 Oct 12.

Artigo em Inglês | MEDLINE | ID: mdl-20937152

RESUMO

BACKGROUND: The process of constructing a systematic review, a document that compiles the published evidence pertaining to a specified medical topic, is intensely time-consuming, often taking a team of researchers over a year, with the identification of relevant published research comprising a substantial portion of the effort. The standard paradigm for this information-seeking task is to use Boolean search; however, this leaves the user(s) the requirement of examining every returned result. Further, our experience is that effective Boolean queries for this specific task are extremely difficult to formulate and typically require multiple iterations of refinement before being finalized. METHODS: We explore the effectiveness of using ranked retrieval as compared to Boolean querying for the purpose of constructing a systematic review. We conduct a series of experiments involving ranked retrieval, using queries defined methodologically, in an effort to understand the practicalities of incorporating ranked retrieval into the systematic search task. RESULTS: Our results show that ranked retrieval by itself is not viable for this search task requiring high recall. However, we describe a refinement of the standard Boolean search process and show that ranking within a Boolean result set can improve the overall search performance by providing early indication of the quality of the results, thereby speeding up the iterative query-refinement process. CONCLUSIONS: Outcomes of experiments suggest that an interactive query-development process using a hybrid ranked and Boolean retrieval system has the potential for significant time-savings over the current search process in the systematic reviewing.

Assuntos

Armazenamento e Recuperação da Informação/métodos , Literatura de Revisão como Assunto

18.

White Rabbit Time and Frequency Transfer Over Wireless Millimeter-Wave Carriers.

Gilligan, Jane E; Konitzer, Eric M; Siman-Tov, Elad; Zobel, Justin W; Adles, Eric J.

IEEE Trans Ultrason Ferroelectr Freq Control ; 67(9): 1946-1952, 2020 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-32324550

RESUMO

We demonstrate time and frequency transfer using a White Rabbit (WR) time transfer system over millimeter-wave (mm-wave) 71-76 GHz carriers. To validate the performance of our system, we present overlapping Allan deviation (ADEV), time deviation (TDEV), and phase statistics. Over mm-wave carriers, we report an ADEV of 7.1×10-12 at 1 s and a TDEV of <10 ps at 10 000 s. Our results show that after 4 s of averaging, we have sufficient precision to transfer a cesium atomic frequency standard. We analyze the link budget and architecture of our mm-wave link and discuss possible sources of phase error and their potential impact on the WR frequency transfer. Our data show that WR can synchronize new network architectures, such as physically separated fiber-optic networks, and support new applications, such as the synchronization of intermittently connected platforms. We conclude with recommendations for future investigation, including cascaded hybrid wireline and wireless architectures.

19.

Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions.

Chen, Qingyu; Zhang, Xiuzhen; Wan, Yu; Zobel, Justin; Verspoor, Karin.

J Comput Biol ; 26(6): 605-617, 2019 06.

Artigo em Inglês | MEDLINE | ID: mdl-30585742

RESUMO

Duplicate sequence records-that is, records having similar or identical sequences-are a challenge in search of biological sequence databases. They significantly increase database search time and can lead to uninformative search results containing similar sequences. Sequence clustering methods have been used to address this issue to group similar sequences into clusters. These clusters form a nonredundant database consisting of representatives (one record per cluster) and members (the remaining records in a cluster). In this approach, for nonredundant database search, users search against representatives first and optionally expand search results by exploring member records from matching clusters. Existing studies used Precision and Recall to assess the search effectiveness of nonredundant databases. However, the use of Precision and Recall does not model user behavior in practice and thus may not reflect practical search effectiveness. In this study, we first propose innovative evaluation metrics to measure search effectiveness. The findings are that (1) the Precision of expanded sets is consistently lower than that of representatives, with a decrease up to 7% at top ranks; and (2) Recall is uninformative because, for most queries, expanded sets return more records than does search of the original unclustered databases. Motivated by these findings, we propose a solution that returns a user-specified proportion of top similar records, modeled by a ranking function that aggregates sequence and annotation similarities. In experiments undertaken on UniProtKB/Swiss-Prot, the largest expert-curated protein database, we show that our method dramatically reduces the number of returned sequences, increases Precision by 3%, and does not impact effective search time.

Assuntos

Proteínas/química , Análise por Conglomerados , Bases de Dados de Proteínas , Humanos , Interface Usuário-Computador

20.

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.

Bouadjenek, Mohamed Reda; Verspoor, Karin; Zobel, Justin.

Database (Oxford) ; 2017(1)2017 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-28365737

RESUMO

Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics.

Assuntos

Biologia Computacional/métodos , Curadoria de Dados/métodos , Mineração de Dados/métodos , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA