Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Rev Socionetwork Strateg ; 18(1): 101-121, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38646589

RESUMEN

The challenge of information overload in the legal domain increases every day. The COLIEE competition has created four challenge tasks that are intended to encourage the development of systems and methods to alleviate some of that pressure: a case law retrieval (Task 1) and entailment (Task 2), and a statute law retrieval (Task 3) and entailment (Task 4). Here we describe our methods for Task 1 and Task 4. In Task 1, we used a sentence-transformer model to create a numeric representation for each case paragraph. We then created a histogram of the similarities between a query case and a candidate case. The histogram is used to build a binary classifier that decides whether a candidate case should be noticed or not. In Task 4, our approach relies on fine-tuning a pre-trained DeBERTa large language model (LLM) trained on SNLI and MultiNLI datasets. Our method for Task 4 was ranked third among eight participating teams in the COLIEE 2023 competition. For Task 4, We also compared the performance of the DeBERTa model with those of a knowledge distillation model and ensemble methods including Random Forest and Voting.

2.
Rev Socionetwork Strateg ; 18(1): 27-47, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38646588

RESUMEN

We summarize the 10th Competition on Legal Information Extraction and Entailment. In this tenth edition, the competition included four tasks on case law and statute law. The case law component includes an information retrieval task (Task 1), and the confirmation of an entailment relation between an existing case and a selected unseen case (Task 2). The statute law component includes an information retrieval task (Task 3), and an entailment/question-answering task based on retrieved civil code statutes (Task 4). Participation was open to any group based on any approach. Ten different teams participated in the case law competition tasks, most of them in more than one task. We received results from 8 teams for Task 1 (22 runs) and seven teams for Task 2 (18 runs). On the statute law task, there were 9 different teams participating, most in more than one task. 6 teams submitted a total of 16 runs for Task 3, and 9 teams submitted a total of 26 runs for Task 4. We describe the variety of approaches, our official evaluation, and analysis of our data and submission results.

3.
Rev Socionetwork Strateg ; 16(1): 157-174, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35535319

RESUMEN

We describe the techniques applied by the University of Alberta (UA) team in the most recent Competition on Legal Information Extraction and Entailment (COLIEE 2021). We participated in retrieval and entailment tasks for both case law and statute law; we applied a transformer-based approach for the case law entailment task, an information retrieval technique based on BM25 for legal information retrieval, and a natural language inference mechanism using semantic knowledge applied to statute law texts. This competition included 25 teams from 14 countries; our case law entailment approach was ranked no. 4 in Task 2, the BM25 technique for legal information retrieval was ranked no. 3 in Task 3, and the natural language inference technique incorporating semantic information was ranked no. 4 in Task 4. The combination of the latter two techniques on Task 5 was ranked no. 2. We also performed error analysis of our system in Task 4, which provides some insight into current state-of-the-art and research priorities for future directions.

4.
J Bioinform Comput Biol ; 11(2): 1350002, 2013 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-23600820

RESUMEN

High-throughput single nucleotide polymorphism genotyping assays conveniently produce genotype data for genome-wide genetic linkage and association studies. For pedigree datasets, the unphased genotype data is used to infer the haplotypes for individuals, according to Mendelian inheritance rules. Linkage studies can then locate putative chromosomal regions based on the haplotype allele sharing among the pedigree members and their disease status. Most existing haplotyping programs require rather strict pedigree structures and return a single inferred solution for downstream analysis. In this research, we relax the pedigree structure to contain ungenotyped founders and present a cubic time whole genome haplotyping algorithm to minimize the number of zero-recombination haplotype blocks. With or without explicitly enumerating all the haplotyping solutions, the algorithm determines all distinct haplotype allele identity-by-descent (IBD) sharings among the pedigree members, in linear time in the total number of haplotyping solutions. Our algorithm is implemented as a computer program iBDD. Extensive simulation experiments using 2 sets of 16 pedigree structures from previous studies showed that, in general, there are trillions of haplotyping solutions, but only up to a few thousand distinct haplotype allele IBD sharings. iBDD is able to return all these sharings for downstream genome-wide linkage and association studies.


Asunto(s)
Algoritmos , Genómica/estadística & datos numéricos , Haplotipos , Linaje , Alelos , Biología Computacional , Femenino , Ligamiento Genético , Genoma Humano , Genotipo , Humanos , Masculino , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Programas Informáticos
5.
BMC Res Notes ; 5: 404, 2012 Aug 03.
Artículo en Inglés | MEDLINE | ID: mdl-22863359

RESUMEN

BACKGROUND: Single nucleotide polymorphism (SNP) genotyping assays normally give rise to certain percents of no-calls; the problem becomes severe when the target organisms, such as cattle, do not have a high resolution genomic sequence. Missing SNP genotypes, when related to target traits, would confound downstream data analyses such as genome-wide association studies (GWAS). Existing methods for recovering the missing values are successful to some extent - either accurate but not fast enough or fast but not accurate enough. RESULTS: To a target missing genotype, we take only the SNP loci within a genetic distance vicinity and only the samples within a similarity vicinity into our local imputation process. For missing genotype imputation, the comparative performance evaluations through extensive simulation studies using real human and cattle genotype datasets demonstrated that our nearest neighbor based local imputation method was one of the most efficient methods, and outperformed existing methods except the time-consuming fastPHASE; for missing haplotype allele imputation, the comparative performance evaluations using real mouse haplotype datasets demonstrated that our method was not only one of the most efficient methods, but also one of the most accurate methods. CONCLUSIONS: Given that fastPHASE requires a long imputation time on medium to high density datasets, and that our nearest neighbor based local imputation method only performed slightly worse, yet better than all other methods, one might want to adopt our method as an alternative missing SNP genotype or missing haplotype allele imputation method.


Asunto(s)
Polimorfismo de Nucleótido Simple , Alelos , Animales , Bovinos , Estudio de Asociación del Genoma Completo , Genotipo , Haplotipos , Humanos , Ratones , Reproducibilidad de los Resultados
6.
BMC Bioinformatics ; 10: 115, 2009 Apr 21.
Artículo en Inglés | MEDLINE | ID: mdl-19379528

RESUMEN

BACKGROUND: The "common disease--common variant" hypothesis and genome-wide association studies have achieved numerous successes in the last three years, particularly in genetic mapping in human diseases. Nevertheless, the power of the association study methods are still low, in particular on quantitative traits, and the description of the full allelic spectrum is deemed still far from reach. Given increasing density of single nucleotide polymorphisms available and suggested by the block-like structure of the human genome, a popular and prosperous strategy is to use haplotypes to try to capture the correlation structure of SNPs in regions of little recombination. The key to the success of this strategy is thus the ability to unambiguously determine the haplotype allele sharing status among the members. The association studies based on haplotype sharing status would have significantly reduced degrees of freedom and be able to capture the combined effects of tightly linked causal variants. RESULTS: For pedigree genotype datasets of medium density of SNPs, we present two methods for haplotype allele sharing status determination among the pedigree members. Extensive simulation study showed that both methods performed nearly perfectly on breakpoint discovery, mutation haplotype allele discovery, and shared chromosomal region discovery. CONCLUSION: For pedigree genotype datasets, the haplotype allele sharing status among the members can be deterministically, efficiently, and accurately determined, even for very small pedigrees. Given their excellent performance, the presented haplotype allele sharing status determination programs can be useful in many downstream applications including haplotype based association studies.


Asunto(s)
Alelos , Haplotipos , Mapeo Cromosómico/métodos , Simulación por Computador , Genoma Humano , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Linaje , Polimorfismo de Nucleótido Simple
7.
BMC Bioinformatics ; 9: 279, 2008 Jun 13.
Artículo en Inglés | MEDLINE | ID: mdl-18554404

RESUMEN

BACKGROUND: Serotypes of the Foot-and-Mouth disease viruses (FMDVs) were generally determined by biological experiments. The computational genotyping is not well studied even with the availability of whole viral genomes, due to uneven evolution among genes as well as frequent genetic recombination. Naively using sequence comparison for genotyping is only able to achieve a limited extent of success. RESULTS: We used 129 FMDV strains with known serotype as training strains to select as many as 140 most serotype-specific nucleotide strings. We then constructed a linear-kernel Support Vector Machine classifier using these 140 strings. Under the leave-one-out cross validation scheme, this classifier was able to assign correct serotype to 127 of these 129 strains, achieving 98.45% accuracy. It also assigned serotype correctly to an independent test set of 83 other FMDV strains downloaded separately from NCBI GenBank. CONCLUSION: Computational genotyping is much faster and much cheaper than the wet-lab based biological experiments, upon the availability of the detailed molecular sequences. The high accuracy of our proposed method suggests the potential of utilizing a few signature nucleotide strings instead of whole genomes to determine the serotypes of novel FMDV strains.


Asunto(s)
ADN Viral/genética , Virus de la Fiebre Aftosa/genética , Virus de la Fiebre Aftosa/aislamiento & purificación , Genotipo , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Datos de Secuencia Molecular , Especificidad de la Especie
8.
Artículo en Inglés | MEDLINE | ID: mdl-19164008

RESUMEN

Gene expression microarray technology has enabled advanced biological and medical research, but the data are well-recognized noisy and must be used with caution, since they are greatly affected by many experimental factors such as RNA concentration, spot typing, hybridization condition, and image analysis. It is highly desirable that the inaccurate data entries ('stains') can be identified and subsequently curated. In this paper, we propose a novel computational method, based on feature gene selection and sample classification, to efficiently discover the stains and apply imputation methods to estimate their values. Extensive experimental results on three Affymetrix platforms for human cancer diagnosis showed that by picking only 1-4% data entries as the most likely stains, the smoothed datasets could be used for better downstream data analyses such as robust biomarker identification and disease diagnosis.


Asunto(s)
Biomarcadores de Tumor/análisis , Diagnóstico por Computador/métodos , Perfilación de la Expresión Génica/métodos , Neoplasias/diagnóstico , Neoplasias/metabolismo , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Algoritmos , Inteligencia Artificial , Humanos , Proteínas de Neoplasias/análisis , Reproducibilidad de los Resultados , Tamaño de la Muestra , Sensibilidad y Especificidad , Procesamiento de Señales Asistido por Computador
9.
BMC Bioinformatics ; 8: 206, 2007 Jun 16.
Artículo en Inglés | MEDLINE | ID: mdl-17573973

RESUMEN

BACKGROUND: Gene expression microarray is a powerful technology for genetic profiling diseases and their associated treatments. Such a process involves a key step of biomarker identification, which are expected to be closely related to the disease. A most important task of these identified genes is that they can be used to construct a classifier which can effectively diagnose disease and even recognize the disease subtypes. Binary classification, for example, diseased or healthy, in microarray data analysis has been successful, while multi-class classification, such as cancer subtyping, remains challenging. RESULTS: We target on the challenging multi-class classification in microarray data analysis, especially on the cancer subtyping using gene expression microarray. We present a novel class discrimination strength vector to represent individual genes and introduce a new measurement to quantify the class discrimination strength difference between two genes. Such a new distance measure is employed in gene clustering, and subsequently the gene cluster information is exploited to select a set of genes which can be used to construct a sample classifier. We tested our method on four real cancer microarray datasets each contains multiple subtypes of cancer patients. The experimental results show that the constructed classifiers all achieved a higher classification accuracy than the previously best classification results obtained on these four datasets. Additional tests show that the selected genes by our method are less correlated and they all contribute statistically significantly to the more accurate cancer subtyping. CONCLUSION: The proposed novel class discrimination strength vector is a better representation than the gene expression vector, in the sense that it can be used to effectively eliminate highly correlated but redundant genes for classifier construction. Such a method can build a classifier to achieve a higher classification accuracy, which is demonstrated via cancer subtyping.


Asunto(s)
Algoritmos , Biomarcadores de Tumor/metabolismo , Diagnóstico por Computador/métodos , Proteínas de Neoplasias/metabolismo , Neoplasias/diagnóstico , Neoplasias/metabolismo , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Biomarcadores de Tumor/genética , Predisposición Genética a la Enfermedad/genética , Humanos , Familia de Multigenes/genética , Proteínas de Neoplasias/genética , Neoplasias/clasificación , Neoplasias/genética , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
10.
Bioinformatics ; 23(14): 1744-52, 2007 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-17495995

RESUMEN

MOTIVATION: The availability of the whole genomic sequences of HIV-1 viruses provides an excellent resource for studying the HIV-1 phylogenies using all the genetic materials. However, such huge volumes of data create computational challenges in both memory consumption and CPU usage. RESULTS: We propose the complete composition vector representation for an HIV-1 strain, and a string scoring method to extract the nucleotide composition strings that contain the richest evolutionary information for phylogenetic analysis. In this way, a large-scale whole genome phylogenetic analysis for thousands of strains can be done both efficiently and effectively. By using 42 carefully curated strains as references, we apply our method to subtype 1156 HIV-1 strains (10.5 million nucleotides in total), which include 825 pure subtype strains and 331 recombinants. Our results show that our nucleotide composition string selection scheme is computationally efficient, and is able to define both pure subtypes and recombinant forms for HIV-1 strains using the 5000 top ranked nucleotide strings. AVAILABILITY: The Java executable and the HIV-1 datasets are accessible through 'http://www.cs.ualberta.ca/~ghlin/src/WebTools/hiv.php. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional/métodos , Genoma Viral , Genómica/métodos , VIH-1/genética , Composición de Base , Evolución Molecular , Técnicas Genéticas , Vectores Genéticos , Modelos Estadísticos , Nucleótidos/química , Filogenia , ARN/genética , Recombinación Genética , Reproducibilidad de los Resultados
11.
Artículo en Inglés | MEDLINE | ID: mdl-17369636

RESUMEN

Existing HIV-1 genotyping systems require a computationally expensive phase of multiple sequence alignments and the alignments must have a sufficiently high quality for accurate genotyping. This is particularly a challenge when the number of strains is large. Here we propose a whole genome composition distance (WGCD) to measure the evolutionary closeness between two HIV-1 whole genomic RNA sequences, and use that measure to develop an HIV-1 genotyping system. Such a WGCD-based genotyping system avoids multiple sequence alignments and does not require any pre-knowledge about the evolutionary rates. Experimental results showed that the system is able to correctly identify the known subtypes, sub-subtypes, and individual circulating recombinant forms.


Asunto(s)
Biología Computacional/métodos , Genoma Viral , Genotipo , VIH-1/genética , Evolución Molecular , Técnicas Genéticas , Genoma , Modelos Genéticos , Modelos Estadísticos , Filogenia , Proteínas Recombinantes/química , Programas Informáticos , Interfaz Usuario-Computador
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...