RESUMEN
The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.
RESUMEN
The integration of mitochondrial genome fragments into the nuclear genome is well documented, and the transfer of these mitochondrial nuclear pseudogenes (numts) is thought to be an ongoing evolutionary process. With the increasing number of eukaryotic genomes available, genome-wide distributions of numts are often surveyed. However, inconsistencies in genome quality can reduce the accuracy of numt estimates, and methods used for identification can be complicated by the diverse sizes and ages of numts. Numts have been previously characterized in rodent genomes and it was postulated that they might be more prevalent in a group of voles with rapidly evolving karyotypes. Here, we examine 37 rodent genomes, and an additional 26 vertebrate genomes, while also considering numt detection methods. We identify numts using DNA:DNA and protein:translated-DNA similarity searches and compare numt distributions among rodent and vertebrate taxa to assess whether some groups are more susceptible to transfer. A combination of protein sequence comparisons (protein:translated-DNA) and BLASTN genomic DNA searches detect 50% more numts than genomic DNA:DNA searches alone. In addition, higher-quality RefSeq genomes produce lower estimates of numts than GenBank genomes, suggesting that lower quality genome assemblies can overestimate numts abundance. Phylogenetic analysis shows that mitochondrial transfers are not associated with karyotypic diversity among rodents. Surprisingly, we did not find a strong correlation between numt counts and genome size. Estimates using DNA: DNA analyses can underestimate the amount of mitochondrial DNA that is transferred to the nucleus.
RESUMEN
Bioinformatics, a discipline that combines aspects of biology, statistics, mathematics, and computer science, is becoming increasingly important for biological research. However, bioinformatics instruction is not yet generally integrated into undergraduate life sciences curricula. To understand why we studied how bioinformatics is being included in biology education in the US by conducting a nationwide survey of faculty at two- and four-year institutions. The survey asked several open-ended questions that probed barriers to integration, the answers to which were analyzed using a mixed-methods approach. The barrier most frequently reported by the 1,260 respondents was lack of faculty expertise/training, but other deterrents-lack of student interest, overly-full curricula, and lack of student preparation-were also common. Interestingly, the barriers faculty face depended strongly on whether they are members of an underrepresented group and on the Carnegie Classification of their home institution. We were surprised to discover that the cohort of faculty who were awarded their terminal degree most recently reported the most preparation in bioinformatics but teach it at the lowest rate.
Asunto(s)
Biología/educación , Biología Computacional/educación , Curriculum , Docentes/estadística & datos numéricos , Femenino , Humanos , Masculino , Motivación , Estudiantes/psicología , Encuestas y Cuestionarios/estadística & datos numéricos , Estados UnidosRESUMEN
Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc.
Asunto(s)
Biología Computacional/métodos , Bases de Datos de Proteínas , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Escherichia coli/genética , Evolución Molecular , Proteínas/química , Proteínas/genética , Alineación de Secuencia , Homología de Secuencia de AminoácidoRESUMEN
Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity.
Asunto(s)
Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Dominios Proteicos , Programas InformáticosRESUMEN
The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons.
Asunto(s)
Nucleótidos/química , Proteínas/química , Alineación de Secuencia , Homología de Secuencia de Aminoácido , Homología de Secuencia de Ácido Nucleico , Bases de Datos de Ácidos Nucleicos , Bases de Datos de ProteínasRESUMEN
The characterization of new genomes based on their protein sets has been revolutionized by new sequencing technologies, but biologists seeking to exploit new sequence information are often frustrated by the challenges associated with accurately assigning biological functions to newly identified proteins. Here, we highlight some of the challenges in functional inference from sequence similarity. Investigators can improve the accuracy of function prediction by (1) being conservative about the evolutionary distance to a protein of known function; (2) considering the ambiguous meaning of "functional similarity," and (3) being aware of the limitations of annotations in functional databases. Protein function prediction does not offer "one-size-fits-all" solutions. Prediction strategies work better when the idiosyncrasies of function and functional annotation are better understood.
Asunto(s)
Bases de Datos de Proteínas , Proteínas/química , Proteínas/metabolismo , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Minería de Datos/métodos , Datos de Secuencia Molecular , Relación Estructura-ActividadRESUMEN
BACKGROUND: Protein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2). RESULTS: We characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts. CONCLUSIONS: Partial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein's gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.
Asunto(s)
Anotación de Secuencia Molecular , Estructura Terciaria de Proteína , Proteínas/química , Alineación de Secuencia , Animales , Bases de Datos Genéticas , Bases de Datos de Proteínas , Drosophila/genética , Humanos , Ratones , Modelos Moleculares , Programas InformáticosRESUMEN
Understanding which are the catalytic residues in an enzyme and what function they perform is crucial to many biology studies, particularly those leading to new therapeutics and enzyme design. The original version of the Catalytic Site Atlas (CSA) (http://www.ebi.ac.uk/thornton-srv/databases/CSA) published in 2004, which catalogs the residues involved in enzyme catalysis in experimentally determined protein structures, had only 177 curated entries and employed a simplistic approach to expanding these annotations to homologous enzyme structures. Here we present a new version of the CSA (CSA 2.0), which greatly expands the number of both curated (968) and automatically annotated catalytic sites in enzyme structures, utilizing a new method for annotation transfer. The curated entries are used, along with the variation in residue type from the sequence comparison, to generate 3D templates of the catalytic sites, which in turn can be used to find catalytic sites in new structures. To ease the transfer of CSA annotations to other resources a new ontology has been developed: the Enzyme Mechanism Ontology, which has permitted the transfer of annotations to Mechanism, Annotation and Classification in Enzymes (MACiE) and UniProt Knowledge Base (UniProtKB) resources. The CSA database schema has been re-designed and both the CSA data and search capabilities are presented in a new modern web interface.
Asunto(s)
Dominio Catalítico , Bases de Datos de Proteínas , Enzimas/química , Ontologías Biológicas , Internet , Análisis de Secuencia de ProteínaRESUMEN
BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA sequences based on excess sequence similarity. If two sequences share much more similarity than expected by chance, the simplest explanation for the excess similarity is common ancestry-homology. The most effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be downloaded and installed on local computers. With local installation, target databases can be customized for the sequence data being characterized. With today's very large protein databases, search sensitivity can also be improved by searching smaller comprehensive databases, for example, a complete protein set from an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein sequences that diverged more than 2 billion years ago.
Asunto(s)
Biología Computacional/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Secuencia de Aminoácidos , Minería de Datos , Bases de Datos de Proteínas , Humanos , Datos de Secuencia Molecular , Homología de Secuencia de AminoácidoRESUMEN
MOTIVATION: Sequence similarity searches performed with BLAST, SSEARCH and FASTA achieve high sensitivity by using scoring matrices (e.g. BLOSUM62) that target low identity (<33%) alignments. Although such scoring matrices can effectively identify distant homologs, they can also produce local alignments that extend beyond the homologous regions. RESULTS: We measured local alignment start/stop boundary accuracy using a set of queries where the correct alignment boundaries were known, and found that 7% of BLASTP and 8% of SSEARCH alignment boundaries were overextended. Overextended alignments include non-homologous sequences; they occur most frequently between sequences that are more closely related (>33% identity). Adjusting the scoring matrix to reflect the identity of the homologous sequence can correct higher identity overextended alignment boundaries. In addition, the scoring matrix that produced a correct alignment could be reliably predicted based on the sequence identity seen in the original BLOSUM62 alignment. Realigning with the predicted scoring matrix corrected 37% of all overextended alignments, resulting in more correct alignments than using BLOSUM62 alone.
Asunto(s)
Biología Computacional/métodos , Posición Específica de Matrices de Puntuación , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Datos de Secuencia Molecular , Homología de Secuencia de AminoácidoRESUMEN
Sequence similarity searching, typically with BLAST, is the most widely used and most reliable strategy for characterizing newly determined sequences. Sequence similarity searches can identify "homologous" proteins or genes by detecting excess similarity- statistically significant similarity that reflects common ancestry. This unit provides an overview of the inference of homology from significant similarity, and introduces other units in this chapter that provide more details on effective strategies for identifying homologs.
Asunto(s)
Proteínas/química , Alineación de Secuencia/métodos , Bases de Datos de Proteínas , Proteínas/genética , Análisis de Secuencia , Homología de SecuenciaRESUMEN
Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. "Deep" scoring matrices like BLOSUM62 and BLOSUM50 target alignments with 20 - 30% identity, while "shallow" scoring matrices (e.g. VTML10 - VTML80), target alignments that share 90 - 50% identity, reflecting much less evolutionary change. While "deep" matrices provide very sensitive similarity searches, they also require longer sequence alignments and can sometimes produce alignment overextension into non-homologous regions. Shallower scoring matrices are more effective when searching for short protein domains, or when the goal is to limit the scope of the search to sequences that are likely to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match and mismatch parameters set evolutionary look-back times and domain boundaries. In this unit, we will discuss the theoretical foundations that drive practical choices of protein and DNA similarity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50) should be used for sensitive searches with full-length protein sequences, but short domains or restricted evolutionary look-back require shallower scoring matrices.
Asunto(s)
Posición Específica de Matrices de Puntuación , Secuencia de Aminoácidos , Sustitución de Aminoácidos , ADN , Datos de Secuencia Molecular , Alineación de Secuencia , Homología de Secuencia de AminoácidoRESUMEN
UNLABELLED: Iterative similarity searches with PSI-BLAST position-specific score matrices (PSSMs) find many more homologs than single searches, but PSSMs can be contaminated when homologous alignments are extended into unrelated protein domains-homologous over-extension (HOE). PSI-Search combines an optimal Smith-Waterman local alignment sequence search, using SSEARCH, with the PSI-BLAST profile construction strategy. An optional sequence boundary-masking procedure, which prevents alignments from being extended after they are initially included, can reduce HOE errors in the PSSM profile. Preventing HOE improves selectivity for both PSI-BLAST and PSI-Search, but PSI-Search has ~4-fold better selectivity than PSI-BLAST and similar sensitivity at 50% and 60% family coverage. PSI-Search is also produces 2- for 4-fold fewer false-positives than JackHMMER, but is ~5% less sensitive. AVAILABILITY AND IMPLEMENTATION: PSI-Search is available from the authors as a standalone implementation written in Perl for Linux-compatible platforms. It is also available through a web interface (www.ebi.ac.uk/Tools/sss/psisearch) and SOAP and REST Web Services (www.ebi.ac.uk/Tools/webservices).
Asunto(s)
Secuencias de Aminoácidos , Alineación de Secuencia/métodos , Programas Informáticos , Biología Computacional/métodos , Bases de Datos de Proteínas , Internet , Lenguajes de ProgramaciónRESUMEN
MACiE (which stands for Mechanism, Annotation and Classification in Enzymes) is a database of enzyme reaction mechanisms, and can be accessed from http://www.ebi.ac.uk/thornton-srv/databases/MACiE/. This article presents the release of Version 3 of MACiE, which not only extends the dataset to 335 entries, covering 182 of the EC sub-subclasses with a crystal structure available (~90%), but also incorporates greater chemical and structural detail. This version of MACiE represents a shift in emphasis for new entries, from non-homologous representatives covering EC reaction space to enzymes with mechanisms of interest to our users and collaborators with a view to exploring the chemical diversity of life. We present new tools for exploring the data in MACiE and comparing entries as well as new analyses of the data and new searches, many of which can now be accessed via dedicated Perl scripts.
Asunto(s)
Bases de Datos de Proteínas , Enzimas/química , Biocatálisis , Fenómenos Bioquímicos , Dominio Catalítico , Coenzimas/química , Enzimas/clasificación , Internet , Anotación de Secuencia MolecularRESUMEN
UNLABELLED: RefProtDom provides a set of divergent query domains, originally selected from Pfam, and full-length proteins containing their homologous domains, with diverse architectures, for evaluating pair-wise and iterative sequence similarity searches. Pfam homology and domain boundary annotations in the target library were supplemented using local and semi-global searches, PSI-BLAST searches, and SCOP and CATH classifications. AVAILABILITY: RefProtDom is available from http://faculty.virginia.edu/wrpearson/fasta/PUBS/gonzalez09a.
Asunto(s)
Bases de Datos de Proteínas , Estructura Terciaria de Proteína , Proteínas , Programas InformáticosRESUMEN
BACKGROUND: While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. RESULTS: We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance. CONCLUSIONS: The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.
Asunto(s)
Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de ProteínaRESUMEN
We have characterized a novel type of PSI-BLAST error, homologous over-extension (HOE), using embedded PFAM domain queries on searches against a reference library containing Pfam-annotated UniProt sequences and random synthetic sequences. PSI-BLAST makes two types of errors: alignments to non-homologous regions and HOE alignments that begin in a homologous region, but extend beyond the homology into neighboring sequence regions. When the neighboring sequence region contains a non-homologous domain, PSI-BLAST can incorporate the unrelated sequence into its position specific scoring matrix, which then finds non-homologous proteins with significant expectation values. HOE accounts for the largest fraction of the initial false positive (FP) errors, and the largest fraction of FPs at iteration 5. In searches against complete protein sequences, 5-9% of alignments at iteration 5 are non-homologous. HOE frequently begins in a partial protein domain; when partial domains are removed from the library, HOE errors decrease from 16 to 3% of weighted coverage (hard queries; 35-5% for sampled queries) and no-error searches increase from 2 to 58% weighed coverage (hard; 16-78% sampled). When HOE is reduced by not extending previously found sequences, PSI-BLAST specificity improves 4-8-fold, with little loss in sensitivity.
Asunto(s)
Alineación de Secuencia/métodos , Homología de Secuencia de Aminoácido , Filogenia , Posición Específica de Matrices de Puntuación , Estructura Terciaria de Proteína , Proteínas/química , Proteínas/clasificación , Proteínas/genéticaRESUMEN
MOTIVATION: To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models. RESULTS: While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1% of four-amino acid word clumps (4.7% of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1% (4mers) to 0.5% (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in alpha-helical secondary structures (but not beta-strands). Five-residue consensus exceptional words are enriched for alpha-helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for alpha-helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Secuencia de Aminoácidos , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Bases de Datos de Proteínas , Pliegue de Proteína , Estructura Secundaria de ProteínaRESUMEN
Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized.