Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 4 de 4
Filter
1.
Nucleic Acids Res ; 43(20): e129, 2015 Nov 16.
Article in English | MEDLINE | ID: mdl-26101252

ABSTRACT

Single Molecule, Real-Time (SMRT) Sequencing (Pacific Biosciences, Menlo Park, CA, USA) provides the longest continuous DNA sequencing reads currently available. However, the relatively high error rate in the raw read data requires novel analysis methods to deconvolute sequences derived from complex samples. Here, we present a workflow of novel computer algorithms able to reconstruct viral variant genomes present in mixtures with an accuracy of >QV50. This approach relies exclusively on Continuous Long Reads (CLR), which are the raw reads generated during SMRT Sequencing. We successfully implement this workflow for simultaneous sequencing of mixtures containing up to forty different >9 kb HIV-1 full genomes. This was achieved using a single SMRT Cell for each mixture and desktop computing power. This novel approach opens the possibility of solving complex sequencing tasks that currently lack a solution.


Subject(s)
Genetic Variation , Genome, Viral , HIV-1/genetics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Algorithms , Cluster Analysis , Humans , Sequence Alignment
2.
Biostatistics ; 10(3): 424-35, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19234308

ABSTRACT

Classification studies with high-dimensional measurements and relatively small sample sizes are increasingly common. Prospective analysis of the role of sample sizes in the performance of such studies is important for study design and interpretation of results, but the complexity of typical pattern discovery methods makes this problem challenging. The approach developed here combines Monte Carlo methods and new approximations for linear discriminant analysis, assuming multivariate normal distributions. Monte Carlo methods are used to sample the distribution of which features are selected for a classifier and the mean and variance of features given that they are selected. Given selected features, the linear discriminant problem involves different distributions of training data and generalization data, for which 2 approximations are compared: one based on Taylor series approximation of the generalization error and the other on approximating the discriminant scores as normally distributed. Combining the Monte Carlo and approximation approaches to different aspects of the problem allows efficient estimation of expected generalization error without full simulations of the entire sampling and analysis process. To evaluate the method and investigate realistic study design questions, full simulations are used to ask how validation error rate depends on the strength and number of informative features, the number of noninformative features, the sample size, and the number of features allowed into the pattern. Both approximation methods perform well for most cases but only the normal discriminant score approximation performs well for cases of very many weakly informative or uninformative dimensions. The simulated cases show that many realistic study designs will typically estimate substantially suboptimal patterns and may have low probability of statistically significant validation results.


Subject(s)
Biometry/methods , Classification/methods , Sample Size , Algorithms , Genomics/statistics & numerical data , Humans , Linear Models , Monte Carlo Method , Multivariate Analysis , Proteomics/statistics & numerical data
3.
Eukaryot Cell ; 6(6): 940-8, 2007 Jun.
Article in English | MEDLINE | ID: mdl-17468393

ABSTRACT

Pre-mRNA splicing is essential to ensure accurate expression of many genes in eukaryotic organisms. In Entamoeba histolytica, a deep-branching eukaryote, approximately 30% of the annotated genes are predicted to contain introns; however, the accuracy of these predictions has not been tested. In this study, we mined an expressed sequence tag (EST) library representing 7% of amoebic genes and found evidence supporting splicing of 60% of the testable intron predictions, the majority of which contain a GUUUGU 5' splice site and a UAG 3' splice site. Additionally, we identified several splice site misannotations, evidence for the existence of 30 novel introns in previously annotated genes, and identified novel genes through uncovering their spliced ESTs. Finally, we provided molecular evidence for the E. histolytica U2, U4, and U5 snRNAs. These data lay the foundation for further dissection of the role of RNA processing in E. histolytica gene expression.


Subject(s)
Entamoeba histolytica , Introns , RNA, Small Nuclear/metabolism , Spliceosomes/metabolism , Animals , Base Sequence , Entamoeba histolytica/genetics , Entamoeba histolytica/metabolism , Expressed Sequence Tags , Gene Expression Regulation , Molecular Sequence Data , Nucleic Acid Conformation , RNA Splicing , RNA, Small Nuclear/chemistry , RNA, Small Nuclear/genetics , Spliceosomes/genetics
4.
Electrophoresis ; 26(7-8): 1500-12, 2005 Apr.
Article in English | MEDLINE | ID: mdl-15765480

ABSTRACT

A capillary electrophoresis-mass spectrometry (CE-MS) method has been developed to perform routine, automated analysis of low-molecular-weight peptides in human serum. The method incorporates transient isotachophoresis for in-line preconcentration and a sheathless electrospray interface. To evaluate the performance of the method and demonstrate the utility of the approach, an experiment was designed in which peptides were added to sera from individuals at each of two different concentrations, artificially creating two groups of samples. The CE-MS data from the serum samples were divided into separate training and test sets. A pattern-recognition/feature-selection algorithm based on support vector machines was used to select the mass-to-charge (m/z) values from the training set data that distinguished the two groups of samples from each other. The added peptides were identified correctly as the distinguishing features, and pattern recognition based on these peptides was used to assign each sample in the independent test set to its respective group. A twofold difference in peptide concentration could be detected with statistical significance (p-value < 0.0001). The accuracy of the assignment was 95%, demonstrating the utility of this technique for the discovery of patterns of biomarkers in serum.


Subject(s)
Biomarkers/blood , Electrophoresis, Capillary/methods , Spectrometry, Mass, Electrospray Ionization/methods , Automation , Electrophoresis, Gel, Two-Dimensional , Humans
SELECTION OF CITATIONS
SEARCH DETAIL