RESUMO
MOTIVATION: Blood cell development is thought to be controlled by a circuit of transcription factors (TFs) and chromatin modifications that determine the cell fate through activating cell type-specific expression programs. To shed light on the interplay between histone marks and TFs during blood cell development, we model gene expression from regulatory signals by means of combinations of sparse linear regression models. RESULTS: The mixture of sparse linear regression models was able to improve the gene expression prediction in relation to the use of a single linear model. Moreover, it performed an efficient selection of regulatory signals even when analyzing all TFs with known motifs (>600). The method identified interesting roles for histone modifications and a selection of TFs related to blood development and chromatin remodelling. AVAILABILITY: The method and datasets are available from http://www.cin.ufpe.br/~igcf/SparseMix. CONTACT: igcf@cin.ufpe.br SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Células Sanguíneas/metabolismo , Epigênese Genética , Transcrição Gênica , Animais , Teorema de Bayes , Sítios de Ligação , Diferenciação Celular/genética , Células-Tronco Embrionárias/metabolismo , Histonas/metabolismo , Modelos Lineares , Camundongos , Regiões Promotoras Genéticas , Fatores de Transcrição/metabolismoRESUMO
In breast tomosynthesis, multiple low-dose projections are acquired in a single scanning direction over a limited angular range to produce cross-sectional planes through the breast for three-dimensional imaging interpretation. We built a next-generation tomosynthesis system capable of multidirectional source motion with the intent to customize scanning motions around "suspicious findings". Customized acquisitions can improve the image quality in areas that require increased scrutiny, such as breast cancers, architectural distortions, and dense clusters. In this paper, virtual clinical trial techniques were used to analyze whether a finding or area at high risk of masking cancers can be detected in a single low-dose projection and thus be used for motion planning. This represents a step towards customizing the subsequent low-dose projection acquisitions autonomously, guided by the first low-dose projection; we call this technique "self-steering tomosynthesis." A U-Net was used to classify the low-dose projections into "risk classes" in simulated breasts with soft-tissue lesions; class probabilities were modified using post hoc Dirichlet calibration (DC). DC improved the multiclass segmentation (Dice = 0.43 vs. 0.28 before DC) and significantly reduced false positives (FPs) from the class of the highest risk of masking (sensitivity = 81.3% at 2 FPs per image vs. 76.0%). This simulation-based study demonstrated the feasibility of identifying suspicious areas using a single low-dose projection for self-steering tomosynthesis.
Assuntos
Neoplasias da Mama , Mamografia , Humanos , Feminino , Mamografia/métodos , Estudos Transversais , Mama/diagnóstico por imagem , Neoplasias da Mama/diagnóstico por imagem , Neoplasias da Mama/patologia , Imageamento Tridimensional/métodosRESUMO
Long non-coding RNAs (lncRNAs) comprise the most representative transcriptional units of the mammalian genome. They are associated with organ development linked with the emergence of cardiovascular diseases. We used bioinformatic approaches, machine learning algorithms, systems biology analyses, and statistical techniques to define co-expression modules linked to heart development and cardiovascular diseases. We also uncovered differentially expressed transcripts in subpopulations of cardiomyocytes. Finally, from this work, we were able to identify eight cardiac cell-types; several new coding, lncRNA, and pcRNA markers; two cardiomyocyte subpopulations at four different time points (ventricle E9.5, left ventricle E11.5, right ventricle E14.5 and left atrium P0) that harbored co-expressed gene modules enriched in mitochondrial, heart development and cardiovascular diseases. Our results evidence the role of particular lncRNAs in heart development and highlight the usage of co-expression modular approaches in the cell-type functional definition.
Assuntos
Doenças Cardiovasculares , RNA Longo não Codificante , Animais , Camundongos , RNA Longo não Codificante/genética , Perfilação da Expressão Gênica/métodos , Organogênese , Miócitos Cardíacos , Mamíferos/genéticaRESUMO
Virtual clinical trials (VCTs) have been used widely to evaluate digital breast tomosynthesis (DBT) systems. VCTs require realistic simulations of the breast anatomy (phantoms) to characterize lesions and to estimate risk of masking cancers. This study introduces the use of Perlin-based phantoms to optimize the acquisition geometry of a novel DBT prototype. These phantoms were developed using a GPU implementation of a novel library called Perlin-CuPy. The breast anatomy is simulated using 3D models under mammography cranio-caudal compression. In total, 240 phantoms were created using compressed breast thickness, chest-wall to nipple distance, and skin thickness that varied in a {[35, 75], [59, 130), [1.0, 2.0]} mm interval, respectively. DBT projections and reconstructions of the phantoms were simulated using two acquisition geometries of our DBT prototype. The performance of both acquisition geometries was compared using breast volume segmentations of the Perlin phantoms. Results show that breast volume estimates are improved with the introduction of posterior-anterior motion of the x-ray source in DBT acquisitions. The breast volume is overestimated in DBT, varying substantially with the acquisition geometry; segmentation errors are more evident for thicker and larger breasts. These results provide additional evidence and suggest that custom acquisition geometries can improve the performance and accuracy in DBT. Perlin phantoms help to identify limitations in acquisition geometries and to optimize the performance of the DBT prototypes.
RESUMO
Our lab has built a next-generation tomosynthesis (NGT) system utilizing scanning motions with more degrees of freedom than clinical digital breast tomosynthesis systems. We are working toward designing scanning motions that are customized around the locations of suspicious findings. The first step in this direction is to demonstrate that these findings can be detected with a single projection image, which can guide the remainder of the scan. This paper develops an automated method to identify findings that are prone to be masked. Perlin-noise phantoms and synthetic lesions were used to simulate masked cancers. NGT projections of phantoms were simulated using ray-tracing software. The risk of masking cancers was mapped using the ground-truth labels of phantoms. The phantom labels were used to denote regions of low and high risk of masking suspicious findings. A U-Net model was trained for multiclass segmentation of phantom images. Model performance was quantified with a receiver operating characteristic (ROC) curve using area under the curve (AUC). The ROC operating point was defined to be the point closest to the upper left corner of ROC space. The output predictions showed an accurate segmentation of tissue predominantly adipose (mean AUC of 0.93). The predictions also indicate regions of suspicious findings; for the highest risk class, mean AUC was 0.89, with a true positive rate of 0.80 and a true negative rate of 0.83 at the operating point. In summary, this paper demonstrates with virtual phantoms that a single projection can indeed be used to identify suspicious findings.
RESUMO
BACKGROUND: The differentiation process from stem cells to fully differentiated cell types is controlled by the interplay of chromatin modifications and transcription factor activity. Histone modifications or transcription factors frequently act in a multi-functional manner, with a given DNA motif or histone modification conveying both transcriptional repression and activation depending on its location in the promoter and other regulatory signals surrounding it. RESULTS: To account for the possible multi functionality of regulatory signals, we model the observed gene expression patterns by a mixture of linear regression models. We apply the approach to identify the underlying histone modifications and transcription factors guiding gene expression of differentiated CD4+ T cells. The method improves the gene expression prediction in relation to the use of a single linear model, as often used by previous approaches. Moreover, it recovered the known role of the modifications H3K4me3 and H3K27me3 in activating cell specific genes and of some transcription factors related to CD4+ T differentiation.
Assuntos
Linfócitos T CD4-Positivos/citologia , Diferenciação Celular , Histonas/metabolismo , Fatores de Transcrição/metabolismo , Teorema de Bayes , Linfócitos T CD4-Positivos/metabolismo , DNA/genética , DNA/metabolismo , Regulação da Expressão Gênica , Histonas/genética , Modelos Lineares , Ligação Proteica , Fatores de Transcrição/genéticaRESUMO
Non-coding RNAs (ncRNAs) are important players in the cellular regulation of organisms from different kingdoms. One of the key steps in ncRNAs research is the ability to distinguish coding/non-coding sequences. We applied seven machine learning algorithms (Naive Bayes, SVM, KNN, Random Forest, XGBoost, ANN and DL) through 15 model organisms from different evolutionary branches. Then, we created a stand-alone and web server tool (RNAmining) to distinguish coding and non-coding sequences, selecting the algorithm with the best performance (XGBoost). Firstly, we used coding/non-coding sequences downloaded from Ensembl (April 14th, 2020). Then, coding/non-coding sequences were balanced, had their tri-nucleotides counts analysed and we performed a normalization by the sequence length. Thus, in total we built 180 models. All the machine learning algorithms tests were performed using 10-folds cross-validation and we selected the algorithm with the best results (XGBoost) to implement at RNAmining. Best F1-scores ranged from 97.56% to 99.57% depending on the organism. Moreover, we produced a benchmarking with other tools already in literature (CPAT, CPC2, RNAcon and Transdecoder) and our results outperformed them, opening opportunities for the development of RNAmining, which is freely available at https://rnamining.integrativebioinformatics.me/.
Assuntos
Aprendizado de Máquina , RNA , Algoritmos , Teorema de Bayes , Máquina de Vetores de SuporteRESUMO
OBJECTIVE: Data normalization and clustering are mandatory steps in gene expression and downstream analyses, respectively. However, user-friendly implementations of these methodologies are available exclusively under expensive licensing agreements, or in stand-alone scripts developed, reflecting on a great obstacle for users with less computational skills. RESULTS: We developed an online tool called CORAZON (Correlations Analyses Zipper Online), which implements three unsupervised learning methods to cluster gene expression datasets in a friendly environment. It allows the usage of eight gene expression normalization/transformation methodologies and the attribute's influence. The normalizations requiring the gene length only could be performed to RNA-seq, meanwhile the others can be used with microarray and/or NanoString data. Clustering methodologies performances were evaluated through five models with accuracies between 92 and 100%. We applied our tool to obtain functional insights of non-coding RNAs (ncRNAs) based on Gene Ontology enrichment of clusters in a dataset generated by the ENCODE project. The clusters where the majority of transcripts are coding genes were enriched in Cellular, Metabolic, Transports, and Systems Development categories. Meanwhile, the ncRNAs were enriched in the Detection of Stimulus, Sensory Perception, Immunological System, and Digestion categories. CORAZON source-code is freely available at https://gitlab.com/integrativebioinformatics/corazon and the web-server can be accessed at http://corazon.integrativebioinformatics.me .
Assuntos
Computadores , Software , Análise por Conglomerados , Perfilação da Expressão Gênica , Ontologia Genética , Internet , RNA não TraduzidoRESUMO
Genomic Islands (GIs) are regions of bacterial genomes that are acquired from other organisms by the phenomenon of horizontal transfer. These regions are often responsible for many important acquired adaptations of the bacteria, with great impact on their evolution and behavior. Nevertheless, these adaptations are usually associated with pathogenicity, antibiotic resistance, degradation and metabolism. Identification of such regions is of medical and industrial interest. For this reason, different approaches for genomic islands prediction have been proposed. However, none of them are capable of predicting precisely the complete repertory of GIs in a genome. The difficulties arise due to the changes in performance of different algorithms in the face of the variety of nucleotide distribution in different species. In this paper, we present a novel method to predict GIs that is built upon mean shift clustering algorithm. It does not require any information regarding the number of clusters, and the bandwidth parameter is automatically calculated based on a heuristic approach. The method was implemented in a new user-friendly tool named MSGIP--Mean Shift Genomic Island Predictor. Genomes of bacteria with GIs discussed in other papers were used to evaluate the proposed method. The application of this tool revealed the same GIs predicted by other methods and also different novel unpredicted islands. A detailed investigation of the different features related to typical GI elements inserted in these new regions confirmed its effectiveness. Stand-alone and user-friendly versions for this new methodology are available at http://msgip.integrativebioinformatics.me.