RESUMO
Organ-specific gene expression datasets that include hundreds to thousands of experiments allow the reconstruction of organ-level gene regulatory networks (GRNs). However, creating such datasets is greatly hampered by the requirements of extensive and tedious manual curation. Here, we trained a supervised classification model that can accurately classify the organ-of-origin for a plant transcriptome. This K-Nearest Neighbor-based multiclass classifier was used to create organ-specific gene expression datasets for the leaf, root, shoot, flower, and seed in Arabidopsis thaliana. A GRN inference approach was used to determine the: i. influential transcription factors (TFs) in each organ and, ii. most influential TFs for specific biological processes in that organ. These genome-wide, organ-delimited GRNs (OD-GRNs), recalled many known regulators of organ development and processes operating in those organs. Importantly, many previously unknown TF regulators were uncovered as potential regulators of these processes. As a proof-of-concept, we focused on experimentally validating the predicted TF regulators of lipid biosynthesis in seeds, an important food and biofuel trait. Of the top 20 predicted TFs, eight are known regulators of seed oil content, e.g., WRI1, LEC1, FUS3. Importantly, we validated our prediction of MybS2, TGA4, SPL12, AGL18, and DiV2 as regulators of seed lipid biosynthesis. We elucidated the molecular mechanism of MybS2 and show that it induces purple acid phosphatase family genes and lipid synthesis genes to enhance seed lipid content. This general approach has the potential to be extended to any species with sufficiently large gene expression datasets to find unique regulators of any trait-of-interest.