*Plant Genome ; 13(3): e20034, 2020 Nov.*

##### RESUMO

Wheat quality improvement is an important objective in all wheat breeding programs. However, due to the cost, time and quantity of seed required, wheat quality is typically analyzed only in the last stages of the breeding cycle on a limited number of samples. The use of genomic prediction could greatly help to select for wheat quality more efficiently by reducing the cost and time required for this analysis. Here were evaluated the prediction performances of 13 wheat quality traits under two multi-trait models (Bayesian multi-trait multi-environment [BMTME] and multi-trait ridge regression [MTR]) using five data sets of wheat lines evaluated in the field during two consecutive years. Lines in the second year (testing) were predicted using the quality information obtained in the first year (training). For most quality traits were found moderate to high prediction accuracies, suggesting that the use of genomic selection could be feasible. The best predictions were obtained with the BMTME model in all traits and the worst with the MTR model. The best predictions with the BMTME model under the mean arctangent absolute percentage error (MAAPE) were for test weight across the five data sets, whereas the worst predictions were for the alveograph trait ALVPL. In contrast, under Pearson's correlation, the best predictions depended on the data set. The results obtained suggest that the BMTME model should be preferred for multi-trait prediction analyses. This model allows to obtain not only the correlation among traits, but also the correlation among environments, helping to increase the prediction accuracy.

*Plant Genome ; 13(3): e20033, 2020 Nov.*

##### RESUMO

When including genotype × environment interactions (G × E) in genomic prediction models, Hadamard or Kronecker products have been used to model the covariance structure of interactions. The relation between these two types of modeling has not been made clear in genomic prediction literature. Here, we demonstrate that a certain model based on a Hadamard formulation and another using the Kronecker product lead to exactly the same statistical model. Moreover, we illustrate how a multiplication of entries of covariance matrices is related to modeling locus × environmental-variable interactions explicitly. Finally, we use a wheat and a maize data set to illustrate that the environmental covariance E can be specified easily, also if no information on environmental variables - such as temperature or precipitation - is available. Given that lines have been tested in different environments, the corresponding environmental covariance can simply be estimated from the training set as phenotypic covariance between environments. To achieve a high level of increase in predictive ability, the environmental covariance has to be defined appropriately and records on the performance of the lines of the test set under different environmental conditions have to be included in the training set.

*Plant Genome ; 13(2): e20021, 2020 Jul.*

##### RESUMO

Linear and non-linear models used in applications of genomic selection (GS) can fit different types of responses (e.g., continuous, ordinal, binary). In recent years, several genomic-enabled prediction models have been developed for predicting complex traits in genomic-assisted animal and plant breeding. These models include linear, non-linear and non-parametric models, mostly for continuous responses and less frequently for categorical responses. Several linear and non-linear models are special cases of a more general family of statistical models known as artificial neural networks, which provide better prediction ability than other models. In this paper, we propose a Bayesian Regularized Neural Network (BRNNO) for modelling ordinal data. The proposed model was fitted using a Bayesian framework; we used the data augmentation algorithm to facilitate computations. The proposed model was fitted using the Gibbs Maximum a Posteriori and Generalized EM algorithm implemented by combining code written in C and R programming languages. The new model was tested with two real maize datasets evaluated for Septoria and GLS diseases and was compared with the Bayesian Ordered Probit Model (BOPM). Results indicated that the BRNNO model performed better in terms of genomic-based prediction than the BOPM model.

*Theor Appl Genet ; 2020 Oct 10.*

##### RESUMO

KEY MESSAGE: Historical data from breeding programs can be efficiently used to improve genomic selection accuracy, especially when the training set is optimized to subset individuals most informative of the target testing set. The current strategy for large-scale implementation of genomic selection (GS) at the International Maize and Wheat Improvement Center (CIMMYT) global maize breeding program has been to train models using information from full-sibs in a "test-half-predict-half approach." Although effective, this approach has limitations, as it requires large full-sib populations and limits the ability to shorten variety testing and breeding cycle times. The primary objective of this study was to identify optimal experimental and training set designs to maximize prediction accuracy of GS in CIMMYT's maize breeding programs. Training set (TS) design strategies were evaluated to determine the most efficient use of phenotypic data collected on relatives for genomic prediction (GP) using datasets containing 849 (DS1) and 1389 (DS2) DH-lines evaluated as testcrosses in 2017 and 2018, respectively. Our results show there is merit in the use of multiple bi-parental populations as TS when selected using algorithms to maximize relatedness between the training and prediction sets. In a breeding program where relevant past breeding information is not readily available, the phenotyping expenditure can be spread across connected bi-parental populations by phenotyping only a small number of lines from each population. This significantly improves prediction accuracy compared to within-population prediction, especially when the TS for within full-sib prediction is small. Finally, we demonstrate that prediction accuracy in either sparse testing or "test-half-predict-half" can further be improved by optimizing which lines are planted for phenotyping and which lines are to be only genotyped for advancement based on GP.

*Nat Commun ; 11(1): 4876, 2020 09 25.*

##### RESUMO

In most crops, genetic and environmental factors interact in complex ways giving rise to substantial genotype-by-environment interactions (G×E). We propose that computer simulations leveraging field trial data, DNA sequences, and historical weather records can be used to tackle the longstanding problem of predicting cultivars' future performances under largely uncertain weather conditions. We present a computer simulation platform that uses Monte Carlo methods to integrate uncertainty about future weather conditions and model parameters. We use extensive experimental wheat yield data (n = 25,841) to learn G×E patterns and validate, using left-trial-out cross-validation, the predictive performance of the model. Subsequently, we use the fitted model to generate circa 143 million grain yield data points for 28 wheat genotypes in 16 locations in France, over 16 years of historical weather records. The phenotypes generated by the simulation platform have multiple downstream uses; we illustrate this by predicting the distribution of expected yield at 448 cultivar-location combinations and performing means-stability analyses.

##### Assuntos

Simulação por Computador , Produtos Agrícolas/genética , Genótipo , Incerteza , Tempo (Meteorologia) , Agricultura/métodos , DNA de Plantas , Grão Comestível/genética , França , Interação Gene-Ambiente , Modelos Genéticos , Fenótipo , Triticum/genética*G3 (Bethesda) ; 10(11): 4083-4102, 2020 Nov 05.*

##### RESUMO

Due to the ever-increasing data collected in genomic breeding programs, there is a need for genomic prediction models that can deal better with big data. For this reason, here we propose a Maximum a posteriori Threshold Genomic Prediction (MAPT) model for ordinal traits that is more efficient than the conventional Bayesian Threshold Genomic Prediction model for ordinal traits. The MAPT performs the predictions of the Threshold Genomic Prediction model by using the maximum a posteriori estimation of the parameters, that is, the values of the parameters that maximize the joint posterior density. We compared the prediction performance of the proposed MAPT to the conventional Bayesian Threshold Genomic Prediction model, the multinomial Ridge regression and support vector machine on 8 real data sets. We found that the proposed MAPT was competitive with regard to the multinomial and support vector machine models in terms of prediction performance, and slightly better than the conventional Bayesian Threshold Genomic Prediction model. With regard to the implementation time, we found that in general the MAPT and the support vector machine were the best, while the slowest was the multinomial Ridge regression model. However, it is important to point out that the successful implementation of the proposed MAPT model depends on the informative priors used to avoid underestimation of variance components.

*G3 (Bethesda) ; 10(11): 4177-4190, 2020 Nov 05.*

##### RESUMO

The paradigm called genomic selection (GS) is a revolutionary way of developing new plants and animals. This is a predictive methodology, since it uses learning methods to perform its task. Unfortunately, there is no universal model that can be used for all types of predictions; for this reason, specific methodologies are required for each type of output (response variables). Since there is a lack of efficient methodologies for multivariate count data outcomes, in this paper, a multivariate Poisson deep neural network (MPDN) model is proposed for the genomic prediction of various count outcomes simultaneously. The MPDN model uses the minus log-likelihood of a Poisson distribution as a loss function, in hidden layers for capturing nonlinear patterns using the rectified linear unit (RELU) activation function and, in the output layer, the exponential activation function was used for producing outputs on the same scale of counts. The proposed MPDN model was compared to conventional generalized Poisson regression models and univariate Poisson deep learning models in two experimental data sets of count data. We found that the proposed MPDL outperformed univariate Poisson deep neural network models, but did not outperform, in terms of prediction, the univariate generalized Poisson regression models. All deep learning models were implemented in Tensorflow as back-end and Keras as front-end, which allows implementing these models on moderate and large data sets, which is a significant advantage over previous GS models for multivariate count data.

*Heredity (Edinb) ; 2020 Aug 27.*

##### RESUMO

Modern whole-genome prediction (WGP) frameworks that focus on multi-environment trials (MET) integrate large-scale genomics, phenomics, and envirotyping data. However, the more complex the statistical model, the longer the computational processing times, which do not always result in accuracy gains. We investigated the use of new kernel methods and modeling structures involving genomics and nongenomic sources of variation in two MET maize data sets. Five WGP models were considered, advancing in complexity from a main-effect additive model (A) to more complex structures, including dominance deviations (D), genotype × environment interaction (AE and DE), and the reaction-norm model using environmental covariables (W) and their interaction with A and D (AW + DW). A combination of those models built with three different kernel methods, Gaussian kernel (GK), Deep kernel (DK), and the benchmark genomic best linear-unbiased predictor (GBLUP/GB), was tested under three prediction scenarios: newly developed hybrids (CV1), sparse MET conditions (CV2), and new environments (CV0). GK and DK outperformed GB in prediction accuracy and reduction of computation time (~up to 20%) under all model-kernel scenarios. GK was more efficient in capturing the variation due to A + AE and D + DE effects and translated it into accuracy gains (~up to 85% compared with GB). DK provided more consistent predictions, even for more complex structures such as W + AW + DW. Our results suggest that DK and GK are more efficient in translating model complexity into accuracy, and more suitable for including dominance and reaction-norm effects in a biologically accurate and faster way.

*Theor Appl Genet ; 133(11): 3101-3117, 2020 Nov.*

##### RESUMO

KEY MESSAGE: Comparative assessment identified naïve interaction model, and naïve and informed interaction GS models suitable for achieving higher prediction accuracy in groundnut keeping in mind the high genotype × environment interaction for complex traits. Genomic selection (GS) can be an efficient and cost-effective breeding approach which captures both small- and large-effect genetic factors and therefore promises to achieve higher genetic gains for complex traits such as yield and oil content in groundnut. A training population was constituted with 340 elite lines followed by genotyping with 58 K 'Axiom_Arachis' SNP array and phenotyping for key agronomic traits at three locations in India. Four GS models were tested using three different random cross-validation schemes (CV0, CV1 and CV2). These models are: (1) model 1 (M1 = E + L) which includes the main effects of environment (E) and line (L); (2) model 2 (M2 = E + L + G) which includes the main effects of markers (G) in addition to E and L; (3) model 3 (M3 = E + L + G + GE), a naïve interaction model; and (4) model 4 (E + L + G + LE + GE), a naïve and informed interaction model. Prediction accuracy estimated for four models indicated clear advantage of the inclusion of marker information which was reflected in better prediction accuracy achieved with models M2, M3 and M4 as compared to M1 model. High prediction accuracies (> 0.600) were observed for days to 50% flowering, days to maturity, hundred seed weight, oleic acid, rust@90 days, rust@105 days and late leaf spot@90 days, while medium prediction accuracies (0.400-0.600) were obtained for pods/plant, shelling %, and total yield/plant. Assessment of comparative prediction accuracy for different GS models to perform selection for untested genotypes, and unobserved and unevaluated environments provided greater insights on potential application of GS breeding in groundnut.

*Theor Appl Genet ; 133(10): 2869-2879, 2020 Oct.*

##### RESUMO

KEY MESSAGE: Genomic selection with a multiple-year training population dataset could accelerate early-stage testcross testing by skipping the first-stage yield testing, which significantly saves the time and cost of early-stage testcross testing. With the development of doubled haploid (DH) technology, the main task for a maize breeder is to estimate the breeding values of thousands of DH lines annually. In early-stage testcross testing, genomic selection (GS) offers the opportunity of replacing expensive multiple-environment phenotyping and phenotypic selection with lower-cost genotyping and genomic estimated breeding value (GEBV)-based selection. In the present study, a total of 1528 maize DH lines, phenotyped in multiple-environment trials in three consecutive years and genotyped with a low-cost per-sample genotyping platform of rAmpSeq, were used to explore how to implement GS to accelerate early-stage testcross testing. Results showed that the average prediction accuracy estimated from the cross-validation schemes was above 0.60 across all the scenarios. The average prediction accuracies estimated from the independent validation schemes ranged from 0.23 to 0.32 across all the scenarios, when the one-year datasets were used as training population (TRN) to predict the other year data as testing population (TST). The average prediction accuracies increased to a range from 0.31 to 0.42 across all the scenarios, when the two-years datasets were used as TRN. The prediction accuracies increased to a range from 0.50 to 0.56, when the TRN consisted of two-years of breeding data and 50% of third year's data converted from TST to TRN. This information showed that GS with a multiple-year TRN set offers the opportunity to accelerate early-stage testcross testing by skipping the first-stage yield testing, which significantly saves the time and cost of early-stage testcross testing.

*G3 (Bethesda) ; 10(8): 2629-2639, 2020 Aug 05.*

##### RESUMO

Zinc (Zn) deficiency is a major risk factor for human health, affecting about 30% of the world's population. To study the potential of genomic selection (GS) for maize with increased Zn concentration, an association panel and two doubled haploid (DH) populations were evaluated in three environments. Three genomic prediction models, M (M1: Environment + Line, M2: Environment + Line + Genomic, and M3: Environment + Line + Genomic + Genomic x Environment) incorporating main effects (lines and genomic) and the interaction between genomic and environment (G x E) were assessed to estimate the prediction ability (rMP ) for each model. Two distinct cross-validation (CV) schemes simulating two genomic prediction breeding scenarios were used. CV1 predicts the performance of newly developed lines, whereas CV2 predicts the performance of lines tested in sparse multi-location trials. Predictions for Zn in CV1 ranged from -0.01 to 0.56 for DH1, 0.04 to 0.50 for DH2 and -0.001 to 0.47 for the association panel. For CV2, rMP values ranged from 0.67 to 0.71 for DH1, 0.40 to 0.56 for DH2 and 0.64 to 0.72 for the association panel. The genomic prediction model which included G x E had the highest average rMP for both CV1 (0.39 and 0.44) and CV2 (0.71 and 0.51) for the association panel and DH2 population, respectively. These results suggest that GS has potential to accelerate breeding for enhanced kernel Zn concentration by facilitating selection of superior genotypes.

*Theor Appl Genet ; 133(9): 2743-2758, 2020 Sep.*

##### RESUMO

KEY MESSAGE: The expectation and variance of the estimator of the maximized index selection response allow the breeders to construct confidence intervals and to complete the analysis of a selection process. The maximized selection response and the correlation of the linear selection index (LSI) with the net genetic merit are the main criterion to compare the efficiency of any LSI. The estimator of the maximized selection response is the square root of the variance of the estimated LSI values multiplied by the selection intensity. The expectation and variance of this estimator allow the breeder to construct confidence intervals and determine the appropriate sample size to complete the analysis of a selection process. Assuming that the estimated LSI values have normal distribution, we obtained those two parameters as follows. First, with the Fourier transform, we found the distribution of the variance of the estimated LSI values, which was a Gamma distribution; therefore, the expectation and variance of this distribution were the expectation and variance of the variance of the estimated LSI values. Second, with these results, we obtained the expectation and the variance of the estimator of the selection response using the Delta method. We validated the theoretical results in the phenotypic selection context using real and simulated dataset. With the simulated dataset, we compared the LSI efficiency when the genotypic covariance matrix is known versus when this matrix is estimated; the differences were not significant. We concluded that our results are valid for any LSI with normal distribution and that the method described in this work is useful for finding the expectation and variance of the estimator of any LSI response in the phenotypic or genomic selection context.

*G3 (Bethesda) ; 10(8): 2725-2739, 2020 Aug 05.*

##### RESUMO

"Sparse testing" refers to reduced multi-environment breeding trials in which not all genotypes of interest are grown in each environment. Using genomic-enabled prediction and a model embracing genotype × environment interaction (GE), the non-observed genotype-in-environment combinations can be predicted. Consequently, the overall costs can be reduced and the testing capacities can be increased. The accuracy of predicting the unobserved data depends on different factors including (1) how many genotypes overlap between environments, (2) in how many environments each genotype is grown, and (3) which prediction method is used. In this research, we studied the predictive ability obtained when using a fixed number of plots and different sparse testing designs. The considered designs included the extreme cases of (1) no overlap of genotypes between environments, and (2) complete overlap of the genotypes between environments. In the latter case, the prediction set fully consists of genotypes that have not been tested at all. Moreover, we gradually go from one extreme to the other considering (3) intermediates between the two previous cases with varying numbers of different or non-overlapping (NO)/overlapping (O) genotypes. The empirical study is built upon two different maize hybrid data sets consisting of different genotypes crossed to two different testers (T1 and T2) and each data set was analyzed separately. For each set, phenotypic records on yield from three different environments are available. Three different prediction models were implemented, two main effects models (M1 and M2), and a model (M3) including GE. The results showed that the genome-based model including GE (M3) captured more phenotypic variation than the models that did not include this component. Also, M3 provided higher prediction accuracy than models M1 and M2 for the different allocation scenarios. Reducing the size of the calibration sets decreased the prediction accuracy under all allocation designs with M3 being the less affected model; however, using the genome-enabled models (i.e., M2 and M3) the predictive ability is recovered when more genotypes are tested across environments. Our results indicate that a substantial part of the testing resources can be saved when using genome-based models including GE for optimizing sparse testing designs.

*Sci Rep ; 10(1): 8195, 2020 05 18.*

##### RESUMO

High-throughput phenotyping (HTP) technologies can produce data on thousands of phenotypes per unit being monitored. These data can be used to breed for economically and environmentally relevant traits (e.g., drought tolerance); however, incorporating high-dimensional phenotypes in genetic analyses and in breeding schemes poses important statistical and computational challenges. To address this problem, we developed regularized selection indices; the methodology integrates techniques commonly used in high-dimensional phenotypic regressions (including penalization and rank-reduction approaches) into the selection index (SI) framework. Using extensive data from CIMMYT's (International Maize and Wheat Improvement Center) wheat breeding program we show that regularized SIs derived from hyper-spectral data offer consistently higher accuracy for grain yield than those achieved by standard SIs, and by vegetation indices commonly used to predict agronomic traits. Regularized SIs offer an effective approach to leverage HTP data that is routinely generated in agriculture; the methodology can also be used to conduct genetic studies using high-dimensional phenotypes that are often collected in humans and model organisms including body images and whole-genome gene expression profiles.

*G3 (Bethesda) ; 10(6): 2087-2101, 2020 06 01.*

##### RESUMO

A combined multistage linear genomic selection index (CMLGSI) is a linear combination of phenotypic and genomic estimated breeding values useful for predicting the individual net genetic merit, which in turn is a linear combination of the true unobservable breeding values of the traits weighted by their respective economic values. The CMLGSI is a cost-saving strategy for improving multiple traits because the breeder does not need to measure all traits at each stage. The optimum (OCMLGSI) and decorrelated (DCMLGSI) indices are the main CMLGSIs. Whereas the OCMLGSI takes into consideration the index correlation values among stages, the DCMLGSI imposes the restriction that the index correlation values among stages be zero. Using real and simulated datasets, we compared the efficiency of both indices in a two-stage context. The criteria we applied to compare the efficiency of both indices were that the total selection response of each index must be lower than or equal to the single-stage combined linear genomic selection index (CLGSI) response and that the correlation of each index with the net genetic merit should be maximum. Using four different total proportions for the real dataset, the estimated total OCMLGSI and DCMLGSI responses explained 97.5% and 90%, respectively, of the estimated single-stage CLGSI selection response. In addition, at stage two, the estimated correlations of the OCMLGSI and the DCMLGSI with the net genetic merit were 0.84 and 0.63, respectively. We found similar results for the simulated datasets. Thus, we recommend using the OCMLGSI when performing multistage selection.

*Theor Popul Biol ; 132: 16-23, 2020 04.*

##### RESUMO

Whole genome epistasis models with interactions between different loci can be approximated by genomic relationship models based on Hadamard powers of the additive genomic relationship. We illustrate that the quality of this approximation reduces when the degree of interaction d increases. Moreover, considering relationship models defined as weighted sum of interactions of different degree, we investigate the impact of this decreasing quality of approximation of the summands on the approximation of the weighted sum. Our results indicate that these approximations remain on a reliable level, but their quality reduces when the weights of interactions of higher degrees do not decrease quickly.

*Front Plant Sci ; 10: 1311, 2019.*

##### RESUMO

Although durum wheat (Triticum turgidum var. durum Desf.) is a minor cereal crop representing just 5-7% of the world's total wheat crop, it is a staple food in Mediterranean countries, where it is used to produce pasta, couscous, bulgur and bread. In this paper, we cover multi-trait prediction of grain yield (GY), days to heading (DH) and plant height (PH) of 270 durum wheat lines that were evaluated in 43 environments (country-location-year combinations) across a broad range of water regimes in the Mediterranean Basin and other locations. Multi-trait prediction analyses were performed by implementing a multi-trait deep learning model (MTDL) with a feed-forward network topology and a rectified linear unit activation function with a grid search approach for the selection of hyper-parameters. The results of the multi-trait deep learning method were also compared with univariate predictions of the genomic best linear unbiased predictor (GBLUP) method and the univariate counterpart of the multi-trait deep learning method (UDL). All models were implemented with and without the genotype × environment interaction term. We found that the best predictions were observed without the genotype × environment interaction term in the UDL and MTDL methods. However, under the GBLUP method, the best predictions were observed when the genotype × environment interaction term was taken into account. We also found that in general the best predictions were observed under the GBLUP model; however, the predictions of the MTDL were very similar to those of the GBLUP model. This result provides more evidence that the GBLUP model is a powerful approach for genomic prediction, but also that the deep learning method is a practical approach for predicting univariate and multivariate traits in the context of genomic selection.

*Front Plant Sci ; 10: 1502, 2019.*

##### RESUMO

Genomic selection predicts the genomic estimated breeding values (GEBVs) of individuals not previously phenotyped. Several studies have investigated the accuracy of genomic predictions in maize but there is little empirical evidence on the practical performance of lines selected based on phenotype in comparison with those selected solely on GEBVs in advanced testcross yield trials. The main objectives of this study were to (1) empirically compare the performance of tropical maize hybrids selected through phenotypic selection (PS) and genomic selection (GS) under well-watered (WW) and managed drought stress (WS) conditions in Kenya, and (2) compare the cost-benefit analysis of GS and PS. For this study, we used two experimental maize data sets (stage I and stage II yield trials). The stage I data set consisted of 1492 doubled haploid (DH) lines genotyped with rAmpSeq SNPs. A subset of these lines (855) representing various DH populations within the stage I cohort was crossed with an individual single-cross tester chosen to complement each population. These testcross hybrids were evaluated in replicated trials under WW and WS conditions for grain yield and other agronomic traits, while the remaining 637 DH lines were predicted using the 855 lines as a training set. The second data set (stage II) consists of 348 DH lines from the first data set. Among these 348 best DH lines, 172 lines selected were solely based on GEBVs, and 176 lines were selected based on phenotypic performance. Each of the 348 DH lines were crossed with three common testers from complementary heterotic groups, and the resulting 1042 testcross hybrids and six commercial checks were evaluated in four to five WW locations and one WS condition in Kenya. For stage I trials, the cross-validated prediction accuracy for grain yield was 0.67 and 0.65 under WW and WS conditions, respectively. We found similar responses to selection using PS and GS for grain yield other agronomic traits under WW and WS conditions. The top 15% of hybrids advanced through GS and PS gave 21%-23% higher grain yield under WW and 51%-52% more grain yield under WS than the mean of the checks. The GS reduced the cost by 32% over the PS with similar selection gains. We concluded that the use of GS for yield under WW and WS conditions in maize can produce selection candidates with similar performance as those generated from conventional PS, but at a lower cost, and therefore, should be incorporated into maize breeding pipelines to increase breeding program efficiency.

*BMC Plant Biol ; 19(1): 520, 2019 Nov 27.*

##### RESUMO

BACKGROUND: Germplasm banks maintain collections representing the most comprehensive catalogue of native genetic diversity available for crop improvement. Users of germplasm banks are interested in a fixed number of samples representing as broadly as possible the diversity present in the wider collection. A relevant question is whether it is necessary to develop completely independent germplasm samples or it is possible to select nested sets from a pre-defined core set panel not from the whole collection. We used data from 15,384, maize landraces stored in the CIMMYT germplasm bank to study the impact on 8 diversity criteria and the sample representativeness of: (1) two core selection strategies, a statistical sampling (DM), or a numerical maximization method (CH); (2) selecting samples of varying sizes; and (3) selecting samples of different sizes independently of each other or in a nested manner. RESULTS: Sample sizes greater than 10% of the whole population size retained more than 75% of the polymorphic markers for all selection strategies and types of sample; lower sample sizes showed more variability (instability) among repetitions; the strongest effect of sample size was observed on the CH-independent combination. Independent and nested samples showed similar performance for all the criteria for the DM method, but there were differences between them for the CH method. The DM method achieved better approximations to the known values in the population than the CH method; 2-d multidimensional scaling plots of the collection and samples highlighted tendency of sample selection towards the extremes of diversity in the CH method, compared with sampling more representative of the overall genotypic distribution of diversity under the DM method. CONCLUSIONS: The use of core subsets of size greater than or equal to 10% of the whole collection satisfied well the requirement of representativeness and diversity. Nested samples showed similar diversity and representativeness characteristics as independent samples offering a cost effective method of sample definition for germplasm banks. For most criteria assessed the DM method achieved better approximations to the known values in the whole population than the CH method, that is, it generated more statistically representative samples from collections.

##### Assuntos

Variação Genética , Banco de Sementes , Zea mays/genética , Modelos Estatísticos , Amostragem*PLoS One ; 14(11): e0224631, 2019.*

##### RESUMO

For doubled haploid (DH) production in maize, F1 generation has been the most frequently used for haploid induction due to facility in the process. However, using F2 generation would be a good alternative to increase genetic variability owing to the additional recombination in meiosis. Our goals were to compare the effect of F1 and F2 generations on DH production in tropical germplasm, evaluating the R1-navajo expression in seeds, the working steps of the methodology, and the genetic variability of the DH lines obtained. Sources germplasm in F1 and F2 generations were crossed with the tropicalized haploid inducer LI-ESALQ. After harvest, for both induction crosses were calculated the haploid induction rate (HIR), diploid seed rate (DSR), and inhibition seed rate (ISR) using the total number of seeds obtained. In order to study the effectiveness of the DH working steps in each generation, the percentage per se and the relative percentage were verified. In addition, SNP markers were obtained for genetic variability studies. Results showed that the values for HIR, ISR, and DSR were 1.23%, 23.48%, and 75.21% for F1 and 1.78%, 15.82%, and 82.38% for F2, respectively. The effectiveness of the DH working step showed the same percentage per se value (0.4%) for F1 and F2, while the relative percentage was 27.2% for F1 and 22.4% for F2. Estimates of population parameters in DH lines from F1 were higher than F2. Furthermore, population structure and kinship analyses showed that one additional generation was not sufficient to create new genotype subgroups. Additionally, the relative efficiency of the response to selection in the F1 was 31.88% higher than F2 due to the number of cycles that are used to obtain the DH. Our results showed that in tropical maize, the use of F1 generation is recommended due to a superior balance between time and genetic variability.