RESUMEN
BACKGROUND: The Algerian honey bee population is composed of two described subspecies A. m. intermissa and A. m. sahariensis, of which little is known regarding population genomics, both in terms of genetic differentiation and of possible contamination by exogenous stock. Moreover, the phenotypic differences between the two subspecies are expected to translate into genetic differences and possible adaptation to heat and drought in A. m. sahariensis. To shed light on the structure of this population and to integrate these two subspecies in the growing dataset of available haploid drone sequences, we performed whole-genome sequencing of 151 haploid drones. RESULTS: Integrated analysis of our drone sequences with a similar dataset of European reference populations did not detect any significant admixture in the Algerian honey bees. Interestingly, most of the genetic variation was not found between the A. m. intermissa and A. m. sahariensis subspecies; instead, two main genetic clusters were found along an East-West axis. We found that the correlation between genetic and geographic distances was higher in the Western cluster and that close-family relationships were mostly detected in the Eastern cluster, sometimes at long distances. In addition, we selected a panel of 96 ancestry-informative markers to decide whether a sampled bee is Algerian or not, and tested this panel in simulated cases of admixture. CONCLUSIONS: The differences between the two main genetic clusters suggest differential breeding management between eastern and western Algeria, with greater exchange of genetic material over long distances in the east. The lack of detected admixture events suggests that, unlike what is seen in many places worldwide, imports of queens from foreign countries do not seem to have occurred on a large scale in Algeria, a finding that is relevant for conservation purposes. In addition, the proposed panel of 96 markers was found effective to distinguish Algerian from European honey bees. Therefore, we conclude that applying this approach to other taxa is promising, in particular when genetic differentiation is difficult to capture.
Asunto(s)
Cruzamiento , Flujo Genético , Humanos , Abejas/genética , Animales , Secuenciación Completa del Genoma/veterinaria , Polimorfismo de Nucleótido Simple , Estructuras GenéticasRESUMEN
The estimation of the inbreeding coefficient (F) is essential for the study of inbreeding depression (ID) or for the management of populations under conservation. Several methods have been proposed to estimate the realized F using genetic markers, but it remains unclear which one should be used. Here we used whole-genome sequence data for 245 individuals from a Holstein cattle pedigree to empirically evaluate which estimators best capture homozygosity at variants causing ID, such as rare deleterious alleles or loci presenting heterozygote advantage and segregating at intermediate frequency. Estimators relying on the correlation between uniting gametes (FUNI) or on the genomic relationships (FGRM) presented the highest correlations with these variants. However, homozygosity at rare alleles remained poorly captured. A second group of estimators relying on excess homozygosity (FHOM), homozygous-by-descent segments (FHBD), runs-of-homozygosity (FROH) or on the known genealogy (FPED) was better at capturing whole-genome homozygosity, reflecting the consequences of inbreeding on all variants, and for young alleles with low to moderate frequencies (0.10 < . < 0.25). The results indicate that FUNI and FGRM might present a stronger association with ID. However, the situation might be different when recessive deleterious alleles reach higher frequencies, such as in populations with a small effective population size. For locus-specific inbreeding measures or at low marker density, the ranking of the methods can also change as FHBD makes better use of the information from neighboring markers. Finally, we confirmed that genomic measures are in general superior to pedigree-based estimates. In particular, FPED was uncorrelated with locus-specific homozygosity.
Asunto(s)
Endogamia , Polimorfismo de Nucleótido Simple , Alelos , Animales , Bovinos/genética , Genotipo , Homocigoto , LinajeRESUMEN
We herein study genetic recombination in three cattle populations from France, New Zealand, and the Netherlands. We identify 2,395,177 crossover (CO) events in 94,516 male gametes, and 579,996 CO events in 25,332 female gametes. The average number of COs was found to be larger in males (23.3) than in females (21.4). The heritability of global recombination rate (GRR) was estimated at 0.13 in males and 0.08 in females, with a genetic correlation of 0.66 indicating that shared variants are influencing GRR in both sexes. A genome-wide association study identified seven quantitative trait loci (QTL) for GRR. Fine-mapping following sequence-based imputation in 14,401 animals pinpointed likely causative coding (5) and noncoding (1) variants in genes known to be involved in meiotic recombination (HFM1, MSH4, RNF212, MLH3, MSH5) for 5/7 QTL, and noncoding variants (3) in RNF212B for 1/7 QTL. This suggests that this RNF212 paralog might also be involved in recombination. Most of the identified mutations had significant effects in both sexes, with three of them each accounting for â¼10% of the genetic variance in males.
Asunto(s)
Bovinos/genética , Recombinación Homóloga , Polimorfismo Genético , Animales , Femenino , Estudio de Asociación del Genoma Completo , Células Germinativas/citología , Células Germinativas/metabolismo , Masculino , Meiosis/genética , Mutación , Sitios de Carácter Cuantitativo , Factores SexualesRESUMEN
We herein report the result of a large-scale, next generation sequencing (NGS)-based screen for embryonic lethal (EL) mutations in Belgian beef and New Zealand dairy cattle. We estimated by simulation that cattle might carry, on average, â¼0.5 recessive EL mutations. We mined exome sequence data from >600 animals, and identified 1377 stop-gain, 3139 frame-shift, 1341 splice-site, 22,939 disruptive missense, 62,399 benign missense, and 92,163 synonymous variants. We show that cattle have a comparable load of loss-of-function (LoF) variants (defined as stop-gain, frame-shift, or splice-site variants) as humans despite having a more variable exome. We genotyped >40,000 animals for up to 296 LoF and 3483 disruptive missense, breed-specific variants. We identified candidate EL mutations based on the observation of a significant depletion in homozygotes. We estimated the proportion of EL mutations at 15% of tested LoF and 6% of tested disruptive missense variants. We confirmed the EL nature of nine candidate variants by genotyping 200 carrier × carrier trios, and demonstrating the absence of homozygous offspring. The nine identified EL mutations segregate at frequencies ranging from 1.2% to 6.6% in the studied populations and collectively account for the mortality of â¼0.6% of conceptuses. We show that EL mutations preferentially affect gene products fulfilling basic cellular functions. The resulting information will be useful to avoid at-risk matings, thereby improving fertility.
Asunto(s)
Bovinos/genética , Fertilidad/genética , Genes Letales , Mutación , Animales , Bovinos/embriología , Bovinos/fisiología , Pruebas Genéticas/métodos , Heterocigoto , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Homocigoto , Genética Inversa/métodos , Análisis de Secuencia de ADN/métodosRESUMEN
BACKGROUND: Haplotype reconstruction (phasing) is an essential step in many applications, including imputation and genomic selection. The best phasing methods rely on both familial and linkage disequilibrium (LD) information. With whole-genome sequence (WGS) data, relatively small samples of reference individuals are generally sequenced due to prohibitive sequencing costs, thus only a limited amount of familial information is available. However, reference individuals have many relatives that have been genotyped (at lower density). The goal of our study was to improve phasing of WGS data by integrating familial information from haplotypes that were obtained from a larger genotyped dataset and to quantify its impact on imputation accuracy. RESULTS: Aligning a pre-phased WGS panel [~5 million single nucleotide polymorphisms (SNPs)], which is based on LD information only, to a 50k SNP array that is phased with both LD and familial information (called scaffold) resulted in correctly assigning parental origin for 99.62% of the WGS SNPs, their phase being determined unambiguously based on parental genotypes. Without using the 50k haplotypes as scaffold, that value dropped as expected to 50%. Correctly phased segments were on average longer after alignment to the genotype phase while the number of switches decreased slightly. Most of the incorrectly assigned segments, and subsequent switches, were due to singleton errors. Imputation from 50k SNP array to WGS data with improved phasing had a marginal impact on imputation accuracy (measured as r 2), i.e. on average, 90.47% with traditional techniques versus 90.65% with pre-phasing integrating familial information. Differences were larger for SNPs located in chromosome ends and rare variants. Using a denser WGS panel (~13 millions SNPs) that was obtained with traditional variant filtering rules, we found similar results although performances of both phasing and imputation accuracy were lower. CONCLUSIONS: We present a phasing strategy for WGS data, which indirectly integrates familial information by aligning WGS haplotypes that are pre-phased with LD information only on haplotypes obtained with genotyping data, with both LD and familial information and on a much larger population. This strategy results in very few mismatches with the phase obtained by Mendelian segregation rules. Finally, we propose a strategy to further improve phasing accuracy based on haplotype clusters obtained with genotyping data.
Asunto(s)
Bovinos/genética , Técnicas de Genotipaje/veterinaria , Haplotipos , Selección Artificial , Análisis de Secuencia de ADN/veterinaria , Animales , Cromosomas/genética , Femenino , Genoma , Técnicas de Genotipaje/métodos , Técnicas de Genotipaje/normas , Masculino , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normasRESUMEN
BACKGROUND: Inbreeding coefficients can be estimated either from pedigree data or from genomic data, and with genomic data, they are either global or local (when the linkage map is used). Recently, we developed a new hidden Markov model (HMM) that estimates probabilities of homozygosity-by-descent (HBD) at each marker position and automatically partitions autozygosity in multiple age-related classes (based on the length of HBD segments). Our objectives were to: (1) characterize inbreeding with our model in an intensively selected population such as the Belgian Blue Beef (BBB) cattle breed; (2) compare the properties of the model at different marker densities; and (3) compare our model with other methods. RESULTS: When using 600 K single nucleotide polymorphisms (SNPs), the inbreeding coefficient (probability of sampling an HBD locus in an individual) was on average 0.303 (ranging from 0.258 to 0.375). HBD-classes associated to historical ancestors (with small segments ≤ 200 kb) accounted for 21.6% of the genome length (71.4% of the total length of the genome in HBD segments), whereas classes associated to more recent ancestors accounted for only 22.6% of the total length of the genome in HBD segments. However, these recent classes presented more individual variation than more ancient classes. Although inbreeding coefficients obtained with low SNP densities (7 and 32 K) were much lower (0.060 and 0.093), they were highly correlated with those obtained at higher density (r = 0.934 and 0.975, respectively), indicating that they captured most of the individual variation. At higher SNP density, smaller HBD segments are identified and, thus, more past generations can be explored. We observed very high correlations between our estimates and those based on homozygosity (r = 0.95) or on runs-of-homozygosity (r = 0.95). As expected, pedigree-based estimates were mainly correlated with recent HBD-classes (r = 0.56). CONCLUSIONS: Although we observed high levels of autozygosity associated with small HBD segments in BBB cattle, recent inbreeding accounted for most of the individual variation. Recent autozygosity can be captured efficiently with low-density SNP arrays and relatively simple models (e.g., two HBD classes). The HMM framework provides local HBD probabilities that are still useful at lower SNP densities.
Asunto(s)
Bovinos/genética , Genómica/métodos , Endogamia/métodos , Modelos Estadísticos , Polimorfismo de Nucleótido Simple/genética , Animales , Genoma , Genotipo , Homocigoto , Masculino , LinajeRESUMEN
BACKGROUND: In recent theoretical developments, the information available (e.g. genotypes) divides the original population into two groups: animals with this information (selected animals) and animals without this information (excluded animals). These developments require inversion of the part of the pedigree-based numerator relationship matrix that describes the genetic covariance between selected animals (A22). Our main objective was to propose and evaluate methodology that takes advantage of any potential sparsity in the inverse of A22 in order to reduce the computing time required for its inversion. This potential sparsity is brought out by searching the pedigree for dependencies between the selected animals. Jointly, we expected distant ancestors to provide relationship ties that increase the density of matrix A22 but that their effect on A22-1 might be minor. This hypothesis was also tested. METHODS: The inverse of A22 can be computed from the inverse of the triangular factor (T-1) obtained by Cholesky root-free decomposition of A22. We propose an algorithm that sets up the sparsity pattern of T-1 using pedigree information. This algorithm provides positions of the elements of T-1 worth to be computed (i.e. different from zero). A recursive computation of A22-1 is then achieved with or without information on the sparsity pattern and time required for each computation was recorded. For three numbers of selected animals (4000; 8000 and 12 000), A22 was computed using different pedigree extractions and the closeness of the resulting A22-1 to the inverse computed using the fully extracted pedigree was measured by an appropriate norm. RESULTS: The use of prior information on the sparsity of T-1 decreased the computing time for inversion by a factor of 1.73 on average. Computational issues and practical uses of the different algorithms were discussed. Cases involving more than 12 000 selected animals were considered. Inclusion of 10 generations was determined to be sufficient when computing A22. CONCLUSIONS: Depending on the size and structure of the selected sub-population, gains in time to compute A22-1 are possible and these gains may increase as the number of selected animals increases. Given the sequential nature of most computational steps, the proposed algorithm can benefit from optimization and may be convenient for genomic evaluations.
Asunto(s)
Biología Computacional/métodos , Genoma , Linaje , Algoritmos , Animales , Genómica , Genotipo , Modelos Genéticos , FenotipoRESUMEN
Many specificities single out HLA-F: its structure, expression regulation at cell membrane and function. HLA-F mRNA is detected in the most cell types and the protein is localized in the ER and Golgi apparatus. When expressed at cell surface, HLA-F may be associated to ß2-microglobulin and peptide or expressed as an open-conformer molecule. HLA-F reaches the membrane upon activation of different primary cell types and cell-lines. HLA-F has its highest affinity for the KIR3DS1-activating NK receptor, but also binds inhibitory immune receptors. Some studies reported that HLA-F expression is associated with its genotype. Higher HLA-F mRNA expression associated with F*01:01:02, and 3 noncoding SNPs, rs1362126, rs2523405, and rs2523393, located in HLA-F-AS1 or upstream the HLA-F sequence were associated with HLA-F mRNA expression. Given the implication of HLA-F in many clinical setting, and the undisclosed process of its expression regulation, we aim to confirm the effect of the aforementioned SNPs with HLA-F transcriptional and protein expression. We analyzed the distribution, frequency and linkage disequilibrium of these SNPs at worldwide scale in the 1000 Genomes Project samples. Influence on the genotype of each SNP on HLA-F expression was explored using RNAseq data from the 1000 Genomes Project, and using Q-PCR and intracellular cytometry in PBMC from healthy individuals. Our results show that the SNPs under studied displayed remarkably different allelic proportion according to geography and confirm that rs1362126, rs2523405, and rs2523393 displayed the most concordant results, with the highest effect size and a double-dose effect.
Asunto(s)
Antígenos de Histocompatibilidad Clase I , Leucocitos Mononucleares , Humanos , Alelos , Antígenos de Histocompatibilidad Clase I/genética , Polimorfismo de Nucleótido Simple , Genotipo , ARN Mensajero/genéticaRESUMEN
The Brazilian merganser (Mergus octosetaceus) is one of the most endangered bird species in South America and comprises less than 250 mature individuals in wild environments. This is a species extremely sensitive to environmental disturbances and restricted to a few "pristine" freshwater habitats in Brazil, and it has been classified as Critically Endangered on the IUCN Red List since 1994. Thus, biological conservation studies are vital to promote adequate management strategies and to avoid the decline of merganser populations. In this context, to understand the evolutionary dynamics and the current genetic diversity of remaining Brazilian merganser populations, we used the "Genotyping by Sequencing" approach to genotype 923 SNPs in 30 individuals from all known areas of occurrence. These populations revealed a low genetic diversity and high inbreeding levels, likely due to the recent population decline associated with habitat loss. Furthermore, it showed a moderate level of genetic differentiation between all populations located in four separated areas of the highly threatened Cerrado biome. The results indicate that urgent actions for the conservation of the species should be accompanied by careful genetic monitoring to allow appropriate in situ and ex situ management to increase the long-term species' survival in its natural environment.
RESUMEN
The Nav1.7 voltage-gated sodium channel plays a key role in nociception. Three functional variants in the SCN9A gene (encoding M932L, V991L, and D1908G in Nav1.7), have recently been identified as stemming from Neanderthal introgression and to associate with pain symptomatology in UK BioBank data. In 1000 genomes data, these variants are absent in Europeans but common in Latin Americans. Analysing high-density genotype data from 7594 Latin Americans, we characterized Neanderthal introgression in SCN9A. We find that tracts of introgression occur on a Native American genomic background, have an average length of ~123 kb and overlap the M932L, V991L, and D1908G coding positions. Furthermore, we measured experimentally six pain thresholds in 1623 healthy Colombians. We found that Neanderthal ancestry in SCN9A is significantly associated with a lower mechanical pain threshold after sensitization with mustard oil and evidence of additivity of effects across Nav1.7 variants. Our findings support the reported association of Neanderthal Nav1.7 variants with clinical pain, define a specific sensory modality affected by archaic introgression in SCN9A and are consistent with independent effects of the Neanderthal variants on Nav1.7 function.
Asunto(s)
Hombre de Neandertal , Umbral del Dolor , Humanos , Animales , Hombre de Neandertal/genética , Dolor/genética , Canal de Sodio Activado por Voltaje NAV1.7/genética , NocicepciónRESUMEN
We report a genome-wide association study of facial features in >6000 Latin Americans based on automatic landmarking of 2D portraits and testing for association with inter-landmark distances. We detected significant associations (P-value <5 × 10-8) at 42 genome regions, nine of which have been previously reported. In follow-up analyses, 26 of the 33 novel regions replicate in East Asians, Europeans, or Africans, and one mouse homologous region influences craniofacial morphology in mice. The novel region in 1q32.3 shows introgression from Neanderthals and we find that the introgressed tract increases nasal height (consistent with the differentiation between Neanderthals and modern humans). Novel regions include candidate genes and genome regulatory elements previously implicated in craniofacial development, and show preferential transcription in cranial neural crest cells. The automated approach used here should simplify the collection of large study samples from across the world, facilitating a cosmopolitan characterization of the genetics of facial features.
Asunto(s)
Hombre de Neandertal , Humanos , Animales , Ratones , Hombre de Neandertal/genética , Estudio de Asociación del Genoma Completo , Nariz , Diferenciación CelularRESUMEN
Repeated application of noxious stimuli leads to a progressively increased pain perception; this temporal summation is enhanced in and predictive of clinical pain disorders. Its electrophysiological correlate is "wind-up," in which dorsal horn spinal neurons increase their response to repeated nociceptor stimulation. To understand the genetic basis of temporal summation, we undertook a GWAS of wind-up in healthy human volunteers and found significant association with SLC8A3 encoding sodium-calcium exchanger type 3 (NCX3). NCX3 was expressed in mouse dorsal horn neurons, and mice lacking NCX3 showed normal, acute pain but hypersensitivity to the second phase of the formalin test and chronic constriction injury. Dorsal horn neurons lacking NCX3 showed increased intracellular calcium following repetitive stimulation, slowed calcium clearance, and increased wind-up. Moreover, virally mediated enhanced spinal expression of NCX3 reduced central sensitization. Our study highlights Ca2+ efflux as a pathway underlying temporal summation and persistent pain, which may be amenable to therapeutic targeting.
Asunto(s)
Calcio , Intercambiador de Sodio-Calcio , Animales , Humanos , Ratones , Dolor , Células del Asta Posterior , Psicofísica , Intercambiador de Sodio-Calcio/genéticaRESUMEN
Blood group systems were the first phenotypic markers used in anthropology to decipher the origin of populations, their migratory movements, and their admixture. The recent emergence of new technologies based on the decoding of nucleic acids from an individual's entire genome has relegated them to their primary application, blood transfusion. Thus, despite the finer mapping of the modern human genome in relation to Neanderthal and Denisova populations, little is known about red cell blood groups in these archaic populations. Here we analyze the available high-quality sequences of three Neanderthals and one Denisovan individuals for 7 blood group systems that are used today in transfusion (ABO including H/Se, Rh (Rhesus), Kell, Duffy, Kidd, MNS, Diego). We show that Neanderthal and Denisova were polymorphic for ABO and shared blood group alleles recurrent in modern Sub-Saharan populations. Furthermore, we found ABO-related alleles currently preventing from viral gut infection and Neanderthal RHD and RHCE alleles nowadays associated with a high risk of hemolytic disease of the fetus and newborn. Such a common blood group pattern across time and space is coherent with a Neanderthal population of low genetic diversity exposed to low reproductive success and with their inevitable demise. Lastly, we connect a Neanderthal RHD allele to two present-day Aboriginal Australian and Papuan, suggesting that a segment of archaic genome was introgressed in this gene in non-Eurasian populations. While contributing to both the origin and late evolutionary history of Neanderthal and Denisova, our results further illustrate that blood group systems are a relevant piece of the puzzle helping to decipher it.
Asunto(s)
Antígenos de Grupos Sanguíneos/genética , Hominidae/genética , Hombre de Neandertal/genética , Alelos , Animales , Fósiles , Variación Genética , Genotipo , Mutación INDEL , Fenotipo , Polimorfismo GenéticoRESUMEN
Here we evaluate the accuracy of prediction for eye, hair and skin pigmentation in a dataset of > 6500 individuals from Mexico, Colombia, Peru, Chile and Brazil (including genome-wide SNP data and quantitative/categorical pigmentation phenotypes - the CANDELA dataset CAN). We evaluated accuracy in relation to different analytical methods and various phenotypic predictors. As expected from statistical principles, we observe that quantitative traits are more sensitive to changes in the prediction models than categorical traits. We find that Random Forest or Linear Regression are generally the best performing methods. We also compare the prediction accuracy of SNP sets defined in the CAN dataset (including 56, 101 and 120 SNPs for eye, hair and skin colour prediction, respectively) to the well-established HIrisPlex-S SNP set (including 6, 22 and 36 SNPs for eye, hair and skin colour prediction respectively). When training prediction models on the CAN data, we observe remarkably similar performances for HIrisPlex-S and the larger CAN SNP sets for the prediction of hair (categorical) and eye (both categorical and quantitative), while the CAN sets outperform HIrisPlex-S for quantitative, but not for categorical skin pigmentation prediction. The performance of HIrisPlex-S, when models are trained in a world-wide sample (although consisting of 80% Europeans, https://hirisplex.erasmusmc.nl), is lower relative to training in the CAN data (particularly for hair and skin colour). Altogether, our observations are consistent with common variation of eye and hair colour having a relatively simple genetic architecture, which is well captured by HIrisPlex-S, even in admixed Latin Americans (with partial European ancestry). By contrast, since skin pigmentation is a more polygenic trait, accuracy is more sensitive to prediction SNP set size, although here this effect was only apparent for a quantitative measure of skin pigmentation. Our results support the use of HIrisPlex-S in the prediction of categorical pigmentation traits for forensic purposes in Latin America, while illustrating the impact of training datasets on its accuracy.
Asunto(s)
Color del Ojo/genética , Color del Cabello/genética , Polimorfismo de Nucleótido Simple , Pigmentación de la Piel/genética , Conjuntos de Datos como Asunto , Genética de Población , Genotipo , Humanos , América Latina , Modelos Logísticos , FenotipoRESUMEN
To characterize the genetic basis of facial features in Latin Americans, we performed a genome-wide association study (GWAS) of more than 6000 individuals using 59 landmark-based measurements from two-dimensional profile photographs and ~9,000,000 genotyped or imputed single-nucleotide polymorphisms. We detected significant association of 32 traits with at least 1 (and up to 6) of 32 different genomic regions, more than doubling the number of robustly associated face morphology loci reported until now (from 11 to 23). These GWAS hits are strongly enriched in regulatory sequences active specifically during craniofacial development. The associated region in 1p12 includes a tract of archaic adaptive introgression, with a Denisovan haplotype common in Native Americans affecting particularly lip thickness. Among the nine previously unidentified face morphology loci we identified is the VPS13B gene region, and we show that variants in this region also affect midfacial morphology in mice.
Asunto(s)
Cara , Polimorfismo de Nucleótido Simple , Proteínas de Transporte Vesicular , Animales , Cara/anatomía & histología , Estudio de Asociación del Genoma Completo , Genotipo , Hispánicos o Latinos/genética , Humanos , Ratones , Fenotipo , Proteínas de Transporte Vesicular/genéticaRESUMEN
Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the length of a segment shared between haplotypes, or estimates of relationships between individuals, gametes, and haplotypes. The random forests framework was fed with 30 relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Similarity comparisons between predicted and true whole-genome sequence haplotypes showed that the random forests framework was more efficient than a hidden Markov model in reconstructing a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that extra-trees are a promising approach for such purposes. The use of this new technique also reveals some useful lessons on the relevant features for the purpose of haplotype matching. We also discuss potential improvements for routine implementation.