RESUMO
The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples.
Assuntos
Aptidão Genética , Aprendizado de Máquina , Animais , Camundongos , Mutação , Aptidão Genética/genéticaRESUMO
Exquisite binding specificity is essential for many protein functions but is difficult to engineer. Many biotechnological or biomedical applications require the discrimination of very similar ligands, which poses the challenge of designing protein sequences with highly specific binding profiles. Experimental methods for generating specific binders rely on in vitro selection, which is limited in terms of library size and control over specificity profiles. Additional control was recently demonstrated through high-throughput sequencing and downstream computational analysis. Here we follow such an approach to demonstrate the design of specific antibodies beyond those probed experimentally. We do so in a context where very similar epitopes need to be discriminated, and where these epitopes cannot be experimentally dissociated from other epitopes present in the selection. Our approach involves the identification of different binding modes, each associated with a particular ligand against which the antibodies are either selected or not. Using data from phage display experiments, we show that the model successfully disentangles these modes, even when they are associated with chemically very similar ligands. Additionally, we demonstrate and validate experimentally the computational design of antibodies with customized specificity profiles, either with specific high affinity for a particular target ligand, or with cross-specificity for multiple target ligands. Overall, our results showcase the potential of leveraging a biophysical model learned from selections against multiple ligands to design proteins with tailored specificity, with applications to protein engineering extending beyond the design of antibodies.
Assuntos
Especificidade de Anticorpos , Biologia Computacional , Biologia Computacional/métodos , Biblioteca de Peptídeos , Ligantes , Epitopos/imunologia , Epitopos/química , Anticorpos/química , Anticorpos/imunologia , Engenharia de Proteínas/métodos , Humanos , Ligação ProteicaRESUMO
Adeno-associated viruses 2 (AAV2) are minute viruses renowned for their capacity to infect human cells and akin organisms. They have recently emerged as prominent candidates in the field of gene therapy, primarily attributed to their inherent non-pathogenic nature in humans and the safety associated with their manipulation. The efficacy of AAV2 as gene therapy vectors hinges on their ability to infiltrate host cells, a phenomenon reliant on their competence to construct a capsid capable of breaching the nucleus of the target cell. To enhance their infection potential, researchers have extensively scrutinized various combinatorial libraries by introducing mutations into the capsid, aiming to boost their effectiveness. The emergence of high-throughput experimental techniques, like deep mutational scanning (DMS), has made it feasible to experimentally assess the fitness of these libraries for their intended purpose. Notably, machine learning is starting to demonstrate its potential in addressing predictions within the mutational landscape from sequence data. In this context, we introduce a biophysically-inspired model designed to predict the viability of genetic variants in DMS experiments. This model is tailored to a specific segment of the CAP region within AAV2's capsid protein. To evaluate its effectiveness, we conduct model training with diverse datasets, each tailored to explore different aspects of the mutational landscape influenced by the selection process. Our assessment of the biophysical model centers on two primary objectives: (i) providing quantitative forecasts for the log-selectivity of variants and (ii) deploying it as a binary classifier to categorize sequences into viable and non-viable classes.
Assuntos
Mutação , Humanos , Proteínas do Capsídeo/genética , Dependovirus/genética , Parvovirinae/genéticaRESUMO
SUMMARY: DCAlign is a new alignment method able to cope with the conservation and the co-evolution signals that characterize the columns of multiple sequence alignments of homologous sequences. However, the pre-processing steps required to align a candidate sequence are computationally demanding. We show in v1.0 how to dramatically reduce the overall computing time by including an empirical prior over an informative set of variables mirroring the presence of insertions and deletions. AVAILABILITY AND IMPLEMENTATION: DCAlign v1.0 is implemented in Julia and it is fully available at https://github.com/infernet-h2020/DCAlign.
Assuntos
Alinhamento de Sequência , Biologia ComputacionalRESUMO
MOTIVATION: Being able to artificially design novel proteins of desired function is pivotal in many biological and biomedical applications. Generative statistical modeling has recently emerged as a new paradigm for designing amino acid sequences, including in particular models and embedding methods borrowed from natural language processing (NLP). However, most approaches target single proteins or protein domains, and do not take into account any functional specificity or interaction with the context. To extend beyond current computational strategies, we develop a method for generating protein domain sequences intended to interact with another protein domain. Using data from natural multidomain proteins, we cast the problem as a translation problem from a given interactor domain to the new domain to be generated, i.e. we generate artificial partner sequences conditional on an input sequence. We also show in an example that the same procedure can be applied to interactions between distinct proteins. RESULTS: Evaluating our model's quality using diverse metrics, in part related to distinct biological questions, we show that our method outperforms state-of-the-art shallow autoregressive strategies. We also explore the possibility of fine-tuning pretrained large language models for the same task and of using Alphafold 2 for assessing the quality of sampled sequences. AVAILABILITY AND IMPLEMENTATION: Data and code on https://github.com/barthelemymp/Domain2DomainProteinTranslation.
Assuntos
Idioma , Proteínas , Sequência de Aminoácidos , Proteínas/química , Domínios ProteicosRESUMO
SUMMARY: Topology determination is one of the most important intermediate steps toward building the atomic structure of proteins from their medium-resolution cryo-electron microscopy (cryo-EM) map. The main goal in the topology determination is to identify correct matches (i.e. assignment and direction) between secondary structure elements (SSEs) (α-helices and ß-sheets) detected in a protein sequence and cryo-EM density map. Despite many recent advances in molecular biology technologies, the problem remains a challenging issue. To overcome the problem, this article proposes a linear programming-based topology determination (LPTD) method to solve the secondary structure topology problem in three-dimensional geometrical space. Through modeling of the protein's sequence with the aid of extracting highly reliable features and a distance-based scoring function, the secondary structure matching problem is transformed into a complete weighted bipartite graph matching problem. Subsequently, an algorithm based on linear programming is developed as a decision-making strategy to extract the true topology (native topology) between all possible topologies. The proposed automatic framework is verified using 12 experimental and 15 simulated α-ß proteins. Results demonstrate that LPTD is highly efficient and extremely fast in such a way that for 77% of cases in the dataset, the native topology has been detected in the first rank topology in <2 s. Besides, this method is able to successfully handle large complex proteins with as many as 65 SSEs. Such a large number of SSEs have never been solved with current tools/methods. AVAILABILITY AND IMPLEMENTATION: The LPTD package (source code and data) is publicly available at https://github.com/B-Behkamal/LPTD. Moreover, two test samples as well as the instruction of utilizing the graphical user interface have been provided in the shared readme file. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Programação Linear , Proteínas , Microscopia Crioeletrônica/métodos , Modelos Moleculares , Conformação Proteica , Estrutura Secundária de Proteína , Proteínas/químicaRESUMO
Despite major environmental and genetic differences, microbial metabolic networks are known to generate consistent physiological outcomes across vastly different organisms. This remarkable robustness suggests that, at least in bacteria, metabolic activity may be guided by universal principles. The constrained optimization of evolutionarily motivated objective functions, such as the growth rate, has emerged as the key theoretical assumption for the study of bacterial metabolism. While conceptually and practically useful in many situations, the idea that certain functions are optimized is hard to validate in data. Moreover, it is not always clear how optimality can be reconciled with the high degree of single-cell variability observed in experiments within microbial populations. To shed light on these issues, we develop an inverse modeling framework that connects the fitness of a population of cells (represented by the mean single-cell growth rate) to the underlying metabolic variability through the maximum entropy inference of the distribution of metabolic phenotypes from data. While no clear objective function emerges, we find that, as the medium gets richer, the fitness and inferred variability for Escherichia coli populations follow and slowly approach the theoretically optimal bound defined by minimal reduction of variability at given fitness. These results suggest that bacterial metabolism may be crucially shaped by a population-level trade-off between growth and heterogeneity.
Assuntos
Escherichia coli , Redes e Vias Metabólicas , Bactérias/metabolismo , Entropia , Escherichia coli/metabolismo , FenótipoRESUMO
The recent technological advances underlying the screening of large combinatorial libraries in high-throughput mutational scans deepen our understanding of adaptive protein evolution and boost its applications in protein design. Nevertheless, the large number of possible genotypes requires suitable computational methods for data analysis, the prediction of mutational effects, and the generation of optimized sequences. We describe a computational method that, trained on sequencing samples from multiple rounds of a screening experiment, provides a model of the genotype-fitness relationship. We tested the method on five large-scale mutational scans, yielding accurate predictions of the mutational effects on fitness. The inferred fitness landscape is robust to experimental and sampling noise and exhibits high generalization power in terms of broader sequence space exploration and higher fitness variant predictions. We investigate the role of epistasis and show that the inferred model provides structural information about the 3D contacts in the molecular fold.
Assuntos
Evolução Molecular , Aptidão Genética , Epistasia Genética , Mutação , Aprendizado de Máquina não SupervisionadoRESUMO
BACKGROUND: Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences. RESULTS: Our adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA . As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain. CONCLUSIONS: The models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.
Assuntos
Aprendizado de Máquina , Proteínas , Humanos , Proteínas/genética , RNARESUMO
We present Annealed Mutational approximated Landscape (AMaLa), a new method to infer fitness landscapes from Directed Evolution experiments sequencing data. Such experiments typically start from a single wild-type sequence, which undergoes Darwinian in vitro evolution via multiple rounds of mutation and selection for a target phenotype. In the last years, Directed Evolution is emerging as a powerful instrument to probe fitness landscapes under controlled experimental conditions and as a relevant testing ground to develop accurate statistical models and inference algorithms (thanks to high-throughput screening and sequencing). Fitness landscape modeling either uses the enrichment of variants abundances as input, thus requiring the observation of the same variants at different rounds or assuming the last sequenced round as being sampled from an equilibrium distribution. AMaLa aims at effectively leveraging the information encoded in the whole time evolution. To do so, while assuming statistical sampling independence between sequenced rounds, the possible trajectories in sequence space are gauged with a time-dependent statistical weight consisting of two contributions: (i) an energy term accounting for the selection process and (ii) a generalized Jukes-Cantor model for the purely mutational step. This simple scheme enables accurately describing the Directed Evolution dynamics and inferring a fitness landscape that correctly reproduces the measures of the phenotype under selection (e.g., antibiotic drug resistance), notably outperforming widely used inference strategies. In addition, we assess the reliability of AMaLa by showing how the inferred statistical model could be used to predict relevant structural properties of the wild-type sequence.
Assuntos
Biologia Computacional/métodos , Evolução Molecular Direcionada/métodos , Mutação , Algoritmos , Evolução Molecular , Aptidão Genética , Sequenciamento de Nucleotídeos em Larga Escala , Modelos Genéticos , Análise de Sequência de DNARESUMO
It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.
Assuntos
Evolução Molecular , Proteínas/química , Proteínas/genética , Sequência de Aminoácidos , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Modelos Moleculares , Mutação , Filogenia , Conformação Proteica , Dobramento de Proteína , Proteínas/classificação , Homologia de Sequência de Aminoácidos , Homologia Estrutural de ProteínaRESUMO
Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue-residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has, in turn, been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being colocalized in operons. Here we show that the direct coupling analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify interprotein residue-residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.
Assuntos
Evolução Molecular , Ligação Proteica/genética , Mapeamento de Interação de Proteínas , Proteínas/genética , Algoritmos , Fenômenos Biofísicos , Biologia Computacional , Conformação Proteica , Proteínas/química , Alinhamento de Sequência , Homologia de Sequência de AminoácidosRESUMO
The immune system has developed a number of distinct complex mechanisms to shape and control the antibody repertoire. One of these mechanisms, the affinity maturation process, works in an evolutionary-like fashion: after binding to a foreign molecule, the antibody-producing B-cells exhibit a high-frequency mutation rate in the genome region that codes for the antibody active site. Eventually, cells that produce antibodies with higher affinity for their cognate antigen are selected and clonally expanded. Here, we propose a new statistical approach based on maximum entropy modeling in which a scoring function related to the binding affinity of antibodies against a specific antigen is inferred from a sample of sequences of the immune repertoire of an individual. We use our inference strategy to infer a statistical model on a data set obtained by sequencing a fairly large portion of the immune repertoire of an HIV-1 infected patient. The Pearson correlation coefficient between our scoring function and the IC50 neutralization titer measured on 30 different antibodies of known sequence is as high as 0.77 (p-value 10-6), outperforming other sequence- and structure-based models.
Assuntos
Afinidade de Anticorpos/fisiologia , Reações Antígeno-Anticorpo/fisiologia , Modelos Imunológicos , Anticorpos Neutralizantes/química , Anticorpos Neutralizantes/genética , Anticorpos Neutralizantes/metabolismo , Afinidade de Anticorpos/genética , Reações Antígeno-Anticorpo/genética , Linfócitos B/imunologia , Sítios de Ligação de Anticorpos/genética , Sítios de Ligação de Anticorpos/fisiologia , Análise por Conglomerados , Biologia Computacional , Simulação por Computador , Entropia , Evolução Molecular , Anticorpos Anti-HIV/química , Anticorpos Anti-HIV/genética , Anticorpos Anti-HIV/metabolismo , Infecções por HIV/genética , Infecções por HIV/imunologia , HIV-1/imunologia , Humanos , Modelos Moleculares , Mutação , Distribuição Normal , Alinhamento de SequênciaRESUMO
Competitive endogenous (ce)RNAs cross-regulate each other through sequestration of shared microRNAs and form complex regulatory networks based on their microRNA signature. However, the molecular requirements for ceRNA cross-regulation and the extent of ceRNA networks remain unknown. Here, we present a mathematical mass-action model to determine the optimal conditions for ceRNA activity in silico. This model was validated using phosphatase and tensin homolog (PTEN) and its ceRNA VAMP (vesicle-associated membrane protein)-associated protein A (VAPA) as paradigmatic examples. A computational assessment of the complexity of ceRNA networks revealed that transcription factor and ceRNA networks are intimately intertwined. Notably, we found that ceRNA networks are responsive to transcription factor up-regulation or their aberrant expression in cancer. Thus, given optimal molecular conditions, alterations of one ceRNA can have striking effects on integrated ceRNA and transcriptional networks.
Assuntos
Regulação da Expressão Gênica , Redes Reguladoras de Genes/genética , RNA/genética , Linhagem Celular , Biologia Computacional , Dosagem de Genes , Humanos , MicroRNAs/genética , MicroRNAs/metabolismo , Modelos Biológicos , RNA/metabolismo , Elementos de Resposta/genética , Fatores de Transcrição/metabolismoRESUMO
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date.
Assuntos
Biologia Computacional/métodos , Proteínas/química , Análise de Sequência de Proteína/métodos , Modelos Estatísticos , Alinhamento de SequênciaRESUMO
Systems biology aims at creating mathematical models, i.e., computational reconstructions of biological systems and processes that will result in a new level of understanding-the elucidation of the basic and presumably conserved "design" and "engineering" principles of biomolecular systems. Thus, systems biology will move biology from a phenomenological to a predictive science. Mathematical modeling of biological networks and processes has already greatly improved our understanding of many cellular processes. However, given the massive amount of qualitative and quantitative data currently produced and number of burning questions in health care and biotechnology needed to be solved is still in its early phases. The field requires novel approaches for abstraction, for modeling bioprocesses that follow different biochemical and biophysical rules, and for combining different modules into larger models that still allow realistic simulation with the computational power available today. We have identified and discussed currently most prominent problems in systems biology: (1) how to bridge different scales of modeling abstraction, (2) how to bridge the gap between topological and mechanistic modeling, and (3) how to bridge the wet and dry laboratory gap. The future success of systems biology largely depends on bridging the recognized gaps.
Assuntos
Pesquisa Biomédica/normas , Biologia de Sistemas , Humanos , Modelos Biológicos , Padrões de ReferênciaRESUMO
We present a powerful experimental-computational technology for inferring network models that predict the response of cells to perturbations, and that may be useful in the design of combinatorial therapy against cancer. The experiments are systematic series of perturbations of cancer cell lines by targeted drugs, singly or in combination. The response to perturbation is quantified in terms of relative changes in the measured levels of proteins, phospho-proteins and cellular phenotypes such as viability. Computational network models are derived de novo, i.e., without prior knowledge of signaling pathways, and are based on simple non-linear differential equations. The prohibitively large solution space of all possible network models is explored efficiently using a probabilistic algorithm, Belief Propagation (BP), which is three orders of magnitude faster than standard Monte Carlo methods. Explicit executable models are derived for a set of perturbation experiments in SKMEL-133 melanoma cell lines, which are resistant to the therapeutically important inhibitor of RAF kinase. The resulting network models reproduce and extend known pathway biology. They empower potential discoveries of new molecular interactions and predict efficacious novel drug perturbations, such as the inhibition of PLK1, which is verified experimentally. This technology is suitable for application to larger systems in diverse areas of molecular biology.
Assuntos
Modelos Biológicos , Transdução de Sinais , Biologia de Sistemas , Linhagem Celular Tumoral , Humanos , Método de Monte Carlo , ProbabilidadeRESUMO
The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intradomain residue contacts, arising, e.g., from alternative protein conformations, ligand-mediated residue couplings, and interdomain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, contingent on the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.
Assuntos
Algoritmos , Aminoácidos/química , Biologia Computacional/métodos , Proteínas/química , Aminoácidos/genética , Aminoácidos/metabolismo , Sítios de Ligação/genética , Modelos Moleculares , Ligação Proteica , Conformação Proteica , Mapeamento de Interação de Proteínas/métodos , Multimerização Proteica , Proteínas/genética , Proteínas/metabolismo , Reprodutibilidade dos TestesRESUMO
The potential and promise of deep learning systems to provide an independent assessment and relieve radiologists' burden in screening mammography have been recognized in several studies. However, the low cancer prevalence, the need to process high-resolution images, and the need to combine information from multiple views and scales still pose technical challenges. Multi-view architectures that combine information from the four mammographic views to produce an exam-level classification score are a promising approach to the automated processing of screening mammography. However, training such architectures from exam-level labels, without relying on pixel-level supervision, requires very large datasets and may result in suboptimal accuracy. Emerging architectures such as Visual Transformers (ViT) and graph-based architectures can potentially integrate ipsi-lateral and contra-lateral breast views better than traditional convolutional neural networks, thanks to their stronger ability of modeling long-range dependencies. In this paper, we extensively evaluate novel transformer-based and graph-based architectures against state-of-the-art multi-view convolutional neural networks, trained in a weakly-supervised setting on a middle-scale dataset, both in terms of performance and interpretability. Extensive experiments on the CSAW dataset suggest that, while transformer-based architecture outperform other architectures, different inductive biases lead to complementary strengths and weaknesses, as each architecture is sensitive to different signs and mammographic features. Hence, an ensemble of different architectures should be preferred over a winner-takes-all approach to achieve more accurate and robust results. Overall, the findings highlight the potential of a wide range of multi-view architectures for breast cancer classification, even in datasets of relatively modest size, although the detection of small lesions remains challenging without pixel-wise supervision or ad-hoc networks.
RESUMO
Inter-individual differences in DNA repair capacity (DRC) may lead to genome instability and, consequently, modulate individual cancer risk. Among the different DNA repair pathways, nucleotide excision repair (NER) is one of the most versatile, as it can eliminate a wide range of helix-distorting DNA lesions caused by ultraviolet light irradiation and chemical mutagens. We performed a genotype-phenotype correlation study in 122 healthy subjects in order to assess if any associations exist between phenotypic profiles of NER and DNA repair gene single nucleotide polymorphisms (SNPs). Individuals were genotyped for 768 SNPs with a custom Illumina Golden Gate Assay, and peripheral blood mononuclear cells (PBMCs) of the same subjects were tested for a NER comet assay to measure DRC after challenging cells by benzo(a)pyrene diolepoxide (BPDE). We observed a large inter-individual variability of NER capacity, with women showing a statistically significant lower DRC (mean ± SD: 6.68 ± 4.76; p = 0.004) than men (mean ± SD: 8.89 ± 5.20). Moreover, DRC was significantly lower in individuals carrying a variant allele for the ERCC4 rs1800124 non-synonymous SNP (nsSNP) (p = 0.006) and significantly higher in subjects with the variant allele of MBD4 rs2005618 SNP (p = 0.008), in linkage disequilibrium (r(2) = 0.908) with rs10342 nsSNP. Traditional in silico docking approaches on protein-DNA and protein-protein interaction showed that Gly875 variant in ERCC4 (rs1800124) decreases the DNA-protein interaction and that Ser273 and Thr273 variants in MBD4 (rs10342) indicate complete loss of protein-DNA interactions. Our results showed that NER inter-individual capacity can be modulated by cross-talk activity involving nsSNPs in ERCC4 and MBD4 genes, and they suggested to better investigate SNP effect on cancer risk and response to chemo- and radiotherapies.