From Genotype to Phenotype Section Review 81 Answers
-
Loading metrics
Genotype to Phenotype Mapping and the Fitness Mural of the East. coli lac Promoter
- Jakub Otwinowski,
- Ilya Nemenman
x
- Published: May ane, 2013
- https://doi.org/10.1371/journal.pone.0061570
Figures
Abstract
Genotype-to-phenotype maps and the related fettle landscapes that include epistatic interactions are difficult to measure because of their loftier dimensional structure. Hither we construct such a map using the recently collected corpora of loftier-throughput sequence data from the 75 base pairs long mutagenized E. coli lac promoter region, where each sequence is associated with its phenotype, the induced transcriptional activity measured by a fluorescent reporter. We find that the additive (non-epistatic) contributions of private mutations account for about two-thirds of the explainable phenotype variance, while pairwise epistasis explains well-nigh seven% of the variance for the total mutagenized sequence and about fifteen% for the subsequence associated with protein binding sites. Surprisingly, there is no show for third order epistatic contributions, and our inferred fitness landscape is essentially unmarried peaked, with a modest amount of antagonistic epistasis. There is a significant selective pressure on the wild type, which we deduce to be multi-objective optimal for gene expression in environments with different nutrient sources. We identify transcription gene (CRP) and RNA polymerase binding sites in the promotor region and their interactions without hard optimization steps. In item, nosotros notice evidence for previously unexplored genetic regulatory mechanisms, possibly kinetic in nature. We conclude with a cautionary note that inferred properties of fettle landscapes may be severely influenced by biases in the sequence data.
Citation: Otwinowski J, Nemenman I (2013) Genotype to Phenotype Mapping and the Fitness Landscape of the East. coli lac Promoter. PLoS ONE 8(5): e61570. https://doi.org/x.1371/journal.pone.0061570
Editor: Andrew J. Yates, Albert Einstein College of Medicine, Us of America
Received: February 14, 2013; Accepted: March 9, 2013; Published: May 1, 2013
Copyright: © 2013 Otwinowski, Nemenman. This is an open up-access article distributed under the terms of the Artistic Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The work has been funded, in part, by the HFSP plan and the James S. McDonnell Foundation. The funders had no role in study blueprint, data collection and analysis, decision to publish, or grooming of the manuscript.
Competing interests: The authors take alleged that no competing interests exist.
Introduction
Many aspects of development, such every bit choice, recombination, and speciation, depend on the relationships betwixt genotype, phenotype, and fitness. These relationships often involve complex and commonage effects [1], which are difficult to untangle. One approach is to mensurate the fitness of many dissimilar genotypes, and build a fitness landscape, a high dimensional map from genotype/phenotype to reproductive fitness. This concept was outset introduced by Sewell Wright in 1932 [2]. Evolutionary dynamics and adaptation depend crucially on features of the fitness mural, and many studies accept quantified big scale features of landscapes, including genetic interactions [3]–[10], the presence of stabilizing selection [xi], [12], or the reproducibility of evolutionary paths [7], [13].
A major difficulty that has precluded mapping of large fitness landscape, is epistasis, which is the dependence of fitness furnishings of a mutation on the presence of other mutations. Epistasis makes the inference of landscapes combinatorially circuitous. This problem has attracted substantial attending. For instance, millions of interactions betwixt gene pairs have been measured from genetic knockout experiments [14]–[19]. Higher order epistatic interactions, that is those involving more than two loci at a time, have also been investigated for small fettle landscapes [3].
Another popular arroyo is mapping genotypes to phenotypes (also known equally the Quantitative trait loci or QTL assay [xx]), which includes the dimensionality reduction problem, merely is simpler since many phenotypes are easier to quantify reliably than the number of progenies, which exhibits large fluctuations. One and then separately studies the lower dimensional map from the phenotype to the reproductive charge per unit to complete the construction of the fitness landscape.
Unfortunately, few of these pioneering studies have provided a genotype to phenotype or to fitness mapping for longer genetic sequences, and most such big maps are modeled without epistasis (see, e. k., [21]). Indeed, a complete landscape would be divers not by genes or specific loci, but by all possible nucleotide sequences. However with dissimilar sequences of length , it had been impractical to measure the landscapes for sequences of relatively large length until next generation sequencing technologies dramatically lowered the cost [22]. Nonetheless, measuring phenotypes of a big number of sequences is still tricky, and but a few large fitness landscapes have been quantified. For example, Pitt et al. measured the fettle landscape of RNA sequences with an in vitro selection protocol [23]. Similarly, Mora et al. studied frequencies of genetic sequences of IgM molecules in zebrafish B cells (which are related to fitnesses), simply they imposed a translational symmetry of the sequence [24]. Finally, Hinkley et al. analyzed seventy,000 HIV sequences and their in vitro fitnesses, built a fitness landscape divers on different amino acids of certain HIV genes, and and then investigated large scale properties of the ensuing landscape [25], [26]. However, fifty-fifty in these high throughput studies, the data did not contain all possible pairs of mutations, potentially biasing the results, particularly far from the wild blazon sequences (see Word).
In this commodity, we reconstruct a large, yet detailed bacterial genotype to phenotype map, including quantifying the epistatic interactions in the ensuing fitness landscape. We seek a mural based on long nucleotide sequences, which additionally allows quantifying phenotypes of transcriptional regulation in addition to those of enzymatic activity. This permits fitnesses to exist divers over both coding and non-coding Deoxyribonucleic acid. To map the landscape far from the wild type genotype, we would similar sampling of the sequence data that is unbiased by selection.
Recent experiments past Kinney et al. [27] accept collected a dataset that comes close to satisfying these criteria. The data consists of mutagenized transcriptional regulatory sequences from the E. coli (MG1655 and TK310 strains) lac promoter. In full, there were lac promoter sequences mutagenized in a 75 nucleotide region containing the military camp receptor protein (CRP) and RNA polymerase (RNAP) binding sites (−75∶−i), with mutations per sequence (mean standard deviation) (encounter Ref. [27] for additional data set up details). The transcriptional activity induced past the mutagenized promoters was measured through fluorescence of the transcribed gene products and FACS sorted according to the transcriptional activity into up to nine logarithmically spaced categories. All categories were then independently sequenced, so that the quantitative (on the calibration of 1 to nine) phenotypic result of each sequence is known to within a certain accurateness. Farther, there were an additional sequence-expression pairs for the same operon analyzes in unlike environmental weather. Thus the data tin be used to reconstruct the genotype-to-phenotype map. However, the promotor activity is direct related to lactose metabolism and thus is correlated with growth rate or fitness under atmospheric condition where lactose is the preferred free energy source. Therefore, the fluorescence may too be viewed equally a proxy for fitness of this sequence.
In summary, the Kinney et al. [27] dataset provides simultaneous measurements of sequences and their phenotype. Crucially, the information gear up is dense, so that every pair of mutations has occurred at least twenty times, each time in a different genetic backgrounds of about 5 other random mutations. We use these sequence and transcriptional activity information to infer the detailed genetic landscape for the 75 nucleotide DNA sequence, quantifying pairwise epistatic interactions among all of the nucleotides to the accuracy afforded by the data. This is washed by constructing a linear-nonlinear regression model that connects sequences to their phenotypes. Since the number of possible epistatic interactions is comparable with the number of sampled sequences, we control the complexity of the models by regularization, and hence prevent overfitting. This also imposes sparsity on the epistatic interactions, which nosotros expect from the limited number of binding sites. Nosotros then clarify the statistics of epistatic effects in the inferred landscape. Finally, analysis of the landscapes obtained under different environmental conditions provides evidence that the wild-type sequence of the Due east. coli lac promoter is close to optimal in the ecological niche that the bacterium occupies.
Results
Inferring the non-epistatic genotype to phenotype map
The simplest model of a genotype to phenotype map is one where each locus contributes a fixed amount to the phenotype, regardless of the state of other loci. Thus we used the sequence and the fluorescence measurements (see Methods) to fit an additive map using linear regression of the fluorescence values (integers 1 to 9) on the genetic code which are treated as 75 chiselled variables with four levels: A,T,G,C. The dummy variables encode the presence of mutations relative to the wild type ( when a mutation is present, and otherwise). Since there are four nucleic acids, each locus has iii binary numbers for each of the possible mutations from the wild-type, and the sequence length is finer tripled. In other words, for each locus, 000 represents the wild-type, and 001, 010, 100 correspond the 3 mutations (see Table ane in Methods). The statistical model is (1) where is the statistical noise, and the superscript stands for a single bacterium, for which the sequence, , and the fluorescence, , are known. In subsequent equations, the superscript is suppressed for brevity. Part of the genotype-phenotype map may be non-linear due to the mapping from fluorescence to bin number and due to some remaining background fluorescence. Thus we replace with a non-linear monotonic function chosen to optimize the explanatory power of the nonepistatic statistical model, and likely bias downwardly inferred effects of epistatic contributions (run across Methods). The coefficients, and , are establish by ordinary least squares regression, east. g., coefficients that minimize in Eq. (i). Since the wild-type is a sequence of all zeros, is the predicted phenotype of the wild type. The coefficients can be plant in File S1.
The coefficient measures the goodness of fit, or how much of the variance in the data, , is explained by the model. The linear model yields .
Some variation in the data is experimental racket, such as groundwork fluorescence and cell-to-cell variability, and sets an upper jump on the possible . In Methods, we estimate this intrinsic dissonance to exist 10–24%, and therefore about 76–90% of the total variability of the data can be explained by any statistical model, fifty-fifty an arbitrarily circuitous model. Therefore the linear model accounts for 57–67% of the explainable variance. We emphasize that this statement is not about mechanistic underpinnings of the genotype-to-phenotype relation, but about statistics of the data merely. Equally in whatever multivariate model, information technology is possible for the statistical linear effects to sally from superposition of many mechanistic epistatic interactions.
Examination of the coefficients with the largest magnitude reveals the consensus locations of the CRP and RNAP bounden sites (Fig. 1), which validates the modeling approach. Interestingly, the wild type does not contain the "consensus" binding sequences: for CRP [28] and for RNAP [29], but the wild type is only four mutations away. Four of the big positive coefficients in Fig. i (positions −54, −34, −9, −8, ruddy circles) correspond to the mutations needed to get the consensus sequences.
Three circles on each stem represent the changes in phenotype for each of the iii possible mutations per site. CRP and RNAP are known to each bind at two sites (magenta and cyan areas). Crimson circles correspond to the mutations needed to get the consensus sequences.
These inferred coefficients may be compared to the energy matrices derived from the aforementioned information with data theoretic techniques by Kinney et al. [27]. There the energy matrices were inferred separately for CRP and RNAP, and also over many different experiments, while our regression coefficients were inferred from the whole sequence data. Correlation between our 's and the energy matrices ranged from 89%–91% for CRP binding sites. This is comparable to the 95% correlation amidst free energy matrices estimated from dissimilar subsets of the information in [27]. Such an agreement between a plainly elementary linear-nonlinear model and the results of a computationally circuitous optimization of information-theoretic quantities is truly surprising and encouraging.
Since correlations amongst various free energy matrices for the RNAP binding are somewhat lower (92%) [27], nosotros expect the agreement between the regression and the data-theoretic methods to be worse for this instance. Indeed, the correlations betwixt 's and energy matrices range between 46% and 54%. We expect that this reduction tin can be attributed partially to the fact that the energy matrices were inferred past Kinney et al. for CRP and RNAP separately or jointly in a thermodynamic model, which assumed a directly relation between RNAP bounden and the transcription charge per unit. It has been discussed and measured repeatedly [thirty], [31] that transcription charge per unit is strongly affected by kinetics of transcriptional initiation, which is not modeled for by the thermodynamic probability of finding RNAP bound to the regulatory sequence. Dissimilar the free energy matrices, our statistical model inferred from the entire sequence tin can account for these kinetic effects, and may exist more accurate in this context. Since such effects are absent for transcription factor bounden, they tin potentially explain the differences in agreements betwixt the models observed for CRP and RNAP binding sites. Such kinetic effects may also explicate the deviation between the wild blazon and the consensus (that is, the strongest) binding sequences mentioned above. Additional biophysical experiments are needed to carefully explore these issues.
Inferring epistatic contributions to fitness
The simplest model with epistatic interactions between all pairs of nucleotides is a quadratic or bilinear model, written equally: (2)
The last sum is over all nucleotide pairs. Here nonzero would indicate the presence of pairwise epistasis. For example, , and all of the aforementioned sign is comonly referred equally synergistic epistasis, where contribution of the pair of mutations is stronger than of each mutation lonely. Other possible types of epistasis are described below.
Note that, in Eq. (two), we proceed the same equally in the previous section, which maximizes the explanatory ability of the non-epistatic terms and minimizes that for the epistatic terms. The number of epistatic terms in this statistical model ( ) should be contrasted with typical biophysical models of protein-DNA interactions, which include merely a single free energy term describing interactions betwixt the CRP and RNAP proteins [27], [32].
The total number of coefficients , , and in the quadratic epistasis model, Eq. (2), is 25,201 (accounting for the fact that, in a unmarried genome, simply one mutation per site is allowed). Overfitting is a concern since the number of observations, 129,000, is not much larger than the number of coefficients. To infer a model that does not overfit, we applied a standard regularization procedure, which penalizes overly complex models and imposes sparsity on the number of nonzero interaction terms (run into Methods). Since available genotypes were not uniformly distributed, but rather biased towards the wild type, we supplemented traditional cross-validation approaches with additional checks to ensure that the regularization selects the model with the highest explanatory power, merely no overfitting. The chosen model and its coefficients are discussed in the following. Coefficients of the chosen model, the total model, and the model deemed best by cantankerous-validation can be found in File S1. As we bear witness in Methods, the general construction of the inferred epistatic coefficients is but weakly dependent on the specifics of the model pick.
The distribution of inferred phenotype values for randomly generated sequences (Fig. 2) shows that the random sequences are typically not very functional (presumably because the bounden sites loose specificity). The summit almost represents the about common sequence that would exist observed under neutral evolution, and the relatively loftier value for the wild-type ( ) compared to the random sequences indicates that it is under strong pick. Notice that nosotros tin can assert this without any comparative genomics or population genetics data, which would typically be required.
Random sequences have very low inferred phenotype values because of the specificity of binding sites. The peak of the distribution indicates what phenotype values evolve nether neutral conditions. The the wild-blazon value, (green line), is much higher than the neutral value indicating selective pressure.
The fraction of variance explained past the pairwise epistatic model is (although it is sensitive to the regularization parameter, see Methods). Comparing to the non-epistatic model with , and taking into account the intrinsic experimental noise of 10–24%, we run into that most vii% of the explainable variance is due to the pairwise epistasis. However, it is possible that more data would increase the corporeality of predictive ability of the epistatic contributions. Furthermore, combinations of multiple epistatic interactions may have a internet nonepistatic contribution to the phenotype (but not the other way around). Thus this 7% effigy is, in many respects, a negatively biased estimate of importance of epistasis.
The not-epistatic coefficients are about lxx% non-zero, but the interaction terms are very thin, about three% non-cypher. The phenotype is afflicted past mutations in some positions more than others. Coefficients with the largest magnitudes vest to positions within the CRP and RNAP bounden sites (run across Fig. iii). Thus this kind of data allows for identification of binding sites without a biophysical model of poly peptide-DNA interactions, as is done traditionally [33], [34]. More importantly, as Fig. 3 shows, the model can infer functional interactions betwixt amino acrid or nucleic acid binding over a much longer range than can be computed from biophysical and structural biology approaches [35]. The consistency of our results with known binding sites validates our inferences. Alternative methods that instead limit the number of inferred coefficients by constraining the range of interactions, or by allowing interactions only betwixt consensus sites, would either miss the long-range furnishings, or the small (only statistically significant) interactions abroad from the binding sites seen in Fig. three.
a) Matrix of the sum of the absolute values of the pair interaction coefficients for each pair of sites (3 mutations per site equals ix interactions) for the chosen statistical model. The clusters almost the diagonal are interactions within the RNAP and CRP binding sites, and the off-diagonal clusters are interactions betwixt the binding sites. b) Reddish: Site-specific sum of absolute values of additive coefficients, divided by iii (the number of possible mutations). Black: site-specific sum of absolute values of epistatic coefficients, divided by 9 (the number of possible mutation pairs). Epistatic and additive furnishings are strongly correlated, with the correlation coefficient 0.90.
The interaction coefficients are observed to be clustered effectually the subunits of the organisation CRP, RNAP, and their constituent bounden sites. The inter- and intra- binding site interactions are like shooting fish in a barrel to dissever in Fig. 3, allowing a comparing of the magnitude of the interactions between the subunits, summarized in Tbl. 2. Interestingly, CRP and RNAP interact on the same order of magnitude every bit their constituent bounden sites interact amid and inside themselves.
Epistatic interactions may exist classified into several categories (see Tabular array 2): synergistic epistasis (the result of two aforementioned-sign mutations is larger than the sum of the effects of each one separately), combative epistasis (the consequence of 2 aforementioned-sign mutations is smaller than the sum of their individual effects), and other epistatic effects (the individual effects of two mutations have opposite signs, while epistasis is nowadays). We find that almost of the interactions in the E. coli lac promoter are combative (388/629 = 62%). This is probable considering mutations change poly peptide-Deoxyribonucleic acid bounden affinity most additively, which leads to "diminishing returns" from contributions of private mutations to transcriptional activity, similar to [iv], [6]. Indeed, if the transcription rate is given past a sigmoidal function of the binding gratis energy , such every bit or like [27], then improvements in are incrementally less of import when it is already big and negative. Thus the upshot of matching an appropriate nucleotide to the corresponding amino acrid decreases when other bases are already matched. Epistasis produced by this machinery should exist combative, simply mild [4], [six]. Indeed, we found only one case of a severe type of antagonistic epistasis (reciprocal sign epistasis), where the individual effects are both harmful, but the total event is beneficial. It is known that reciprocal sign epistasis is a necessary (but insufficient) condition for a multi-peaked mural [36], and hence we expect this landscape to be fairly smooth (at most two maxima).
While the relationship betwixt phenotype (transcription) and fettle is not precisely known in this experiment, they are likely to exist correlated. Therefore the roughness in the genotype-phenotype map is likely to be important for the whole fettle landscape. Identifying fitness with , nosotros characterized this roughness by direct exploring the accessibility of the local optima of the inferred map. We used an adaptive walk similar to the development of a large population in the weak mutation regime, which can move only towards higher values and cannot escape local maxima. Starting from the wild-type sequence, the algorithm only chooses mutations that increase the phenotype (or fitness), with probability proportional to the log fettle difference. Out of 1000 random walks, the population ends up in merely two very similar sequences which differ by 2 mutations, and they are xl and 39 mutations away from the wild blazon (compare to the average of mutations per sequence). Since the sequences are and then far away from the training data, their predicted phenotype value are non authentic predictions of the real local maxima.
2nd and higher order epistasis for a subsequence
We accept insufficient data to written report tertiary and higher club epistasis on the unabridged 75 bp sequence. Withal, since near of the linear and the second order epistatic furnishings in our analysis are concentrated at the consensus bounden sites (cf. Fig. 3), we accept performed 3rd order epistatic assay on 22 base of operations pairs subsequences of the data, limited to the four known binding sites in the sequence. That is, in add-on to the linear and the bi-linear model, nosotros too fitted: (3) where the same procedure was used to notice the non-linear part, (encounter Methods). Note that the 22 base of operations pairs were selected based upon consensus binding site locations, non upon our assay in the preceding sections. Thus one does not expect overfitting that would ensue if the same data were used to identify the binding sites first, and then to refine their epistatic model.
For this subset of nucleotides, the model with only additive effects, Eq. (1), had an . The 2nd order epistatic model, Eq. (2) had . Here the number of interaction coefficients was much smaller (2,212), resulting in no signs of overfitting even without regularization. Thus the importance of quadratic epistasis, which explains 14–20% of the explainable variance for the subsequence, is no longer information limited. Like for the full sequence, we investigated the roughness of the landscape created by the binding sites subsequence. We found the landcape to be smooth, with only i global maximum, exactly matching the consensus (but not the wild blazon) regulatory sequence.
The third order epistatic model, Eq. (three), had 47,972 coefficients, which needed to be regularized in the same manner every bit the quadratic model (Methods). This yielded at maximum cross validated . Thus the higher club interactions do not improve the fit, and at that place is no show for these third social club epistatic interactions in the information, although it is possible that larger data sets would reveal them. Similarly, farther restricting the subset of base pairs used in the analysis did non discover statistically significant 3rd order furnishings. In other words, quite surprisingly, for these information, combinatorial furnishings of triple mutations can exist fully modeled by effects produced by constitutive pairs of the triples.
Mural in two environments
In addition to the information from the three experiments analyzed in a higher place, Kinney et al. [27] performed experiments with a different strain of bacteria (TK310) that is unable to control its intracellular cAMP levels. Because CRP is activated past army camp, varying extracellular cAMP levels controls the active intracellular concentration of CRP. E. coli prefers to metabolize glucose over lactose, so cAMP is inhibited by the presence of glucose, and lac expression is suppressed when glucose is present. We inferred genotype-phenotype maps using the not-epistatic model every bit in the Section two.1 for two weather condition, no cAMP and cAMP, representing an surroundings with glucose and no glucose. The datasets are smaller ( sequences), and distinguish only 5 levels of fluorescence, merely they are otherwise very similar, then the same linear-nonlinear optimization was used. The results shown beneath were found with the non-epistatic model. However, here the pair interactions account for a smaller fraction of the variance, and the epistatic model produces very similar fitted values.
As expected, when CRP is not active at that place is little binding at the CRP sites, and the associated coefficients are nearly all small (Fig. 4). Considering of the lack of CRP binding, expression for the wild blazon sequence, and sequences close to the wild-type, is lower when there is glucose (Fig. five). However, there are some changes to the RNAP binding site coefficients. Random sequences are not functional in the no-glucose environs, but they have some modest functionality, comparable to the wild-type, in the glucose surroundings (Fig. 5), suggesting that in that location is less specificity in the RNAP binding. Note besides that some of the coefficients, specially for the no camp instance, are big just outside the traditional RNAP bounden domain. Unexpectedly, for no army camp, the transcription rate is comparable to the cAMP nowadays case, when CRP helps polymerase recruitment. This suggests some additional biophysical binding mechanisms, currently unexplored. Every bit discussed above, these mechanisms are quite possibly kinetic in nature.
The wild-blazon is nearly on the optimal front in that very few sequences take both higher expression with cAMP and lower expression without army camp (to a higher place and to the left of the plus sign). The phenotype values range from i to 5 in these experiments. The dis-similarity of measured expression and expressions predicted for random sequences forth the vertical, only not the horizontal axis, probable signals presence of poorly understood biophysical mechanisms differentially employed in the ii considered environments.
In the no cAMP (glucose) surroundings, lac expression should decrease the growth rate considering the cell is metabolizing glucose instead of lactose, and lac expression costs resources [37], [38]. Therefore we expect sequences under selection, such as the wild type, to have relatively high expression with army camp, and low expression without cAMP, compared to sequences not under selection (random sequences). Figure 5 shows that in that location exist very few sequences which are improve than the wild type in both environments, i.east. simultaneously higher expression with cAMP, and lower expression without cAMP. The non-elliptical shape of the fitted values for the experimental sequences suggests again that the wild blazon is under a strong selection towards the peak left corner of the plot. Finally, we point out that, even when lactose is existence metabolized, too high expression of lac genes is plush, possibly because cellular resources are pulled to lac transcription and translation and away from production of essential proteins [37]. This may make sequences in the top right corner of Fig. v less fit than our monotonically increasing model assumes, making the wild type even closer to the global optimality.
Word
We constructed a genotype-to-phenotype mapping, including effects of all pairwise and some college order epistatic interactions. This was done by analyzing functional properties of over randomly mutated sequences in the vicinity of the wild type Due east. coli lac operon, queried under dissimilar experimental conditions. The control of dimensionality for the epistatic models, along with the large size of the dataset, allows for a much more than detailed assay of epistasis in this bacterial genetic regulatory region.
Our approach is generally similar to those in Refs. [25], [26]. Still, in that location are substantial differences beyond a different model organism used. Our alleles are nucleotides in a regulatory region of a bacteria, instead of amino acid variants. Our landscape is more consummate, in that interaction among all pairs of nucleotides in the sequence are estimated from the data that includes each such pair at least twenty times in different genetic backgrounds. In particular, we take relaxed the status [24] that the interaction terms can depend simply on the distance betwixt the loci, rather than on the specific positions of the loci. Mora et al. [24] used maximum entropy approaches to infer a fitness landscape, while, forth with Hinkley et al. [25], we have focused on linear regression (though with different regularization constraints and different nonlinear mapping between the fitness and the observed phenotype). The epistatic model, Eq. (ii), is the same in the regression and the maximum entropy approach. However, the philosophical ground behind the approaches is different, so are the criteria used to specify the coefficients . Maximum entropy methods choose them to constrain observable correlation functions, while regression attempts to approximate the entire fettle office. Information technology remains to be seen which of the two frameworks provides a better model for genomic data.
Possibly the largest departure from the previous approaches that considered epistatic interactions for many mutations is that we institute a genotype-phenotype map, rather than the true fitness landscape. While we expect the phenotype and the fettle to be strongly correlated when lactose is being metabolized (and anti-correlated otherwise), the relation betwixt the fitness and either the observed fluorescence or its nonlinearly reparameterized form, , is likely nontrivial. Ideally, a second experiment would measure the phenotype-to-fitness map to complete the reconstruction of the fettle landscape. In fact, Dekel and Alon[37] have completed this second step for the lac regulatory sequence. However, we cannot apply their findings since their E. coli strains and growth environments were slightly different from those of Kinney et al. [27].
Binding energy-fitness maps have been inferred from genome wide studies of transcription factor binding sites using genomic statistics and population genetics models [39]–[42]. In those studies, the genotype-phenotype maps were largely assumed to be non-epistatic, in contrast to our work. It would be interesting to combine the methods to make a more complete account of epistasis from genotype to fettle.
Our observations have revealed a few cautionary notes regarding using genome frequency in a population to reconstruct fitness landscapes [24], [25]. In such experiments, all sequence data (including whatever part of information technology that is left for cross-validation) are localized near the wild type, near-optimal sequences due to choice. Carefully inferred models (whether regression or maximum entropy based) perform well for the observed information, but will generalize badly for sequences far away from the wild type. Our arroyo samples the genotype space more evenly without selection, and therefore is better suited for making inferences about the global landscape properties, such as its ruggedness. Still, even in our data, with each sequence mutations away from the wild type, extrapolation to much larger genotypic differences produces absurd results, fifty-fifty if cross-validation fails to find problems, (meet Methods: Regularization and model selection.
In our inferred mural, epistasis accounted for about seven% (about 15% for the bounden sites subsequence) of the explainable variance. Nearly of the epistasis was antagonistic, but the landscape was essentially single peaked. This is similar to properties of epistasis in metabolism [4], [6], and the explanation for both likely involves diminishing returns from successive individual mutations. It is useful to dissimilarity these findings with the work on HIV [26] or protein fitness landscapes [7], which accept observed more substantial epistasis and many more local maxima. While it is possible that more than epistatic effects would be observed for our system if more data were available, more than intriguing is the following observation. During model selection (see Methods), it was noticed that, due to well-nigh of the sequences beingness mutations from the wildtype, information technology was possible to brand large prediction errors for sequences with more than mutations. In other words, at that place was a big extrapolation error for sequences exterior of the training information, and this led to choosing a more constrained model for last analysis. A less constrained model (which maximizes , cf. Methods : Regularization and Model Option) is much more epistatic, with adaptive walks indicating many local maxima. The severity of the problem correlates with the nonuniformity of the genotype sampling, making the data from populations under stiff selection especially suspect. To allow studying global backdrop of landscapes, an platonic experiment would sample the sequence space much more uniformly to avert extrapolation.
In add-on to the weak epistasis, nosotros also constitute that the wild-type E. coli lac regulatory region is optimal for the two environments measured. That is, it is on the front of possible sequences which maximize expression when it is beneficial, and minimize expression when it is harmful. If under the growth conditions the fitness is a non-monotonic function of the transcriptional activity and decreases at loftier expression [37], the wild type operon may be not only about multi-objective optimal, just nigh globally optimal. To investigate this, experiments are needed that would study fitnesses of many sequences under selection in fluctuating environments.
The power of our method to identify protein binding sites and epistatic interactions among them raises an important signal. These epistatic interactions, inferred by either of the methods we have mentioned in this work, particularly interactions over long ranges, may not correspond to true biophysical interactions between amino acids and nucleotides. They are likely effective interactions resulting from commonage furnishings of many other epistatic terms, including higher society terms, or a small number of interactions, such as binding between CRP and RNAP. While in that location is an admirable similarity between our linear regression coefficients and energies of protein-Dna interactions, our approach may not exist every bit informative where at that place is enough data to build a detailed biophysical model, but in that location are few places in the genome where this is the example. On the other hand, our approach can detect long distance epistasis, or non-thermodynamic effects on transcription where a priori information technology is unclear that these effects and interactions be. When working on the genome calibration, constructive models that can brand accurate predictions of phenotype or fitness for previously unobserved sequences may be useful regardless of their lack of microscopic accuracy. They may be closer to the right level of clarification of the problem [43], past hitting a rest between microscopic biophysically relevant detail, and power to describe the richness of phenomena emerging on the genomic scale. Equally an case of this utility, here nosotros plant that, for the 22 bp long subsequence of the regulatory region that includes the binding sites, there was no evidence for 3rd order epistatic effects. The fact that pairwise effective interaction models, with only a few higher order contributions, provide excellent fits to multivariate data has been observed by now in the context of neurophysiological recordings [44]–[48], microarray-measured cistron expressions [49]–[51], and sequencing data [24], to which our analysis has just added some other example. These frequent successes of pairwise models in diverse domains are certainly surprising and, as of now, unexplained. They raise many interesting questions about general theories of multivariate biological information, which are still waiting for their answers.
Methods
Preparation of the dataset
To make inferences on the largest dataset possible, we combined the data from iii experiments done past Kinney et al. [27] (fullwt, crpwt, rnapwt, 129,000 sequences total), which differ but by the regions in which mutations were immune to take place. Fullwt was mutagenized over the whole sequence (−75∶−ane), while crpwt and rnapwt were mutagenized simply over the CRP binding expanse and RNAP binding area. In addition, some sequences were rejected for data quality reasons: identical sequences in the aforementioned bin were likely to be non independent measurements (encounter Supplemental Materials in Ref. [27]), and sequences with an exceptional number of mutations ( ) were probably errors.
Linear-nonlinear model
Part of the genotype-phenotype map may exist non-linear due to the mapping from fluorescence to bin number and some remaining background fluorescence. To place pairwise interactions in the background of an arbitrary mean nonlinear genotype-phenotype map, we introduce a generalized linear-nonlinear model: (4) where is a monotonically increasing, nonlinear office of . The function is found past maximizing the fit ( ), which corresponds to minimizing (five)
We add the constraints that , and to continue finite. The function is divers over only 9 values of , and a constrained non-linear optimization procedure (fmincon from MATLAB) finds an optimal chop-chop (Fig. half dozen). Note that this method resembles a type of generalized linear model chosen ordinal probit regression[52], and is too similar to the inference of non-linear filters in computational neuroscience using data-theoretic tools [53].
Constrained not-linear optimization constitute the optimal for the linear model with . The not-linearity is due to the get-go few bins existence dominated by background fluorescence and not gene expression.
The summary statistics change when replacing with . The variance of the bin numbers increases from six.five to seven.vi, and the increases from 0.476 for the linear model for , to 0.514 for the linear model for . The experimental noise estimates (see below) are also slightly different.
Assuming a monotonic human relationship between genotype and phenotype, is the function that maximizes the phenotype prediction from the non-epistatic (linear in contributions. This reduces the corporeality of variability left to be predicted by any epistatic model, whether of genotype-phenotype map, or genotype-fettle map (provided that the fitness is monotonically related to the phenotype). This also prevents the epistatic model from plumbing fixtures any boilerplate non-linear furnishings. Thus our subsequent assessment of importance of the epistasis should be viewed as biased towards underestimation.
Estimates of intrinsic noise in the data
Experimental data is corrupted by errors in both fluorescence measurements and sequencing. One approximate of this intrinsic noise is obtained past averaging the variance of for identical sequences with different recorded fluorescence values. The ratio of this intrinsic variance to the full variance of is . Since this excludes all sequences that brutal into just i bin and have an unknown variance , this gauge is an upper bound on the noise variance.
Another guess can exist obtained past using the controls from Ref. [27], which provide fluorescence numbers for many individual wild type leaner. The fluorescence variance in optimized bin units is 0.74, which is of the data variance. This number underestimates the average noise since wild type bacteria limited strongly, and so that the fluorescence noise for them is smaller than for most other sequences.
Regularization and model choice
Statistical model with the number of parameters comparable to the data set size may overfit, that is, model statistical racket in the information. To prevent overfitting, we minimize the mean squared error in Eq. (2) subject to a regularizing constraint (6) where is the concatenated vector of all the regression coefficients, is its norm, and is a free parameter (Lagrange multiplier), unknown a priori. Regularization constrains the statistical complication of the model by minimizing the norm of the coefficients [54]. When the norm is used, , this regression is called the Least Absolute Shrinkage and Selection Operator (LASSO) [55]. LASSO favors sparse solutions, which is a reasonable supposition since about of the 'southward are interaction terms, and interactions are presumed to be mainly between the relatively pocket-sized CRP and RNAP binding sequences. Cheers to an efficient implementation of the algorithm [56], we tin can compute the LASSO solution for 100 different values of , from the maximum value (where the solution is all 's equal to nada), to four orders of magnitude smaller.
Still, choosing the best solution (i.due east., the right ) is ambiguous. A common method of model selection is cross-validation. Effigy 7 shows that solutions with large are a poor fit, while small values take less predictive power, as seen through cross-validation. Typically ane chooses the best model as the one with the maximum ( ) [55]. All the same, both the training and the cross-validation data are sequences with an average of only half-dozen.8 mutations from the wild-blazon (9% mutated sites). Thus cross-validation may not ensure predictability for sequences farther away in the genotype space. Indeed, the variance of the fitted values of for the experimental data is non sensitive to changes in (not shown). Nonetheless, Fig. 7 shows that the variance of for random sequences blows up for less constrained models (depression ), where unrealistically high fitted values of or emerge. This indicates overfitting due to uneven sampling of the genotype space and the resulting correlations in the training and the exam information. We thus limit to the range where the variance of the fitted values for random sequences is comparable to that for the experimental information and is insensitive to . Incidentally, this is also the place where and curves split in Effigy 7 (dashed line, , 629 non-zero coefficients). Finally, Fig. 8 shows that the general structure of the solution is only weakly dependent on the exact choice of .
Blue is the value, and red is the 10-fold cross-validated . The green bend is the variance of for randomly generated sequences. The variance is too large even for values of that are larger than the optimal value predicted by the maximum of the curve. We choose the model with (dashed line) for farther analysis. This model has not-naught coefficients, virtually of which are epistatic.
As in Fig. three, we show the matrices of the sums of the absolute values of the pair interaction coefficients for each pair of sites . a) Coefficients for the model with maximum ( ). b) Coefficients for the full model: . Detect the aforementioned general construction of the coefficients for varying , including in Fig. 3. This indicates stability under changes of the parameter.
Supporting Information
Acknowledgments
We would like to give thanks Justin Kinney for providing us with the sequence data, David Cutler, Thierry Mora, and Minsu Kim for illuminating discussions, Thierry Mora, Justin Kinney, and Philip Johnson for commenting on the manuscript, and Bruce Levine for full general guidance.
Writer Contributions
Analyzed the information: JO IN. Contributed reagents/materials/assay tools: JO IN. Wrote the paper: JO IN.
References
- i. Goldenfeld N, Woese C (2011) Life is Physics: Evolution as a Collective Phenomenon Far From Equilibrium. Ann Rev Cond Mat Phys 2: 375–399.
- View Article
- Google Scholar
- ii. Wright S (1932) The roles of mutation, inbreeding, crossbreeding and selection in evolution. Proc 6th Int Congress Genetics 1: 356–365.
- View Commodity
- Google Scholar
- 3. Szendro I, Schenk M, Franke J, Krug J, de Visser J (2012) Quantitative analyses of empirical fitness landscapes. Arxiv preprint arXiv 12024378.
- View Article
- Google Scholar
- 4. Chou H, Chiu H, Delaney N, Segrè D, Marx C (2011) Diminishing returns epistasis among beneficial mutations decelerates adaptation. Science 332: 1190–1192.
- View Commodity
- Google Scholar
- 5. Franke J, Klözer A, de Visser J, Krug J (2011) Evolutionary accessibility of mutational pathways. PLoS Comp Biol 7: e1002134.
- View Article
- Google Scholar
- 6. Khan AI, Dinh DM, Schneider D, Lenski RE, Cooper TF (2011) Negative epistasis betwixt beneficial mutations in an evolving bacterial population. Science 332: 1193–6.
- View Article
- Google Scholar
- 7. Weinreich D, Delaney Northward, Depristo G, Hartl D (2006) Darwinian evolution tin can follow only very few mutational paths to fitter proteins. Science 312: 111–114.
- View Article
- Google Scholar
- 8. Hall DW, Agan One thousand, Pope SC (2010) Fitness epistasis amidst 6 biosynthetic loci in the budding yeast Saccharomyces cerevisiae. J Heredity 101: S75–84.
- View Article
- Google Scholar
- 9. da Silva J, Coetzer M, Nedellec R, Pastore C, Mosier DE (2010) Fitness epistasis and constraints on adaptation in a homo immunodeficiency virus blazon 1 poly peptide region. Genetics 185: 293–303.
- View Commodity
- Google Scholar
- x. Lunzer M, Miller SP, Felsheim R, Dean AM (2005) The biochemical architecture of an aboriginal adaptive mural. Science 310: 499–501.
- View Commodity
- Google Scholar
- 11. Kingsolver JG, Hoekstra HE, Hoekstra JM, Berrigan D, Vignieri SN, et al. (2001) The force of phenotypic choice in natural populations. American Naturalist 157: 245–61.
- View Article
- Google Scholar
- 12. Shaw RG, Geyer CJ (2010) Inferring fitness landscapes. Development 64: 2510–20.
- View Commodity
- Google Scholar
- 13. Poelwijk FJ, Kiviet DJ, Weinreich DM, Tans SJ (2007) Empirical fitness landscapes reveal attainable evolutionary paths. Nature 445: 383–half dozen.
- View Commodity
- Google Scholar
- 14. Segrè D, Deluna A, Church building GM, Kishony R (2005) Modular epistasis in yeast metabolism. Nature Genet 37: 77–83.
- View Commodity
- Google Scholar
- 15. Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, et al. (2010) The genetic landscape of a jail cell. Science 327: 425–31.
- View Article
- Google Scholar
- xvi. Moore JH (2005) A global view of epistasis. Nature Genet 37: 13–four.
- View Article
- Google Scholar
- 17. Phillips PC (2008) Epistasis–the essential role of cistron interactions in the structure and evolution of genetic systems. Nature Rev Genet 9: 855–67.
- View Article
- Google Scholar
- eighteen. Stark C, Breitkreutz BJ, Chatr-Aryamontri A, Boucher L, Oughtred R, et al. (2011) The BioGRID Interaction Database: 2011 update. Nucl Acids Res 39: D698–704.
- View Article
- Google Scholar
- 19. Baryshnikova A, Costanzo M, Kim Y, Ding H, Koh J, et al. (2010) Quantitative analysis of fettle and genetic interactions in yeast on a genome calibration. Nature Methods 7: 1017–24.
- View Article
- Google Scholar
- twenty. Liu BH (1997) Statistical Genomics: Linkage, Mapping, and QTL Analysis. CRC Press.
- 21. Brem R, Kruglyak 50 (2005) The landscape of genetic complexity across 5,700 factor expression traits in yeast. Proc Natl Acad Sci U.s.a. 102: 1572–1577.
- View Commodity
- Google Scholar
- 22. Shendure J, Ji H (2008) Next-generation Dna sequencing. Nature Biotech 26: 1135–1145.
- View Commodity
- Google Scholar
- 23. Pitt JN, Ferré-D'Amaré AR (2010) Rapid construction of empirical RNA fettle landscapes. Science 330: 376–nine.
- View Article
- Google Scholar
- 24. Mora T, Walczak AM, Bialek W, Callan CG (2010) Maximum entropy models for antibody diversity. Proc Natl Acad Sci USA 107: 5405–10.
- View Article
- Google Scholar
- 25. Hinkley T, Martins JA, Chappey C, Haddad M, Stawiski E, et al. (2011) A systems analysis of mutational effects in HIV-1 protease and contrary transcriptase. Nature Genet 43: 487–ix.
- View Article
- Google Scholar
- 26. Kouyos RD, Leventhal GE, Hinkley T, Haddad M, Whitcomb JM, et al. (2012) Exploring the Complexity of the HIV-i Fitness Mural. PLoS Genetics viii: e1002551.
- View Article
- Google Scholar
- 27. Kinney JB, Murugan A, Callan CG, Cox EC (2010) Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci USA 107: 9158–9163.
- View Article
- Google Scholar
- 28. Berg OG, von Hippel PH (1988) Pick of DNA binding sites by regulatory proteins. II. The bounden specificity of cyclic AMP receptor protein to recognition sites. J Mol Biol 200: 709–23.
- View Article
- Google Scholar
- 29. Harley CB, Reynolds RP (1987) Analysis of E. coli promoter sequences. Nucl Acids Res 15: 2343–2361.
- View Article
- Google Scholar
- 30. Wall 1000, Markowitz D, Rosner J, Martin R (2009) Model of transcriptional activation by MarA in Escherichia coli. PLoS Comput Biol five: e1000614.
- View Commodity
- Google Scholar
- 31. Garcia H, Sanchez A, Boedicker J, Osborne M, Gelles J, et al. (2012) Operator sequence alters gene expression independently of transcription gene occupancy in bacteria. Prison cell Rep ii: 150–161.
- View Article
- Google Scholar
- 32. Kuhlman T, Zhang Z, Saier MH, Hwa T (2007) Combinatorial transcriptional control of the lactose operon of Escherichia coli. Proc Natl Acad Sci USA 104: 6043–viii.
- View Article
- Google Scholar
- 33. Berg O, von Hippel P (1987) Selection of dna binding sites by regulatory proteins. statistical-mechanical theory and application to operators and promoters. J Mol Biol 193: 723–750.
- View Article
- Google Scholar
- 34. Djordjevic 1000, Sengupta AM, Shraiman BI (2003) A biophysical approach to transcription factor binding site discovery. Genome Res 13: 2381–90.
- View Article
- Google Scholar
- 35. Bauer AL, Hlavacek WS, Unkefer PJ, Mu F (2010) Using Sequence-Specific Chemic and Structural Properties of Deoxyribonucleic acid to Predict Transcription Factor Binding Sites. PLoS Comp Biol vi: 13.
- View Article
- Google Scholar
- 36. Poelwijk FJ, Tanase-Nicola S, Kiviet DJ, Tans SJ (2011) Reciprocal sign epistasis is a necessary condition for multi-peaked fitness landscapes. J Theor Biol 272: 141–4.
- View Article
- Google Scholar
- 37. Dekel E, Alon U (2005) Optimality and evolutionary tuning of the expression level of a protein. Nature 436: 588–592.
- View Commodity
- Google Scholar
- 38. Perfeito 50, Ghozzi S, Berg J, Schnetz Thou, Lässig Yard (2011) Nonlinear fitness landscape of a molecular pathway. PLoS Genetics 7: ane–ten.
- View Article
- Google Scholar
- 39. Gerland U, Hwa T (2002) On the selection and evolution of regulatory DNA motifs. J Mol Evol 55: 386–400.
- View Article
- Google Scholar
- forty. Berg J, Willmann Due south, Lässig M (2004) Adaptive evolution of transcription cistron binding sites. BMC Evol Biol 4: 42.
- View Article
- Google Scholar
- 41. Mustonen 5, Lässig Thou (2005) Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci USA 102: 15936–41.
- View Commodity
- Google Scholar
- 42. Mustonen Five, Kinney J, Callan CG, Lässig Grand (2008) Free energy-dependent fitness: a quantitative model for the evolution of yeast transcription gene binding sites. Proc Natl Acad Sci United states 105: 12376–81.
- View Article
- Google Scholar
- 43. Goldenfeld N, Kadanoff L (1999) Simple lessons from complexity. Scientific discipline 284: 87–89.
- View Article
- Google Scholar
- 44. Schneidman E, Berry M, Segev R, Bialek Westward (2006) Weak pairwise correlations imply strongly correlated network states in a neural population. Nature 440: 1007–1012.
- View Article
- Google Scholar
- 45. Tkacik G, Schneidman E, Berry G, Bialek Due west (2006) Ising models for networks of real neurons. ArXiv:q-bio/0611072arXiv.
- 46. Tang A, Jackson J, Hobbs J, Chen W, Smith JL, et al. (2008) A maximum entropy model applied to spatial and temporal correlations from cortical networks in vitro. J Neurosci 28: 505–518.
- View Article
- Google Scholar
- 47. Cocco South, Leibler S, Monasson R (2009) Neuronal couplings between retinal ganglion cells inferred past efficient inverse statistical physics methods. Proc Natl Acad Sci U.s. 106: 14058–14062.
- View Article
- Google Scholar
- 48. Ohiorhenuan I, Mechler F, Purpura K, Schmid A, Hu Q, et al. (2010) Thin coding and high-order correlations in fine-scale cortical networks. Nature 466: 617–621.
- View Article
- Google Scholar
- 49. Margolin A, Nemenman I, Basso Yard, Wiggins C, Stolovitzky Thou, et al. (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinf 7: S7.
- View Article
- Google Scholar
- l. Wang 1000, Saito Yard, Bisikirska B, Alvarez WM, and Lim , et al. (2009) Genome-wide identification of post-translational modulators of transcription factor activity in man B cells. Nat Biotechnol 27: 829–839.
- View Article
- Google Scholar
- 51. Margolin A, Wang K, Califano A, Nemenman I (2010) Multivariate dependence and genetic networks inference. IET Syst Biol four: 428–440.
- View Article
- Google Scholar
- 52. Green WH (2003) Econometric Analysis. Prentice Hall.
- 53. Sharpee TO, Sugihara H, Kurgansky AV, Rebrik SP, Stryker MP, et al. (2006) Adaptive filtering enhances information transmission in visual cortex. Nature 439: 936–42.
- View Commodity
- Google Scholar
- 54. MacKay DJC (2003) Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
- 55. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58: 267–288.
- View Article
- Google Scholar
- 56. Friedman J, Hastie T, Tibshirani R (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Software 33..
Source: https://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0061570
0 Response to "From Genotype to Phenotype Section Review 81 Answers"
Post a Comment