Finding the ‘missing environmentality’ FINDME
Explaining and predicting individual outcomes in a social population
The wish to understand, explain and predict human behaviour and outcomes is a central goal in (social) sciences. Since the 1980s, empirical social research aimed to increase the credibility of their explanatory statements advancing methods for statistical causal analysis, in an effect of causes perspective, establishing interpretable, causal consequences of specific social variables. (Quite) recently, questions have been highlighted, not only of the causal effect of one variable on another, but about general social predictability of an individual’s outcome localizing it in a society. A preceding, related – whilst distinct – question is, given a population, how much of the individual differences can we explain? One central contrast to current mainstream analyses is that those questions assume a cause of effect instead of an effect of cause perspective, focussing on the outcome variable and including many – if not all possible – predictors.
The interdisciplinary exploration of underlying causes of (social) outcomes has a tradition in Behavior Genetics. On average across traits 50% of individual differences are associated with genetic factors (heritability) and 50% with environmental ones (environmentality; including measurement error), according to a meta-analysis of all twin studies ever conducted, while there is an imbalance towards non-genetic influences for behavioural outcomes. Heritability is defined as the proportion of variance in an outcome divided by the total variance in this outcome in the population. More intuitively for social scientists, it is basically the coefficient of determination (R2) in a regression model of a trait on the genome. Even more intuitive, it tells us how much of the differences between individuals is due to differences in their genome.
Findings of substantial heritability have motivated researchers and funders to invest in the discovery of genes (Genome-wide association studies, GWAS), which explain the outcomes, with various promises of applications and understanding of pathways – also for the social sciences. However, it has been a challenge to find genetic variants responsible for heritability, even when analysing millions of genetic markers and millions of individuals. The initial ‘disappointment’ that no or only few genetic variants could be discovered despite massive sample sizes (n>100.000) resulted in over a decade of inspired methodological advances. The central question was: How can we explain ‘missing heritability’? This literature provides a detailed roadmap for game changing innovations in the social sciences regarding the explanatory power of non-genetic factors.
Figure 1 depicts the parallelization we aim for between genetics (Figure 1 left) and non-genetic (Figure 1 right) influences including social ones. A first finding in the ‘missing heritability’ puzzle was small effect sizes ‘hide heritability’ (see Figure 1 left), so methods have been developed to models small effects of many variables. Those tools have been extended to investigate the so-called still-missing heritability (see Figure 1 left) as in non-linear effects of SNPs both within SNPs on multiple alleles (dominant effects) and between SNPs (epistasis) as well as questions of measuring the right variables (or whether rare variants are important). We showed in a study that a substantial part of the missing heritability is due to gene-environment interaction across populations and birth cohorts interaction.
FINDME infuses social sciences with this theoretical and methodological knowledge and confronts sociology with the ‘missing environmentality’ puzzle unravelled. As first guideline, we will quantify ‘missing environmentality’ as the difference between expectations about environmental variance explanations from twin studies and the explanatory power of measured non-genetic variables in various data sources. Importantly, and in novel ways for social science researchers, FINDME will be able to disentangle genetic from environmental factors which are often correlated as, for example, in the transmission of education from parents to children – gene-environment correlation.
We will focus on educational attainment, fertility and well-being and provide quantitative answers to questions such as: Are our theories too simple – do we need to think more complex about the social world? Is the secret in the interplay between genes and the environment – and to what extent? Do we find support for deterministic world view after all? To what degree are explanations universal across contexts?
The project is both of scientific importance and of practical urgency as massive and costly data collections are currently conducted with proper evaluation of explanatory and predictive power of their genetic, but not yet of their measured non-genetic/environmental determinants. We will use population- and register-based datasets from Europe and the US as well as the huge UK Biobank with more than 500,000 genotyped individuals and thousands of environmental measures.
We will systematically adapting statistical methodology to jointly model genes and the environment with extremely high-dimensional data. While classic twin models mostly focus on a nature and nurture perspective, we will investigate millions of genetic and millions of nurturing dimensions as well as their interactions. Note, that for educational attainment, for example, we might only expect a limited number of variables to contribute to our theoretical explanation, such as parental education, neighbourhoods and so forth. The UK Biobank for example features already 10 indices of neighbourhood deprivation measuring economic deprivation, criminality, and others. Note that, when considering 20 explanatory variables in a model, the number of possible interaction-terms will be more than 1 million. Fully taking (potential) social complexity into account requires therefore big data techniques. The proposed methods will also be able to take the interplay between social and genetic factors into account to define explanatory power of measured environmental factors controlling for genes. We will use higher order matrix models, recently developed genetic instrumental variable approaches and aim to develop new techniques for social science research controlling for all genetic effects and correcting for measurement error in environmentality research.
Fig. 1. Geneticists have evaluated to what extent they can explain heritability based on all measured genes (SNP-heritability) or known genetic variants for an outcome. So far, we do not know, however, how much of the environmental component we can explain based on measured variables and theoretical models if we comprehensively control for genes.