Neutral population genetic structure and variance partitioning

Scots pine genomic data

Author

Juliette Archambeau

Published

March 10, 2025

1 Neutral population genetic structure

We estimate the neutral population genetic structure with:

a PCA estimated with the vegan R package; Oksanen et al. (2022).
Ancestry coefficients estimated with the LEA R package, function snmf; Frichot and Francois (2015).

1.1 PCA

We run a Principal Component Analysis (PCA).

1.1.1 Individuals allele counts

First, using individual allele counts (not filtered for MAF, minor allele frequencies).

Code

# We load the imputed allele counts not filtered for MAF
geno <- readRDS(here("data/ScotsPine/GenomicData/data_update_Fev2025/formatted_data/imputed_allele_counts_withMAF.rds")) %>% 
  arrange(PopulationCode) # to be sure that colors in the PCA graph are attributed to the right individuals

geno[1:10,1:10] %>% kable_mydf()

PopulationCode	FieldCode	Family	AX-117395395	AX-117395404	AX-117395409	AX-117395425	AX-117395439
AB	5023	AB6	1	1	0	1	2
AB	5046	AB2	0	1	0	1	1
AB	5109	AB7	1	1	0	0	2
AB	5119	AB5	0	1	0	1	1
AB	5159	AB6	0	1	0	1	2
AB	5161	AB10	0	0	0	2	1
AB	5185	AB5	1	0	0	1	0
AB	5240	AB2	0	1	1	2	1
AB	5256	AB2	1	1	0	1	1
AB	5293	AB7	0	2	0	0	0

Code

# Run the PCA (quite long)
pca <- rda(geno[,-c(1:3)], scale=T)

saveRDS(pca,here("outputs/ScotsPine/PCA/PCA_IndividualAlleleCounts.rds"))

Code

pca <- readRDS(here("outputs/ScotsPine/PCA/PCA_IndividualAlleleCounts.rds"))

Screeplot of the PCA eigenvalues:

Code

# Screeplot of the PCA eigenvalues
screeplot(pca, type = "barplot", npcs=20, main="PCA Eigenvalues")

Code

screeplot(pca, type = "lines", npcs=20, main="PCA Eigenvalues")

There is an elbow point at K = 4.

Code

# Add population colors to the genomic dataset
geno_plot <- tibble(PopulationCode = unique(geno$PopulationCode),
                    PopulationColor = cols25(length(unique(geno$PopulationCode)))) %>% 
  right_join(geno, by="PopulationCode") %>% 
  arrange(PopulationCode)

# We plot only the first and second axis
second_axis <- 2

plot(pca, type="n", scaling=3, choices=c(1,second_axis),
     xlab=paste0("PC1 (",round(summary(eigenvals(pca))[2,1]*100,2),"%)"),
     ylab=paste0("PC",second_axis," (",round(summary(eigenvals(pca))[2,second_axis]*100,2),"%)"))
points(pca, display="sites", pch=21, cex=0.9, 
         col=geno_plot$PopulationColor, 
         scaling=3, 
         bg=geno_plot$PopulationColor, choices=c(1,second_axis))
legend("topleft", 
       legend=unique(geno_plot$PopulationCode), 
       bty="n", 
       col=unique(geno_plot$PopulationColor), 
       pch=21, 
       cex=0.9,
       ncol = 3,
       pt.bg=unique(geno_plot$PopulationColor), 
       title="Populations")

1.1.2 Population allele frequencies

We then look at the population genetic structure with the population allele frequencies.

Code

# We load the population allele frequencies not filtered for MAF
geno <- readRDS(here("data/ScotsPine/GenomicData/data_update_Fev2025/formatted_data/filtered_allele_freq_withMAF.rds")) %>% 
  arrange(PopulationCode)  # to be sure that colors in the PCA graph are attributed to the right individuals
  
geno[1:10,1:10] %>% kable_mydf()

PopulationCode	AX-117395395	AX-117395404	AX-117395409	AX-117395416	AX-117395425	AX-117395432	AX-117395439	AX-117395441	AX-117395444
AB	0.24	0.51	0.10	0.01	0.39	0.01	0.49	0.46	0.09
AC	0.39	0.45	0.05	0.01	0.30	0.00	0.33	0.36	0.27
AM	0.30	0.45	0.08	0.01	0.52	0.01	0.37	0.38	0.15
BB	0.48	0.34	0.19	0.01	0.34	0.00	0.53	0.45	0.14
BE	0.40	0.60	0.08	0.01	0.47	0.00	0.42	0.42	0.28
BW	0.41	0.32	0.08	0.01	0.41	0.00	0.45	0.33	0.26
CC	0.39	0.41	0.25	0.00	0.32	0.03	0.46	0.52	0.30
CG	0.39	0.58	0.08	0.00	0.51	0.02	0.53	0.42	0.10
CR	0.32	0.56	0.03	0.01	0.26	0.03	0.65	0.49	0.30
GA	0.40	0.39	0.05	0.01	0.38	0.00	0.32	0.39	0.19

Code

# we run the PCA 
pca <- rda(geno %>% column_to_rownames("PopulationCode"), scale=T)

Screeplot of the PCA eigenvalues:

Code

# Screeplot of the PCA eigenvalues
screeplot(pca, type = "barplot", npcs=20, main="PCA Eigenvalues")

Code

screeplot(pca, type = "lines", npcs=20, main="PCA Eigenvalues")

The elbow at K = 4 indicates that there are 5 major genetic clusters in the data.

Code

# Add population colors to the genomic dataset
geno_plot <- tibble(PopulationCode = unique(geno$PopulationCode),
                    PopulationColor = cols25(length(unique(geno$PopulationCode)))) %>% 
  right_join(geno, by="PopulationCode") %>% 
  arrange(PopulationCode)

second_axis <- 2

plot_pca_allele_freq <- function(second_axis){
  
  ordipointlabel(pca, 
                 display="sites", # show population names
                 scaling=3, 
                 pch=21, 
                 cex=1, 
                 xlab=paste0("PC1 (",round(summary(eigenvals(pca))[2,1]*100,2),"%)"),
                 ylab=paste0("PC",second_axis," (",round(summary(eigenvals(pca))[2,second_axis]*100,2),"%)"),
                 col=geno_plot$PopulationColor, 
                 bg=geno_plot$PopulationColor, 
                 choices=c(1,second_axis))
  }

plot_pca_allele_freq(second_axis = 2)

Code

plot_pca_allele_freq(second_axis = 3)

We extract the first three PCs of the PCA for the following steps.

Code

# We extract the first three PCs of the PCA
pgsdf <- scores(pca, choices=c(1:3), display="sites", scaling="none") %>% 
  as.data.frame() %>% 
  rownames_to_column("PopulationCode") %>% 
  dplyr::mutate(across(where(is.numeric), ~ (. - mean(.)) / sd(.))) %>% 
  as_tibble()

# save the PCs
saveRDS(pgsdf, here("outputs/ScotsPine/PCA/PCs.rds"))

1.2 LFMM

We look at the population genetic structure with the R function snmf.

Code

# We load the imputed allele counts not filtered for MAF
geno <- readRDS(here("data/ScotsPine/GenomicData/data_update_Fev2025/formatted_data/imputed_allele_counts_withMAF.rds")) %>% 
  mutate(Family_FieldCode = paste0(Family, "_", FieldCode)) %>% 
  dplyr::select(-FieldCode, -PopulationCode, -Family) %>% 
  column_to_rownames("Family_FieldCode") %>% 
  dplyr::mutate(across(everything(), ~replace_na(.x, 9))) %>% 
  as.data.frame()

geno %>% write.geno(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.geno"))
# kable_mydf(geno[1:10,1:8], boldfirstcolumn = F)

Code

proj_snmf <- snmf(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.geno"), 
                  K=1:10, 
                  repetitions = 10, # nb of repetitions for each K
                  entropy = TRUE,
                  project="new")

pc <- pca(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.geno"), scale = TRUE)

Code

proj_snmf <- load.snmfProject(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.snmfProject"))

Value of the cross-entropy criterion as a function of the number of populations in snmf.

Code

plot(proj_snmf, cex = 1.2, col = "lightblue", pch = 19)

The K value is usually chosen at the minimum cross-entropy. Here, cross-entropy decreases continuously, which is likely be due to very weak population genetic structure in Scottish Scots pine populations.

Here are some tutorials to identify the elbow point:

Code

pc <- load.pcaProject(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.pcaProject"))
tw = tracy.widom(pc)

[1] "*******************"
[1] " Tracy-Widom tests "
[1] "*******************"
summary of the options:

        -n (number of eigenvalues)          1761
        -i (input file)                     /home/juliette/Documents/Projects/2PinesClimateChangeVulnerability/outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.pca/allele_counts_geno_format.eigenvalues
        -o (output file)                    /home/juliette/Documents/Projects/2PinesClimateChangeVulnerability/outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.pca/allele_counts_geno_format.tracywidom

Code

plot(tw$percentage, pch = 19, col = "darkblue", cex = .8, main= "Screeplot for the percentage of variance explained by each component",
     ylab="Percentage of variance explained")

Code

#plot(pc, main = "PCA of Genetic Data")

The percentage of variance explained by each component is very low.

We can look at the Q matrices for K = 2, 3, 4 and 5.

Code

gp_colors <- c("#A1F0A1","#FFF9A1", "#F0A1A1", "#A1C9F0", "#F1A1F0")

# K=2
bp <- barchart(proj_snmf, K = 2,  run= which.min(cross.entropy(proj_snmf, K = 2)),
               border = NA, space = 0, col = gp_colors, 
               xlab = "Individuals", ylab = "Ancestry proportions", 
               main = "Ancestry matrix for K = 2")
axis(1, at = 1:length(bp$order), labels = bp$order, las = 3, cex.axis = .4)

Code

# K=3
bp <- barchart(proj_snmf, K = 3,  run= which.min(cross.entropy(proj_snmf, K = 3)),
               border = NA, space = 0, col = gp_colors, 
               xlab = "Individuals", ylab = "Ancestry proportions", 
               main = "Ancestry matrix for K = 3")
axis(1, at = 1:length(bp$order), labels = bp$order, las = 3, cex.axis = .4)

Code

# K=4
bp <- barchart(proj_snmf, K = 4,  run= which.min(cross.entropy(proj_snmf, K = 4)),
               border = NA, space = 0, col = gp_colors, 
               xlab = "Individuals", ylab = "Ancestry proportions", 
               main = "Ancestry matrix for K = 4")
axis(1, at = 1:length(bp$order), labels = bp$order, las = 3, cex.axis = .4)

Code

# K=5
bp <- barchart(proj_snmf, K = 5,  run= which.min(cross.entropy(proj_snmf, K = 5)),
               border = NA, space = 0, col = gp_colors, 
               xlab = "Individuals", ylab = "Ancestry proportions", 
               main = "Ancestry matrix for K = 5")
axis(1, at = 1:length(bp$order), labels = bp$order, las = 3, cex.axis = .4)

So we may want to run LFMM with K=1 (i.e. no population structure) or K=4 (i.e., strong correction for population genetic structure, based on the elbow point of the PCA with the vegan R package)

2 Variance partitioning

2.1 The data

2.1.1 Climatic data

The climatic variables come from the CHESS-SCAPE climatic database, see here for details on this data and for visualizing climatic variation: https://juliettearchambeau.shinyapps.io/VizClimateDifferencesScotsPine/.

Some details on the choice of the climatic variables are given in the report ExploreChessCapeClimaticData.html.

Correlations among the selected climatic variables can be visualized with the following shinny app : https://juliettearchambeau.shinyapps.io/VizClimateDifferencesScotsPine_SelectedVariables/.

Code

# climatic dataset
climdf <- readRDS(here("data/ScotsPine/subdf.rds"))

# selected climatic variables
clim_var_avg_selected <- unique(climdf$clim_var_avg)

2.1.2 Distance-based Moran’s Eigenvector Maps (dbMEMs)

Some papers/resources :

The vignette Moran’s Eigenvector Maps and related methods for the spatial multiscale analysis of ecological data by Stéphane Dray.
Dray, Legendre, and Peres-Neto (2006)
Forester et al. (2018)
Gutaker et al. (2020)
Lasky et al. (2015)

Moran’s Eigenvector Maps (MEMs) are calculated based on a spatial weighting matrix (SWM) and are orthogonal vectors maximizing the spatial autocorrelation (measured by Moran’s coefficient) of the sampled locations. Distance-based MEMs are calculated based on a particular SWM which is a geographic distance matrix.

Using the function distm from the R package geosphere, we first calculate a matrix of geodesic distances (in meters) among population locations using great-circle distances calculated for the WGS84 ellipsoid. The shortest path between two points on an ellipsoid is called the geodesic. The WGS84 ellipsoid is the best available global ellipsoid.

Then we calculate the dbMEMs with the function dbmem of the adespatial R package. Similarly to Lasky et al. (2015), we only keep MEMs with positive eigenvalues.

We get a message informing the we do not use Euclidean distances, but that’s ok, we do not have to convert geodesic distances to Euclidean distances.

Code

popdf <- climdf %>% 
  dplyr::select(PopulationCode,Longitude, Latitude) %>% 
  distinct()

dbmems <- popdf %>% 
  dplyr::select(Longitude, Latitude) %>% 
  distm(fun = distGeo) %>% # calculate the geodesic distances in meters
  as.dist() %>%  
  dbmem(MEM.autocor = "positive") %>% 
  bind_cols(popdf,.)

Warning in dudi.pco(matdist, scannf = FALSE, nf = 2): Non euclidean distance

2.1.3 Merging climate, population structure and dbMEMs

We merge the climatic data with the dbMEMs and the population genetic structure data.

Code

df <- climdf %>%
  left_join(dbmems, by=c("PopulationCode","Latitude","Longitude")) %>% 
  left_join(pgsdf, by="PopulationCode")

2.2 RDA models

We aim to disentangle the relative contribution of climate, neutral population genetic structure (accounted for with the first three PC scores) and geography (accounted for by the distance-based Moran’s eigenvector maps with positive eigenvalues) in explaining the genomic variation.

For that, we run:

one full RDA with all factors (population structure, climate and geographical distance), i.e. no variable conditioning.
three partial RDA in which the factor of interest is conditioned by the other two factors.

Code

# Function to calculate the 
var_part <- function(df_geno, 
                     clim_var_avg_selected, 
                     df_var, 
                     scale_rda, 
                     center_rda){

# Build formulas of the RDA models
# ================================
form_clim_var <- paste(clim_var_avg_selected,collapse= " + ") # climatic variables
form_pgs_var <- colnames(df_var) %>% stringr::str_subset("^PC") %>%  paste(collapse= " + ") # pop structure

form_geo_var <- colnames(df_var) %>%  
    stringr::str_subset("MEM") %>%  
   `[`(1:length(.)) %>% # keep only the n first MEMs
    paste(collapse= " + ") # coordinates

# Model formulas
form_full_rda <- paste("df_geno ~ ",
                       paste(c(form_clim_var,form_pgs_var,form_geo_var),collapse=" + " )) %>% 
  as.formula()

form_clim_rda <- paste("df_geno ~ ",
                       form_clim_var,"+ Condition(",paste(c(form_pgs_var,form_geo_var),collapse=" + " ), ")") %>% 
  as.formula()

form_pgs_rda <- paste("df_geno ~ ",
                       form_pgs_var,"+ Condition(",paste(c(form_clim_var,form_geo_var),collapse=" + " ), ")") %>% 
  as.formula()

form.geo_rda <- paste("df_geno ~ ",
                       form_geo_var,"+ Condition(",paste(c(form_clim_var,form_pgs_var),collapse=" + " ), ")") %>% 
  as.formula()


# Run the RDA models
# ==================

# Full RDA
full_rda <- rda(form_full_rda, 
                data=df_var, 
                scale=scale_rda,
                center=center_rda) 
anova_full_rda <- anova(full_rda)

# Partial RDA: pure climatic model
clim_rda <- rda(form_clim_rda, 
                data=df_var, 
                scale=scale_rda,
                center=center_rda)
anova_clim_rda <- anova(clim_rda)

# Partial RDA: pure neutral population structure model
pgs_rda <- rda(form_pgs_rda, 
                data=df_var, 
                scale=scale_rda,
                center=center_rda)
anova_pgs_rda <- anova(pgs_rda)

# Partial RDA: pure geography model
geo_rda <- rda(form.geo_rda, 
                data=df_var, 
                scale=scale_rda,
                center=center_rda)
anova_geo_rda <- anova(geo_rda)


# Summary table
# =============

sum_tab_RDA <- tibble("RDA models"=c("Full model: Y ~ clim + geo + pgs.",
                      "Pure climate model: Y ~ clim | (geo + pgs)",
                      "Pure pop. gen. structure model: Y ~ pgs | (geo + clim)",
                      "Pure geography model: Y ~ geo | (pgs + clim)"),
       "Total exp. variance"=c(RsquareAdj(full_rda)[[1]],
              RsquareAdj(clim_rda)[[1]],
              RsquareAdj(pgs_rda)[[1]],
              RsquareAdj(geo_rda)[[1]]),
       "Relative exp. variance"=c(1,
                                  RsquareAdj(clim_rda)[[1]]/RsquareAdj(full_rda)[[1]],
                                  RsquareAdj(pgs_rda)[[1]]/RsquareAdj(full_rda)[[1]],
                                  RsquareAdj(geo_rda)[[1]]/RsquareAdj(full_rda)[[1]]),
       "P-value"=c(anova_full_rda[["Pr(>F)"]][[1]],
                   anova_clim_rda[["Pr(>F)"]][[1]],
                   anova_pgs_rda[["Pr(>F)"]][[1]],
                   anova_geo_rda[["Pr(>F)"]][[1]]))

# # Export the table in latex
# xtable(sum_tab_RDA, type = "latex",digits=2) %>% 
#   print(file = here(paste0("tables/VariancePartitioningRDA_EM",EM_i,".tex")), include.rownames=FALSE)

return(sum_tab_RDA)
}

Code

# Variance partionning estimated with the different EM
list_varpart <- lapply(unique(climdf$ensemble_member), function(EM_i){
  
  format_clim(df, 
              EM = EM_i, 
              RCP = "RCP2.6", 
              fut_timeslice = "2030-2050", 
              clim_var_avg_selected)$climdf_ref_scaled %>% 
    dplyr::select(starts_with("MEM"),
                  contains("PC"),
                  any_of(clim_var_avg_selected)) %>% 
    var_part(df_geno=geno[,-1],
             clim_var_avg_selected=clim_var_avg_selected,
             df_var=.,
             scale_rda=F,
             center_rda=F)
  
}) %>% setNames(unique(climdf$ensemble_member))

saveRDS(list_varpart, here("outputs/ScotsPine/VarPart/VarPart.rds"))

Ensemble member 01:

Code

list_varpart <- readRDS(file=here("outputs/ScotsPine/VarPart/VarPart.rds"))

# Ensemble member 01
kable_mydf(list_varpart$`01`, round_number = 4, boldfirstcolumn = T)

RDA models	Total exp. variance	Relative exp. variance	P-value
Full model: Y ~ clim + geo + pgs.	0.6520	1.0000	0.001
Pure climate model: Y ~ clim \| (geo + pgs)	0.2168	0.3326	0.534
Pure pop. gen. structure model: Y ~ pgs \| (geo + clim)	0.1363	0.2091	0.420
Pure geography model: Y ~ geo \| (pgs + clim)	0.1782	0.2733	0.429

Ensemble member 04:

Code

# Ensemble member 04
kable_mydf(list_varpart$`04`, round_number = 4, boldfirstcolumn = T)

RDA models	Total exp. variance	Relative exp. variance	P-value
Full model: Y ~ clim + geo + pgs.	0.6524	1.0000	0.001
Pure climate model: Y ~ clim \| (geo + pgs)	0.2172	0.3330	0.538
Pure pop. gen. structure model: Y ~ pgs \| (geo + clim)	0.1380	0.2115	0.385
Pure geography model: Y ~ geo \| (pgs + clim)	0.1767	0.2708	0.424

Ensemble member 06:

Code

# Ensemble member 06
kable_mydf(list_varpart$`06`, round_number = 4, boldfirstcolumn = T)

RDA models	Total exp. variance	Relative exp. variance	P-value
Full model: Y ~ clim + geo + pgs.	0.6530	1.0000	0.001
Pure climate model: Y ~ clim \| (geo + pgs)	0.2178	0.3335	0.479
Pure pop. gen. structure model: Y ~ pgs \| (geo + clim)	0.1379	0.2112	0.390
Pure geography model: Y ~ geo \| (pgs + clim)	0.1765	0.2703	0.473

Ensemble member 15:

Code

# Ensemble member 15
kable_mydf(list_varpart$`15`, round_number = 4, boldfirstcolumn = T)

RDA models	Total exp. variance	Relative exp. variance	P-value
Full model: Y ~ clim + geo + pgs.	0.6521	1.0000	0.001
Pure climate model: Y ~ clim \| (geo + pgs)	0.2169	0.3327	0.513
Pure pop. gen. structure model: Y ~ pgs \| (geo + clim)	0.1369	0.2100	0.393
Pure geography model: Y ~ geo \| (pgs + clim)	0.1768	0.2711	0.445

Interpretation

Very little differences among ensemble members, so we can only show the results from one model in the paper (SI).

About 60% of genomic variation can be explained by the three factors: climate, geography and population genetic structure. 22% of genomic variation (more than 1/3 of the variance explained by the three factors combined) can be explained by climate alone. In contrast, population genetic structure and geography alone explain less genomic variation, about 14% and 18% respectively.

As a comparison, in maritime pine, about 80% of genomic variation can be explained by the three factors, but climate can only explain 8% alone.

References

Dray, Stéphane, Pierre Legendre, and Pedro R. Peres-Neto. 2006. “Spatial Modelling: A Comprehensive Framework for Principal Coordinate Analysis of Neighbour Matrices (PCNM).” Ecological Modelling 196 (3): 483–93. https://doi.org/10.1016/j.ecolmodel.2006.02.015.

Forester, Brenna R., Jesse R. Lasky, Helene H. Wagner, and Dean L. Urban. 2018. “Comparing Methods for Detecting Multilocus Adaptation with Multivariate Genotype–Environment Associations.” Molecular Ecology 27 (9): 2215–33. https://doi.org/10.1111/mec.14584.

Frichot, Eric, and Olivier Francois. 2015. “LEA: An R Package for Landscape and Ecological Association Studies.” Methods in Ecology and Evolution. http://membres-timc.imag.fr/Olivier.Francois/lea.html.

Gutaker, Rafal M., Simon C. Groen, Emily S. Bellis, Jae Y. Choi, Inês S. Pires, R. Kyle Bocinsky, Emma R. Slayton, et al. 2020. “Genomic History and Ecology of the Geographic Spread of Rice.” Nature Plants 6 (5): 492–502. https://doi.org/10.1038/s41477-020-0659-6.

Lasky, Jesse R., Hari D. Upadhyaya, Punna Ramu, Santosh Deshpande, C. Tom Hash, Jason Bonnette, Thomas E. Juenger, et al. 2015. “Genome-Environment Associations in Sorghum Landraces Predict Adaptive Traits.” Science Advances 1 (6): e1400218. https://doi.org/10.1126/sciadv.1400218.

Oksanen, Jari, Gavin L. Simpson, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O’Hara, et al. 2022. vegan: Community Ecology Package. https://CRAN.R-project.org/package=vegan.