Neutral population genetic structure and variance partitioning
Scots pine genomic data
Author
Juliette Archambeau
Published
March 10, 2025
1 Neutral population genetic structure
We estimate the neutral population genetic structure with:
a PCA estimated with the vegan R package; Oksanen et al. (2022).
Ancestry coefficients estimated with the LEA R package, function snmf; Frichot and Francois (2015).
1.1 PCA
We run a Principal Component Analysis (PCA).
1.1.1 Individuals allele counts
First, using individual allele counts (not filtered for MAF, minor allele frequencies).
Code
# We load the imputed allele counts not filtered for MAFgeno <-readRDS(here("data/ScotsPine/GenomicData/data_update_Fev2025/formatted_data/imputed_allele_counts_withMAF.rds")) %>%arrange(PopulationCode) # to be sure that colors in the PCA graph are attributed to the right individualsgeno[1:10,1:10] %>%kable_mydf()
PopulationCode
FieldCode
Family
AX-117395395
AX-117395404
AX-117395409
AX-117395416
AX-117395425
AX-117395432
AX-117395439
AB
5023
AB6
1
1
0
0
1
0
2
AB
5046
AB2
0
1
0
0
1
0
1
AB
5109
AB7
1
1
0
0
0
0
2
AB
5119
AB5
0
1
0
0
1
0
1
AB
5159
AB6
0
1
0
0
1
0
2
AB
5161
AB10
0
0
0
0
2
0
1
AB
5185
AB5
1
0
0
0
1
0
0
AB
5240
AB2
0
1
1
0
2
0
1
AB
5256
AB2
1
1
0
0
1
0
1
AB
5293
AB7
0
2
0
0
0
0
0
Code
# Run the PCA (quite long)pca <-rda(geno[,-c(1:3)], scale=T)saveRDS(pca,here("outputs/ScotsPine/PCA/PCA_IndividualAlleleCounts.rds"))
# Screeplot of the PCA eigenvaluesscreeplot(pca, type ="barplot", npcs=20, main="PCA Eigenvalues")
Code
screeplot(pca, type ="lines", npcs=20, main="PCA Eigenvalues")
There is an elbow point at K = 4.
Code
# Add population colors to the genomic datasetgeno_plot <-tibble(PopulationCode =unique(geno$PopulationCode),PopulationColor =cols25(length(unique(geno$PopulationCode)))) %>%right_join(geno, by="PopulationCode") %>%arrange(PopulationCode)# We plot only the first and second axissecond_axis <-2plot(pca, type="n", scaling=3, choices=c(1,second_axis),xlab=paste0("PC1 (",round(summary(eigenvals(pca))[2,1]*100,2),"%)"),ylab=paste0("PC",second_axis," (",round(summary(eigenvals(pca))[2,second_axis]*100,2),"%)"))points(pca, display="sites", pch=21, cex=0.9, col=geno_plot$PopulationColor, scaling=3, bg=geno_plot$PopulationColor, choices=c(1,second_axis))legend("topleft", legend=unique(geno_plot$PopulationCode), bty="n", col=unique(geno_plot$PopulationColor), pch=21, cex=0.9,ncol =3,pt.bg=unique(geno_plot$PopulationColor), title="Populations")
1.1.2 Population allele frequencies
We then look at the population genetic structure with the population allele frequencies.
Code
# We load the population allele frequencies not filtered for MAFgeno <-readRDS(here("data/ScotsPine/GenomicData/data_update_Fev2025/formatted_data/filtered_allele_freq_withMAF.rds")) %>%arrange(PopulationCode) # to be sure that colors in the PCA graph are attributed to the right individualsgeno[1:10,1:10] %>%kable_mydf()
PopulationCode
AX-117395395
AX-117395404
AX-117395409
AX-117395416
AX-117395425
AX-117395432
AX-117395439
AX-117395441
AX-117395444
AB
0.24
0.51
0.10
0.01
0.39
0.01
0.49
0.46
0.09
AC
0.39
0.45
0.05
0.01
0.30
0.00
0.33
0.36
0.27
AM
0.30
0.45
0.08
0.01
0.52
0.01
0.37
0.38
0.15
BB
0.48
0.34
0.19
0.01
0.34
0.00
0.53
0.45
0.14
BE
0.40
0.60
0.08
0.01
0.47
0.00
0.42
0.42
0.28
BW
0.41
0.32
0.08
0.01
0.41
0.00
0.45
0.33
0.26
CC
0.39
0.41
0.25
0.00
0.32
0.03
0.46
0.52
0.30
CG
0.39
0.58
0.08
0.00
0.51
0.02
0.53
0.42
0.10
CR
0.32
0.56
0.03
0.01
0.26
0.03
0.65
0.49
0.30
GA
0.40
0.39
0.05
0.01
0.38
0.00
0.32
0.39
0.19
Code
# we run the PCA pca <-rda(geno %>%column_to_rownames("PopulationCode"), scale=T)
Screeplot of the PCA eigenvalues:
Code
# Screeplot of the PCA eigenvaluesscreeplot(pca, type ="barplot", npcs=20, main="PCA Eigenvalues")
Code
screeplot(pca, type ="lines", npcs=20, main="PCA Eigenvalues")
The elbow at K = 4 indicates that there are 5 major genetic clusters in the data.
Code
# Add population colors to the genomic datasetgeno_plot <-tibble(PopulationCode =unique(geno$PopulationCode),PopulationColor =cols25(length(unique(geno$PopulationCode)))) %>%right_join(geno, by="PopulationCode") %>%arrange(PopulationCode)second_axis <-2plot_pca_allele_freq <-function(second_axis){ordipointlabel(pca, display="sites", # show population namesscaling=3, pch=21, cex=1, xlab=paste0("PC1 (",round(summary(eigenvals(pca))[2,1]*100,2),"%)"),ylab=paste0("PC",second_axis," (",round(summary(eigenvals(pca))[2,second_axis]*100,2),"%)"),col=geno_plot$PopulationColor, bg=geno_plot$PopulationColor, choices=c(1,second_axis)) }plot_pca_allele_freq(second_axis =2)
Code
plot_pca_allele_freq(second_axis =3)
We extract the first three PCs of the PCA for the following steps.
Code
# We extract the first three PCs of the PCApgsdf <-scores(pca, choices=c(1:3), display="sites", scaling="none") %>%as.data.frame() %>%rownames_to_column("PopulationCode") %>% dplyr::mutate(across(where(is.numeric), ~ (. -mean(.)) /sd(.))) %>%as_tibble()# save the PCssaveRDS(pgsdf, here("outputs/ScotsPine/PCA/PCs.rds"))
1.2 LFMM
We look at the population genetic structure with the R function snmf.
Code
# We load the imputed allele counts not filtered for MAFgeno <-readRDS(here("data/ScotsPine/GenomicData/data_update_Fev2025/formatted_data/imputed_allele_counts_withMAF.rds")) %>%mutate(Family_FieldCode =paste0(Family, "_", FieldCode)) %>% dplyr::select(-FieldCode, -PopulationCode, -Family) %>%column_to_rownames("Family_FieldCode") %>% dplyr::mutate(across(everything(), ~replace_na(.x, 9))) %>%as.data.frame()geno %>%write.geno(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.geno"))# kable_mydf(geno[1:10,1:8], boldfirstcolumn = F)
Code
proj_snmf <-snmf(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.geno"), K=1:10, repetitions =10, # nb of repetitions for each Kentropy =TRUE,project="new")pc <-pca(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.geno"), scale =TRUE)
Value of the cross-entropy criterion as a function of the number of populations in snmf.
Code
plot(proj_snmf, cex =1.2, col ="lightblue", pch =19)
The K value is usually chosen at the minimum cross-entropy. Here, cross-entropy decreases continuously, which is likely be due to very weak population genetic structure in Scottish Scots pine populations.
Here are some tutorials to identify the elbow point:
pc <-load.pcaProject(here("outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.pcaProject"))tw =tracy.widom(pc)
[1] "*******************"
[1] " Tracy-Widom tests "
[1] "*******************"
summary of the options:
-n (number of eigenvalues) 1761
-i (input file) /home/juliette/Documents/Projects/2PinesClimateChangeVulnerability/outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.pca/allele_counts_geno_format.eigenvalues
-o (output file) /home/juliette/Documents/Projects/2PinesClimateChangeVulnerability/outputs/ScotsPine/LFMM/snmf/allele_counts_geno_format.pca/allele_counts_geno_format.tracywidom
Code
plot(tw$percentage, pch =19, col ="darkblue", cex = .8, main="Screeplot for the percentage of variance explained by each component",ylab="Percentage of variance explained")
Code
#plot(pc, main = "PCA of Genetic Data")
The percentage of variance explained by each component is very low.
We can look at the Q matrices for K = 2, 3, 4 and 5.
Code
gp_colors <-c("#A1F0A1","#FFF9A1", "#F0A1A1", "#A1C9F0", "#F1A1F0")# K=2bp <-barchart(proj_snmf, K =2, run=which.min(cross.entropy(proj_snmf, K =2)),border =NA, space =0, col = gp_colors, xlab ="Individuals", ylab ="Ancestry proportions", main ="Ancestry matrix for K = 2")axis(1, at =1:length(bp$order), labels = bp$order, las =3, cex.axis = .4)
Code
# K=3bp <-barchart(proj_snmf, K =3, run=which.min(cross.entropy(proj_snmf, K =3)),border =NA, space =0, col = gp_colors, xlab ="Individuals", ylab ="Ancestry proportions", main ="Ancestry matrix for K = 3")axis(1, at =1:length(bp$order), labels = bp$order, las =3, cex.axis = .4)
Code
# K=4bp <-barchart(proj_snmf, K =4, run=which.min(cross.entropy(proj_snmf, K =4)),border =NA, space =0, col = gp_colors, xlab ="Individuals", ylab ="Ancestry proportions", main ="Ancestry matrix for K = 4")axis(1, at =1:length(bp$order), labels = bp$order, las =3, cex.axis = .4)
Code
# K=5bp <-barchart(proj_snmf, K =5, run=which.min(cross.entropy(proj_snmf, K =5)),border =NA, space =0, col = gp_colors, xlab ="Individuals", ylab ="Ancestry proportions", main ="Ancestry matrix for K = 5")axis(1, at =1:length(bp$order), labels = bp$order, las =3, cex.axis = .4)
So we may want to run LFMM with K=1 (i.e. no population structure) or K=4 (i.e., strong correction for population genetic structure, based on the elbow point of the PCA with the vegan R package)
Moran’s Eigenvector Maps (MEMs) are calculated based on a spatial weighting matrix (SWM) and are orthogonal vectors maximizing the spatial autocorrelation (measured by Moran’s coefficient) of the sampled locations. Distance-based MEMs are calculated based on a particular SWM which is a geographic distance matrix.
Using the function distm from the R package geosphere, we first calculate a matrix of geodesic distances (in meters) among population locations using great-circle distances calculated for the WGS84 ellipsoid. The shortest path between two points on an ellipsoid is called the geodesic. The WGS84 ellipsoid is the best available global ellipsoid.
Then we calculate the dbMEMs with the function dbmem of the adespatial R package. Similarly to Lasky et al. (2015), we only keep MEMs with positive eigenvalues.
We get a message informing the we do not use Euclidean distances, but that’s ok, we do not have to convert geodesic distances to Euclidean distances.
We aim to disentangle the relative contribution of climate, neutral population genetic structure (accounted for with the first three PC scores) and geography (accounted for by the distance-based Moran’s eigenvector maps with positive eigenvalues) in explaining the genomic variation.
For that, we run:
one full RDA with all factors (population structure, climate and geographical distance), i.e. no variable conditioning.
three partial RDA in which the factor of interest is conditioned by the other two factors.
Code
# Function to calculate the var_part <-function(df_geno, clim_var_avg_selected, df_var, scale_rda, center_rda){# Build formulas of the RDA models# ================================form_clim_var <-paste(clim_var_avg_selected,collapse=" + ") # climatic variablesform_pgs_var <-colnames(df_var) %>% stringr::str_subset("^PC") %>%paste(collapse=" + ") # pop structureform_geo_var <-colnames(df_var) %>% stringr::str_subset("MEM") %>%`[`(1:length(.)) %>%# keep only the n first MEMspaste(collapse=" + ") # coordinates# Model formulasform_full_rda <-paste("df_geno ~ ",paste(c(form_clim_var,form_pgs_var,form_geo_var),collapse=" + " )) %>%as.formula()form_clim_rda <-paste("df_geno ~ ", form_clim_var,"+ Condition(",paste(c(form_pgs_var,form_geo_var),collapse=" + " ), ")") %>%as.formula()form_pgs_rda <-paste("df_geno ~ ", form_pgs_var,"+ Condition(",paste(c(form_clim_var,form_geo_var),collapse=" + " ), ")") %>%as.formula()form.geo_rda <-paste("df_geno ~ ", form_geo_var,"+ Condition(",paste(c(form_clim_var,form_pgs_var),collapse=" + " ), ")") %>%as.formula()# Run the RDA models# ==================# Full RDAfull_rda <-rda(form_full_rda, data=df_var, scale=scale_rda,center=center_rda) anova_full_rda <-anova(full_rda)# Partial RDA: pure climatic modelclim_rda <-rda(form_clim_rda, data=df_var, scale=scale_rda,center=center_rda)anova_clim_rda <-anova(clim_rda)# Partial RDA: pure neutral population structure modelpgs_rda <-rda(form_pgs_rda, data=df_var, scale=scale_rda,center=center_rda)anova_pgs_rda <-anova(pgs_rda)# Partial RDA: pure geography modelgeo_rda <-rda(form.geo_rda, data=df_var, scale=scale_rda,center=center_rda)anova_geo_rda <-anova(geo_rda)# Summary table# =============sum_tab_RDA <-tibble("RDA models"=c("Full model: Y ~ clim + geo + pgs.","Pure climate model: Y ~ clim | (geo + pgs)","Pure pop. gen. structure model: Y ~ pgs | (geo + clim)","Pure geography model: Y ~ geo | (pgs + clim)"),"Total exp. variance"=c(RsquareAdj(full_rda)[[1]],RsquareAdj(clim_rda)[[1]],RsquareAdj(pgs_rda)[[1]],RsquareAdj(geo_rda)[[1]]),"Relative exp. variance"=c(1,RsquareAdj(clim_rda)[[1]]/RsquareAdj(full_rda)[[1]],RsquareAdj(pgs_rda)[[1]]/RsquareAdj(full_rda)[[1]],RsquareAdj(geo_rda)[[1]]/RsquareAdj(full_rda)[[1]]),"P-value"=c(anova_full_rda[["Pr(>F)"]][[1]], anova_clim_rda[["Pr(>F)"]][[1]], anova_pgs_rda[["Pr(>F)"]][[1]], anova_geo_rda[["Pr(>F)"]][[1]]))# # Export the table in latex# xtable(sum_tab_RDA, type = "latex",digits=2) %>% # print(file = here(paste0("tables/VariancePartitioningRDA_EM",EM_i,".tex")), include.rownames=FALSE)return(sum_tab_RDA)}
Code
# Variance partionning estimated with the different EMlist_varpart <-lapply(unique(climdf$ensemble_member), function(EM_i){format_clim(df, EM = EM_i, RCP ="RCP2.6", fut_timeslice ="2030-2050", clim_var_avg_selected)$climdf_ref_scaled %>% dplyr::select(starts_with("MEM"),contains("PC"),any_of(clim_var_avg_selected)) %>%var_part(df_geno=geno[,-1],clim_var_avg_selected=clim_var_avg_selected,df_var=.,scale_rda=F,center_rda=F)}) %>%setNames(unique(climdf$ensemble_member))saveRDS(list_varpart, here("outputs/ScotsPine/VarPart/VarPart.rds"))
Ensemble member 01:
Code
list_varpart <-readRDS(file=here("outputs/ScotsPine/VarPart/VarPart.rds"))# Ensemble member 01kable_mydf(list_varpart$`01`, round_number =4, boldfirstcolumn = T)
RDA models
Total exp. variance
Relative exp. variance
P-value
Full model: Y ~ clim + geo + pgs.
0.6520
1.0000
0.001
Pure climate model: Y ~ clim | (geo + pgs)
0.2168
0.3326
0.534
Pure pop. gen. structure model: Y ~ pgs | (geo + clim)
0.1363
0.2091
0.420
Pure geography model: Y ~ geo | (pgs + clim)
0.1782
0.2733
0.429
Ensemble member 04:
Code
# Ensemble member 04kable_mydf(list_varpart$`04`, round_number =4, boldfirstcolumn = T)
RDA models
Total exp. variance
Relative exp. variance
P-value
Full model: Y ~ clim + geo + pgs.
0.6524
1.0000
0.001
Pure climate model: Y ~ clim | (geo + pgs)
0.2172
0.3330
0.538
Pure pop. gen. structure model: Y ~ pgs | (geo + clim)
0.1380
0.2115
0.385
Pure geography model: Y ~ geo | (pgs + clim)
0.1767
0.2708
0.424
Ensemble member 06:
Code
# Ensemble member 06kable_mydf(list_varpart$`06`, round_number =4, boldfirstcolumn = T)
RDA models
Total exp. variance
Relative exp. variance
P-value
Full model: Y ~ clim + geo + pgs.
0.6530
1.0000
0.001
Pure climate model: Y ~ clim | (geo + pgs)
0.2178
0.3335
0.479
Pure pop. gen. structure model: Y ~ pgs | (geo + clim)
0.1379
0.2112
0.390
Pure geography model: Y ~ geo | (pgs + clim)
0.1765
0.2703
0.473
Ensemble member 15:
Code
# Ensemble member 15kable_mydf(list_varpart$`15`, round_number =4, boldfirstcolumn = T)
RDA models
Total exp. variance
Relative exp. variance
P-value
Full model: Y ~ clim + geo + pgs.
0.6521
1.0000
0.001
Pure climate model: Y ~ clim | (geo + pgs)
0.2169
0.3327
0.513
Pure pop. gen. structure model: Y ~ pgs | (geo + clim)
0.1369
0.2100
0.393
Pure geography model: Y ~ geo | (pgs + clim)
0.1768
0.2711
0.445
Interpretation
Very little differences among ensemble members, so we can only show the results from one model in the paper (SI).
About 60% of genomic variation can be explained by the three factors: climate, geography and population genetic structure. 22% of genomic variation (more than 1/3 of the variance explained by the three factors combined) can be explained by climate alone. In contrast, population genetic structure and geography alone explain less genomic variation, about 14% and 18% respectively.
As a comparison, in maritime pine, about 80% of genomic variation can be explained by the three factors, but climate can only explain 8% alone.
References
Dray, Stéphane, Pierre Legendre, and Pedro R. Peres-Neto. 2006. “Spatial Modelling: A Comprehensive Framework for Principal Coordinate Analysis of Neighbour Matrices (PCNM).”Ecological Modelling 196 (3): 483–93. https://doi.org/10.1016/j.ecolmodel.2006.02.015.
Forester, Brenna R., Jesse R. Lasky, Helene H. Wagner, and Dean L. Urban. 2018. “Comparing Methods for Detecting Multilocus Adaptation with Multivariate Genotype–Environment Associations.”Molecular Ecology 27 (9): 2215–33. https://doi.org/10.1111/mec.14584.
Gutaker, Rafal M., Simon C. Groen, Emily S. Bellis, Jae Y. Choi, Inês S. Pires, R. Kyle Bocinsky, Emma R. Slayton, et al. 2020. “Genomic History and Ecology of the Geographic Spread of Rice.”Nature Plants 6 (5): 492–502. https://doi.org/10.1038/s41477-020-0659-6.
Lasky, Jesse R., Hari D. Upadhyaya, Punna Ramu, Santosh Deshpande, C. Tom Hash, Jason Bonnette, Thomas E. Juenger, et al. 2015. “Genome-Environment Associations in Sorghum Landraces Predict Adaptive Traits.”Science Advances 1 (6): e1400218. https://doi.org/10.1126/sciadv.1400218.
Oksanen, Jari, Gavin L. Simpson, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O’Hara, et al. 2022. vegan: Community Ecology Package. https://CRAN.R-project.org/package=vegan.