Workshop 3 – Binomial model

Binomial and Bernoulli (generalized linear) models on simulated data

Here we construct a Binomial model first. We have got 25 groups of size 50 or less, with a slightly different probability of success in each group. Let’s say we have we have 25 groups of \(z_i\) turtles. Females are born with probability \(p_i\) in group \(i\).

sample_size_per_group = round(30*runif(25))+20
n_groups = length(sample_size_per_group)
temperature = (1:n_groups)*0.1 + rnorm(n_groups,0,0.5)
plot(1:n_groups,temperature,type="o",xlab="Group ID",ylab="Temperature")

Y = p = temperature
for (i in 1:n_groups){
  p[i] = 1/(1+exp(-3*(temperature[i]-mean(temperature)) ) )
  Y[i] = rbinom(1,sample_size_per_group[i],p[i])
}

plot(temperature,p,type="p",xlab="Temperature",ylab="Pr(female)")

plot(1:n_groups,Y,type="p",xlab="Number_group",ylab="N_females")

Data in a list:

m11.data <- list(N = n_groups, y = Y, temp = temperature, z = sample_size_per_group)

Now we fit that model which writes mathematically like

\[ y_i \sim \text{Binomial}(z_i,p(\text{temp}_i)) \]

data {                                             // observed variables 
  int<lower=1> N;                                  // number of observations
  vector[N] temp;                                  // temperature
  int<lower=0,upper=50> y[N];                      // response variable
  int<lower=0,upper=50> z[N];                      // sample size per group  
}
parameters {                                       // unobserved variables
  real mu_temp;
  real gamma; 
}
model {
  for (k in 1:N){
  y[k] ~ binomial(z[k], inv_logit(gamma*(temp[k]-mu_temp)));       // likelihood
  //y[k] ~ bernoulli_logit(alpha+beta*temp[k]);       // likelihood -- does not work because not bernoulli of course!
  }
  mu_temp ~ normal(2, 10);        // prior of the mean temp
  gamma ~ normal(1, 10);          // prior of the slope
}

fit.m1.1 <- sampling(m1.1, data = m11.data, iter = 1000, chains = 2, cores = 2)
print(fit.m1.1, probs = c(0.10, 0.5, 0.9))

## Inference for Stan model: 3f1e7d0b1f2848684011f04c4f6f7f52.
## 2 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1000.
## 
##            mean se_mean   sd     10%     50%     90% n_eff Rhat
## mu_temp    1.35    0.00 0.03    1.31    1.35    1.38   622    1
## gamma      3.87    0.01 0.26    3.52    3.86    4.19   683    1
## lp__    -250.01    0.04 0.98 -251.27 -249.72 -249.11   482    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Mar 30 08:40:46 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Now we consider a simpler model with a different set of priors. We consider only one group which has 50 young turtles. The probability of obtaining a female of 0.3. We have individual-level data. (As we don’t have individual-level covariates, doing a bernoulli or binomial model is very much similar, but I am changing things just for fun and exercise here. Also to see if some codes are much faster).

Y=rbinom(50,1,0.3)
m12.data <- list(N = 50, y = Y)

And now we fit the model

m1.2 = stan_model("m1.2.stan")

## Warning in readLines(file, warn = TRUE): ligne finale incomplète trouvée dans '/home/sylvain/Documents/BIOGECO/BiogecoBayes/Workshop3_BinomialModels/m1.2.stan'

fit.m1.2 <- sampling(m1.2, data = m12.data, iter = 1000, chains = 2, cores = 2)
print(fit.m1.2, probs = c(0.10, 0.5, 0.9))

## Inference for Stan model: m1.
## 2 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1000.
## 
##        mean se_mean   sd    10%    50%    90% n_eff Rhat
## p      0.30    0.00 0.06   0.22   0.30   0.38   355    1
## lp__ -31.71    0.03 0.62 -32.51 -31.46 -31.25   469    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Mar 30 08:41:26 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Playing with priors

Why Beta?

Why did I choose the Beta distribution for the prior of \(p\)? In short, this is the conjugate prior distribution, some sort of canonical choice. In the past, choosing conjugate priors used to speed up monumentously the computations. Nowadays this is less true but still: - A conjugate prior is a prior whose probability distribution is also the distribution of the posterior. In other words, here if we have a Beta prior we get a Beta posterior, and if we iterate the data assimilation process 10 000 times it will still be Beta.

For more info and formulas

What shape?

curve(dbeta(x,2,2))
curve(dbeta(x,0.5,0.5),add=T,col="red")

Obivously red would make little sense here. What would it imply?

m1.3 = stan_model("m1.3.stan")

## Warning in readLines(file, warn = TRUE): ligne finale incomplète trouvée dans '/home/sylvain/Documents/BIOGECO/BiogecoBayes/Workshop3_BinomialModels/m1.3.stan'

fit.m1.3 <- sampling(m1.3, data = m12.data, iter = 1000, chains = 2, cores = 2)
print(fit.m1.3, probs = c(0.10, 0.5, 0.9))

## Inference for Stan model: m1.
## 2 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1000.
## 
##        mean se_mean   sd    10%    50%    90% n_eff Rhat
## p      0.28    0.00 0.06   0.20   0.29   0.36   344 1.01
## lp__ -29.38    0.03 0.72 -30.17 -29.12 -28.88   485 1.00
## 
## Samples were drawn using NUTS(diag_e) at Mon Mar 30 08:42:06 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Still works well but the prior is bringing no information or even poor information.

The consequence of too flat priors (esp. when messing with link functions)

data {                                             // observed variables 
  int<lower=1> N;                                  // number of observations
  vector[N] temp;                                  // temperature
  int<lower=0,upper=50> y[N];                      // response variable
  int<lower=0,upper=50> z[N];                      // sample size per group  
}
parameters {                                       // unobserved variables
  real mu_temp;
  real gamma; 
}
model {
  for (k in 1:N){
  y[k] ~ binomial(z[k], inv_logit(gamma*(temp[k]-mu_temp)));       // likelihood
  //y[k] ~ bernoulli_logit(alpha+beta*temp[k]);       // likelihood -- does not work because not bernoulli of course!
  }
  mu_temp ~ normal(2, 100);        // prior of the mean temp
  gamma ~ normal(1, 100);          // prior of the slope
}

Let us propagate the uncertainty by visualizing the transformed parameter. We simulate according to the prior

mu_temp = rnorm(100,2, 100) # prior of the mean temp
gamma = rnorm(100,1, 100)   # prior on the slope
x=seq(min(temperature),max(temperature),by=0.01)
par(mfrow=c(1,2))
plot(0, bty = 'n', pch = '', ylab = "Pr(female)", xlab = "Temperature",ylim=c(0,1))
for (kprior in 1:100) {
  prob = 1/(1+exp(-0.3*(x-mu_temp[kprior])) ) 
  lines(x,prob,type="l",col="blue")}
plot(0, bty = 'n', pch = '', ylab = "Pr(female)", xlab = "Temperature",ylim=c(0,1))
for (kprior in 1:100) {
  prob = 1/(1+exp(-gamma[kprior]*(x-mean(temperature))) ) 
  lines(x,prob,type="l",col="blue")}

### We get either 0 or 1. 

### Better priors
mu_temp = rnorm(100,2, 1) # prior of the mean temp
gamma = rnorm(100,1, 1)   # prior on the slope
par(mfrow=c(1,2))
plot(0, bty = 'n', pch = '', ylab = "Pr(female)", xlab = "Temperature",ylim=c(0,1))
for (kprior in 1:100) {
  prob = 1/(1+exp(-0.3*(x-mu_temp[kprior])) ) 
  lines(x,prob,type="l",col="blue")}
plot(0, bty = 'n', pch = '', ylab = "Pr(female)", xlab = "Temperature",ylim=c(0,1))
for (kprior in 1:100) {
  prob = 1/(1+exp(-abs(gamma[kprior])*(x-mean(temperature))) ) 
  lines(x,prob,type="l",col="blue")}

Weakly informative priors

What are they? Why use them? A more detailed explanation

A real data example: eagles

As suggested by McElreath p. 330 of his book, records of (160!) salmon-pirating attempts by one Bald eagle on another Bald eagle (not always the same!). NB: I haven’t implemented the quadratic approx. suggested by McElreath though that sounds like an excellent exercise.

?eagles
data(eagles) ### Explanatory variables
#P
#Size of pirating eagle (L = large, S = small).
#A
#Age of pirating eagle (I = immature, A = adult).
#V
#Size of victim eagle (L = large, S = small).
eagles.glm <- glm(cbind(y, n - y) ~ P*A + V, data = eagles,
                  family = binomial) ##classic frequentist modelling
# Small and immature birds capture less, small birds are more likely to have salmon stolen. So far so good. 
m2.data = list(N=nrow(eagles),y=eagles$y,z=eagles$n,P=as.numeric(eagles$P)-1,A=as.numeric(eagles$A)-1,V=as.numeric(eagles$V)-1)
# m2.data = list(N=nrow(eagles),y=eagles$y,z=eagles$n,P=eagles$P,A=eagles$A,V=eagles$V)

We’re in a similar situation to the turtles sex, but with categorical explanatory variables

data {                                             // observed variables 
  int<lower=1> N;                                  // number of observations
  int<lower=0,upper=50> y[N];                      // response variable
  int<lower=0,upper=50> z[N];                      // sample size per group  
  int<lower=0,upper=1>  A[N];                      // attribute age of pirating eagle
  int<lower=0,upper=1>  P[N];                      // attribute size of pirating eagle
  int<lower=0,upper=1>  V[N];                      // attribute size of victim eagle
  
}

parameters {                                       // unobserved variables
  real alpha;
  real beta_P; 
  real beta_V; 
  real beta_A; 
}

model {
  for (k in 1:N){
  y[k] ~ binomial(z[k], inv_logit(alpha + beta_P*P[k] + beta_V*V[k] +beta_A*A[k]) );       // likelihood
  }
  alpha ~ normal(0, 10);        // prior of the mean temp
  beta_P ~ normal(0, 5);          // prior of the slope
  beta_V ~ normal(0, 5);          // prior of the slope
  beta_A ~ normal(0, 5);          // prior of the slope
}

Fitting the model

m2.1 = stan_model("m2.1.stan",verbose = TRUE)

## 
## TRANSLATING MODEL 'm2' FROM Stan CODE TO C++ CODE NOW.
## successful in parsing the Stan model 'm2'.

fit.m2.1 <- sampling(m2.1, data = m2.data, iter = 1000, chains = 2, cores = 2)
print(fit.m2.1, probs = c(0.10, 0.5, 0.9))

## Inference for Stan model: e47cfaa9a07140b50edb541286b764dc.
## 2 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1000.
## 
##          mean se_mean   sd    10%    50%    90% n_eff Rhat
## alpha    1.39    0.02 0.43   0.85   1.38   1.93   536    1
## beta_P  -4.60    0.04 0.96  -5.95  -4.54  -3.42   512    1
## beta_V   4.98    0.05 1.02   3.73   4.91   6.35   511    1
## beta_A  -1.15    0.02 0.52  -1.80  -1.14  -0.49   575    1
## lp__   -47.98    0.06 1.31 -49.74 -47.67 -46.60   456    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Mar 30 08:43:28 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

What do we think of these priors?

A model with hierarchical priors (hyperprior)

The stan book promotes this model


data {
  int<lower=1> D;
  int<lower=0> N;
  int<lower=1> L;
  int<lower=0,upper=1> y[N];  // raw data
  int<lower=1,upper=L> ll[N]; // set of levels related to each data point
  row_vector[D] x[N];
}
parameters {
  real mu[D];                
  real<lower=0> sigma[D];
  vector[D] beta[L];
}
model {
  for (d in 1:D) {
    mu[d] ~ normal(0, 100);
    for (l in 1:L)
      beta[l,d] ~ normal(mu[d], sigma[d]);
  }
  for (n in 1:N)
    y[n] ~ bernoulli(inv_logit(x[n] * beta[ll[n]]));
}

We’ve got the opportunity to transform the previously eagles model into something less complex but with a similar philosophy (NB I wondering what priors they have on sigma[d] in their example). Currently we have estimated each effect (size of pirating eagle, age of pirating eagle, size of victim eagle) independently. On relatively small datasets, this is prone to inflation of effects and why in practice many statisticians implement some of form of shrinkage in such models. This is done easily with an additional prior linking together the effects (see also McElreath’s book on this).

data {                                             // observed variables 
  int<lower=1> N;                                  // number of observations
  int<lower=0,upper=50> y[N];                      // response variable
  int<lower=0,upper=50> z[N];                      // sample size per group  
  int<lower=0,upper=1>  A[N];
  int<lower=0,upper=1>  P[N];
  int<lower=0,upper=1>  V[N];
  
}

parameters {                                       // unobserved variables
  real alpha;
  real beta_P; 
  real beta_V; 
  real beta_A; 
  real gamma_b; 
}

model {
  for (k in 1:N){
  y[k] ~ binomial(z[k], inv_logit(alpha + beta_P*P[k] + beta_V*V[k] +beta_A*A[k]) );       // likelihood
  }
  alpha ~ normal(0, 1);                // prior of the mean 
  beta_P ~ normal(0, gamma_b);          // prior of the "slope" (here the effect, rather)
  beta_V ~ normal(0, gamma_b);          // prior of the slope
  beta_A ~ normal(0, gamma_b);          // prior of the slope
  gamma_b ~ gamma(0.1,0.1);         // hyperprior
}

Let us now fit that model

m3.2 = stan_model("m3.2.stan")
fit.m3.2 <- sampling(m3.2, data = m2.data, iter = 1000, chains = 2, cores = 2)

## Warning: There were 7 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

## Warning: Examine the pairs() plot to diagnose sampling problems

print(fit.m3.2, probs = c(0.10, 0.5, 0.9))

## Inference for Stan model: m3.
## 2 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1000.
## 
##           mean se_mean   sd    10%    50%    90% n_eff Rhat
## alpha     1.09    0.02 0.41   0.61   1.07   1.65   332 1.01
## beta_P   -4.37    0.07 1.10  -5.82  -4.22  -3.15   241 1.00
## beta_V    4.95    0.07 1.13   3.60   4.82   6.43   239 1.00
## beta_A   -0.87    0.03 0.51  -1.55  -0.85  -0.21   359 1.00
## gamma_b   4.66    0.16 2.42   2.38   4.13   7.46   227 1.00
## lp__    -55.74    0.12 1.91 -58.33 -55.36 -53.65   254 1.00
## 
## Samples were drawn using NUTS(diag_e) at Mon Mar 30 08:45:36 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Very moderate amount of shrinkage. There are some divergent iterations as well. Perhaps there are not enough effects for this, or the distribution - gamma - is not constraining enough.