Occupancy models in R

Infer occupancy with unmarked and stan

This post was originally published on CESAB’s tips and tricks blog

Question. Why do we talk out loud when we know we’re alone? Conjecture. Because we know we’re not.

The Twelfth Doctor, Doctor Who (Series 8, Episode 4: “Listen”)

Introduction

It’s not because we didn’t see something that this thing wasn’t present. At least, that’s the idea behind occupancy models: our observations aren’t a direct representation of reality, but merely of what we can detect.

An owl perched on a photographer's lens
Present ($z_i = 1$), but not detected ($y_{ij} = 0$). Photo by Caroline Kirk ( source)

Occupancy models were first published by MacKenzie et al. (2002) in the context of species occurrence modelling. Many extensions of occupancy have been proposed since, allowing to explicitly model occupancy dynamics (MacKenzie et al. 2003), take into account multiple species (Rota et al. 2016) or a continuous detection process (MacKenzie et al. 2003). This blog post only goes over the original simple occupancy model.

Simple occupancy model

Summary diagram of the structure of an occupancy model.

To discriminate between the real and the observed states, occupancy models have one parameter for each of these states. The true presence or absence of a species in a given site $i$ is noted $z_i$. The observed presence or absence at site $i$ for a given visit $j$ is noted $y_{ij}$.

Then, the occupancy model for a given site $i$ is written as:

$$ y_{ij} \sim Bern(z_i~p) $$
  • if the species is really present in site $i$ ($z_i = 1$), then it is detected ($y_{ij} = 1$) with probability $p$ according to a Bernoulli trial.
  • if the species is absent in site $i$ ($z_i = 0$), then we assume that it cannot be detected ($y_{ij}$ has to be zero) (no false detection).

In the occupancy framework, $z_i$ itself is governed by a Bernoulli trial of probability $\psi$.

$$ z_i \sim Bern(\psi) $$

$\psi$ is called the occupancy probability: most of the times, when dealing with occupancy, that’s the quantity we’re really interested in.

Simulate occupancy

Occupancy models are really easy to simulate: here is a sample R code to simulate data under a simple occupancy model.

 1set.seed(42)
 2
 3M <- 100 # Number of sites
 4p <- 0.4 # Detection probability
 5psi <- 0.8 # Occupancy
 6
 7# Simulate a number of visits for each site
 8nvisit <- rpois(n = M, lambda = 3)
 9nvisit[nvisit == 0] <- 1 # Don't allow zero visits
10
11# Initialize vectors
12z <- vector(mode = "numeric", length = M)
13y <- vector(mode = "list", length = M)
14
15for (i in 1:M) { # For each site
16  # Simulate true presence/absence at site i
17  zi <- rbinom(n = 1, size = 1, prob = psi)
18  
19  # Simulate observed presence/absence at site i for all visits
20  yij <- rbinom(n = nvisit[i], 
21                size = 1, prob = p*zi)
22  
23  z[i] <- zi # True sites states
24  y[[i]] <- yij # Detections
25}

In this simulation, the true proportion of occupied sites is 0.73. It is very close to the occupancy parameter $\psi = 0.8$ (but it is not exactly equal because of the stochastic nature of our model).

1# True proportion of occupied sites
2sum(z)/M
[1] 0.73

The proportion of sites that we detect as occupied (what is called the naive occupancy) is 0.51, which wildly under-estimates $\psi$.

1# Proportion of sites with at least one detection
2sum(sapply(y, function(yi) any(yi != 0)))/M
[1] 0.51

Infer occupancy models in R

Now, the true value of occupancy models lies in the analysis of species detections. Inferring parameters from data is possible under three main conditions:

  • You have repeated visits. This is an crucial point which allows parameter identifiability.
  • The site remains in the same state (occupied or unoccupied) during the entire study period (closure assumption)1.
  • There are no false detections. The model automatically assumes that a detection event means that the site is really occupied, and false detections might induce a positive bias in occupancy estimates.

There are lots of strategies to infer occupancy models in R: among the most popular occupancy packages are unmarked and spOccupancy. Many people will also use Bayesian softwares like JAGS (e.g. with the R package jagsUI or rjags), nimble or Stan (with the R packagescmdstanrorrstan).

This blog post focuses on two of them: unmarked and Stan (using the R interface package cmdstanr).

unmarked

The R package unmarked has functions specifically designed for occupancy inference in R using maximum likelihood estimation (frequentist statistics).

First, we have to format the observed visits to an unmarkedFrameOccu object, as exemplified below:

 1library(unmarked)
 2
 3# Format the list of observed detections y
 4max_visit <- max(sapply(y, length)) # Get maximum number of detections
 5
 6# Transform y to matrix
 7y_matrix <- matrix(data = NA,
 8                   nrow = M,
 9                   ncol = max_visit)
10for (i in 1:M) { # For each site
11  nvisit_i <- length(y[[i]]) # Get number of visits
12  y_matrix[i, 1:nvisit_i] <- y[[i]] # Fill n-th first rows with detection history
13}
14
15# Each row contains the detections at a site, filled with NAs for 
16# sites that have less visits than the most visited one
17head(y_matrix, 3)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0    0    0    1    1   NA   NA   NA
[2,]    1    0    0    1    1    1   NA   NA
[3,]    0    0   NA   NA   NA   NA   NA   NA
1# Cast y_matrix to unmarkedFrameOccu
2y_occu <- unmarked::unmarkedFrameOccu(y_matrix)
3head(y_occu, 3)
Data frame representation of unmarkedFrame object.
  y.1 y.2 y.3 y.4 y.5 y.6 y.7 y.8
1   0   0   0   1   1  NA  NA  NA
2   1   0   0   1   1   1  NA  NA
3   0   0  NA  NA  NA  NA  NA  NA

In unmarked, the main function to infer parameters of a simpler occupancy model is occu, which uses a maximum likelihood estimator. In its simplest form, it only needs two arguments:

  • formula, which gives the formulas for the logit of $p$ and $\psi$. Here, since occupancy and detection are constant, we will set them to ~1 ~1.
  • data: an unmarkedFrameOccu object containing observed detections (here, y_occu).
1# Infer occupancy
2occ <- unmarked::occu(formula = ~1 ~1, 
3                      data = y_occu)
4
5class(occ)
[1] "unmarkedFitOccu"
attr(,"package")
[1] "unmarked"
1summary(occ)
Call:
unmarked::occu(formula = ~1 ~ 1, data = y_occu)

Occupancy (logit-scale):
 Estimate    SE    z P(>|z|)
     1.03 0.406 2.54  0.0112

Detection (logit-scale):
 Estimate    SE     z  P(>|z|)
   -0.623 0.184 -3.38 0.000722

AIC: 360.8983 
Number of sites: 100

The output of the model is an unmarkedFitOccu object. The summary gives us the parameter estimates on the logit scale (i.e. we get $\text{logit}(p)$ and $\text{logit}(\psi)$), but we can get estimates on the natural scale with backTransform:

1# Get detection (p) on the natural scale
2unmarked::backTransform(occ, type = "det")
Backtransformed linear combination(s) of Detection estimate(s)

 Estimate     SE LinComb (Intercept)
    0.349 0.0419  -0.623           1

Transformation: logistic 
1# Get occupancy (state parameter, psi) on the natural scale
2unmarked::backTransform(occ, type = "state")
Backtransformed linear combination(s) of Occupancy estimate(s)

 Estimate     SE LinComb (Intercept)
    0.737 0.0788    1.03           1

Transformation: logistic 

And inferred parameters are a good approximation of true values (see figure below).

Code
 1p_inf <- unmarked::predict(occ, 
 2                           type = "det",
 3                           backTransform = TRUE,
 4                           newdata = data.frame(1))
 5psi_inf <- unmarked::predict(occ, 
 6                             type = "state",
 7                             backTransform = TRUE,
 8                             newdata = data.frame(1))
 9
10umk_param_df <- data.frame(param = c("p", "p", "psi", "psi"),
11                           type = c("inferred", "true", "inferred", "true"),
12                           estimate = c(p_inf$Predicted, p,
13                                        psi_inf$Predicted, psi),
14                           min = c(p_inf$lower, NA,
15                                   psi_inf$lower, NA),
16                           max = c(p_inf$upper, NA,
17                                   psi_inf$upper, NA))
18
19ggplot(umk_param_df) +
20  geom_errorbar(data = subset(umk_param_df, type == "inferred"),
21                aes(xmin = min, xmax = max, y = param,
22                    color = type)) +
23  geom_point(aes(x = estimate, y = param,
24                 color = type)) +
25  xlim(0, 1) +
26  theme(axis.title = element_blank())
True and inferred occupancy parameters with unmarked

Stan

Bayesian frameworks are also quite common to infer occupancy models, notably because they allow for more flexibility in the parameters specifications. Here, we use Stan through the R package cmdstanr. Stan is a Bayesian software using optimized algorithms for the MCMC samplers.

In Stan, it is more efficient to use a Binomial model specification: we model the number of detections $n_i$ at each site:

$$n_i \sim Binom({n_{\text{visit}}}_i, z_i~p) \quad \text{and} \quad z_i \sim Bern(\psi)$$

where ${n_{\text{visit}}}_i$ the number of visits at site $i$. The equations above are strictly equivalent to the general occupancy model specification described earlier.

The first step is to format our data to be used by Stan. For that, we aggregate our visits to total number of detections, and save data parameters to a list named dat.

1# Format data for Stan
2n <- sapply(y, sum) # Number of detections
3nvisit <- sapply(y, length) # Number of visits
4
5# List of parameters for Stan
6dat <- list(M = M,
7            n = n,
8            nvisit = nvisit)

Stan code

Then, we write the Stan code to infer our occupancy parameters. Stan is a programming language in its own right, and here we will only go over the Stan syntax quickly.

Stan programs are organized in code blocks, which are indicated by opening and closing brackets as shown below:

 1data {
 2  // Code block to define data variables
 3}
 4
 5parameters {
 6  // Code block to define parameters
 7}
 8
 9model {
10  // Code block to infer parameters
11}

Note that Stan comments are specified with a double backslash //. There are other optional code blocks (in fact, we will use one later), but the mandatory ones are shown above.

Let’s start by defining variables in the data block. Stan variables are typed (i.e. we must define their type manually with the statements before variables names). Here, we only need the 3 variables that we specified on our data list above.

1data {
2  int<lower=1> M; // Number of sites
3  array[M] int nvisit; // Number of visits per sites-years
4  array[M] int<lower=0> n; // Observations vector
5}

Next, we define the parameters we want to infer in the parameters block. Here, they are defined on the logit scale because inferring unbounded parameters is more efficient in Stan.

1parameters {
2  real psi_logit; // Value of psi on the logit scale
3  real p_logit; // Value of p on the logit scale
4}

The syntax of the model block is slightly more complex. There are 3 steps:

  • Define the parameters priors: here, we use flat normal priors centered on zero with a standard deviation of 3. This amounts to initializing the log-posterior with a log-transformed normal density, which is what target += normal_lpdf(param | 0, 3) does.
  • (Optional) Define intermediate variables. Here, we define nvi as a shortcut for the number of visits for site $i$.
  • Specify the model. This is where it requires a bit of work, because Stan cannot model discrete variables, so we have to update the posterior probability density manually instead. Fortunately, it is fairly easy to write with respect to $p$ and $\psi$. Here, we won’t go into the details of these computations (but see Note 1 below).
 1model {
 2  // 1. Priors
 3  target += normal_lpdf(psi_logit | 0, 3);
 4  target += normal_lpdf(p_logit | 0, 3);
 5
 6  // 2. Variables
 7  int nvi;
 8
 9  // 3. Model specification
10  for (i in 1:M) { // Iterate over sites
11    nvi = nvisit[i]; // Number of visits
12    if (n[i] > 0) { // The species was seen
13      // Update log-likelihood: species was detected | present
14      target += log_inv_logit(psi_logit) + 
15        binomial_logit_lpmf(n[i] | nvi, p_logit);
16    } else {
17      // Update log-likelihood: species was non-detected | present
18      // or non-detected | absent
19      target += log_sum_exp(log_inv_logit(psi_logit) + binomial_logit_lpmf(0 | nvi, p_logit), 
20        log1m_inv_logit(psi_logit));
21    }
22  }
23}

Note 1: Log-posterior probability density computation

Below, we give a bit more explanations about the computation of the log-posterior.

We distinguish between two cases to write the log-posterior:

  • The species was detected at least once (case n[i] > 0). In that case, we know that it was present (no false detections), so the conditional log-probability that the species was seen $n_i$ times (which is what we update target with) is the product of the occupancy probability $\psi$ and the binomial density probability of $N = n$:

    $$P(N = n_i | \{p, \psi\}) = \psi ~ \underbrace{\binom{{n_{\text{visit}}}_i}{n_i} p^{n_i} p^{{n_{\text{visit}}}_i - n_i}}_{\text{Binomial density probability}}$$

    In Stan, parameters are on the logit scale: psi_logit is first back-transformed with log_inv_logit and Stan uses the Binomial density probability on the logit scale with binomial_logit_lpmf. Finally, because we compute the log-posterior, by properties of the logarithm the multiplication is changed to an addition.

  • When the species was not seen (else block), we have two options. Either the species was present, but undetected; or it was absent. In that case, the conditional log-probability is the sum of these two events:

    $$P(N = 0 | \{p, \psi\}) = \underbrace{\psi ~ \binom{{n_{\text{visit}}}_i}{0} p^{0} p^{{n_{\text{visit}}}_i - 0}}_{\text{present, undetected}} + \underbrace{(1 - \psi)}_{\text{absent}}$$

    In Stan, log_sum_exp is an efficient way to compute a logarithm of the sum above, and log1m_inv_logit(psi_logit) returns $1 - \psi$.

More details on log-likelihood computation can be found in this blog post or in chapter 2.4.6 of Kéry and Royle (2016).

The final (and optional) code block we use here is a generated quantities block. It allows to compute $p$ and $\psi$ on the natural scale by using inv_logit to back-transform parameters.

1generated quantities {
2  real<lower=0,upper=1> p = inv_logit(p_logit);
3  real<lower=0, upper=1> psi = inv_logit(psi_logit);
4}

The final Stan code is summarised below:

 1data {
 2  int<lower=1> M; // Number of sites
 3  array[M] int nvisit; // Number of visits per sites-years
 4  array[M] int<lower=0> n; // Observations vector
 5}
 6
 7parameters {
 8  real psi_logit; // Value of psi on the logit scale
 9  real p_logit; // Value of p on the logit scale
10}
11
12model {
13  // 1. Priors
14  target += normal_lpdf(psi_logit | 0, 3);
15  target += normal_lpdf(p_logit | 0, 3);
16
17  // 2. Variables
18  int nvi;
19
20  // 3. Model specification
21  for (i in 1:M) { // Iterate over sites
22    nvi = nvisit[i]; // Number of visits
23    if (n[i] > 0) { // The species was seen
24      // Update log-likelihood: species was detected | present
25      target += log_inv_logit(psi_logit) + binomial_logit_lpmf(n[i] | nvi, p_logit);
26    } else {
27      // Update log-likelihood: species was non-detected | present
28      // or non-detected | absent
29      target += log_sum_exp(log_inv_logit(psi_logit) + binomial_logit_lpmf(0 | nvi, p_logit), log1m_inv_logit(psi_logit));
30    }
31  }
32}
33
34generated quantities {
35  real<lower=0,upper=1> p = inv_logit(p_logit);
36  real<lower=0, upper=1> psi = inv_logit(psi_logit);
37}

Inference with cmdstanr

Now, we can use this code to infer our occupancy parameters from R. The first step is to compile the Stan model using cmdstan_model, as shown in the code below. Here, we assume the Stan code above is written in a file, which path is stored in the stan_file variable in R.

1library(cmdstanr)
2
3model <- cmdstanr::cmdstan_model(stan_file = stan_file)

We can finally perform the inference on our model object, with the $sample method from the CmdStanModel class. The $sample method can be customised with many arguments, but here we only use our data list (data), the number of burn-in iterations (warmup), the number of post-burn-in iterations (iter_sampling) and the number of chains and their parallelisation (chains and parallel_chains).

1stanfit <- model$sample(data = dat,
2                        iter_warmup = 200,
3                        iter_sampling = 500,
4                        chains = 4,
5                        parallel_chains = 4)

We can inspect the inferred values with the $summary method.

1(stan_inf <- stanfit$summary())
# A tibble: 5 × 10
  variable      mean   median     sd    mad       q5      q95  rhat ess_bulk
  <chr>        <dbl>    <dbl>  <dbl>  <dbl>    <dbl>    <dbl> <dbl>    <dbl>
1 lp__      -115.    -114.    1.05   0.798  -117.    -114.    1.00      893.
2 psi_logit    1.17     1.11  0.506  0.446     0.483    2.07  1.000     873.
3 p_logit     -0.655   -0.654 0.191  0.190    -0.977   -0.343 1.00      689.
4 p            0.343    0.342 0.0428 0.0426    0.274    0.415 1.00      689.
5 psi          0.752    0.751 0.0815 0.0841    0.618    0.888 1.000     873.
# ℹ 1 more variable: ess_tail <dbl>

The summary gives information about the log-posterior (lp__) and all inferred parameters (on the logit and the natural scales). For each parameter, a set of summary statistics which summarize their distribution are given. Three diagnostic parameters to evaluate MCMC sampling are also given (a convergence statistic $\hat{R}$, and effective sample sizes (ESS)).

Finally, a graphical summary shows that inferred parameters are close to the real ones:

Code
 1library(ggplot2)
 2
 3stan_param_df <- data.frame(param = c("p", "p", "psi", "psi"),
 4                            type = c("inferred", "true", "inferred", "true"),
 5                            estimate = c(stan_inf$mean[stan_inf$variable == "p"], p,
 6                                         stan_inf$mean[stan_inf$variable == "psi"], psi),
 7                            min = c(stan_inf$q5[stan_inf$variable == "p"], NA,
 8                                    stan_inf$q5[stan_inf$variable == "psi"], NA),
 9                            max = c(stan_inf$q95[stan_inf$variable == "p"], NA,
10                                    stan_inf$q95[stan_inf$variable == "psi"], NA))
11
12ggplot(stan_param_df) +
13  geom_errorbar(data = subset(stan_param_df, type == "inferred"),
14                aes(xmin = min, xmax = max, y = param,
15                    color = type)) +
16  geom_point(aes(x = estimate, y = param,
17                 color = type)) +
18  xlim(0, 1)
True and inferred occupancy parameters with Stan

Conclusion

In this blog post, we have seen how the simple occupancy model is written and how simulate data under an occupancy model in R. We have then inferred occupancy models with the unmarked and cmdstanr R packages.

There is much more one can do with occupancy models. In particular, the parameters $p$ and $\psi$ are often not constant, and can be modeled as functions of covariates with a logit link, such as $\text{logit}(\psi_i) = \beta_0 + \beta_1 x_i$ or $\text{logit}(p_{ij}) = \beta_0 + \beta_1 x_{ij}$. But this is a story for another blog post…

References

Kéry, Marc, and J. Andrew Royle. 2016. Applied Hierarchical Modeling in Ecology: Analysis of Distribution, Abundance and Species Richness in R and BUGS. Amsterdam; Boston: Elsevier/AP, Academic Press is an imprint of Elsevier.

MacKenzie, Darryl I., James D. Nichols, James E. Hines, Melinda G. Knutson, and Alan B. Franklin. 2003. “Estimating Site Occupancy, Colonization, and Local Extinction When a Species Is Detected Imperfectly.” Ecology 84 (8): 2200–2207. https://doi.org/10.1890/02-3090.

MacKenzie, Darryl I., James D. Nichols, Gideon B. Lachman, Sam Droege, J. Andrew Royle, and Catherine A. Langtimm. 2002. “Estimating Site Occupancy Rates When Detection Probabilities Are Less Than One.” Ecology 83 (8): 2248–55. https://doi.org/10.1890/0012-9658(2002)083[2248:ESORWD]2.0.CO;2.

Rota, Christopher T., Marco A. R. Ferreira, Roland W. Kays, Tavis D. Forrester, Elizabeth L. Kalies, William J. McShea, Arielle W. Parsons, and Joshua J. Millspaugh. 2016. “A Multispecies Occupancy Model for Two or More Interacting Species.” Methods in Ecology and Evolution 7 (10): 1164–73. https://onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12587.

Valente, Jonathon J., Vitek Jirinec, and Matthias Leu. 2024. “Thinking Beyond the Closure Assumption: Designing Surveys for Estimating Biological Truth with Occupancy Models.” Methods in Ecology and Evolution 15 (12): 2289–2300. https://doi.org/10.1111/2041-210X.14439.

Other resources

  • Richard A. Erickson’s GitHub repository with Stan occupancy models tutorials
  • Olivier Gimenez’s workshop with resources on occupancy modelling with unmarked

  1. In practice, this assumption is often violated in studies using occupancy models: this has been discussed extensively in the literature (see for example Valente, Jirinec, and Leu (2024)), and authors have proposed to critically evaluate whether the closure assumption is met to interpret the inferred occupancy. ↩︎