Missing values I

Ideally, in a breeding bird monitoring scheme, all sites are surveyed every year. If so, it is easy to assess the changes in the yearly all-sites totals of breeding pairs. These totals are usually represented as indices by setting the first year at a value of 100. But the reality in large-scale monitoring schemes is that many sites are skipped once or several times during a scheme’s lifetime because some fieldworkers enroll years after the scheme’s start, while others drop out after several years. Thus, missing counts and simple comparisons of yearly all-sites totals of breeding pairs give misleading inferences on trends, as a simplified example shows.

The number of breeding pairs of a given species in the example has declined in sites 1 and 2, which were sampled each year. Site 3 was only surveyed in the third year. Consequently, the yearly total and the index would be highest in year 3 if they are based on the simple sum across all sites, which is, of course, an artefact caused by the enlargement of the monitoring scheme in year 3. Taking the mean numbers of the sites is also incorrect. This is because, in this case, site 3 happens to have more breeding pairs of the species.

year 1 year 2 year 3
site 1 4 3 2
site 2 4 3 2
site 3 missing count missing count 8
yearly all-sites total 8 6 12
yearly indices 100 75 150
yearly mean 4 3 4

To solve the problem by simply disregarding site 3 would be a waste of useful information, especially if site 3 continues to be surveyed in the years to come. It is a better solution to estimate (impute) the missing counts with sound statistical methods. Such an imputation makes it possible to compare the years fairly, ruling out artefacts and producing more reliable figures.

We use the predominant statistical technique to impute missing values in count data, viz. Poisson regression (log-linear models), as implemented in TRIM software (TRends and Indices for Monitoring data; Pannekoek & Van Strien, 2001). Poisson regression is also available in the generalized linear model modules of many other statistical packages. TRIM is an efficient implementation of Poisson regression to analyze the time-series of count data collected in many sites and produce indices and associated standard errors. It is a widely used freeware program (available via TRIM).

TRIM implements several log-linear models to impute missing data. The basic model contains both site effects and year effects and estimates missing values from all visited sites’ data. The key assumption is that changes observed in surveyed sites also apply to non-surveyed sites. The next example shows the result.

TRIM produces the following values for the sites in the example:

year 1 year 2 year 3
site 1 count 4 3 2
site 2 count 4 3 2
site 3 count estimated: 16 estimated: 12 8
yearly all-sites total 24 18 12
yearly indices 100 75 50

Changes in site 3 have been based on the changes in sites 1 and 2. It is clear that the yearly totals and indices now make sense. The same procedure is applied to impute values for sites surveyed in past years but is not surveyed anymore.

Note that such imputation does not affect the trend estimation, just because missing values are calculated from the changes in sites with observations. Estimating missing values only serves as a fair comparison between years. Also, note that it is not the aim to get reliable information on changes in site 3, but only to get reliable information on trends based on all available information. Imputed values are less valuable than real observations. The major drawback is that the more missing values occur in the data, the wider the confidence intervals of indices. This is because imputed values don´t enlarge the sample size; in the example, the sample size for the first two years still is 2.

The basic model may be elaborated by including covariates, such as habitat or region. Any changes between years for non-surveyed sites are derived from changes in surveyed sites with a similar habitat or within the same region, thereby relaxing the assumption mentioned above. Incorporating covariates may lead to better model fit, better imputations, and smaller confidence limits of the resulting indices and trends. The penalty for not using such elaborate TRIM models is having larger standard errors of indices.

The usual statistical inference approach for log-linear models is maximum likelihood estimation and associated calculations of standard errors and test statistics. These estimations and testing procedures are based on the assumption of independent Poisson distributions for the counts. Such an assumption is likely to be violated when animals are counted because the variance may be larger than expected for a Poisson distribution (overdispersion), for instance, when the animals occur in colonies. Furthermore, counts are often not independently distributed because the counts at a particular point in time may depend on the previous time-point counts (serial correlation). TRIM uses procedures for estimation and testing that take into account these two phenomena.

Recently, TRIM has been developed in R as an RTRIM package. Its special version modified for PECBMS’ needs and available for download is called RTRIM-shell. RTRIM-shell is a set of three R scripts developed by Statistics Netherlands, using the RTRIM package to calculate national species indices.

TRIM, an older software tool called BirdSTATs is also available for the computation of population indices and trends. BirdSTATs is an open-source Microsoft Access database that is programmed to use and automatically runs the program TRIM in batch mode to perform the statistical analysis for a series of bird counts in the dataset.

BirdSTATs can import different kinds of counts data, enable stratification of count sites and selection of subsets of counts data, produce standardized TRIM input and command files, and run TRIM in batch mode for all or a selection of strata. It collects the output of the batched TRIM runs in a convenient and standardized format to fit the requirements of PECBMS.