Missing values I

Ideally, in a breeding bird monitoring scheme all sites are surveyed every year. If so, it is easy to assess the changes in the yearly all-sites totals of breeding pairs. These totals are usually represented as indices by setting the first year at value 100. But the reality in large-scale monitoring schemes is that many sites are skipped once or several times during the lifetime of a scheme because some fieldworkers enroll years after the start of the scheme, while others drop out after a number of years. Missing counts thus arise and simple comparisons of yearly all-sites totals of breeding pairs give misleading inferences on trends, as a simplified example shows.

The number of breeding pairs of a given species in the example has declined in sites 1 and 2, which were sampled each year. Site 3 was only surveyed in the third year. As a consequence, the yearly total as well as the index would be highest in year 3 if they are based on the simple sum across all sites, which is of course an artifact caused by the enlargement of the monitoring scheme in year 3. Taking the mean numbers of the sites is also incorrect. This is because, in this case, site 3 happens to have more breeding pairs of the species.

year 1 year 2 year 3
site 1 4 3 2
site 2 4 3 2
site 3 missing count missing count 8
yearly all-sites total 8 6 12
yearly indices 100 75 150
yearly mean 4 3 4

To solve the problem by simply disregarding site 3 would be a waste of useful information, especially if site 3 continues to be surveyed in the years to come. It is a better solution to estimate (impute) the missing counts with sound statistical methods. Such an imputation makes it possible to compare the years in a fair way, ruling out artifacts and producing more reliable figures.

We use the predominant statistical technique to impute missing values in count data, viz. Poisson regression (log-linear models), as implemented in TRIM software (TRends and Indices for Monitoring data; Pannekoek & Van Strien, 2001). Poisson regression is also available in the generalized linear model modules of many other statistical packages. TRIM is an efficient implementation of Poisson regression to analyze time-series of count data collected in many sites and to produce indices and associated standard errors. It is a widely used freeware program (available via https://pecbms.info/methods/software/trim/.

TRIM implements several log-linear models to impute missing data. The basic model contains both site effects and year effects and estimates missing values from the data of all visited sites. The key assumption is that changes observed in surveyed sites also apply to non-surveyed sites. The next example shows the result.

TRIM produces the following values for the sites in the example:

year 1 year 2 year 3
site 1 count 4 3 2
site 2 count 4 3 2
site 3 count estimated: 16 estimated: 12 8
yearly all-sites total 24 18 12
yearly indices 100 75 50

Changes in site 3 have been based on the changes in site 1 and 2. It is clear that the yearly totals and indices now make sense. The same procedure is applied to impute values for sites that had been surveyed in past years, but are not surveyed any more.

Note that such imputation does not affect the trend estimation, just because missing values are calculated from the changes in sites with observations. Estimating missing values only serves a fair comparison between years. Also note that it is not the aim to get reliable information on changes in site 3, but only to get reliable information on trends based on all available information. Imputed values are less valuable than real observations. The major drawback is that the more missing values occur in the data, the wider the confidence intervals of indices will be. This is because imputed values don´t enlarge the sample size; in the example, the sample size for the first two years still is 2.

The basic model may be elaborated by including covariates, such as habitat or region. Any changes between years for non-surveyed sites then are derived from changes in surveyed sites with a similar habitat or within the same region, thereby relaxing the assumption mentioned above. The incorporation of covariates may lead to better model fit, better imputations and smaller confidence limits of the resulting indices and trends. The penalty of not using such elaborate TRIM models is having larger standard errors of indices.

The usual approach to statistical inference for log-linear models is maximum likelihood estimation and associated calculations of standard errors and test statistics. These estimation and testing procedures are based on the assumption of independent Poisson distributions for the counts. Such an assumption is likely to be violated when animals are counted because the variance may be larger than expected for a Poisson distribution (overdispersion), for instance when the animals occur in colonies. Furthermore, counts are often not independently distributed because the counts at a particular point in time may depend on the counts at the previous time-point (serial correlation). TRIM uses procedures for estimation and testing that take into account these two phenomena.

Apart from TRIM, a software tool called BirdSTATs is also available for computation of population indices and trends. BirdSTATs is an open source Microsoft Access database, which is programmed to use and automatically run the program TRIM in batch mode to perform the statistical analysis for series of bird counts in the dataset.

BirdSTATs is capable of importing different kinds of counts data, enables stratification of count sites and selection of subsets of counts data, produces standardised TRIM input and command files and runs TRIM in batch mode for all or a selection of strata, and it collects the output of the batched TRIM runs in a convenient and standardised format to fit the requirements of PECBMS.