Monday, 16 April 2018

SABAP2 The Movie maps in R

SABAP2 'The Movie' maps in R

Professor Les Underhill's vision for SABAP2 would be that it would continue to collect data in perpetuity in order to examine patterns in space and time: SABAP2 The Movie.

Now, with 10 years of data and counting, we can create these movies. There are many stories that wait to be told.

The existing SABAP2 map is static: it is the result of compounded lists across the 10 year history. This hides any dynamic patterns. The first and most obvious is seasonality, especially for migratory species. First, we will look at the monthly occurrence of a migratory species, building on what we did here:

# These are the packages required to run these scripts
library(ggplot2); library(dplyr); library(gganimate); library(animation) 
# gganimate is still only available via GitHub at this stage:

# You will need to download and install the free image processing software ImageMagick: 

# Tell R where ImageMagick lives:
magickPath = shortPathName("C:\\Program Files\\ImageMagick-7.0.7-Q16\\magick.exe") 

# Download your data here by replacing the 493 here with the species code of your choice (or pasting the link)
swallow =
  read.csv("", stringsAsFactors = F)

# restrict the data to southern Africa (the region with the best coverage)
swallow = filter(swallow, lat<(-15))

# add reporting rate
 swallow$reporting_rate = swallow$cards_with_spp/swallow$cards

The new code starts here. First, we use the raw data to create a background map of where we have bird lists (coverage) using the atlas data. We then create a new dataframe of average reporting rate by month. Finally, we plot the black coverage map and overlay reporting rate (on a gradient from black with nothing, to red i.e. high reporting rate). Pentad size here is proportionally larger than in reality, which helps fill in atlasing holes. This is a movie after all, not an analysis.

#first, background pentad map
pentad_map_data = swallow%>%select(pentad, lat, lng)%>%distinct()

# now dataframe for mapping
annual_pattern = swallow%>%group_by(mnth, lng, lat)%>%summarise(rr=mean(reporting_rate))%>filter(rr>0)

swallow_space_time =
    ggplot(pentad_map_data, aes(lng, lat))+
      geom_point(shape=15)+ylim(min(swallow$lat), -15)+xlim(min(swallow$lng), 37)+
      theme_bw(base_size = 15)+coord_equal()+xlab("Longitude")+ylab("Latitude")+

      geom_point(data=annual_pattern, aes(lng,lat, colour=rr, frame=mnth),shape=15)+
      scale_colour_gradient(low="black", high="#de2d26", guide="none")+
      ggtitle("Spatial distribution of Barn Swallow for month ")

 suppressMessages(gganimate(swallow_space_time, interval=1))

A static map is also not appropriate if a species is nomadic: range will change from year to year depending on rainfall and resources. This is especially the case for several species of the arid zone. One of the most common and familiar is the Lark-like Bunting Emberiza impetuani. Here we examine year by year reporting rate for this species, revealing a remarkeably dynamic range (as you will see when we compare this to an endemic bird species below). SABAP2 data reveals an extraordinary irruption of Lark-like Bunting in 2013 that went largely unmentioned in South Africa's ornithological literature.

# get the data:
llbunting =
  read.csv("", stringsAsFactors = F)

llbunting = filter(llbunting, !yrr%in%c(2007, 2018))
llbunting$reporting_rate = llbunting$cards_with_spp/llbunting$cards
annual_pattern = llbunting%>%group_by(yrr, lng, lat)%>%

# now a spatial map

llbunting_space_time =
    ggplot(pentad_map_data, aes(lng, lat))+
    geom_point(shape=15)+ylim(min(llbunting$lat), -15)+xlim(min(llbunting$lng), 37)+
    theme_bw(base_size = 15)+coord_equal()+xlab("Longitude")+ylab("Latitude")+

    geom_point(data=annual_pattern, aes(lng,lat, colour=rr, frame=yrr),shape=15)+
    scale_colour_gradient(low="black", high="#de2d26", guide="none")+
    ggtitle("Spatial distribution of Lark-like bunting for year ")

suppressMessages(gganimate(llbunting_space_time, interval=2))

By comparison, the 'movie' for Cape Sugarbird is far less interesting, with the exception of perhaps of the odd individual wondering out of range.

# get the data:
csb =
  read.csv("", stringsAsFactors = F)

csb = filter(csb, !yrr%in%c(2007, 2018))
csb$reporting_rate = csb$cards_with_spp/csb$cards
annual_pattern = csb%>%group_by(yrr, lng, lat)%>%
# now an animated spatial map csb_space_time = ( ggplot(pentad_map_data, aes(lng, lat))+ geom_point(shape=15)+ylim(min(csb$lat), -15)+xlim(min(csb$lng), 37)+ theme_bw(base_size = 15)+coord_equal()+xlab("Longitude")+ylab("Latitude")+ geom_point(data=annual_pattern, aes(lng,lat, colour=rr, frame=yrr),shape=15)+ scale_colour_gradient(low="black", high="#de2d26", guide="none")+
ggtitle("Spatial distribution of Cape Sugarbird for year ")

suppressMessages(gganimate(csb_space_time, interval=2))

And finally, we watch the spread of atlasing coverage over the last decade, coupled with the presence of Pied Crow. This chart is a cumulative one: each year laid over the previous.

pied_crow =   read.csv("", stringsAsFactors = F)

pied_crow$rr = with(pied_crow, (cards_with_spp/cards))

annual_pattern = pied_crow%>%group_by(yrr, lng, lat)%>%summarise(rr=mean(rr))

pied_crow_space_time = 
      ylim(min(pied_crow$lat), -15)+xlim(min(pied_crow$lng), 37)+
      geom_point(data=annual_pattern, aes(lng,lat, colour=rr, frame=yrr, cumulative=T),shape=15)+

      theme_bw(base_size = 15)+coord_equal()+xlab("Longitude")+ylab("Latitude")+

      scale_colour_gradient(low="black", high="#de2d26")+
ggtitle("Coverage and Pied Crow CUMULATIVE year ")

gganimate(pied_crow_space_time, interval = 1)

Illustrating inter-annual trends of SABAP2 reporting rate in R

Illustrating inter-annual trends of SABAP2 reporting rate in R

In my last post we learnt how to animate annual trends in monthly reporting rate. But was there any change in overall reporting rate across years?

So how is reporting rate changing between years for Barn Swallow Hirundo rustica. We can't really tell from the species summary page:

For this exercise you will need the data at the link marked “Pentad level summary (monthly)” under the Data downloads options.

For the animated chart you will need to download and install the free image processing software ImageMagick:

# These are the packages required to run these scripts
library(ggplot2); library(dplyr); library(gganimate); library(animation) 

# Tell R where ImageMagick lives:
magickPath <- ani.options="" code="" convert="magickPath)" files="" magemagick-7.0.7-q16="" magick.exe="" rogram="" shortpathname="">

Here we prepare the data (for explanation of these steps see previous post). I've included this so the code can run as a stand-along script.

# Download your data here by replacing the 493 here with the species code of your choice (or pasting the link)
swallow <- africa="" best="" coverage="" data="" filter="" group_by="" http:="" inc="" lat="" occurrence="" of="" pentad="" range="" read.csv="" region="" restrict="""" southern="" species="" species_data_download.php="" spp="493&section=6" stringsasfactors="F)" swallow="" the="" to="" with="">%

occurrence <- add="" amp="" cards="" cards_with_spp="" code="" filter="" occurrence="" ok="" rate="" reporting="" reporting_rate="" swallow="">

The new code starts here. First, it is a good idea to remove data from 2007 (and maybe even 2008). 2007 is the year SABAP2 got going: submission rates and spatial coverage were low. Also remove the current year.

# exclude unwanted years:
swallow <- 2018="" a="" and="" by="" c="" error="" filter="" group_by="" here="" in="" include="" measure="" now:="" of="" rate="" reporting="" standard="" summarise="" summary="" swallow="" year="" yrr="">%
summarise(mean=mean(reporting_rate), sd=sd(reporting_rate), n=sum(cards_with_spp), se=sd/sqrt(n))

chart_title = c("SABAP2 reporting rate trend for \n Barn Swallow")

p <- 2018="" aes="" base_size="14)+" chart="" code="" coord_cartesian="" cumulative="T))+geom_bar(stat=" ear="" eporting="" frame="yrr," geom_errorbar="" ggplot="" identity="" is="" labs="" mean="" p="" proportion="" rate="" static="" summary="" the="" theme_bw="" this="" title="chart_title)+" xlab="" xlim="c(2007," ylab="" ymax="mean+se))" ymin="mean-se," yrr="">

# Here I add a regression line across the interannual mean reporting rate values
q = p+
geom_smooth(data=summary, aes(yrr, mean),   method="lm", color="red")


# The animated version  
gganimate(q, interval = .5)

# Save the above as gif: suppressMessages(gganimate(q, interval=6, "SWALLOW_barchart_time_series.gif"))

# A simple linear regression suggests no significant change. 
# Obviously real life may be much more complicated than this suggests.   

summary(lm(mean~yrr, data=summary))
## Call:
## lm(formula = mean ~ yrr, data = summary)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.042814 -0.015509  0.005625  0.024023  0.025913 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  8.854237   6.314515   1.402    0.198
## yrr         -0.004226   0.003138  -1.347    0.215
## Residual standard error: 0.0285 on 8 degrees of freedom
## Multiple R-squared:  0.1848, Adjusted R-squared:  0.08293 
## F-statistic: 1.814 on 1 and 8 DF,  p-value: 0.215

##  Try a more statistically robust approach to interannual change in reporting rate
## using a clt / bootstrapping approach with random samples

#initialise the target dataframe:

df = data.frame(year=2007, rr=NA)

# Run a loop that selects each year, removes sites with low sampling effort, 
# and then subsamples a random set of the available data 1000 times

for(i in 2008:2017){
  temp = filter(swallow, yrr==i&cards>2)%>%select(reporting_rate)
  holder = rep(NA, 1000)

    for(b in 1:1000){
    temp2 = sample(temp$reporting_rate, 100, replace = F)
    holder[b] = mean(temp2)

  holder2 = data.frame(year=i, rr=holder)
  df = rbind(df, holder2)

df = filter(df, year!=2007)

# plot it:

ggplot(df, aes(year, rr))+
  geom_smooth(method="lm", colour="red")+

Friday, 13 April 2018

Animated SABAP2 reporting rate timelines

Making animated monthly reporting rate charts from SABAP2 public data in R

Currently you can get a lot of information from the SABAP2 website. For example, go here to find information on Barn Swallow Hirundo rustica.

The aim of this post is to animate the current static monthly reporting rate chart to the right of the map. As there are now many years of data, it is hard to understand what is going on in the static chart now.

For this exercise you will need the data at the link marked “Pentad level summary (monthly)” under the Data downloads options. Note this file is fairly large (8 Mb), so download can take a minute or two, depending on your bandwidth.

You will need to download and install the free image processing software ImageMagick:

In addition gganimate is currently only available on github

# These are the packages required to run these scripts
library(ggplot2); library(dplyr); library(gganimate); library(animation) 

# For me (but maybe not for you), I need to tell R where ImageMagick lives:
magickPath <- shortPathName("C:\\Program Files\\ImageMagick-7.0.7-Q16\\magick.exe") 
#update your path as necessary
# Download your data here by replacing the 493 here with the species code of your choice (or pasting the link)
swallow <- read.csv("", stringsAsFactors = F)

Have you got the data?

Here is a quick chart to make sure you have the data, a histogram of the number of full protocol cards submitted by year


Since SABAP2 now extends into Africa on a fairly ad-hoc basis, you may wish to restrict your analysis to the southern African region by running swallow <- filter(swallow, lat<(-15)).
Most of the atlased area (currently in your data frame) may well be outside the range of the species, especially if you are dealing with a range-restricted or endemic species. If we use all data, then we will have very low reporting rates. This is likely not what you want: you want to know reporting rate from within a species range. So this step restricts the dataset to the set of pentads where the species has ever been recorded.

occurrence <- group_by(swallow, pentad) %>%

# now get rid of pentads with ok ==0 i.e non occurrence. 
occurrence <- filter(occurrence, ok!=0) 
Unlike some of the data sets available for download, this one does not come with reporting rate. So we need to calculate that. It will be a proportion (between 0 and 1).

 swallow$reporting_rate <- swallow$cards_with_spp/swallow$cards
#lots of 0 and 1: artefacts of coverage where pentads surveyed only once can only be 0 or 1

This is something like what the chart on the SABAP2 website currently looks like:

ggplot(data = group_by(swallow, yrr,mnth)%>%
  , aes(mnth, mean_reporting_rate,colour=factor(yrr)))+geom_line()

So lets animate that.

# we need a time series value that combines year and month

swallow$yrr_mnth <- swallow$yrr+round(swallow$mnth/100, 2)

# Now create the chart
p <- ggplot(data = group_by(swallow, yrr, mnth, yrr_mnth)%>%
    filter(pentad%in%occurrence$pentad  & yrr%in%c(2009, 2012, 2015, 2017))%>% # displaying only a few years for clarity
  aes(mnth, mean_reporting_rate,colour=factor(yrr), frame=yrr_mnth,cumulative=T, size=yrr/2009))+
  geom_path(alpha=0.5)+ggtitle("Monthly reporting rates for Barn Swallow for ") +
  theme_bw(base_size = 14)

# Display the animated plot. Interval controls the speed. 
suppressMessages(gganimate(p, interval=0.5))
If you want to save the chart as a gif file run this. Various other formats exist: check the animation package documentation.

suppressMessages(gganimate(p, interval=0.5, "barn_swallow_time_series.gif"))

Tuesday, 27 March 2018

Estimating conservation metrics from atlas data: the case of southern African endemic birds - the images

The publication "Estimating conservation metrics from atlas data: the case of southern African endemic birds" did not include several charts in color and so they are hard to interpret. Here they are for anyone who is interested.

Figure 2: Reporting rate change for 58 South African endemic bird species plotted against change
in reported range between SABAP1 and SABAP2. Point size represents the absolute value of
the mean z-score.

Figure 3. Reporting rate change for seven South African endemic bird species plotted against
change in range between SABAP1 and SABAP2. This figure shows the lower left hand corner of
Figure 2 in more detail - species qualifying as those of conservation concern due to range and
population decrease. Size of the points is weighted by mean z-score.

Figure 4. Presence/absence ratios for 58 endemic bird species for each atlas period. Species on the
negative end of the x-axis are generally infrequently reported, while those on the positive side are
commonly reported: negative values indicate species reported from less than 50% of cells. Shading
represents the 95% confidence interval of the regression between the values on the two axes.
Species below the 1:1 line (black diagonal) are species reported less frequently in SABAP2.
Selected species classified as Least Concern with a lower reporting rate in SABAP2 are labelled, as
are selected species with threatened status with higher reporting rates in SABAP2.

Figure 5. Mean population change across all species within each grid cell (left panel). N for each
grid cell is indicated by endemic species richness (right panel). Grids not included in this analysis
due to insufficient coverage (<2 are="" atlas="" black="" both="" for="" lists="" p="" periods="" points.="" white="" with="">

Figure 6. Connectivity (left-hand panels) and corrected connectivity (connected score/log(range);
right hand panels) for southern African endemic bird species. The lower two charts are the lower
left sections of the upper charts, indicating species with small ranges and low connectivity; QDGC
= quarter degree grid cells.

And these didn't make it into the manuscript, but are interesting:

Firstly median reporting rate within a species range:

And how large is the species range:

Why and How does the South African Bird Atlas Project (SABAP2) work in getting conservation metrics?

SABAP1 and 2 are two amazing citizen science projects: millions of bird records compiled by birdwatchers and submitted to a central database managed at the University of Cape Town.

Birdwatchers are requested to submit lists after surveying a set area (pentad) for a minimum of 2 hours, ideally covering all habitats within the 8x8 km pentad. These are known as full protocol cards.

The data can be used for many purposes, but my interest is focused on the information that can be acquired for supporting conservation decision making.

For this we need an index of relative abundance: how common is the bird? May seem like a simple question: but it isn’t. There are many problems that need to be overcome in interpreting the data that is part of the listing process.

In its simplest form, the basic unit of relative abundance is known as the ‘reporting rate’. This is the proportion of a set of lists that a species appears on. For instance, for a pentad, if 10 lists have been submitted, and a species is recorded on 6 of these, then the reporting rate is 6/10 = 0.6 (or 60%).

So our first challenge: what happens if a pentad has been surveyed only once? Then reporting rate can only be 0 (not recorded) or 1 (the species is recorded). These numbers are not very useful, so the more lists for a pentad, the better: the closer we come to a realistic probability that you will record a species should you go to a site.   Unfortunately, most of southern Africa is either inaccessible, or has few bird watchers. So much of the country is either not visited, or has been visited only once.

Here is a chart that illustrates this for the first SABAP1 conducted from 1987-1992 and SABAP2 which was initiated in 2007 and is ongoing. Here, the sampling unit is the Quarter Degree Grid Cell (QDGC), the sampling unit of SABAP1: each QDGC contains 9 pentads. The x axis (number of cards) is truncated at 50 in each case: some pentads have many more cards than this.

Both SABAP1 and SABAP2 have a very large number of QDGCs sampled only once. As the sampling effort of SABAP2 is now higher than SABAP1, there are a very large number of cells that have been visited only once (nearly 1500!).

In this paper below I focused on calculating single index parameters of change to compare how species were doing between atlas periods:

Estimating conservation metrics from atlas data: the case of southern African endemic birds

For reporting rate change, I calculate the mean reporting rate across all QDGCs for SABAP2, subtract the mean of reporting rate from SABAP1, then divide that results by the mean of the reporting rate from SABAP1. A positive number indicates an increase in reporting rate (inferring increased abundance), while a negative number infers decreased abundance. For example, if reporting rate for SABAP2 was 100%, but only 50% for SABAP1, then the change is 100% (i.e. 100-50 / 50) i.e. a doubling in probability of reporting. Zero means no change. It might seem complicated, but it is the simplest way to look at change: there are many many more complicated ways. However, reporting rates make sense to most people, since they are proportions and percentages. Other metrics can get rather esoteric (z-scores, standardized population change, occupancy modelling probability output). And rather importantly, there is good agreement between these in terms of what they are telling us.

To create summary reporting rate figures, I did not want to use QDGCs that had been surveyed only once to avoid the bias of cells with reporting rate of 0 or 1 only, so I filtered data to include QDGCs that were filtered only twice or more for both atlas periods.

But what happens if I had not filtered at all, or if I had filtered more?

Here I illustrate what the difference is in overall reporting rate between atlas periods using different filters, from no filter all the way to using only the subset of QDGCs with 25 or more lists. My study subject of choice is the African Black Oystercatcher, South Africa’s Bird of the Year for 2018. This coastal bird has been recorded in 177 QDGCs over the course of the two atlas projects.

We can see in the above chart that without applying a filter that reporting rate change is the lowest value: 11%, while filtering the data so at least 1 card was atlased during both projects gives us the highest value (c18%). After that it is fairly stable: around 15% increase, no matter what the filter.

Similarly, range change shows an odd result for unfiltered data: here range change is negative for unfiltered data (implying a range contraction), but hovering around a 10% increase in range for most other filters. So it looks like filtering is a good idea to get rid of some of that instability introduced with too many 1s and 0s. A minimum filter would be 2 or more lists for both atlas projects.

But what about sampling bias? This is a big problem with atlas data. Most atlasing is conducted around urban areas, and these are likely not representative of the wider landscape. Also, we need to deal with spatial autocorrelation, because sites close to each other are likely not independent. So in order to get a better idea of how much reporting rate has changed between atlas periods, here I appeal to the central limit theorem and sub-sample randomly using 50 QDGCs from the range of the Oyc 1000 times i.e. I calculate a reporting rate change based on 1000 random draws from the Oyc dataset, each time calculating a change in reporting rate value. I get an answer with mean of 16% plus or minus 9%. So a 15% increase seems reasonable using the entire data set wasn’t too bad. To think of it simply, you are 15% more likely to see an Oyc now in its range than you were during SABAP1. It isn’t a huge increase, but it is on the positive side, which is good news for Oycs.

For the above random sampling, 50 seems like a good minimum value: lower than this and standard deviation starts to increase i.e. we start to become less sure about where a reasonable answer lies. This is because large numbers are key to this process, and so inference for species with small ranges is a bad idea when it comes to using atlas data.

Let’s get back to that Law of Large numbers. The law of large numbers is a principle of probability according to which the frequencies of events with the same likelihood of occurrence even out, given enough trials or instances. As the number of samples increases, the actual ratio of outcomes will converge on the theoretical, or expected, ratio of outcomes. I.e. The Law of Large Number states that when sample size tends to get very large, the sample mean equals the population mean.

With the SABAP projects, we have large numbers. Almost certainly, some of the answers we are getting from relative abundance at the pentad level will be wrong, but using them all together we start to satisfy the law of large numbers. Here I illustrate for a species with theoretical relative abundance of 65% how many pentads we would require that have been sampled 4 times (giving us pentads with possibilities of 0, 1, 2, 3, 4) before the mean value starts to become acceptable enough that we can be confident it represents the ‘real’ value. Note that the confidence intervals are pretty narrow by about 30 pentads already.

And the last question I will answer in this post is: how does presence/absence data (binomial distribution) i.e. 0s and 1s eventually create a value that is between 0 and 100? Here, the Central Limit Theory is what is important.  The Central limit Theorem states that when sample size tends to infinity, the sample mean will be normally distributed.  For our example of pentads with 4 checklists, see how quickly the distribution becomes normal. Here the p value is a significant deviation away from normal. We get lift off (i.e. normal data) once we start hitting 30 pentads. Certainly then, 30 (pentads or QDGCs) is the minimum that would be required to make inference from atlas data for a population.

Don’t forget though: single value conservation metrics are not the whole picture, they are simply something easy to grasp in a complex world. But more about complex modelling another day.

Related Posts Plugin for WordPress, Blogger...