Biodiversity Data Journal : General research article
Trends in access of plant biodiversity data revealed by Google Analytics
Corresponding author: Timothy Mark Jones (firstname.lastname@example.org)
Academic editor: Andreas Beck
Received: 21 Aug 2014 | Accepted: 04 Nov 2014 | Published: 11 Nov 2014
© 2014 Timothy Mark Jones, David G. Baxter, Gregor Hagedorn, Ben Legler, Edward Gilbert, Kevin Thiele, Yalma Vargas-Rodriguez, Lowell E. Urbatsch.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0) which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Jones T, Baxter D, Hagedorn G, Legler B, Gilbert E, Thiele K, Vargas-Rodriguez Y, Urbatsch L (2014) Trends in access of plant biodiversity data revealed by Google Analytics. Biodiversity Data Journal 2: e1558. doi: 10.3897/BDJ.2.e1558
The amount of plant biodiversity data available via the web has exploded in the last decade, but making these data available requires a considerable investment of time and work, both vital considerations for organizations and institutions looking to validate the impact factors of these online works. Here we used Google Analytics (GA), to measure the value of this digital presence. In this paper we examine usage trends using 15 different GA accounts, spread across 451 institutions or botanical projects that comprise over five percent of the world's herbaria. They were studied at both one year and total years. User data from the sample reveal: 1) over 17 million web sessions, 2) on five primary operating systems, 3) search and direct traffic dominates with minimal impact from social media, 4) mobile and new device types have doubled each year for the past three years, 5) and web browsers, the tools we use to interact with the web, are changing. Server-side analytics differ from site to site making the comparison of their data sets difficult. However, use of Google Analytics erases the reporting heterogeneity of unique server-side analytics, as they can now be examined with a standard that provides a clarity for data-driven decisions. The knowledge gained here empowers any collection-based environment regardless of size, with metrics about usability, design, and possible directions for future development.
Biodiversity, big data, herbarium, Google Analytics, botany, museums, vascular plants, systematics, taxonomy, collections, digitization, web development, Kingdom Plantae
Herbaria are natural history museums that preserve collections of millions of specimens that offer a well established distributional model for a large-scale taxon (
The goal of this manuscript is twofold: to provide recommendations for current information managers and developers concerning the user interface and experience; and to provide a picture about the possible directions to take for those in-charge of the creation of information at all levels. Online plant databases can facilitate the democratization of botanical information through their availability, via open information that exceeds the speed of retrieval from a cabinet or bookshelf. Specimens, including type specimens, no longer need to be shipped back and forth across the globe; thereby limiting wear and tear to these important biodiversity objects while eliminating shipping costs. And importantly, all researchers can now share equal access globally, without travel, to a well established model at kingdom level (
Understanding how taxonomic resources now provided via the World Wide Web (WWW) are used, represents a new challenge. For this reason, presented here are collected data obtained from contributors using Google Analytics that functioned as a standard report (
We selected GA for website usage analytics for multiple reasons: 1) It is free to use, so is widely adopted, 2) It is standardized so analytics can be compared across institutional users, and 3) GA only tracks human usage, as opposed to most server-side analytics programs which track human and robot traffic indiscriminately.
What is Google Analytics?
A user directs a browser to a website that contains a tracking code. This tracking code or script leverages the information already being gathered by the browser; but then also writes a cookie back to the device that yields additional information that the browser cannot provide, such as time-on-site or page-views. The packaged set of collected data is then sent back to a Google server in the form of a GIF file. Lastly, the GIF file is then interpreted and incorporated into reports.
Sites were selected for this study by searching Hyper Text Markup Language (HTML) source code of biodiversity websites for the presence of Google Analytics. After identifying sites of interest, Jones contacted curators, directors, and developers via email or phone. This process led to the inclusion of fifteen sites (
Participants and their start dates.
|Project||GA Start date||Participants||Website||Tracked analytic|
|Consortium of California Herbaria (CCH)||2-May-07||30||ucjeps.berkeley.edu||UA-1304595-1|
|Consortium of North American Bryophyte Herbaria (CNABH)||1-Jul-12||62||bryophyteportal.org||UA-50594803-2|
|Consortium of North American Lichen Herbaria (CNALH)||17-Jul-12||59||lichenportal.org||UA-50594803-1|
|Consortium of Pacific Northwest Herbaria (PNW)||20-Aug-11||24||pnwherbaria.org||UA-29550699-1|
|Cooperative Taxonomic Resource for American Myrtaceae (CoTRAM)||8-May-11||5||cotram.org||UA-19854426-5|
|Global Biodiversity Information Facility (GBIF)||28-Jun-13||172*||gbif.org||UA-42057855-1|
|Herbario Virtual Austral Americano (HVAA)||8-May-11||5||herbariovaa.org||UA-19854426-4|
|Jepson eFlora (Jepson)||18-Nov-11||1||ucjeps.berkeley.edu||UA-43909100-1|
|Louisiana State University Herbarium Keys (LSU Keys)||24-Aug-08||1||herbarium.lsu.edu/keys||UA-1414632-44|
|Offene Naturführer (ON)||6-Nov-11||1*||offene-naturfuehrer.de||UA-27110487-1|
|Southwest Environmental Information Network (SEINet)||19-Nov-10||87||swbiodiversity.org||UA-19854426-1|
A total of four types of GA resources are charted (
Four variants of GA are represented in this study. Urchin is the first iteration of GA, derived from software developed by Urchin Software acquired by Google in 2005. It is unique in that it employed multiple means of information gathering, using both server logs and multiple cookies. The second iteration, synchronous or traditional, released in late 2007, also used multiple cookies, plus required that the JS load in a linear fashion. Penalizing content over tracking. Asynchronous came out two years later, and allowed for faster loads of content as the webpage loads first, and GA JS loads post-content delivery. The latest variant, universal, addresses issues with mobile and the internet-of-things (emerging wearable devices and existing household appliances that can communicate via the web), as it can assimilate into reports any device that can contact a server.
Number of sessions – 17,198,976 sessions from inception (when each organization began tracking) were found across the 15 GA numbers (
One year of use, across all sites from June 01, 2013 to June 01, 2014, showing over 4.5 million sessions.
|Project||Sessions||Average Page Views||Average User Duration (min)|
Stable bounce rates – Bounce is defined as the user visiting the primary page only and then exiting. Bounces are not included across the statistics, as they are treated as zeros. All participants in the study show relatively stable bounce rates. See discussion (
Historical bounce rates of study participants as compared year by year from January 01 to January 01 (
Operating systems – Revealed five major operating systems: Windows, Macintosh, Linux, iOS, and Android (
Historical operating systems to January 01, 2014 (
One year of operating systems from January 01, 2013 to January 01, 2014, showing same ON trend (
Yearly traffic* broken down by search, direct, referral, 'not set', and social (
Outreach – Each site's traffic favors its country of origin but all nations, territories, and/or commonwealths are represented across the sample (
Long-term outreach in countries, cities, and networks across variable project start dates through June 01, 2014.
One-year outreach in countries, cities, and networks from June 01, 2013 to June 01, 2014.
Mobile growth – Phone & tablet usage is steadily increasing for all resources (
Combined phone and tablet usage by percentage at log, showing emergence of mobile in 2010 in a changing landscape of device use (see
Ten top International Organization for Standardization (ISO) languages in use at Tropicos over six years; in order of percentage of usage (
Device types – The number of different device types has grown exponentially in recent years, from just a few types in 2010 to over 1500 in 2014 (
Tropicos showing the exponential growth of mobile device types over a five year period (
Consistent pattern of usage over seven years of returning users for each resource (
Browser Wars – Five web browsers are in a slow-motion-knife-fight for dominance (
Browsers and their design are vital to how we interact with the WWW. Browser usage at Tropicos from 2009 reveals a changing landscape in the user base of of browsers. This same trend is seen at CCH, eFlora, LSU Keys, and SEINet. Nostalgically and historically, the Netscape browser is also noted in these data at a high of two percent (
Search, Direct, Referrals, and Social – Traffic types were examined in a one year study (
Language – Tropicos demonstrated relatively stable language usage across the user base. With the dominate languages noted being English, Spanish, Brazilian Portuguese, French, German, and Chinese (
Returning Visitors Vs. New Visitors – Consistent usage demonstrated a stable regime of returning plant biodiversity data consumers (
Reinvention and re-purposing of traditional materials have enabled disciplines surrounding plant biodiversity to grow online, as these types of data are ideally suited for the web (
271 years total-session-time in seven years. Total user duration time yields 271 years since inception. Derived by sessions multiplied by the avg time to yield years of usage. *Caveats: those denoted by asterisks are sub-sampled by GA, so it is a population that is sub-sampled due to scale.
|Project||Sessions||Average User Duration (seconds)||Total Duration (years)|
|Total time||271 Years|
How a session is determined – A session is started after a browser requests a tracked webpage. On each, time spent and page views are recorded via a cookie (on desktops, or 90% of this data). By default, each session will expire after thirty minutes. If the user does not progress to another page, it is recorded as a bounce. For example, a researcher clicks on webpage, and then decides to eat lunch for thirty minutes, without clicking on anything after visiting the site. This would count as a 30 minute session, right? No, because they bounced.
Bounce rate – Bounces are not recorded as sessions since the user did not progress through the site after visiting the first page. For example, the same researcher uses the identical website again after lunch for 30 seconds, does a search for Carex aurea, which returns a results page. This results page further links to data-based specimen images which the researcher importantly clicks on. Three clicks and pages into the site now with a good broadband connection. Immediately upon instantiation of the third page, the researcher gets a phone call that lasts for 30 minutes. Here, due to the progression over three different web pages (two pages would count too), the session counts. And a bonus dwell time of 30 minutes is recorded in the report. While the actual session lasted only ~30 seconds. Nevertheless, total duration of a session remains informative because it allows for comparison, albeit a somewhat blurry picture of what is actually happening due to the lunch problem. So, progression is the key to a session, as those that do not progress do not count. This possibly skews overall results downwards, especially for those serving one-page websites such as blogs or apps.
Did that latest upgrade really do anything? – Additionally, when a user clicks on a directed event (campaign), new informational chains are instantiated. Campaigns are modifications to the JS that reveal supplementary information such as URL parameters that can identify a "web development push". FloraBase is unique in this sample, in that they are modifying their GA JS code to reveal additional parameters with their use. However, it can result in occasional double counting of sessions. This minor discrepancy is trivial when compared to the valuable information that can be gleaned from the data about the change in user behavior after an upgrade.
Bring your own device (B.Y.O.D.) or here comes mobile – 2013 was the first year that over one billion smartphones were shipped worldwide, and during this same time period only 300 million PC's were purchased (https://www.gartner.com/doc/2665319). Not so surprisingly, mobile growth has nearly doubled for the examined projects over the years examined (
Plants aren't social? – Overall, the amount of social media interaction was found to be trivial (
What not to do – While canvassing institutions for access to their GA accounts, a few unexpected issues arose concerning the administration of GA accounts:
Many institutions still rely only on server-based tracking. This balloons the data through the inclusion of bots or spiders that constantly scour the web to index pages for search or other not-so-noble reasons. It was recently estimated that over half of all web traffic now is non-human or machine based (http://www.incapsula.com/blog/bot-traffic-report-2013.html) basically rendering those that use this server-log method to be data blind (
Next-generation of GA? – Upgrading any GA user to Universal GA, requires the replacement of GA codes on all pages being tracked. A relatively new method, that still requires a one-time total code replacement, is the use of Google Tag Manager (GTM) (http://www.google.com/tagmanager/), as the International Plant Names Index (http://www.ipni.org/) is currently doing. GTM uniquely generates a script that permits future changes by functioning as an "analytic tattoo" for a website; thereby allowing for easy updating across all the deployed pages without wholesale replacement of all scripts. The tattooed script remains the same, but the instructions to that script are mutable, allowing for coding on-the-fly, and allowing for rapid experimentation across site(s). Surely, traffic for all biodiversity based web sites would dwarf these figures for plant biodiversity sites alone. Then considering that less than five percent of all collections-based biodiversity information is now online (
The authors would like to thank Chuck Miller at the Missouri Botanical Garden, for taking the time. We would also like to thank Rod Page & Tim Hirsch for quickly providing a global dataset with Global Biodiversity Information Facility; and Corinna Gries and Les Landrum for the sharing of their GA data from their resources. Plus thanks to Barbara Thiers, of New York Botanical Garden, for the sharing of Index Herbariorum georeferenced data. Greatly appreciated are the contirbutions of Pedro Lake for the constant editing of this MS. And thank you Mary Barkworth for the discussion that started this chapter.
Tim Jones contacted David Baxter, Ed Gilbert, Tim Hirsch, Ben Legler, Chuck Miller, Rod Page, and Kevin Thiele, for the sharing of GA account information. David Baxter provided all information for CCH and Jepson via Google Sheets (https://docs.google.com/spreadsheets/d/19Rvea4-qtOXEUKBu3c0nEOJo2IfzbSkuQpn83x6Argg/edit?usp=sharing).
Georeferenced list of world's herbaria
The data for all years prior to 1981 were taken from the herbarium's annual report to the Utah Agricultural Experiment Statement. Initially, only specimen growth was included in these reports. With time, we started tracking additional aspects. We have never included our GA data in the report. This is something we should have added when we first installed the software on our pages but we did not. We no longer have easy access to the web site and the GA data.
Total of page, user, and duration
Different devices used on Tropicos over the past year by model and manufacturer.
List of herbaria and specimen numbers in respective institutions
Bounce rates by years
Years are determined by using January 01 (or start date of that year) to January 01
From January 01, 2013 to January 01, 2014
Long and short term operating systems across top-five operating systems.
Top fiver languages over time at Tropicos
Search, diirect, referrals, not set, and social
Browser percentage by years at Jan. 01 to Jan. 01.
Percent returning sessions.