Biodiversity Data Journal :
Research Article
|
Corresponding author: Alison Specht (a.specht@uq.edu.au)
Academic editor: Quentin Groom
Received: 29 Jun 2018 | Accepted: 29 Oct 2018 | Published: 07 Nov 2018
© 2018 Alison Specht, Matthew Bolton, Bryn Kingsford, Raymond Specht, Lee Belbin
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Specht A, Bolton M, Kingsford B, Specht R, Belbin L (2018) A story of data won, data lost and data re-found: the realities of ecological data preservation. Biodiversity Data Journal 6: e28073. https://doi.org/10.3897/BDJ.6.e28073
|
|
This paper discusses the process of retrieval and updating legacy data to allow on-line discovery and delivery. There are many pitfalls of institutional and non-institutional ecological data conservation over the long term. Interruptions to custodianship, old media, lost knowledge and the continuous evolution of species names makes resurrection of old data challenging. We caution against technological arrogance and emphasise the importance of international standards.
We use a case study of a compiled set of continent-wide vegetation survey data for which, although the analyses had been published, the raw data had not. In the original study, publications containing plot data collected from the 1880s onwards had been collected, interpreted, digitised and integrated for the classification of vegetation and analysis of its conservation status across Australia. These compiled data are an extremely valuable national collection that demanded publishing in open, readily accessible online repositories, such as the Terrestrial Ecosystem Research Network (http://www.tern.org.au) and the Atlas of Living Australia (ALA: http://www.ala.org.au), the Australian node of the Global Biodiversity Information Facility (GBIF: http://www.gbif.org). It is hoped that the lessons learnt from this project may trigger a sober review of the value of endangered data, the cost of retrieval and the importance of suitable and timely archiving through the vicissitudes of technological change, so the initial unique collection investment enables multiple re-use in perpetuity.
data conservation, data retrieval, legacy data, data curation, long-term data accessibility
An argument without evidence is mere assertion (
The scale of past data collections is often beyond today’s means so replication may be nearly impossible. Cook’s various explorations of the Pacific, Humbolt’s expedition to South America, and Darwin’s voyages in the Beagle required great planning, the assembly of many personnel across many disciplines, and occurred over great distances and time. Data-collecting expeditions of similar scale would be prohibitively expensive to launch in modern times (
These famous expeditions are mere examples of an abundance of organised data collections made over the centuries. We benefit from only a fraction of this knowledge as a huge mass of data from the filing cabinets and the computers of scientists and research teams, despite the best intentions, are poorly described and managed, unavailable, or completely lost (
Routine long-term data collection and its ongoing management and conservation is often a low priority in policy-driven government departments, while physical and digital data storage has been increasingly ‘rationalised’ as data custodians have been made redundant and agencies are either downsized, re-structured or abolished (
Recovery of past data is a difficult challenge depending on how the data have been stored (see
Data may further be broken up across multiple files, in various formats, and may violate basic principles of current best-practice data structures. Although the principles of relational database design were well established by the 1980s (
Data communities (e.g. Data Science Central: https://www.datasciencecentral.com and the Research Data Alliance: https://www.rd-alliance.org), data repositories (e.g. PANGAEA: https://www.pangaea.de; the Australian Antarctic Data Centre: https://data.aad.gov.au; the Knowledge Network for Biodiversity: https://knb.ecoinformatics.org; DRYAD: https://datadryad.org; the Atlas of Living Australia and the Terrestrial Ecosystem Research Network) and data management support initiatives such as DataONE have been developed to facilitate systematic data sharing and long-term data preservation by scientists. Such good intentions will require, however, consistent advocacy and ongoing monitoring.
Ecological data present a particular challenge for management and preservation because they are:
The heterogeneity of ecological datasets is arguably a consequence of the nature of the profession. A survey of 751 Australian ecologists in 2011 produced more than 160 self-identified sub-categories of ‘ecologist’ (
Although great strides have been made in the past twenty years towards the routine publication of data, properly described, protected and archived for future use, the recovery of past ecological data remains in its infancy. Synthesis centres such as NCEAS, CESAB, sDiv, John Wesley Powell and ACEAS (see www.synthesis-consortium.org) support ecological analyses that only use existing data (
We present a case study of a continental set of ecological data that has had a long history of recovery and digitisation: once in the 1980-90s and again this century. Through this example, we illustrate the challenges imposed by changing norms of publication and technology, the benefits of deposition in a curated repository and provide some guidance for data management.
The chosen case study arose at the dawn of ‘Big Data in Biology’ sensu
Published data in refereed journal articles and ‘grey’ literature (i.e. government and research reports) were retrieved in hard copy (Fig.
The workflow from collation of original documents (A) through the publication of the ‘Conservation Atlas’ (E) to the retrieval project (G). The first step was to extract and digitise data from written publications (A-B). Due to the computing limitations of the time, it was necessary to split the data into sub-files (B and C) for analysis (D) which was the aim of the original project ('The Conservation Atlas' 1975-1995). Storage throughout the Conservation Atlas project was in both hard copy printouts and digital form. The ‘mainframe’ computers referred to were those from the PDP-10 computer family through the University of Queensland computer centre. The magnetic tapes were used as backup storage from the PDP-10s and the Exabyte tape was used to store the data from the magnetic tapes at the end of the Conservation Atlas project.
Note: Letters are used to facilitate reference to the figure from the text. The temporal axis is not to scale.
Due to the computational limitations of the time, the data were organised according to vegetation formation (e.g. forests, sclerophyll vegetation, mallee;
Numbers of sites and species in each vegetation formation in the initial project. These numbers include species that occur in more than one vegetation formation.
* = Not including introduced species or singletons within the formation; ** = Not including tree species >10 m tall
Formation |
Locations |
Communities |
Species* |
Closed forests |
n/a |
644 |
1,418 |
Dry scrubs – SE Queensland |
232 |
232 |
475 |
Dry scrubs – Northern Territory |
n/a |
1,219 |
559 |
Eucalypt open-forests and woodlands (tree species) |
201 |
1,275 |
276 |
Sclerophyll vegetation SW Western Australia |
64 |
172 |
1,761 |
Sclerophyll vegetation Central and Eastern Australia |
188 |
549 |
2,581** |
Sclerophyll vegetation – heathland and tall shrubland |
136 |
312 |
2,071** |
Alpine vegetation |
73 |
61 |
556 |
Savannah understorey |
56 |
198 |
1,313 |
Mallee open-scrub |
28 |
41 |
395 |
Desert Acacia |
54 |
148 |
1,229 |
Chenopod shrubland |
30 |
68 |
410 |
Forested wetlands (including brigalow) |
31 |
36 |
193 |
Arid wetlands |
20 |
42 |
642 |
Freshwater swamp vegetation |
80 |
80 |
139 |
Coastal dune vegetation |
45 |
56 |
315 |
Coastal wetland vegetation (mangroves and saltmarshes) |
n/a |
15 |
74 |
Once entered and organised, the data were analysed to define floristic associations using the non-parametric programmes TAXON (
In 1991, when the mainframe computers at the University of Queensland were de-commissioned, the data from four of the five magnetic tapes – only readable on the PDPs – were transferred to exabyte tape, considered the best option at the time. The information on the fifth tape could not be retrieved. The company making Exabyte tapes ceased operations in 2006 (Fig.
Physical copies of the original papers, various analyses and data files were stored in Ray Specht’s house when he retired (Fig.
Illustration of the data resources available to the retrieval project: (i) a sample of the boxes of original copies of papers and reports (A), (ii) a table extracted from a publication prepared for data entry (B), (iii) a sample of the hard copy printouts showing alphanumeric lists of species under each location and community (C), (iv) the magnetic tapes on which backups were kept from day to day during the 1980s project (D), and (v) an exabyte tape on to which the data from the magnetic tapes were transferred in 1991 (E).
The retrieval project (Fig.
The first challenge was to develop a system for checking and updating the species names at the time of the ‘Conservation Atlas’ data collection. The most efficient and relevant mechanism to do this was through a web-service interface with the ALA (see http://api.ala.org.au, accessed 3 May 2018) which is the relevant authority for Australian species (see https://www.rbg.vic.gov.au/science/projects/taxonomy/atlas-of-living-australia-national-species-lists-project, accessed 11 December 2017).
The plot-based, species structure of the original data was converted to individual observations of species with freely associated data, such as location, date and time, observer, vegetation classification, source and team comments. We wanted to ensure that no information was lost in re-structuring the data for publication using the widely-supported Darwin Core Standard (
Diagrammatic representation of the workflow for retrieval of data from the original reference files (A). These files were separated into two parts for editing influenced by the 1980s organisation of the data: (i) information on the sites at which data were collected (B), and (ii) the species lists, which were updated through the Biodiversity Information Explorer, BIE (http://bie.ala.org.au/ws) (C). Once these components were updated, they were re-assembled using DarwinCore standards (D) to enable delivery through a data portal (in this case the Knowledge Network for Biocomplexity, KNB (https://knb.ecoinformatics.org). Ecological Metadata Language (EML) was used to describe the dataset.
The Darwin Core standard (
When data retrieval began, the comprehensive computer printouts were the only information source immediately available (Figs
A total of 461 locations (135 of these had multiple survey sites within each broad location) were identified from the paper copies and these provided a checklist and structure for the future data compilation. After locating an Exabyte tape reader (not an easy matter either), we found that the tape was fortunately readable but had overlapping content, containing several different file types including basic species and site data, computer programmes for the original data transformation and intermediate and final analysis results (as had the original magnetic tapes and printouts). As noted previously, most of the basic species and site files were consistently structured and were named according to vegetation formation leading to duplication. While confusing, duplication was far preferable to gaps in data. No data remained in either paper or digital form for the rainforest, dry scrubs, alpine vegetation and coastal wetland vegetation formations. This proved to be a loss of a large proportion of the data originally digitised.
Data on 1390 communities were recovered across the remaining formations, with alphanumeric codes for 9450 taxa and associated metadata. The estimated present cost of repeating the collection of raw data from the 461 locations, including species identification, preservation and documentation, would be conservatively AU$29 million. The estimated present cost of extracting and digitising the species lists from the initial articles collated would be around AU$8 million.
The most recent versions of the files were identified relative to the surviving hard copies. The following provides an insight into the complexity of decoding the available files. The digital information was organised (within files by formation) hierarchically: location; source (author); community parameters; and the species codes. Each category was given a control digit to identify the nature of the data following. This provided inputs to (mostly) sequential algorithms programmed in FORTRAN for precise formatting or Pascal for reformatting and quality assurance.
The fundamental problem with the data format (Table
An example of the core data available from printouts and (mostly) retrieved from Exabyte tapes according to formation and State. These examples are from the forested wetlands and desert acacia formations in New South Wales (N) and the Northern Territory (P).
LINE ID | Information |
800000 | N |
503200 | LOCATION N032 = CENTRAL COAST: SYDNEY (PIDGEON 1940) |
903200 | 33 51 151 13 |
503201 | COMMUNITY 01 = FRESHWATER RIVER (COMBINED LIST) |
003201 | UTRIAUST UTRIEXOL UTRIBILO VALLGIGA POTAOCHR POTAPERF POTATRIC BRASSCHR # |
003201 | NAJAMARI MYRIPROP PHRAAUST ELEOCHAR* TYPHORIE TYPHDOMI TRIGPROC TRIGSTRI # |
003201 | JUNCPAUC JUNCPALL JUNCPLAN AGROAVEN GAHNIA__* CASUCUNN MELALINA MELASTYP # |
003201 | CALLSALI EUCAROBU EUCAAMPL CAREX___* ISOLPROL VILLRENI ALISPLAN RANURIVU # |
003201 | GRATPUBE GOODPANI HYDRPEDU CENTASIA VIOLHEDE PRUNVULG STELFLAC SCHOAPOG # |
003201 | OPLIIMBE BLECINDI ADIAAETH PHILLANU # |
503202 | COMMUNITY 02 = FRESHWATER SWAMPS ON WIND BLOWN SAND (PORT STEPHENS) |
003202 | BAUMTERE BAUMARTI TRIGPROC TRIGSTRI PHILLANU LEPIARTI MELAQUIN EUCAROBU # |
003202 | ISOLINUN GRATPEDU DROSSPAT VILLRENI BAUMJUNC SCHOBREV RESTAUST LEPTTENA # |
003202 | RESTTETR SPREINCA BOROPARV EPACOBTU GONOMICR BLECINDI HYDRTRIP SPHAGNUM* # |
003202 | VIOLHEDE # |
500000 | ------------------------------- |
800000 | P |
503700 | LOCATION P037 = TANAMI DESERT: LAKE SURPRISE, N.T. (MACONOCHIE 1973) |
903700 | 20 15 131 45 |
503701 | COMMUNITY 01 = TUSSOCK GRASS-SEDGE-LAND + TREES |
303701 | EUCAPAPU ACACVICT # |
003701 | ABUTOTOC ACACADSU ACACJENS ACACMELL ACACSTIP ACACTENU ALTEANGU ARISBROW # |
003701 | ARISINAE BERGTRIM BONALINE BRACHOLO BRUNAUS2 BULBBARB CANTATTE CASSCOST # |
003701 | CASSHELM CASSOLIG CASSFILI CLEOVISC CLERFLOR COMESYLV CROTCUNN CROTEREM # |
003701 | CYPEBULB CYPECUNN CYPEHOLO CYPEIRIA DAMPCAND DESMMUEL DICRLEWE DODOPETI # |
003701 | ECTRSCHU ELYTSPIC ERAGLANF ERIAARIS ERIABENT EUCAASPE EUCAPRUI EUCASETO # |
003701 | EUCATERM EULAFULV EUPHDRUM EUPHWHEE GOODAZUR GOODENIA*GOMPCONI GREVJUNC # |
003701 | GREVWICK HALGSOLA HELIAMBI HIBILEPT HIBISTURC HIBISTURP INDIBREV IPOMMUEL # |
003701 | ISOTATRO LOMALEUC MARSEXAR MELAGLOM MELALASI MELANERV MELHOBLO MELOMADE # |
003701 | MERRDAVE MIRBVIMI MORGFLOR NEPTDIMO PANIAUST PARAMUEL PHYLCARP PHYLHUNT # |
003701 | PHYLRHYT PIMEAMMO PLECPUNG PLUCTETR PLUCTETRT POLYSYNA POLYGALA *PORTFILI # |
003701 | PORTOLER PSORMART PTILARTH PTILASTR PTILCALO RULILOXO SANTLANC SCAEPARV # |
003701 | SCIRLAEV SIDAPLAT STACMEGA SWAIBUR3 SYNATILL TINOSMIL TRIAPILO TRIOPUNG # |
003701 | TRIUGLAU WALTINDI ZORNALBI # |
500000 | ------------------------------- |
The list of publications, from which the data had been retrieved, was fortunately readable and only required checking and updating. Each citation was given a unique number for the purposes of retrieval (Table
Example of records from the publications spreadsheet. ID = our imposed identification number (roughly alphabetical).
ID |
Author(s) |
Date |
Title |
Journal etc. |
Volume No. |
Page numbers |
1 |
Abbott, J. |
1977 |
Species richness, turnover and equilibrium in insular floras near Perth, Western Australia. |
Aust. J. Bot. |
25 |
193-208 |
8 |
Adams, L. D. & Craven, L. A. |
1976 |
Checklist of vascular plants in a study area of the South Coast of N.S.W. |
C.S.I.R.O. Land Use Res. Tech. Mem. |
76/16 |
|
387 |
McMahon, A.R.G., Carr, G.W., Todd, J.A. & Race, G.J. |
1990 |
The Conservation Status of Major Plant Communities in Australia: Victoria. |
Ecological Horticulture Pty Ltd, Clifton Hill, Vic. |
||
474 |
Pye, K. |
1982 |
Morphology and sediments of the Ramsay Bay sand dunes, Hinchinbrook Island, North Queensland. |
Proc. R. Soc. Qld |
93 |
31-47 |
560 |
Tate, R. |
1880 |
On the geological and botanical features of southern Yorke Peninsula, South Australia. |
Trans. R. Soc. S. Aust. |
13 |
112-120 |
705 |
Willis, J.H. |
1967 |
Systematic arrangement of vascular plants noted on the slopes and summit of the peak: The Rocks Nature Reserve, New South Wales. |
Nat. Pks & Wildl. Serv., N.S.W. |
705 |
The locations in this project were governed by the historical record (Table
The attributes (columns) of the master site file were:
Throughout the project there was an evolution of the fields in the master site file. In the original lists, multiple authorities were often cited, with sequential dates, one building on the work of the other or acknowledging a re-citation (an attempt to trace the provenance of a species list). Such multiple citations were not supported in Darwin Core format. Ray and Alison Specht, in consequence, reviewed all multiple author attributions and selected the most relevant to be the primary authority.
The definition of location, now possible with Geographic Positioning Systems (GPS), was not available in most of the original studies. The broad latitude and longitude information in the original datasets (Table
Several typographical and procedural inconsistencies were highlighted as the datasets were ingested. The duration of the original project — from the first datasets (late 1970s) to the last (early 1990s) — and the splicing of the data into different formations resulted in variations in the way associated information was recorded, from state/territory codes to the numbers associated with record lines for plant communities in the datasets (Table
When protected species are encountered in the ALA, some of their locations may be obfuscated, resulting in locational refinements being undone. The ALA’s Sensitive Data Service (SDS) examines records of any sensitive species (state, territory, federal or IUCN status) and applies rules depending on the location. As these data are in the public domain, we considered it was justifiable to overrule the SDS.
Full taxonomic names are used in most biodiversity information systems and analysis packages (
At the time of the original study, a species name was updated if a new species name was identified. To retain fidelity with the original record, both names were recorded. These updates were performed by R.L. Specht as part of the original CAVE protocol (
An example of the species conversion file for the sclerophyll formation and of alphacodes. This example does not illustrate the size of the files.
Sequential row number |
Validity and Growth habit flag |
Species code |
Scientific name (in publication) |
New Scientific name (at time of original entry) |
2 |
L G |
ABELMOSC |
Abelmoschus moschatus |
|
14 |
LZG |
ACACACAN |
Acacia acanthoclada |
|
19 |
LMG |
ACACARGY |
Acacia argyrodendron |
|
20 |
SZG |
ACACARMA |
Acacia armata |
Acacia paradoxa |
21 |
MLG |
ACACASHA |
Acacia ashanesii |
Acacia oshanesii |
174 |
S G |
ACACKEMP |
Acacia sp. aff. A. sibirica |
Acacia sp. aff. A. kempeana |
466 |
S G |
BORRCARP/ |
Borreria sp. aff. Carpentariae |
Spermacoce sp. aff. stenophylla |
704 |
S G |
CARPAEQU |
Carpobrotus aequilaterus |
Carpobrotus modestus |
705 |
L G |
CARPMODE |
Carpobrotus modestus |
|
3019 |
SIG |
RUMEACET |
Rumex acetosella |
Acetosella vulgaris sens. lat. * |
3020 |
SIG |
RUMEANGI |
Rumex angiocarpus |
Acetosella vulgaris sens. lat. * |
3647 |
S G |
ZYGOFRUT |
Zygophyllum fruticulosum |
Zygophyllum aurantiacum |
3650 |
L G |
ZYGOIODO |
Zygophyllum iodocarpum |
The datasets in this project were large enough to preclude manual processing. As with the original study, we were therefore dependent on several computer programmes to extract, integrate and validate the data matched to Darwin Core standard terms. This process was facilitated by access to modern programming languages such as Pentaho, Java and JavaScript, utilisation of json format, and ALA web services as noted above.
The largest problem encountered was matching the species names in the data against the National Species Lists. There will always be arguments about species identification and nomenclature. There is no universally agreed taxonomy. This phase took around half of the project programming time, even with recourse to the Australian National Species Lists (http://www.rbg.vic.gov.au/science/projects/taxonomy/atlas-of-living-australia-national-species-lists-project, accessed 26 June 2018). Many names had been superseded over the intervening decades. The 9-digit alphacodes, required for the original TWINSPAN analyses, presented an additional complication, since the codes were guaranteed unique only within each vegetation formation.
In many cases, the original name for the taxon had moved to a third name. In some cases, the original name was again the currently accepted name for the taxon. Splits of broadly-defined taxa e.g. Acacia aneura and Senecio lautus, into multiple taxa were mostly unresolvable into current names.
Amongst the information returned through this process were the scientific name for the taxon, its globally unique identifier, the taxon concept (essentially the name, named by and named date), common names and a match score. This ALA web service was the key component of the programme that produced a master species spreadsheet containing the best guess scientific name, taxon concept, match type and scores, source files and other parameters. We used five name match categories (Table
CODE |
Meaning |
action |
MATCH |
Near-exact match or better |
accept |
PARTIAL-L and PARTIAL-R |
A significant substring match |
manual check |
FUZZY |
Fuzzy matching algorithm built on the score from the web service using a 'letter-pair similarity' score |
manual check |
WEAK |
A weak match falling below thresholds; the best match is retained |
manual check |
TAXM |
No match or major problem with original or subsequent species name |
refer to expert |
This process used online and offline resources in roughly the following priority order, dependent on the nature of the uncertainty:
The workflow for name resolution typically followed three stages.
Stage 1: Current name check
Often an incorrect name lookup using the ALA web service was caused by the name being misspelled in the original data, sometimes as a result of a simple typographical error. Taxonomists register common mistakes as ‘orth. var.’ and these are registered in APNI (https://biodiversity.org.au/nsl/services/apni).
The ALA name lookup sometimes returned an ambiguous result requiring further investigation. For example, Eragrostis ciliata could be mapped to E. cilianensis, E. ciliolata or Ericachne ciliata. Where only a single letter was used to represent a genus (as was occasionally the case in sequential lists in the digital master species conversion file: Table
Stage 2: Validation
Validation was dependent on the botanical knowledge of the assessor, in this case primarily Bolton. For the cases of taxonomic splits and misapplied names, additional information was required for name resolution. If no obvious match could be found from the available resources, we checked the original data file. In cases where no clarifying information could be found, the ALA’s ‘Explore your area’ or the ALA’s Spatial Portal (http://spatial.ala.org.au, accessed 26 June 2018) was used to identify potential candidates restricted to one or two of the original sites. Where the sites were associated with a small national park or reserve, the Spatial Portal was used to define the park or reserve as the area of interest and a species list was produced from the area report. Matches were usually found amongst the small number of species in the target genus. A good candidate species was one that was most common and occurred across the park/reserve. This strategy worked well for many taxa in south-western Western Australia.
Stage 3: Reference to an expert
Where no obvious species matches could be identified, the list of unmatched names was sent to Ray Specht, the lead author of the 1995 study (
The result of this process was a master species file with 9450 taxa, mostly species names. It would be desirable to link all the species listed in this project to voucher specimens which would potentially enable the several remaining incomplete identifications (to genus, family, sp. aff. etc.) to be resolved. Comprehensively linking these records to vouchers was, however, well beyond the scope of the current project. The voucher specimens will have been deposited in relevant state and national and, possibly, international herbaria. Users may wish to pursue this if necessary and practical for repurposing.
The intention of this data recovery project was to enable the data to be discoverable through as many systems as possible. As the largest challenge was updating the species lists, the resources of the ALA were considered of primary importance. A set of programmes was written to interrogate:
to produce the Darwin Core Records (Fig.
It was not trivial to map the attributes to Darwin Core (DwC). Five main output files were created, each file containing overlapping parts of the DwC Standard, as well as additional data that were not DwC-compliant - either for debugging purposes or because there was no DwC corollary (Fig.
This case study highlights the importance of providing for sustained data curation if we wish to expose data for maximal re-use. The recovery project was started because of the perceived value of the historic data, its national coverage, the fear of complete data loss and the continued existence of the key player in the initial exercise, Ray Specht. The estimated cost of the time the authors have spent in recovering and processing these data is minimally AU$100,000 in addition to the AU$50,000 invested by each of the funding organisations, the ALA and TERN. As a consequence of this effort and commitment, the data are now integrated with the ALA, Australia’s largest repository of species observations (https://collections.ala.org.au/public/show/dr8212) and will, in the future, be delivered as plot-based data through the Eco-informatics facility of TERN. The data set is downloadable from the Knowledge Network for Biocomplexity (
Even though we had access to digital data and supporting materials, a wide range of unanticipated problems were encountered. These should provide a strong warning to those active in or retired from the ecological research community. Many of the problems encountered were the result of:
(a) technological limitations at the time of the initial project and the work therefore required to update the data and formats to suit modern requirements,
(b) changed spatial referencing between the source material and modern standards,
(c) the long time taken to complete the initial project (resulting in variations in formatting and structure of the core data),
(d) the lapse in time between the compilation in 1995 and the start of the retrieval process in 2015 (Fig.
(e) the evolution of species names.
Changes in species names were expected, but even with the recent digital tools available through the Atlas of Living Australia, bespoke programming and expert taxonomic skills, considerably more time than initially anticipated was needed to resolve ambiguities. Without the effort, expertise and persistence of the authors, the recovery would have been impossible.
The involvement of three people from the original data collection (Specht, Specht and Bolton) in the recovery effort was invaluable for the resolution of taxonomic names, understanding the nature of the overlapping files, interpreting the information recorded, and understanding how the original project had been refined as it developed. Access to the CAVE manual (
It is interesting to note that there is wide acceptance of the value of the systematic collection of long-term data (e.g.
In the open-data world, with deposition of data for public use increasingly encouraged and supported through organisations like the Atlas of Living Australia, the Terrestrial Ecosystem Research Network, Elixir (https://www.elixir-europe.org), the Research Data Alliance, DataONE and GBIF, hopefully data loss will be less likely into the future. Even so, scientists need to be trained and encouraged to take advantage of repositories, and sustained funding is required to support the infrastructure necessary for good data conservation outcomes.
The original project was envisioned as a stock-take of the past, and by its conversion to and storage in digital form, a resource for the future. Despite initial enthusiasm for the project, lack of subsequent funding and continuity of effort meant this resource was almost lost. This is a common story even in cases where there was more substantial initial investment (
As our environment and our technological sophistication change, we need to respect information as it was originally reported. An object lesson from this project is not to be scornful of the efforts of times past, but to value them for the information they provide.
Sufficient resources need to be set aside to ensure that:
(a) scientists deposit their data as closely as possible to the time of their creation in appropriate, sustainable digital repositories,
(b) the technology of repositories is updated, and
(c) the data are appropriately conserved, allowing access, while maintaining integrity.
Only thus will data be useful to a myriad of future applications. If not, the cost of recovery of data in the future will be far higher than you may imagine and may, in fact, be impossible.
As is clear from the article, this recovery project was only possible through the contribution of an array of people and organisations over a sustained period of time. This includes the initial data collectors and their organisations, funding obtained in the 1980s from a vast number of organisations listed in the Conservation Atlas and latterly material and library support from the University of Queensland and the financial and collegial support provided by the Terrestrial Ecosystem Research Network (TERN: www.tern.org.au). AS would like to thank Bob Parsons and Peter Saenger for providing critical missing data, Eric Garnier and Bill Michener for valuable comments on earlier versions of this manuscript and several people, including Siddeswara Guru, Baptiste Laporte, Todd Vision and Chris Lortie for their encouragement.
We particularly thank the late John LaSalle, Director of the Atlas of Living Australia (ALA: www.ala.org.au) for supporting this recovery project through funding Bryn Kingsford for coding the scripts and providing the time of Lee Belbin (Advisor to the ALA) and Miles Nicholls (the ALA's Data Manager): John had a rare understanding of the true value of these data.
AS & LB devised project, AS was primarily responsible for article.
AS, MPB, LB and RLS verified data collation, geographical and taxonomic mapping.
BK was primarily responsible for developing the code to retrieve the data from the original files and deliver in DarwinCore format.
BK and MPB monitored programme linkages.