Data Paper
Biodiversity Data Journal 8: e59249
https://doi.org/10.3897/BDJ.8.e59249 (17 Nov 2020)
https://doi.org/10.3897/BDJ.8.e59249 (17 Nov 2020)
Other versions:
- ContentsContents
- Article InfoArticle Info
- CitationCitation
- MetricsMetrics
- Comment3Comment
- RelatedRelated
- FigsFigs
- TabsTabs
- TaxaTaxa
- RefsRefs
- CitedCited
This article sets a dubious example that it is possible to publish a data paper without serious work with the primary data directly. Besides, the dataset that the data paper is based on, published with licenses violation for most records that dataset includes. Also, the published dataset mostly duplicates data already published through GBIF. Further, the data described in the article, not under the management of the authors, as they did not perform data structure design, data quality control, and data cleaning on the published dataset. There is no calculation using the data directly. Everything the authors described was done inside the iNaturalist system. They were restricted with facilities provided by this system (every picture and metrics value were taken from iNaturalist, not calculated and plotted with R, Excel, or other software using the dataset itself).
Firstly, it is essential to pay attention to the published dataset (through Zenodo). According to the data publishing guidelines of BDJ: "large primary biodiversity data sets ... should be published with the GBIF Integrated Publishing Toolkit". As most of the data described already were published on behalf of iNaturalist through GBIF, authors duplicated the data into Zenodo non-specialised repository to meet the formal requirements. As Zenodo accepted the CC-BY only as an open license, this way violates licenses specified by thousands of the iNaturalist users for their data. The dataset published through Zenodo licensed as CC-BY contains records with CC-BY-NC, CC-BY-SA, CC-BY-ND licenses, and even Copyright. License violation affects more than 90% of records in this dataset.
There was no consistent data collection method as protocol "what, where, and how to collect." The detailed description of methods was substituted by describing the iNaturalist system and the data publishing process, automated by the system's developers. Obviously, a smartphone with the iNaturalist application or web-interface is a data-gathering tool, not the research method, even considering all the power of neural networks hidden under the application interface and helping to determine what the user has photographed.
Not only the collection method was missing, but also the purpose for which all these observations were collected. Some of the data were gathered during competitions to collect observations, some during student practices, many observations were made during individual excursions of naturalists to satisfy their curiosity, somewhere the system was used for biodiversity inventory, as evidenced by hundreds of local projects dedicated to protected areas. So, this activity does not meet the citizen science definition when volunteers help collect and/or process scientific data for some goals, because the scientific goals of this activity are mostly absent.
The dataset published through Zenodo is not a product of the collation data by authors from different sources, but simple duplication of the data from the global source into non-specialised scientific repository. Such important steps of the dataset processing as the data structure design, data standardisation, and data publishing are absent. So, the data presented in the paper are not under the management of the authors because these data were already available via iNaturalist and GBIF.
Moreover, the authors did nothing to process this dataset after the raw data download. This proves that Excel files in the archive have the same form as those downloaded from the iNaturalist system. However, in the text of the data paper, the CSV format was mentioned. The data presented in the archive as four files (according to iNaturalist technical restrictions). Barely calculation of metrics presented in the tables was performed on each table separately. All metrics have likely been taken from local project summaries.
Moreover, there are no figures in the paper, composed by authors immediately from the dataset using R, Excel, Statistica, or other software. A simple and relevant question: "What portion of these data collected by authors themselves?", asked in the "Flora of Russia" project post section, remains unanswered.
Undoubtedly, the data accumulated with the efforts of "Flora of Russia" project participants could be valuable for different research: flora inventory, especially for protected areas, monitoring of invasive species as mentioned in the data paper, species distribution modelling, etc. But, it needs previous comprehensive and thorough work on the data quality assessment and data cleaning. Except for species identification, there is no work done in this direction. There are some common mistakes provided in the data paper, but without a description of the work on data quality checking.
These data are at the curation of the authors, to some extent, inside the iNaturalist system. Considerable work was performed on the identification of observations and identification verifying. Leading Russian botanists became involved in this process, which ensured the high quality of observations identification. That can be a matter of the possible exciting articles, described this experience and outcomes, but not the data paper in this form. Unfortunately, data quality control is restricted by species identification. Two other main things - WHEN and WHERE it was observed remained without attention. The exclusion of 50 km coordinates uncertainty, which is a too rough and too formal approach.
The types of errors associated with the observation date have long been widely known but did not mentioned:
There is an instance of the negligent use of terminology, "portal" by definition, is a resource that unites heterogeneous data by structure, origin, initial purpose, and so on. Examples of the portal are GBIF, OBIS, and BOLD. In the paper, even a local iNaturalist project with hundreds of unified observations is called a portal. There is a mess among alternative links. Authors were provided links to different sources instead of different links to one source.
So, this paper does not contain a description of the dataset collated, standardised, and published by authors, nor comprehensive processing of the consolidated data array. The dataset, formally placed by authors in Zenodo, can not be reused for research without serious data quality control and data cleaning work. Everyone can get the data directly from the iNaturalist in the same form as it was published in Zenodo and download similar graphics from the iNaturalist system.