"Flora of Russia" on iNaturalist: a dataset

Alexey Seregin; Dmitriy Bochkov; Julia Shner; Eduard Garin; Igor Pospelov; Vadim Prokhorov; Pavel Golyakov; Sergey Mayorov; Sergey Svirin; Alexander Khimin; Marina Gorbunova; Ekaterina Kashirina; Olga Kuryakova; Boris Bolshakov; Aleksandr Ebel; Anatoliy Khapugin; Maxim Mallaliev; Sergey Mirvoda; Sergey Lednev; Dina Nesterkova; Nadezhda Zelenova; Svetlana Nesterova; Viktoriya Zelenkova; Georgy Vinogradov; Olga Biryukova; Alla Verkhozina; Alexey Zyrianov; Sergey Gerasimov; Ramazan Murtazaliev; Yurii Basov; Kira Marchenkova; Dmitry Vladimirov; Dina Safina; Sergey Dudov; Nikolai Degtyarev; Diana Tretyakova; Daba Chimitov; Evgenij Sklyar; Alesya Kandaurova; Svetlana Bogdanovich; Alexander Dubynin; Olga Chernyagina; Aleksandr Lebedev; Mikhail Knyazev; Irina Mitjushina; Nina Filippova; Kseniia Dudova; Igor Kuzmin; Tatyana Svetasheva; Vladimir Zakharov; Vladimir Travkin; Yaroslav Magazov; Vladimir Teploukhov; Andrey Efremov; Olesya Deineko; Viktor Stepanov; Eugene Popov; Dmitry Kuzmenckin; Tatiana Strus; Tatyana Zarubo; Konstantin Romanov; Alexei Ebel; Denis Tishin; Vladimir Arkhipov; Vladimir Korotkov; Svetlana Kutueva; Vladimir Gostev; Mikhail Krivosheev; Natalia Gamova; Veronica Belova; Oleg Kosterin; Sergey Prokopenko; Rinat Sultanov; Irina Kobuzeva; Nikolay Dorofeev; Alexander Yakovlev; Yuriy Danilevsky; Irina Zolotukhina; Damir Yumagulov; Valerii Glazunov; Vladimir Bakutov; Andrey Danilin; Igor Pavlov; Elena Pushay; Elena Tikhonova; Konstantin Samodurov; Dmitrii Epikhin; Tatyana Silaeva; Andrei Pyak; Yulia Fedorova; Evgeniy Samarin; Denis Shilov; Valentina Borodulina; Ekaterina Kropocheva; Gennadiy Kosenkov; Uladzimir Bury; Anna Mitroshenkova; Tatiana Karpenko; Ruslan Osmanov; Maria Kozlova; Tatiana Gavrilova; Stepan Senator; Maxim Khomutovskiy; Eugene Borovichev; Ilya Filippov; Serguei Ponomarenko; Elena Shumikhina; Dmitry Lyskov; Evgeny Belyakov; Mikhail Kozhin; Leonid Poryadin; Artem Leostrin

doi:10.3897/BDJ.8.e59249

Related articles by

Data Paper

Biodiversity Data Journal 8: e59249
https://doi.org/10.3897/BDJ.8.e59249 (17 Nov 2020)

Other versions:

There are no comments to this article

General comments

Maxim Shashkov

This article sets a dubious example that it is possible to publish a data paper without serious work with the primary data directly. Besides, the dataset that the data paper is based on, published with licenses violation for most records that dataset includes. Also, the published dataset mostly duplicates data already published through GBIF. Further, the data described in the article, not under the management of the authors, as they did not perform data structure design, data quality control, and data cleaning on the published dataset. There is no calculation using the data directly. Everything the authors described was done inside the iNaturalist system. They were restricted with facilities provided by this system (every picture and metrics value were taken from iNaturalist, not calculated and plotted with R, Excel, or other software using the dataset itself).

Firstly, it is essential to pay attention to the published dataset (through Zenodo). According to the data publishing guidelines of BDJ: "large primary biodiversity data sets ... should be published with the GBIF Integrated Publishing Toolkit". As most of the data described already were published on behalf of iNaturalist through GBIF, authors duplicated the data into Zenodo non-specialised repository to meet the formal requirements. As Zenodo accepted the CC-BY only as an open license, this way violates licenses specified by thousands of the iNaturalist users for their data. The dataset published through Zenodo licensed as CC-BY contains records with CC-BY-NC, CC-BY-SA, CC-BY-ND licenses, and even Copyright. License violation affects more than 90% of records in this dataset.

There was no consistent data collection method as protocol "what, where, and how to collect." The detailed description of methods was substituted by describing the iNaturalist system and the data publishing process, automated by the system's developers. Obviously, a smartphone with the iNaturalist application or web-interface is a data-gathering tool, not the research method, even considering all the power of neural networks hidden under the application interface and helping to determine what the user has photographed.

Not only the collection method was missing, but also the purpose for which all these observations were collected. Some of the data were gathered during competitions to collect observations, some during student practices, many observations were made during individual excursions of naturalists to satisfy their curiosity, somewhere the system was used for biodiversity inventory, as evidenced by hundreds of local projects dedicated to protected areas. So, this activity does not meet the citizen science definition when volunteers help collect and/or process scientific data for some goals, because the scientific goals of this activity are mostly absent.

The dataset published through Zenodo is not a product of the collation data by authors from different sources, but simple duplication of the data from the global source into non-specialised scientific repository. Such important steps of the dataset processing as the data structure design, data standardisation, and data publishing are absent. So, the data presented in the paper are not under the management of the authors because these data were already available via iNaturalist and GBIF.

Moreover, the authors did nothing to process this dataset after the raw data download. This proves that Excel files in the archive have the same form as those downloaded from the iNaturalist system. However, in the text of the data paper, the CSV format was mentioned. The data presented in the archive as four files (according to iNaturalist technical restrictions). Barely calculation of metrics presented in the tables was performed on each table separately. All metrics have likely been taken from local project summaries.

Moreover, there are no figures in the paper, composed by authors immediately from the dataset using R, Excel, Statistica, or other software. A simple and relevant question: "What portion of these data collected by authors themselves?", asked in the "Flora of Russia" project post section, remains unanswered.

Undoubtedly, the data accumulated with the efforts of "Flora of Russia" project participants could be valuable for different research: flora inventory, especially for protected areas, monitoring of invasive species as mentioned in the data paper, species distribution modelling, etc. But, it needs previous comprehensive and thorough work on the data quality assessment and data cleaning. Except for species identification, there is no work done in this direction. There are some common mistakes provided in the data paper, but without a description of the work on data quality checking.

These data are at the curation of the authors, to some extent, inside the iNaturalist system. Considerable work was performed on the identification of observations and identification verifying. Leading Russian botanists became involved in this process, which ensured the high quality of observations identification. That can be a matter of the possible exciting articles, described this experience and outcomes, but not the data paper in this form. Unfortunately, data quality control is restricted by species identification. Two other main things - WHEN and WHERE it was observed remained without attention. The exclusion of 50 km coordinates uncertainty, which is a too rough and too formal approach.

The types of errors associated with the observation date have long been widely known but did not mentioned:

The photo does not match the date (season mismatch)
The error of "1 January."
When the observation was posted in the citizen science system earlier than when the photo was taken, this is hard to find with the iNaturalist website but very simple handling the spreadsheet.

There is an instance of the negligent use of terminology, "portal" by definition, is a resource that unites heterogeneous data by structure, origin, initial purpose, and so on. Examples of the portal are GBIF, OBIS, and BOLD. In the paper, even a local iNaturalist project with hundreds of unified observations is called a portal. There is a mess among alternative links. Authors were provided links to different sources instead of different links to one source.

So, this paper does not contain a description of the dataset collated, standardised, and published by authors, nor comprehensive processing of the consolidated data array. The dataset, formally placed by authors in Zenodo, can not be reused for research without serious data quality control and data cleaning work. Everyone can get the data directly from the iNaturalist in the same form as it was published in Zenodo and download similar graphics from the iNaturalist system.

Oleg Kosterin

Thank you for the comment, Dr. Shashkov. It’s nice to see that nowadays the genre of scientific critics has not still been smashed by the explosion of papers ‘communicating something positive’. Your pointed out some objective drawbacks (from license discordance to misuse of the word ‘portal’) but also too minor points (e.g. using the term ‘social science’ beyond its strict definition, for data collected not specially to this project), and made many subjective evaluations. In your so long a comment you collected virtually everything critical you could (but you may say it was just a small selection) and so oversaturated the comment with the aura of too personal irritation to the project as such (or maybe to some of its key persons). This plays against yourself as will inspire a reader’s critical attitude to the comment rather than the paper and impair perception of its constructive part.

The comment is focused to your verdict at the end: “Everyone can get the data directly from the iNaturalist in the same form ... and download similar graphics from the iNaturalist system.”

Here you are, let us assume that to have happened. Most of scientific papers are made not because their authors get capable to do that but because they took a labour to do that. This long paper with tables, graphs and a map has appeared since some people were fortunately not lazy enough to abstain from that amount of work. It would be as good if the paper were produced by ‘anyone’, but it would be regretful if it never appears.

It seems that you are irritated that someone did what would be too easy to do for people with your skills if they were not so uninterested. That means you are just out of the circle of interested readers of this paper (such a case, isn’t it?), which nevertheless exists and is even broader than the circle of its authors.

You regretted repeatedly that the authors did not produce instead some different and much better, ‘comprehensive’ paper, with different and prominent goals and sophisticated methods (you used the words ‘serious work’, ‘comprehensive processing’, ‘possible exciting articles’). But isn’t it an advice relevant to any paper on earth?

Fortunately, the reviewers and Editor considered this paper useful as it is. And fortunately there is always a variety of opinions what is relevant, useful or interesting. This allows science to go on.

To more particular issues:

(i) You claim that using iNaturalist functions is not a scientific method. It is. Look, even advanced analyses of DNA sequences involve software worked out by other teams for such purposes. Methods may be sophisticated and very simple, precise and raw, with more or less resolution. Any is useful for a relevant goal, if explicitly indicated and its limitations considered.

(ii) You claim that no tool but those of iNaturalist was used for that paper. This is just not true, the data taken from iNaturalist were further processed and to certain extent controlled (e.g. with respect to geographical accuracy). Most important, the data were thoroughly controlled with respect to identification accuracy still before they get research grade in iNaturalist and passed its project ‘Flora of Russia’. All authors are among top 200 identifiers hence you claim that “they did not perform ... data quality control” is just false, while the mention of “neural network hidden under the application interface” is irrelevant. OK, you acknowledged that efforts at the end of your comment, but your strict claim “they did not perform” comes first.

(iii) “Not only the collection method was missing, but also the purpose for which all these ovservations were collected”. It is evident that there was a distinct purpose why them were collected from iNaturalist, even if the purposes why they were posted there varied. That is one of the main sense of the work.

(iii) Maybe content duplication is not the best thing, but anyway is better than its absence. Why not duplicating data if required formally? And OK, the data are present in iNaturalist but, again, to get such an elaborated overview someone would have to do the job of taking and processing them, that exactly is to prepare this paper.

Maxim Shashkov

Some changes can be noticed - observations with an unsuitable license were excluded from the archive. However, this arose a new difficulty.
It was apparent previously that schemes and charts were not based on the dataset (means Zenodo archive as the primary data source specified in the paper). Now inconsistency of the data archive and data paper became more explicit. The dataset has been changed, and all the illustrations remained the same.
Moreover, authors paid particular attention to the occurrence distribution among the Russian Federation subjects, and the field with province ID could be essential, but there is no in the dataset. So, even a species list for a particular province could not be extracted from the dataset without a GIS and additional processing.
Besides, the archive on the ResearchGate that paper referred to, still contains records with inappropriate licenses.

Dmitry Schigel

Thank you for the lively discussion. I am informed that the archived data snapshot at Zenodo has now been corrected and its metadata updated https://zenodo.org/record/4061848. As before, data reuse is enabled through the links to the dynamic data resrouces throurgh iNaturalist and GBIF.org.

Dariia Borovyk

See Review form

This article is a violation of international law and ethical principles. The authors of this article consider Crimea as a part of russia, and add a comment, "Republic of Crimea and the City of Sevastopol claimed by Ukraine". But they are not "claimed". They are simply a part of Ukraine occupied by Russia, which is recognized by the entire civilized world. The authors of this article from the territory of Crimea are signed with affiliations in Russia, which is also a mistake. I am surprised that the reviewers of this article and the editors missed such harsh mistakes.

I have only one question for the coordinators of this very "successful" project - are you going to include the "newly acquired" territories of your motherland in your floristic research? You can have even more data if you do not care about the moral and ethical context at all.

Subscribe to email alerts for current Article's categories