The role of plants in the formation of species-specific features in grass flies (Diptera, Chloropidae, Meromyza)

Abstract In the current manuscript, we present the results of comparative analysis of seven species of Meromyza flies in the “variegata” cluster and of the evolutionary close species M.inornata, based the following criteria: 1) 14 external key features; 2) shape and area of the anterior processes of postgonites; 3) mtDNA CO1 region and 4) host plant diversity data. We could demonstrate the primary role of host plants in species formation inside genus Meromyza and calculated the timing of the divergence of M.inornata and the species of “variegata” cluster. Based on our estimates of evolution rate for mtDNA CO1 gene, we could conclude that that divergence of herbs happened before the speciation of grass flies Meromyza. Meromyza species, close to the ancestral species of the cluster, are adapted to the wide range of host plants. We revealed the most informative variables h1, S and Plant analysing data with the following statistical methods: linear discriminant analysis - LDA, regularised discriminant analysis - RDA, flexible discriminant analysis – FDA and probabilistic neural network - PNN. The highest classification accuracy was achieved using PNN (99%) and the lowest when using LDA (95.8%). When the Plant trait was excluded, the classification accuracy decreased by 14%. We revealed the significant trends in size change of the anterior process of the postgonite amongst studies species. This morphological structure is an element of male reproductive apparatus critical for the restriction of interspecies mating. We determined three branches of speciation in the “variegata” cluster and five trends in the evolution of this cluster, based on the external morphological features. We showed that M.variegata and especially M.mosquensis, the species closest to the ancestral haplotype, have the largest number of features typical of those of M.inornata. Based on the external features and the area of the anterior process of the postgonite, we reconstructed the phylogenetic position of M.elbergi in the cluster. In accordance with the obtained outcomes, we could conclude that the distribution, species diversity and the adaptation of the grass flies to narrow oligophagy were directly connected to host plant diversity. The adaptation to different host plants could be the main factor in divergence of grass flies and their evolution started later than the diversification in the Pooideae subfamily of grasses.


Introduction
Studies of the mechanisms and factors contributing to species biodiversity are an important part of modern research in zoology (Chesson 2000, Vorobjeva and Striganova 2005, Parmesan 2006, Gaujour et al. 2012, Jeltsch et al. 2013. Modern approaches to phylogenetic reconstructions are based on interaction of cladistics, numerical and genetic phyletics (Scotland et al. 2003, Pavlinov 2003, Pavlinov 2005, Patwardhan et al. 2014. In classical phylogenetics, construction of a cladogram represents the initial stage of phylogenetic studies where reconstruction of an evolutionary scenario requires a set of additional data (Pavlinov 2005). In contrast, the creation of a cladogram in modern phylogenetics is the final stage of phylogenesis reconstruction.
Grass flies of the genus Meromyza (Diptera, Chloropidae, Meromyza Meigen, 1830) represent a perfect model for employment of both modern and classical approaches to phylogenetic reconstructions. To date, Meromyza includes more than 90 species, distributed throughout the Northern Hemisphere. The identification key of this genus was developed, based on a set of external morphological features and specific features of male genital apparatus (Narchuk and Fedoseeva 2010).
The results of genetic analysis made it possible to divide the genus into eight clusters (Safonkin et al. 2016) which triggered the question about the factors contributing to diversification of this genus. The change of host plant is important in speciation of phytophages (Forbes et al. 2017). Assuming that evolution of host plants can be an important factor, we looked closely into a list of host plants for one third of the species of Meromyza flies ( Panteleeva 1989, Nartshuk andFedoseeva 2011). Additionally, the genetic analysis has allowed us to trace the formation of species groups (clusters) within the genus, as well as the relative time of speciation within these groups and the degree of species nutritional adaptation (Safonkin et al. 2020a). However, no data are available about the development of specific features, in particular, for species during the evolution of genus or the role of host plant. In flies of genus Meromyza, all host plants, the area, the relative time of speciation, the degree of relationship between species, based on the postgonite structure and the mtDNA COI locus are currently known only for a representative sample of species from the "variegata" cluster.
Species of the "variegata" cluster are widespread throughout Europe (Safonkin et al. 2020b). These species show striking similarities despite wide variability of external features, but for species-specific male genital apparatus (Safonkin et al. 2020b). Some species within the cluster have a broad range of host plants, but other species are specialised in limited host plant groups (Panteleeva 1989, Nartshuk andFedoseeva 2011).
In the current manuscript, we show the results from our study of the distribution of various external key features and structural features of male genital apparatus amongst the species of "variegata" cluster. We also present here the results of comparative analysis of seven species of Meromyza flies in the "variegata" cluster and evolutionary close species M. inornata Becker, 1910, based on external key features, shape and area of the anterior processes of postgonites, mtDNA CO1 region and host plant diversity data. We demonstrate the important role of host plants in species formation inside genus Meromyza.

Genetic analysis.
We analysed the relationship and time of species divergence, based on the nucleotide sequences of mtDNA CO1 locus previously deposited by us in GenBank. A phylogenetic tree was constructed via the Bayesian approach using the BEAST v.1.10.4 software package with default parameters, except for the Tree Prior parameter, for which the G. Yule speciation model was chosen. The numbers in the nodes indicate the replacement for the site for 1 million years (Triseleva et al. 2014, Safonkin et al. 2016, Safonkin et al. 2020b.

Methods for species differentiation, based on comparative analysis of external features and male genitals.
For comparative analysis in 66 males and 58 females of eight species of grass flies, we have selected the most distinct features (out of 35 analysed): eight quantitative features: h1 (ratio of the gena height to the height of the 3rd antennal segment), h2 (ratio of the length of triangle to the length of head), h3 (ratio of the height of ocellar triangle to the base of triangle), h4 (ratio of the width of hind femur to the width of hind tibia), L (length of the body without abdomen), L1 (ratio of the length of head to the length of mesonotum), L2 (ratio of mesonotum to scutellum), S (area of anterior process of the postgonite) and six qualitative traits (Table 1). Additionally, we analysed three ecological characters: habitat, biotope and species of host plants. For comparative analysis of the shape of anterior process of the postgonites, we used our previous data .

The analysis of host plants.
The species of host plants are taken from the work of Safonkin et al. (2020b). To time the origin of grasses, we used data from the work of Pimentel et al. (2017).

Statistical methods for species differentiation, based on quantitative and qualitative traits.
Evaluation of the separating ability of the selected quantitative (8) and qualitative (1) traits for the differentiation of 96 individuals, belonging to seven species of grass flies, was carried out using two types of methods. The first type includes various forms of discriminant analysis (DA) (Friedman 1989, Petrosyan 2014, Shitikov and Mastitsky 2017 (linear discriminant analysis -LDA and regularised discriminant analysis -RDA, flexible discriminant analysis -FDA). The second type includes methods based on artificial neural networks, in particular, probabilistic neural network (PNN) (Specht 1990, Petrosyan 2014, Mohebali et al. 2020. The use of different methods of DA is due to the fact that these methods have different requirements for the dataset, in particular, the LDA method assumes that the data distribution in each class is normal and that the intragroup variance and correlation matrices are equal. Another DA method, RDA, builds a species differentiation rule, based on information features by regularising group covariance matrices, allowing a more reliable model to be created taking into account the multicollinearity of the data. This technique is usually useful for large multidimensional data containing highly correlated predictors. The third modification method of DA is FDA, which is a flexible extension to LDA that uses non-linear combinations of predictors, such as splines. The FDA is known to be useful for modelling multidimensional abnormal and nonlinear relationships between variables in each group, resulting in a more accurate classification. It is important to note that DA methods work only with quantitative traits. In our case, there is one quality feature (the number of host plant species). For this reason, PNN is used to assess the informative nature of this trait for the separation of species.
To determine the set of linear discriminant functions (LDF) separating species, we used a stepwise discriminant analysis procedure with the threshold value of the inclusion of variables F = 4 (Petrosyan 2014). A quantitative assessment of the distinguished discriminant functions was carried out using a set of parameters, including eigenvectors, coefficients of canonical correlations for discriminant functions, Wilkes statistics, С hisquare statistics, as well as P-values of testing hypotheses for the significance of the separation of species (Petrosyan 2014). After the selection of informative features, the quality of the LDA model was assessed using a leave-one-out CV-procedures (Shitikov and Mastitsky 2017). For other forms of discriminant analysis in FDA and RDA, we also used a cross-check procedure for the final assessment of the quality of the models (Shitikov and Mastitsky 2017).
In the present study we use PNN, which has four layers: input, pattern, summation and output (Specht 1990, Petrosyan 2014, Mohebali et al. 2020. The task of the input layer is to distribute the input layer data for the pattern layer. The number of input signals is equal to the number of variables that define the samples (individuals). The pattern layer has one element for each sample (individual) from the training dataset. The input layer and the sample layer form a fully-connected structure. The output layer consists of seven neurons that determine whether an individual belongs to one of the seven species based on voting, taking into account the signals received from the summation layer. Effective methods for constructing PNNs are presented in the literature (Specht 1990, Petrosyan 2014, Mohebali et al. 2020. In this PNN, information moves in only one direction -forward from the input nodes (neurons) through the pattern layer to the output layer (neurons). As input variables of the network, we used all the quantitative variables that were used in the LDA, FDA and RDA methods and, additionally, we added the variable plant. The assessment of the classification accuracy was determined by averaging the accuracy of a 24-fold repetition of the network models. At each run, 92 individuals were randomly used as training and four individuals for testing (Petrosyan 2014). The small amount of selection of individuals to test the accuracy of the PNN-models is due to the fact that two species M. rufa and M. laeta are represented by a limited number of some features, three and four, respectively.
All assessments with the method of discriminant analysis were carried out in the RStudio v. 1.4.1106 using basic, special R-packages (MASS, klaR, mda, tidyverse, caret, dplyr, FactoMineR) and additional programmes in the R language. Probabilistic neural networks were created using the Biosystem office (Petrosyan 2014).

Assessment of species diversity, based on genetic analysis
Сlustering of Meromyza species, based on the mtDNA CO1 gene, revealed two close clusters. The "variegata" cluster included six species and the "inornata" cluster included only the single species because, currently, we have no other allied species.
There are three branches in the "variegata" cluster: 1). M. variegata and M. laeta, which are closer to the hypothetical ancestor of the cluster; 2). M. mosquensis; and 3). the more recent species M. femorata, M. rufa and M. bohemica (Fig. 1). The timing of the divergence of M. inornata and the species of the "variegata" cluster is shown in

Assessment of species diversity, based on external features and male genitals.
Based on the combination of morphological features, we include the seventh species M. elbergi into the " variegata" cluster. We can reveal five trends in the evolution of the " variegata" cluster, based on the external key features (Table 1): Table 2.
The divergence of M. inornata and the species of the "variegata" cluster.
The role of plants in the formation of species-specific features in grass ...

1.
Pale colour of the mid-stripe and occipital spot, but bright occipital strips ( Table 1). The shape of the anterior process of the postgonite is similar in M. variegata and M. inornata, but different considering the degree of curvature of the upper and lower contours. In M. laeta, the shape of the anterior process of the postgonite is acuminate. In other species, the anterior process of the postgonite is curved forward (Fig. 1  b).

M. mosquensis
The time of speciation of the grasses and the time of divergence of lineages ( †) is given according to Pimentel et al. (2017), − -no data, ? -supposedly.

Discriminant analysis results
Using a stepwise selection algorithm, it was determined that six variables (S, h1, h2, L1, L and h4) were significant predictors of species (Tables 4, 5). The order of these variables is given by the importance of their inclusion in the LDA model. The stepwise selection algorithm in LDA showed that four discriminating functions have P-values less than 0.05, i.e. are statistically significant at the 95.0% confidence level (Table 5). In our case, although four functions are statistically significant, nevertheless, the first two functions account for Table 3.
The speciation of host plants of species of the «variegata» cluster.
The role of plants in the formation of species-specific features in grass ... the overwhelming majority of the separation of species, i.e. Wilkes' statistics, the relative percentage of discrimination and the canonical correlation coefficient for the first two functions are 0.003, 85.79%, 0.98 and 0.82, 9.26%, 0.86, respectively (Table 5). For the other two functions, the relative percentages of discrimination are 3.4% and 1.23%, respectively. Canonical correlation coefficients indicate that each subsequent discriminant function contributes less to discrimination than the previous one (Table 4). Unlike the first discriminant function, which has a canonical correlation coefficient of 0.98, for the fifth, this coefficient is 0.28. The classification accuracy using the four LDA functions is presented in Table 6.  Results of the stepwise selection algorithm in the LDA model, including eigenvectors, the relative contribution of each function to species differentiation and the coefficients of canonical correlations. Table 5.
Characteristics of the statistical significance of the separation of seven species of flies within the selected LDA model.  The application of the leave-one-out CV procedure showed that the selected LDA discriminant functions allow species classification with an accuracy of 95.83% (Table 6). Species projection onto the first linear discriminants LD1 and LD2 is shown in Fig. 2.
The use of other methods of DA in the form of RDA and FDA using six traits (S, h1, h2, L1, L and h4) showed that the classification accuracy of these methods does not improve significantly. In the RDA methods of classification, the numbers of correct and incorrect classification of individuals are 94 and 2, respectively (Table 7), i.e. the classification accuracy is 97.92%. Regularised discriminant analysis projection on the first (RDA1) and second (RDA2) canonical axes is shown in Fig. 3. The highest classification accuracy is achieved using the FDA method. In the FDA method, the numbers of correct and incorrect classification of individuals are 95 and 1, respectively (Table 8), i.e. the classification accuracy is 98.96%. Flexible discriminant analysis projection of seven species on the first (FDA1) and second (FDA2) canonical axes is shown in Fig. 4.  The results of using a PNN.
The general architecture of a PNN, which was used to differentiate seven species, based on quantitative (6) and qualitative (1) features, is shown in Fig. 5. Application of PNN to the seven species of flies showed that the use of a qualitative variable plant further improves the classification results.    The conducted 24 realisations of PNN models showed that the classification accuracy on the training samples is 98.5% (± 0.3) and on the verification (testing) samples is 99% (± 1.1). To visualise the species differentiation, as an example, Fig. 6 shows the areas of change in the values of traits of individuals of seven species, which are used by PNN to differentiate individuals. These diagrams represent the areas of change in the values of features that characterise the most informative variables h1, S and Plant. Fig. 6A shows that each species in the plane S and h1 has its own characteristic area, determined by the ranges of variation of the variables S and h1. It is important to note that the diagram in Figure 6A is divided into seven parts, i.e. each species is characterised by a certain area of definition of the variables h1 and S. However, this statement is not fulfilled in terms of informative features of Plant and S (Fig. 6B). In the plane of these variables, the areas for six species, with the exception of M. inornata, are clearly distinguished. This diagram shows that the area of variation for Plant and S traits for M. inornata overlaps with that for M. variegata (see Fig. 7).
In the plane of informative features for Plant and h1, the areas of variation of these variables are presented for five species (Fig. 6C), with the exception of M. laeta and M. rufa. This diagram shows that, with the average values of the remaining variables, the area of variation for Plant and h1 traits for M. bohemica overlaps with that for M. laeta and M. rufa (Figs 6, 7), which indicates close values of Plant and h1.
To check the differentiating importance of the Plant trait in PNN, we built another network with six traits without Plant trait. Assessments, based on 24 realisations of PNN models using six features, showed that the classification accuracy on training samples is 84.8% (± 0.4) and, on testing samples, is 90% (± 2.7). When this trait is excluded, the classification accuracy decreased by 14%.
Typical errors in the classification of individuals of the species when using different methods of DA are presented in Tables 6-8  Three-dimensional scatter plot of seven species individuals in the space of three informative features (S, h1 and Plant).

Discussion
Based on the molecular clock of insect mtDNA CO1, the divergence rate is about 1.5 -4% per one million years (from 0.0075 to 0.012 substitutions per site (Rowan and Hunt 1991, Brower 1994, Jamnongluk et al. 2003  . Based on our estimates of evolution rate for the above gene, we can conclude that divergence of herbs was before the speciation of grass flies Meromyza. For example, the speciation time of host grass Lolium perenne L. (Table 3) exceeds the speciation time of the youngest species of the cluster, M. bohemica by ten times.
The adaptation to phytophagy in some dipterans was probably linked to climatic deterioration in the Neogene and the formation of new trophic connections. Tamura et al. demonstrated the correlation between the species evolution, the lowering temperature of the paleoclimate and the fragmentation of habitat in the Cenozoic using the Drosophila group as a model for the analysis (Tamura et al. 2004). Russo et al. assumed utilisation of the variety of new fruits of flowering plants as one of the possible factors in the speciation in the Drosophila group, resulting in Drosophila specialisation and their ecological diversity (Russo et al. 2013). We concluded that similar patterns of speciation observed in grass flies due to the drying of the climate at the end of the Miocene and were related less to formation of grass host plants than to the increase in plant abundance. The adaptation of fly larvae to other species of host grasses, especially to the species of another Subtribe, could be a factor triggering a new speciation in the "variegata" cluster. The change of host plant was important in speciation of some dipteran from Tephritidae: Eurosta solidarines Fitch, 1855(Craig et al. 1993, Stireman et al. 2005 and Rhagoletis pomonella Walsh,1867 (Forbes et al. 2005 The earliest by origin, Supertribe Triticodae includes Elymus sp. grasses. According to Tsvelev (Tsvelev 1987), an unusually wide variability of hybrid grass species, including Elymus repens (L.) Gould, contributed to rapid evolution of the species. Development on evolutionarily more ancient and variable grass species suggests that grass flies should have morphological, physiological and behavioral features close to those of the ancestral species of the cluster, which were wide oligophages. Two oligophage species meet this criteria: M. variegata is close to ancestral haplotype and M. mosquensis represents a separate branch from the ancestral haplotype (Tables 1, 2, 3, Fig. 1).
From the results indicated above, a connection can be assumed between the early origin of plants and their species diversity and the distribution and formation of initial groups of species of grass flies. Climate change and variability of ecological conditions has resulted in the divergence of grass flies, as they adapt to evolutionarily younger, more abundant grasses with expanded distribution ranges. Herbert et al. demonstrated that the reproductive isolating barriers in Phytomyza glabricola Kulp, 1968 (Diptera: Agromyzidae) were associated with different host plants (Herbert et al. 2013).
In general, preferences of grass fly feeding are independent from the ecological characteristics of grasses, except for M. rufa which feeds on firm bunchgrass, M. laeta -on short grasses and M. variegata -on tall and semi-tall grasses. Tall grasses (Elymus repens ), unlike short grasses, are characterised by high shoots, large and raw stems and leaves, as well as low stooling. M. variegata and M. mosquensis are more xerophilic, which may be more consistent with an early origin.
Various methods of statistical analysis demonstrated the set of the most informative differentiating traits of studied grass flies which allowed to differentiate the species with an accuracy of 95 -99%. The highest classification accuracy is achieved when using PNN (99%) and the lowest when using LDA (95.8%). The accuracy of classification using RDA and FDA is 97.9% and 98.96%, respectively. The outcomes from different methods of analysis lead to the conclusion that the most important differentiating features of species in the "variegata" cluster are the traits S, h1 and Plant.
Input data in PNN analysis without Plant trait decreases the classification accuracy indicating the importance of the differentiating role of the trait Plant in this group of flies. Obviously, the contribution of the Plant trait is not limited only to the number of plant species because each species of grass flies in each evolutionary lineage of "variegata" cluster is associated to a specific host plant for its development (Table 3). It is an indication of importance of original plant species in the evolution of grass flies, besides the number of species of host plants.
We think that the area of the anterior process of the postgonite is the most significant criterion of species division. The trends of shape change of the anterior process are less visible than the change of its size (Table 1, Fig. 1B). The Mahalanobis distance calculation confirmed the uniformity of the shape of the anterior process of postgonite both between the "variegata" and "inornata" clusters and within the "variegata" cluster (M. mosquensis -M. variegata, M. femorata -M. variegata) . However, the nonoverlapping dimensions of the anterior process indicate its important role in reproductive isolation.
Possibly, geographic isolation between the East Asian M. inornata and the species of the " variegata" cluster resulted in similarity in shape and size of the anterior process of the postgonite in M. inornata and M. variegata (Table 1).
The revealed trends in the evolution of the "variegata" cluster, based on the external key features, may be correlated with different adaptations of grass flies to the environment. Pale colour of mid-stripes and occipital spot, but bright occiput stripes (except for M. mosquensis and M. elbergi), may be due to adaptation of Meromyza flies to the colour range of the host plants. The considerable thickening of hind femurs may be an adaptation of Meromyza flies mainly to the jumping motion along the stems as a characteristic of the adult behaviour of the genus Meromyza. Black setae on the lower surface of gena are the characteristic of the West European origin of some species . Many features, similar to those of M. inornata, have been revealed for M. mosquensis and, to a lesser degree for M. elbergi (Table 1), suggesting that these features have been rooted in a common hypothetical ancestor of both clusters.
Lacking genetic data for M. elbergi, we can use only external features and the structure of the postgonite for reconstruction of this species phylogenetic position in the cluster (Fig.  1A). The first hypothesis is that M. elbergi could originate from the M. femorata "lineage" since it is close to M. rufa, M. bohemica by postgonite structure. The second is that M. elbergi could originate from the M. mosquensis lineage, since it has a combination of external key features inferred for the ancestral haplotypes of the clusters (M. inornata -M. mosquensis).

Conclusions
This study highlights that the adaptation to different host plants could be the main factor in divergence of grass flies of the "variegata" cluster. We think that formation of the specific set of external features, which are rather uniform within a cluster, was associated with the developing of grass flies in similar conditions of the grass biome. Stabilising selection for a set of species external feature resulted in the formation of differences in structure and size, with insignificant change in shape and specific features of the male genital apparatus. The increased availability of host plants could be directly connected to the distribution, diversification and the adaptation of these grass flies to narrow oligophagy, but the development of these changes started later than the diversification in the Pooideae subfamily of grasses and the distribution of these grasses in biomes in the Middle and Late Miocene.