The phylogeny of Empis and Rhamphomyia (Diptera, Empididae) investigated using UCEs including an over 150 years old museum specimen

The genera Empis Linneus, 1758 and Rhamphomyia Meigen, 1822 (Empidoidea, Empididae Latreille, 1809) are two large genera of flies commonly named dagger flies. They are widely distributed in the world with most species described from the Palearctic Region. Empis comprises about 810 described species and Rhamphomyia comprises about 610 described species, together they represent one third of the known species diversity in Empididae. Two recent studies on the phylogeny of the two genera using Sanger sequencing on a few genetic markers, did not support monophyly of them. In this study high throughput sequencing of target enriched molecular data of ultraconserved elements or UCEs was used to investigate the phylogenetic relationships of included representatives of the genera. This method has proven useful on old and dry museum specimens with high amounts of degraded DNA, which was also tested herein. For this purpose, a commercially synthesized bait kit has previously been developed for Diptera which this study was the first one to test. Three out of nine old and dry museum specimens were successfully sequenced, one with an age of at least 154 years. Higher DNA concentration yielded a greater number of reads. Analyses conducted in the study confirmed that both Empis and Rhamphomyia are non-monophyletic.


Introduction
The family Empididae Latreille, in the superfamily Empidoidea Latreille, commonly known as dagger flies, is a family within Diptera consisting of around 3 051 known species in the world (Roskov et al. 2019). Dagger flies gets their vernacular name from the long and dagger-like piercing mouthparts. An older name is dance flies; however, this name is today assigned to the family Hybotidae Fallén, in the same superfamily. In both Empididae and Hybotidae many species form swarms where a typical mating ritual, which is perceived as a dance, takes place. Members in the subfamily Empidinae Latreille, constitute a high interspecific variation in mating rituals. (Cumming 1994;LeBas et al. 2003;Murray et al. 2018). The empidid tribe Empidini Latreille is highly diverse and consists of 14 genera spread all over the world, with a particular diversity in the Neotropical Region (Wiegmann et al. 2011). The two most species-rich genera in the tribe are the sister groups of Empis Linneus, 1758 and Rhamphomyia Meigen, 1822, and the majority of species of these two genera are described from the Palearctic Region (Watts et al. 2015). Empis constitute about 810 described species and Rhamphomyia about 610 described species according to 2019 Annual Checklist (CoL). Together they represent more than one third of all known Empididae species (Roskov et al. 2019). In an attempt to obtain a better overview of the diversity of the two genera several subgenera have been established. However, there is no clear number on how many subgenera there are but approximately 24 are given for Empis and 18 for Rhamphomyia (Chvála 1994;Poole and Gentili 1996;Yang et al. 2007;Saigusa 2012;Evenhuis and Pape 2019).
Previous studies have indicated that Empis and Rhamphomyia are non-monophyletic (Watts et al. 2016;Wahlberg and Johanson 2018). The study by Watts et al. (2015) found the genera to be polyphyletic, and that there is a Neotropical linage of Empis more closely related to the tribe Hilarini Collin, 1961, sister to Empidini. It was also hypothesized that further studies with additional sampling from the Palearctic and Nearctic regions and a larger molecular data set is necessary to resolve the phylogenetic relationships. Wahlberg and Johanson (2018) found Empis to be monophyletic except by two species of Rhamphomyia nested within it, implying non-monophyly of Rhamphomyia, but far from all subgenera were represented in the study and the specimens were mainly representatives of the Palearctic Region. The genera Empis and Rhamphomyia have several morphological resemblances involving a small head with large eyes, elongated mouth parts, long legs and an elongated abdomen with a high interspecific variability in male genitalia. The morphology of females has been much less studied compared to that of the males, and identification keys generally rely on male characters (Chvála 1994). Bothgenera can be morphologically distinguished from each other by the possession of a forked vein R 4+5 in the wings in Empis, a feature lacking in Rhamphomyia. In addition, Empis species have much longer mouthparts compared to Rhamphomyia (Chvála 1994;Watts et al. 2016).
Former studies on the phylogeny of Empis and Rhamphomyia focused on either morphological or molecular data from traditional Sanger sequencing on only few genetic markers (Chvála 1994;Watts et al. 2016;Wahlberg and Johanson 2018. In this study we use a high throughput sequencing (HTS) method named target enrichment. This method allows for hundreds of genetic markers to be analysed at a less time and money expense per marker and specimen compared to the traditional Sanger sequencing. The targeted genetic markers focused on herein are called UCEs, ultraconserved elements. Which has proven useful for resolving phylogenies of less distant taxa of insects (Faircloth et al. 2012;Faircloth et al. 2014;Blaimer et al. 2016;Ješovnic et al. 2017;Van Dam et al. 2017). Another advantage of using this method is that it is useful on old and dried specimens which otherwise can pose a problem when using traditional Sanger sequencing due to the fragmentation of old DNA (Blaimer et al. 2016). For this purpose, commercially synthesized bait set which is complementary to the targeted UCEs sequences can be used to collect UCE data and its highly variable flanking regions adjacent to the UCE loci. (Faircloth et al. 2012). Until this point UCEs have not yet been tested on dipterans, this study is the first to evaluate the application of UCEs on Diptera (Ultraconserved 2017). Amplification of the cytochrome oxidase subunit I (COI) barcode gene is also performed in order to evaluate if the extractions went well and to validate species determinations in The Barcode of Life Database (BOLD) (Ratnasingham and Hebert 2007). In this study we include both relatively new material kept in ethanol sampled between 1995 and 2015 and older, dry museum specimens sampled between the years 1843 and 1993. The molecular data from the high throughput sequencing is used to investigate the phylogenetic relationship of the genera Empis and Rhamphomyia.

DNA extraction and COI barcode amplification
For DNA extraction the KingFisher™ Cell and Tissue DNA Kit (Thermo Scientific, USA) was used together with KingFisher™ Duo (Thermo Scientific, USA) extraction robot following the manufacturer's protocols. For extractions of large specimens one leg was removed from the body, for medium sized specimens the abdomen was removed and for small ones the whole animal was used. Lysis was performed in 56 °C overnight. After extraction the body part was returned to the specimen. For pinned material only oneg was used, due to restrictions from the museum. All extracted material is kept in 80% ethanol as vouchers at the Swedish Museum of Natural History (NHRS). For pinned material a voucher code was attached to the pin and the specimen was returned to the collection. Amplification of the COI barcode gene was performed for all samples in order to evaluate if the extractions went well and to validate species determinations. The PCR reactions were carried out with a 25 µl reaction containing Ready-To-Go PCR Beads (Amersham Biosciences, Great Britain) and 1 µl of each primer, 2 µl DNA template and 21 µl ddH 2 0 for each sample. The primers used were LCO1490 and HCO2198 (Folmer et al. 1994). The amplification program included 95 °C for 5 min, followed by 40 cycles of 95 °C for 30 sec, 50 °C for 30 sec and 72 °C for 50 sec and a final step at 72 °C for 8 min. To determine whether the amplification was successful the PCR product was inspected using gel electrophoresis. The successful PCR products were purified using Exo-Fast (Qiagen, Germany), and then sent to Macrogen Inc (Netherlands) for Sanger sequencing. Barcode sequences were deposited at NCBI GenBank, accession numbers are given in Table 1.

Library construction and target enrichment
The DNA concentration and fragmentation in each extraction of the nine old and dry samples was measured using Qubit (Thermo Fischer Scientific, USA) and BioAnalyzer (Agilent, USA). The samples from newer specimens were measured and compared to the older samples. The newer samples were fragmented on a Covaris sonicator, at SciLife Lab (Solna), to the target fragment length of 500-600 bp. Libraries for each sample were constructed following a modified version of the Meyer and Kircher (2010) protocol for Illumina sequencing, using magnetic AMPure beads for cleaning steps. The modified protocol does not contain the step of fragmentation and purification of sample DNA and the temperature profile of the PCR reactions is slightly different. This modification has been developed in house to fit museum samples. The protocols used for library preparation and amplification following hybridization are available in Meyer and Kircher (2010). Adapters used were IS1_adapter_P5.F, IS2_adapter_P7.F and IS3_ adapter_P5+P7.R. The libraries were amplified with dual index primers. Before hybridization step the DNA concentration was measured again and fragment size distribution inspected on BioAnalyser. Size selection and purification of libraries was carried out using AmPure XP (Agencourt, France) beads, with a 1.8X ratio, and thereafter pooled in equimolar amounts into 8 pools for the following hybridization step. The hybridization was conducted following the myBaits Hybridization Capture for Targeted NGS version 4.01 protocol and the myBaits UCE Diptera 2.7Kv1 baits kit constructed by Faircloth (2017) was used (synthesized by Arbor Biosciences, USA). The hybridization was conducted in 65 °C for 18 hours. KAPA HiFi HotStart was used for library amplification with the primers IS5_re-amp.P5 forward library primer (10 μM) and IS6_reamp. P7 reverse library primer (10 μM). After hybridization the 8 pools were pooled into one pool in equimolar amounts and sequenced on an Illumina MiSeq v3 2x300bp pair-end platform at SciLife Lab (Solna, Sweden). Raw reads were deposited at NCBI Sequence Read Archive (SRA) as a Bi-oProject, accession number PRJNA596621.

Assembly and alignment of UCE sequences
Demultiplexed reads were quality checked and filtered using the pre-processing tool fastp (Chen et al. 2018) with standard settings and base correction for paired end data. Using the base correction for paired end data also merged forward and reverse reads in one step. Assembly of paired reads were conducted using METASPADES (Nurk et al. 2017). The extraction of UCE sequences, alignment, cleaning and preparation of UCE data followed the PHYLUCE pipeline by Faircloth (2016). The extraction of UCE data was performed using the Diptera 2.7Kv1 probes (Faircloth 2017). Nucleotide based alignment was carried out in MAFFT v7 (Katoh and Standley 2013) with no trimming. Edge and internal trimming of the alignments was conducted outside the pipeline with TRIMAL v.1.2 (Capella- Gutierrez et al. 2009) to remove poorly aligned or ambiguous sites. The alignments were optimized prior to the phylogenetic analysis by finding the best partitioning scheme and substitution models. To create a table of partitions for UCE data PFINDERUCE-SWSC-EN v1.0.0 (Tagliacollo and Lanfear 2018) was used to identify the conservative core and variable flanking regions. Partition scheme and substitution model test was performed in PARTITIONFINDER v2.1.1 (Lanfear et al. 2012) with the options for rcluster and RAxML algorithms. A final dataset of 70% completeness was created for further phylogenetic analysis.

Phylogenetic analysis
Bayesian inference was performed on the partitioned dataset using MRBAYES v3.2.6 (Huelsenbeck and Ronquist 2001;Ronquist and Huelsenbeck 2003) and set with the substitution models generated by PartitionFinder for each partition. The following settings were used for the analysis; 4 chains, 2 runs, 100 million generations, sampling frequency of 10 000, the temperature was set to 0.11, burnin of 25%. The log files were inspected in TRACER v1.

Libraries, alignment and partitioning
Of the 48 specimens ten lack sufficient reads or target loci and were removed from the analyses. Five out of the nine old and dry samples were successfully aligned but only three of them, E. picipes (AE2C), E. hyalipennis (AE5C) and R. gibba (AF1C), with a satisfactory amount of data.
In 35 of the 38 samples there was an increase in DNA concentration after library amplification. DNA concentration for the old and dry samples in the study range between 0.626 and 1.09 ng/µl before the library amplification. For newer material stored in ethanol the concentration range between 0.198 and 11.0 ng/µl. Following library amplification DNA concentration of the old samples range between 1.58 and 13.9 ng/µl and for the newer 0.612 and 46.8 ng/µl. The number of reads for each of the 38 specimens varies between 38 000 and 3 000 000. Seven out of the ten poorly sequenced samples have a lower number of reads than the rest, ranging from 24 to 28 000. However, three of them have reads ranging between 80 000 and 176 000. The measured DNA concentration before and after library construction is depicted for each taxon in Appendix 1 including the ten excluded samples. The dataset has 41 out of 48 specimens with enough loci represented. Six of the seven excluded specimens were old and dry. The number of aligned loci were 15, the alignment length was 7 394 bp and the number of informative sites were 1 900 bp. The number of partitions for the dataset was 21. Detailed partition schemes with models chosen for the partitions are summarized in Appendix 2.

Phylogenetic analysis
The phylogenetic analysis of the dataset generated trees with a total of 38 taxa; 36 ingroup taxa and two outgroup taxa (Figs 1, 2). The two specimens E. grisea (AE3C) and E. nigricans (AD9C) were old and dry museum samples and were removed from the analysis because their sequences were short and has long gaps and generated very long branches in the phylogenetic trees. The ESS-values range between 4800 and 7500 for separate runs, and 2700 to 7500 for combined runs. The two trees have some differences in topology; however, the Bayesian inference-tree has a higher support in general. In the Bayesian inference-tree ( Fig. 1) 33 nodes have a posterior probability support above 93%. In the Maximum likelihood-tree (Fig. 2) 12 of the most recent nodes have a bootstrap support above 85. Empis and Rhamphomyia were divided into multiple well supported monophyletic groups scattered in the tree, leaving both non-monophyletic. The clades A, B and C marked in blue in Fig. 1 contain both genera. Clades A, B and C have high (> 93%) posterior probabilities in the Bayesian inference-tree, in the Maximum likelihood-tree Clade A has a support value of 100, Clade B has a support of 41 and Clade C is not present. In the Bayesian inference-tree Clade B includes five Empis species and three Rhamphomyia species with a support of 0.93.

Discussion
The application of target enrichment of UCEs on old and dry museum samples of Diptera was in general successful. Three of the old samples were sequenced E. (Euempis) picipes (AE2C), R. (Rhamphomyia) plumipes (AE6C) and R. (Amydroneura) gibba (AF1C) with sufficient data, but two specimens, E. (Leptempis) nigricans (AD9C) and E. (Leptempis) grisea (AE3C), had too little data coverage. When performing Sanger sequencing of COI none of the old and dry specimens were recovered. This was expected because old and dry samples usually have highly degraded DNA, and to conduct a successful PCR, preceding Sanger sequencing, it is necessary with sufficient good DNA quality (Lindahl 1993;Junqueira et al. 2002). A recently conducted study applying target enrichment using UCEs succeeded in analysis of a 121 year old museum specimen of carpenter bee (Blaimer et al. 2016). In this study R. (Amydroneura) gibba (AF1C) was sampled by Carl H. Boheman (1796-1868), either in 1852 or 1865, so this sample is at least 154 years old. The specimens of E.
(Euempis) picipes (AE2C) and R. (Rhamphomyia) plumipes (AE6C) were sampled in 1991 and 1993 respectively. This result shows that there is a great possibility to utilize the DNA of the immense collections of old and dry specimens of Diptera at natural history museums in the world. The average DNA concentration in this study (2.26-12.94 ng/µl) were within the same range as in Blaimer et al. (2016), with old samples (1.11-14.67 ng/µl,), whereas in Ješovnic et al. (2017), a study with newer samples, the DNA concentration were in a higher range (3.87-79 ng/ µl). Blaimer et al. (2016) found that DNA concentration and number of reads of old pinned specimens decreased with an increasing age of the sample. Most of the old samples in this study did not express a large increase in DNA concentration after library amplification and the number of reads were low compared to the majority of the newer samples. Comparing the number of reads of the 38 samples (38 000-3 000 000) to three other studies the samples range start from a much lower number. Samples in Blaimer et al. (2016) ranged between 70 256-3 479 137, samples in Ješovnic et al. (2017) ranged between 299 485-3 500 409, in Van Dam et al. (2017) samples ranged between 1 716 890-31 283 213. As stated above, these studies also had a generally higher DNA concentration than our specimens. The studies by Van Dam et al. (2017) and Ješovnic et al. (2017) did not have any samples as old as in the current study, which might explain the difference. But it could also be due to the amount of tissue sampled, from the old samples we were only allowed to extract one leg of quite small specimens. Possibly a larger amount of tissue would increase the DNA concentration and thereby the number of reads. Other factors that might affect DNA concentrations and fragmentations are extraction protocols. This is the first study conducted on dipterans using target enrichment of UCEs. Further development of specially DNA extraction protocols might refine the methodology.
The phylogenetic analyses adopting Bayesian inference inferred a tree with high support values (Fig. 1). The genera were widely scattered in the tree which contradicts the hypothesis that the genera are monophyletic. What strengthens the non-monophyly even more is the three clades A, B and C depicted in the tree (Fig. 1). In clade B there is a subclade of Rhamphomyia sp. 3 (AF7C) and Empis sp. 6 (AH2C) with short branches, these two species belong to the Rhamphomyia subgenus Aclonempis and Empis subge-nus Coptophlebia. These subgenera have been discussed by Chvála (1994), who suggested their monophyly if including the subgenus Empis s. s. All sampled individuals belonging to Empis s. s., Coptophlebia and Aclonempis are in this study grouped in clade B. Chvála (1994) has also stated that the Empis subgenus Lissempis is more closely related to the Rhamphomyia subgenus Lundstroemiella than to any other Empis subgenus. In clade C the Empis subgenus Lissempis is more related to the Empis subgenus Xanthempis but is sister group to Lundstroemiella. The high posterior probability values in the Bayesian inference-tree confirms the non-monophyly of the genera previously suggested by Watts et al. (2016) based on analyses of Sanger sequenced data. In the study by Watts et al. (2016) geographic distribution was taken into account and it was found that there are two linages, one linage with Palearctic + Nearctic Empis and Rhamphomyia and one linage with Neotropical Empis. The two linages were recovered as sister groups and Neotropical Empis was more closely related to the Empidini genera Lamprempis Wheeler & Melander, 1901, Opeatocerata Melander, 1928, Macrostomus Wiedemann, 1817and Porphyrochroa Melander, 1928 than to the other linage. Our sampling is mainly Palearctic; Sweden, Russia, Japan, Greece, but with additional taxa from French Guiana, Chile and New Caledonia. One of the two Neotropical species, the Chile-species Empis sp. (AG9C), is placed as a sister group to all other taxa except three Palearctic species R. (Amydroneura) gibba (AF1C), Empis sp. (AF2C) and E. (Planempis) latro (AF4C). This corresponds to the findings by Watts et al. (2016).
The species within the two genera are morphologically quite similar, and the traditional characters used to distinguish the genera are the wing venation and mouthpart length. However, there are exceptions. For example, species in the Rhamphomyia subgenera Aclonempis and Vockerotempis, Saigusa 2012 possess a long labrum much like those found in Empis species. Another important fac-tor affecting stability of classification based on wing venation is that of intraspecific variation, even within the same exemplar, i.e. one wing having a R 4+5 fork and the other lacking the fork (Chvála 1994). Such a case was found in this study, the species E. (Empis) planetica (AA7C) (Fig. 3). This species was placed as a sister taxon to E. (Empis) nuntia (AB5C) in clade B in the Bayesian inference and Maximum likelihood-tree. This raises the question of how reliable these morphological traits are for separating the genera and species. Lastly, another interesting finding is that the two Chelipodini species representing the outgroup together with Hilara are both nested within Empidinae as a sister group to all other species of the subfamily except the Chilean species and three Palearctic species R. (Amydroneura) gibba (AF1C), Empis sp. (AF2C) and E. (Planempis) latro (AF4C). Rooting on these two groups respectively did not change the tree topology. The trees were rooted on Hilarini, as according to the latest research of the superfamily Empidoidea which concluded that Chelipodini is more closely related to Empidini than Hilarini is (Wahlberg and Johanson 2018). The previous study by Watts et al. (2016) suggest that Hilara is a sistergroup to Empis and Rhamphomyia. However, the support for the placement of Chelipodini in this study is low, 0.64.

Conclusion
The first-time application of the Diptera 2.7Kv1 probe kit (Faircloth 2017) for target enrichment using UCEs was successful regarding inferring phylogenies with high support in the Bayesian inference analysis. Sequences of five out of nine old museum samples were successfully aligned, however only three were good enough to be used in a phylogenetic analysis. For future studies we suggest increasing the tissue sampling on old material of Diptera to increase the chances of higher DNA concentration. However, this result shows that it is possible to use the immense collections of old and dry Diptera samples for DNA studies. The application of this technique can reduce the sampling of new specimens which would be beneficial for the biodiversity. Exchange of specimens between natural history museums, universities and other collections can provide researchers with specimens from all over the world. The analyses performed well in this study and inferred Empis and Rhamphomyia as non-monophyletic. This corresponds with the studies of Watts et al. (2016) and Wahlberg and Johanson (2018) using Sanger sequencing (;). Table A1. DNA concentration before and after library construction and number of reads for each taxon. Old referring to pinned museum specimens sampled before year 1993, new referring to specimens kept in ethanol and sampled after year 1995. Specimens below lines of average, min and max, refers to specimens excluded from the study.