Distributed System of Scientific Collections
otherLeiden, The Netherlands
Research output, citation impact, and the most-cited recent papers from Distributed System of Scientific Collections. Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Distributed System of Scientific Collections
BiCIKL is an European Union Horizon 2020 project that will initiate and build a new European starting community of key research infrastructures, establishing open science practices in the domain of biodiversity through provision of access to data, associated tools and services at each separate stage of and along the entire research cycle. BiCIKL will provide new methods and workflows for an integrated access to harvesting, liberating, linking, accessing and re-using of subarticle-level data (specimens, material citations, samples, sequences, taxonomic names, taxonomic treatments, figures, tables) extracted from literature. BiCIKL will provide for the first time access and tools for seamless linking and usage tracking of data along the line: specimens > sequences > species > analytics > publications > biodiversity knowledge graph > re-use.
Persistent identifiers (PID) to identify digital representations of physical specimens in natural science collections (i.e., digital specimens) unambiguously and uniquely on the Internet are one of the mechanisms for digitally transforming collections-based science. Digital Specimen PIDs contribute to building and maintaining long-term community trust in the accuracy and authenticity of the scientific data to be managed and presented by the Distributed System of Scientific Collections (DiSSCo) research infrastructure planned in Europe to commence implementation in 2024. Not only are such PIDs valid over the very long timescales common in the heritage sector but they can also transcend changes in underlying technologies of their implementation. They are part of the mechanism for widening access to natural science collections. DiSSCo technical experts previously selected the Handle System as the choice to meet core PID requirements. Using a two-step approach, this options appraisal captures, characterises and analyses different alternative Handle-based PID schemes and the possible operational modes of use. In a first step a weighting and ranking the options has been applied followed by a structured qualitative assessment of social and technical compliance across several assessment dimensions: levels of scalability, community trust, persistence, governance, appropriateness of the scheme and suitability for future global adoption. The results are discussed in relation to branding, community perceptions and global context to determine a preferred PID scheme for DiSSCo that also has potential for adoption and acceptance globally. DiSSCo will adopt a ‘driven-by DOI’ persistent identifier (PID) scheme customised with natural sciences community characteristics. Establishing a new Registration Agency in collaboration with the International DOI Foundation is a practical way forward to support the FAIR (findable, accessible interoperable, reusable) data architecture of DiSSCo research infrastructure. This approach is compatible with the policies of the European Open Science Cloud (EOSC) and is aligned to existing practices across the global community of natural science collections.
European Natural Science Collections (NSC) are part of the global natural and cultural capital and represent 80% of the world bio-and geo-diversity. Data derived from these collections underpin thousands of scholarly publications and official reports (used to support legislative and regulatory processes relating to health, food, security, sustainability and environmental change) and let to inventions and products that today play an important role in our bio-economy. In the last decades, the research practice in natural sciences changed dramatically. Advances in digital, genomic and information technologies enable natural science collections to provide new insights but also ask for changing the current operational and business models of individual collections held at local natural history museums and universities. A new business model that provides unified access to collection objects and all scientific data derived from them. Although aggregating infrastructures like the Global Biodiversity Information Facility, GenBank and Catalogue of Life now successfully aggregate data on specific data classes, the landscape remains fragmented with limited capacity to bring together this information in a systematic and robust manner and with scattered access to the physical objects. The Distributed System of Scientific Collections (DiSSCo) represents a pan-European initiative, and the largest ever agreement of natural science museums, to jointly address the fragmentation of European collections. DiSSCo is unifying European natural science collections into a coherent new research infrastructure, able to provide bio- and geo-diversity data at the scale, form and precision required by a multi-disciplinary user base in science. DiSSCo is harmonising digitisation, curation and publication processes and workflows across the scientific collections in Europe and enables linking of occurrence, genomic, chemical and morphological data classes as well as publications and experts to the physical object. In this paper we will present the socio-cultural and governance aspects of this research infrastructure. DiSSCo is receiving political support from 11 countries in Europe and will gradually change its funding model from institutional to national funding, with temporary funding from the EC to support the preparation and development. Solutions to achieve large scale digitisation are currently designed in the EC funded ICEDIG project to underpin the future large scale digitisation carried out by the countries. Unified virtual (digitisation on demand) and transnational physical access to the collections is over the next four years being developed in the EC funded SYNTHESYS+ project. The governance of DiSSCo is designed to gradually change from a steering committee composed of a few large natural history museums contributing in cash to initiate the development into a legal entity in which national consortia are represented, with a central coordination office for daily management. Each country individually decides how its entities (scientific collection facilities, research councils, governmental bodies) are organised in their national consortium. A stakeholder and user forum, Scientific Advisory Board and International Advisory Board will ensure that DiSSCo will be functional in enabling science across disciplines and within the international landscape of infrastructures. Training and short scientific missions are being developed in the MOBILISE COST Action to build capacity in FAIR data production, publication and usage of scientific collection-derived data in Europe and to initiate the socio-cultural changes needed in the collection-holding institutes. A Helpdesk is being constructed in the SYNTHESYS+ and DiSSCo Prepare projects to further facilitate the use and scientific use cases have been collected in ICEDIG to develop and facilitate e-services tailored to scientific needs.
International mass digitization efforts through infrastructures like the European Distributed System of Scientific Collections (DiSSCo), the US resource for Digitization of Biodiversity Collections (iDigBio), the National Specimen Information Infrastructure (NSII) of China, and Australia’s digitization of National Research Collections (NRCA Digital) make geo- and biodiversity specimen data freely, fully and directly accessible. Complementary, overarching infrastructure initiatives like the European Open Science Cloud (EOSC) were established to enable mutual integration, interoperability and reusability of multidisciplinary data streams including biodiversity, Earth system and life sciences (De Smedt et al. 2020). Natural Science Collections (NSC) are of particular importance for such multidisciplinary and internationally linked infrastructures, since they provide hard scientific evidence by allowing direct traceability of derived data (e.g., images, sequences, measurements) to physical specimens and material samples in NSC. To open up the large amounts of trait and habitat data and to link these data to digital resources like sequence databases (e.g., ENA), taxonomic infrastructures (e.g., GBIF) or environmental repositories (e.g., PANGAEA), proper annotation of specimen data with rich (meta)data early in the digitization process is required, next to bridging technologies to facilitate the reuse of these data. This was addressed in recent studies (Younis et al. 2018, Younis et al. 2020), where we employed computational image processing and artificial intelligence technologies (Deep Learning) for the classification and extraction of features like organs and morphological traits from digitized collection data (with a focus on herbarium sheets). However, such applications of artificial intelligence are rarely—this applies both for (sub-symbolic) machine learning and (symbolic) ontology-based annotations—integrated in the workflows of NSC’s management systems, which are the essential repositories for the aforementioned integration of data streams. This was the motivation for the development of a Deep Learning-based trait extraction and coherent Digital Specimen (DS) annotation service providing “Machine learning as a Service” (MLaaS) with a special focus on interoperability with the core services of DiSSCo, notably the DS Repository (nsidr.org) and the Specimen Data Refinery (Walton et al. 2020), as well as reusability within the data fabric of EOSC. Taking up the use case to detect and classify regions of interest (ROI) on herbarium scans, we demonstrate a MLaaS prototype for DiSSCo involving the digital object framework, Cordra, for the management of DS as well as instant annotation of digital objects with extracted trait features (and ROIs) based on the DS specification openDS (Islam et al. 2020). Source code available at: https://github.com/jgrieb/plant-detection-service
The Distributed System of Scientific Collections (DiSSCo) is a new world-class Research Infrastructure (RI) for Natural Science Collections. The DiSSCo RI aims to create a new business model for one European collection that digitally unifies all European natural science assets under common access, curation, policies and practices that ensure that all the data is easily Findable, Accessible, Interoperable and Reusable (FAIR principles). DiSSCo represents the largest ever formal agreement between natural history museums, botanic gardens and collection-holding institutions in the world. DiSSCo entered the European Roadmap for Research Infrastructures in 2018 and launched its main preparatory phase project (DiSSCo Prepare) in 2020. DiSSCo Prepare is the primary vehicle through which DiSSCo reaches the overall maturity necessary for its construction and eventual operation. DiSSCo Prepare raises DiSSCo’s implementation readiness level (IRL) across the five dimensions: technical, scientific, data, organisational and financial. Each dimension of implementation readiness is separately addressed by specific Work Packages (WP) with distinct targets, actions and tasks that will deliver DiSSCo’s Construction Masterplan. This comprehensive and integrated Masterplan will be the product of the outputs of all of its content related tasks and will be the project’s final output. It will serve as the blueprint for construction of the DiSSCo RI, including establishing it as a legal entity. DiSSCo Prepare builds on the successful completion of DiSSCo’s design study, ICEDIG and the outcomes of other DiSSCo-linked projects such as SYNTHESYS+ and MOBILISE. This paper is an abridged version of the original DiSSCo Prepare grant proposal. It contains the overarching scientific case for DiSSCo Prepare, alongside a description of our major activities.
Natural Science Collections (NSCs) contain specimen-related data from which we extract valuable information for science and policy. Openness of those collections facilitates development of science. Moreover, virtual accessibility to physical containers by means of their digitization will allow an exponential increase in the level of available information. Digitization of collections will allow us to set a comprehensive registry of reliable, accurate, updated, comparable and interconnected information. Equally, the scope of interested potential users will largely expand and so will the different levels of granularity required by researchers, institutions and governmental bodies. Meeting diverse needs entails a special effort in data management and data analysis to extract, digest and present information on a compressed but still precise and objective-oriented format. The Collections Digitisation Dashboard (CDD) underpins such an attempt. The CDD stands as a practical tool that specifically aims to support high-level decisions with a wide coverage of data, by providing a visual, simplified and structured arrangement that will allow discovery of key indicators concerning digitization of bio- and geodiversity collections. The realm of possible approaches to the CDD covers levels of digitization, collection exceptionality, resourceavailability and many others. Still all those different angles need to be aligned and processed at once to provide an overall overview of the status of NSCs in the digitization process and analyse its further development. The CDD is a powerful mechanism to identify priorities, specialisation lines together with regional development, gaps and niches and future capabilities as well, and strengths and weaknesses across collections, institutions, countries and regions. It can perfectly underpin measurable and comparable assessments, with evolution indexes and progress indicators, all under an overarching homogenous approach. The Distributed System of Scientific Collections (DiSSCo) Research Infrastructure, currently in its preparatory phase, is built on top of the largest ever community of collections-related institutions across Europe and anchored on the Consortium of European Taxonomic Facilities (CETAF). It aims to provide a unique virtual access point to NSCs by facilitating a large and massive digitisation effort throughout Europe. Setting up priorities and specialization areas is pivotal to its success. To that end, the DiSSCo CDD will provide a valuation tool to summarize and showcase NSC's digitization status on a first-hand visualization. Different projects and initiatives will contribute, jointly and on a synergetic basis, to the production of the DiSSCo CDD. The ICEDIG project will address its basics features, terms of classification and tiers of information, and will produce a prototype and a set of recommendations on how to better attempt a massive dashboard by collating specific collections-based information and defining global strategic representations. CETAF working groups on collections and digitization will provide the desired homogeneity in describing and capturing the different implementation requirements from the users’ perspectives, which will be complemented by the contributions made under the umbrella of the COST Action MOBILISE. The Action will use networking activities to identify the right standards and policies to enable enlarging the scope of the DiSSCo CDD and its broader implementation by linking to the TDWG criteria and adopted standards. Complementarily, the ELViS platform to be developed under the SYNTHESYS+ project will provide the right virtual environment. Furthermore, SYNTHESYS+ will address the assessment capabilities of the CDD to enable the visual representation becoming a practical assessment mechanism and endow it with a dynamic feature for analysis over the time. The DiSSCo CDD will thus become an instrumental mechanism for decision-taking that will be embedded into the clustering initiative of products and services provided to the EOSCby the ENVRI-FAIR project in the environmental domain.
Two Southeast Asian spider collections: that of Frances and John Murphy, now in the Manchester University Museum and the Deeleman collection, now at the Naturalis Biodiversity Center in Leiden constituted the basis of this analysis of Chrysilla Thorell, 1887 and related genera. The latter collection also includes many thousands of spiders obtained by canopy fogging for an ecological project in Borneo by A. Floren. Some incongruences within the genera of the tribe Chrysillini are disentangled. The transfer of C. jesudasi Caleb & Mathai, 2014 from Chrysilla as type species of Phintelloides Kanesharatnam & Benjamin, 2019, based on analysis of molecular data is validated by morphology. An interesting new species known only from the forest canopy in Borneo, Phintelloides scandens sp. nov, is described based on both male and female specimens. Distinguishing chrysilline genera is mostly based on traditional somatic characters, e.g., habitus, carapace and abdomen patterns, mouthparts, and genital organs. The utility of two character systems for distinguishing chrysilline genera is highlighted: 1) the presence of a flexible, articulating embolic tegular branch (etb) in combination with the conformation of the characteristic construction of the epigyne in Chrysilla and Phintelloides ; 2) presence of red colour on carapace and abdomen of live males and females, in combination with abundant blue/violet/white iridescent scales such as in Chrysilla and Siler . The red colour usually gets lost in alcohol, hampering species identification of alcohol material. The genera Chrysilla and Phintelloides are redefined. Specimens of the heretofore unknown female of Chrysilla deelemani Prószyński & Deeleman-Reinhold, 2010 are described. The male and female of Chrysilla lauta and male of C. volupe are redescribed. The genus Chrysilla is diagnosed and discriminated from Phintella Bösenberg & Strand, 1906, Siler Simon, 1889 , Phintelloides Kanesharatnam & Benjamin, 2019 and Proszynskia Kanesharatnam & Benjamin, 2019 . The structure of the female genital organ of Phintelloides flavumi Kanesharatnam & Benjamin, 2019 is scrutinized and the generic placement of Phintelloides is discussed. Males and females of one of the most variable species , Phintelloides versicolor (C. L. Koch, 1846) are redescribed. Phintelloides munita (Bösenberg & Strand, 1906) is removed from synonymy with P. versicolor. Phintella leucaspis Simon 1903 (male, Sumatra) is synonymized with P. versicolor . Biodiversity data are increasingly reliant on digital infrastructure. By linking physical specimens to digital representations of their associated data, we can lower barriers to information flow. Here we demonstrate a workflow whereby persistent identifiers (PIDs) in the form of DOIs issued by DataCite are assigned to specimens. Recognized taxa are identified by their catalog of life identifier, or by registration in ZooBank where no catalog of life identifier is available. We demonstrate the use of nanopublications, creating a series of machine readable, scientifically meaningful assertions regarding the provenance and identification of cited specimens. All human agents associated with the specimen data are linked to a persistent identifier issued by either ORCiD or Wikidata.
The engine for our Distributed System of Scientific Collections (DiSSCo) is running! Core technical components supporting this new research infrastructure are currently being implemented and the engine that will support it is already working. Even though some nuts and bolts may still be missing, we aim to show it in action to present how it will enable annotation and curation of the Digital Specimen. The Digital Specimen is a technical implementation based on FAIR Digital Objects (FAIR stands for Findable, Accessible, Interoperable and Reusable) to support the Digital Extended Specimen concept (Webster et al. 2021). We will also present and demonstrate how we will implement standardized quality checks as they are being developed in Biodiversity Information Standards (TDWG) to enhance the quality of the data. DiSSCo is currently in its preparation phase. This phase will end in January 2023 with the completion of the DiSSCo Prepare project funded by the European Commission. Part of that project is the design of the Digital Specimen infrastructure, which is not an easy task considering the wide range of use cases, stakeholders and the many possibilities it offers. However, as we are moving towards the end of the project, we have defined clear goals and priorities to give shape to that infrastructure. This is where we take a fail fast approach: to quickly implement the proposed solution and see if it really fits. One of the major needs we want to support with the Digital Specimen infrastructure (based on collected user stories (Fitzgerald et al. 2021)) is to provide services for improving the quality and usability of specimen data. Our infrastructure aims to support annotating and community-curation of the data by both machines and users. Examples of these annotations are image-based determinations, automated- or citizen science-contributed label translations or the semi-automated linking with other biodiversity data. Semi-automated linking is currently being piloted as part of the Biodiversity Community Integrated Knowledge Library project (BiCIKL) and will use a process of link prediction through artificial intelligence in combination with human validation. Improvements in data quality made together by human and machine through the curation and annotation services will help in producing a digital specimen data object with high quality, curated and extended data. As part of the presentation we aim to give a live demonstration with the first setup in which we will ingest a dataset, run standardized quality checks and automated data enrichment services. The end result will be a digital specimen that we will present in a user-friendly interface, which has been validated by quality checks and annotated by both a human and a machine. The result will also be accessible as a FAIR Digital Object through an API. During the demonstration, we aim to give the audience a clear view on how DiSSCo can help them create higher quality specimen data, and how we will benefit in this process from the outputs of the TDWG Data quality tests and assertions taskgroup.
DiSSCo (Distributed System of Scientific Collections) is a research infrastructure (RI) under development, which will provide services for the global research community to support and enhance physical and digital access to the natural history collections in Europe. These services include training, support, documentation and e-services. This talk will focus on the e-services and will give an overview of the current status, roadmap and first results as an introduction to the next talks in the session, which focus on some of the services in more detail and the standards work undertaken in Biodiversity Information Standards (TDWG) to enable them. The RI community will provide the envisioned e-services, which will use the novel FAIR Digital Object (FDO) infrastructure serving digital specimens from the European collections. The infrastructure will provide integrated data analysis, enhanced interpretation, annotation and access services for community curation and visualisation. The FDO infrastructure enables specimen data to be (re-)connected with genomic, geographical, morphological, taxonomic and environmental information through the digital specimen, making them Digital Extended Specimens. A large number of user stories have been collected through the DiSSCo-linked projects ICEDIG, SYNTHESYS+ and DiSSCo Prepare, to guide which e-Services to build and what functionality to provide. These user stories are publicly available in a github repository. The e-services are developed based on the user stories and prioritization provided by collection providers and the scientific community. A variety of mechanisms are used to collect input: surveys, workshops, roundtables and workpackage meetings, and feedback from users that have already been using beta versions of some of the services. DiSSCo aims to become operational in 2026 but several of the services are already being piloted or implemented. Experimental services and demonstrators are publicly available through DiSSCo Labs for testing and feedback. By connecting the specimen data with derived and related information in a FAIR way (Findable, Accessible, Interoperable and Reusable), the e-services will accelerate biodiversity discovery and support novel research questions. The FDO infrastructure has a data model that also integrates the PROV Ontology (PROV-O), which allows for the e-services to capture activities and improve the visibility of researcher contributions. This vision towards FAIR and high quality data is essential for community curation of the specimen data and making better use of the limited number of experts available. To provide the DiSSCo e-services in a FAIR way, the data derived from the natural history collections in Europe needs to be integrated as one virtual collection. The data has to be findable and accessible as soon as it is being created for services like a Specimen Data Refinery prior to publication in a facility like GBIF (Global Biodiversity Information Facility). This requires new standards for describing collections and specimen data. Standards being created to fill these gaps are TDWG CD (Collection Descriptions) and TDWG MIDS (Minimum Information about a Digital Specimen). The DiSSCo e-Services vision brings the data, standards, and processes together to serve the user community.
Natural science collections are vast repositories of bio- and geodiversity specimens. These collections, originating from natural history cabinets or expeditions, are increasingly becoming unparalleled sources of data facilitating multidisciplinary research (Meineke et al. 2018, Heberling et al. 2019, Cook et al. 2020, Thompson et al. 2021). Due to various global data mobilization and digitisation efforts (Blagoderov et al. 2012,Nelson and Ellis 2018), this digitised information about specimens includes database records along with two/three-dimensional images, sonograms, sound or video recordings, computerised tomography scans, machine-readable texts from labels on the specimens as well as media items and notes related to the discovery sites and acquisition (Hedrick et al. 2020,Phillipson 2022). The scope and practice of specimen gathering are also evolving. The term extended specimen was coined to refer to the specimen and associated data extending beyond the singular physical object to other physical or digital entities such as chemical composition, genetic sequence data or species data. Thus the specimen becomes an interconnected network of data resources that have incredible potential to enhance integrative and data-driven research (Webster 2017,Lendemer et al. 2019,Hardisty et al. 2022). These practices also reflect the role of data and the curatorial data life-cycle starting from the initial material sampling process to the downstream analysis. We are also seeing growing acknowledgement that disparate and domain specific data elements prevent interdisciplinarity which is crucial for a holistic understanding of biodiversity and climate crisis (Hicks et al. 2010, Craven et al. 2019, Folk and Siniscalchi 2021). Thus the data elements are not just records or rows in a database or data pipelines going from one repository to another. They have the potential to become self-describing digital artefacts that can revolutionise how machines interpret and work with specimen data. Within this context, the Distributed System of Scientific Collections (DiSSCo), a new European Research Infrastructure for natural science collections, envisions an infrastructure based on FAIR Digital Objects (FDO) that can unify more than 170 European natural science collections under common and FAIR-compliant (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016) access and curation policies and practices. DiSSCo’s key element in achieving FAIR is the implementation of Digital Specimen (a domain specific FDO) that closely aligns with the extended specimen practices. The idea behind Digital Specimen – an FDO that acts as a digital surrogate for a specific physical specimen in a natural science collection – was influenced by global conversations around the implementation of the Digital Object Architecture for biodiversity data (De Smedt et al. 2020, Islam et al. 2020,Hardisty et al. 2020). The main purpose of this talk is to explain the vision of how FAIR and FDO can create a data infrastructure that can not only take advantage of existing databases and repositories but at the same time provide support for innovative services such as AI and digital twinning. With scientific use cases in mind, the talk will highlight a few key FAIR and FDO components (persistent identifiers, metadata, ontologies) within the collaborative modelling activity of Digital Specimen specification. These components provide the template for specifying how a Digital Specimen should look so DiSSCo can build a FAIR service ecosystem based on FDOs (Addink et al. 2021). We will also give examples of envisioned services that can help with image feature extraction, and model training (Grieb et al. 2021,Hardisty et al. 2022) and digital twinning (Schultes et al. 2022). We believe this is an exciting new paradigm powered by FAIR and FDO that can help both humans and machines to accelerate the use of specimen data. From physical objects curated over hundred years, we have developed data pipelines, aggregators and repositories (Barberousse 2021). Now is the time to look for solutions where these data records can become FAIR Digital Objects to enable wider access and multidisciplinary research.
Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle. While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model. In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability. As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector. Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON). In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies. The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support.
Digital specimens are new information objects on the internet, which act as digital surrogates of the physical objects they represent. They are designed to be extended with data derived from the specimen like genetic, morphological and chemical data, and with data that puts the specimen in context of its gathering event and the environment it was derived from. This requires linking the digital specimens and their related entities to information about agents, locations, publications, taxa and environmental information. To establish reliable links and (re-)connect data to specimens, a new framework is needed, which creates persistent identifiers (PIDs) for the digital specimen and its related entities. These PIDs should be actionable by machines but also can be used by humans for data citation and communication purposes. The framework that enables this is a new PID infrastructure, produced by the European Commission-funded BiCIKL project (Biodiversity Community Integrated Knowledge Library), creates persistent and actionable identifiers. It is a generic PID infrastructure that will be used by the Distributed System for Scientific Collections research infrastructure (DiSSCo), but it can also be used by other infrastructures and institutions. PIDs minted by DiSSCo will be linked to the digital specimens and samples provided through DiSSCo. The new PIDs are a key element in enabling the concept of Digital Extended Specimens (Webster et al. 2021) and provide unique and resolvable references to enable bidirectional linking. DiSSCo has done extensive work to select the most appropriate PID scheme (Hardisty et al. 2021) and to design a PID infrastructure for the pan-European specimens. The draft design has been discussed with technical specialists in the joint DiSSCo and Consortium of European Taxonomic Facilities (CETAF) community, with international stakeholders like the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio) and was discussed at the 2022 conference of the Society for the Preservation of Natural History Collections (SPNHC). A first implementation was demonstrated in the Biodiversity Information Standards (TDWG) annual conference in 2022 and illustrated key elements in the design. To be able to provide digital specimen identifiers as DOIs (Digital Object Identifiers), a pilot project was started in 2023 with DataCite to investigate if Digital Specimen DOIs in the new PID infrastructure can be created using the DataCite service. The pilot aim was to create metadata crosswalks to the DataCite schema in consultation with the DataCite Metadata Working Group, to evaluate synergies with the IGSN (International Generic Sample Number) metadata schema, to develop and test PID kernel metadata registration, and to evaluate performance and the impact of using DataCite services. There are around two billion specimens and creating PIDs for them as DOIs requires creating DOIs at an unprecedented scale. Also, PID kernel metadata registration is new for DOIs. The included metadata for specimens will complement existing Biodiversity Information Standards such as Darwin Core, and supports the new MIDS (Minimum Information about a digital specimen) standard that is under development. The design, development and testing of the new PID infrastructure is being done as part of the BiCIKL project that aims to foster collaboration between infrastructures and develop bidirectional connections (Penev et al. 2022). In the session, we will demonstrate the results in development of the PID infrastructure as part of the BiCIKL toolbox to link biodiversity data and to discuss the progress with creating digital specimen DOIs.
The Biodiversity Digital Twin (BioDT) project (2022-2025) aims to create prototypes that integrate various data sets, models, and expert domain knowledge enabling prediction capabilities and decision-making support for critical issues in biodiversity dynamics. While digital twin concepts have been applied in industries for continuous monitoring of physical phenomena, their application in biodiversity and environmental sciences presents novel challenges (Bauer et al. 2021, de Koning et al. 2023). In addition, successfully developing digital twins for biodiversity requires addressing interoperability challenges in data standards. BioDT is developing prototype digital twins based on use cases that span various data complexities, from point occurrence data to bioacoustics, covering nationwide forest states to specific communities and individual species. The project relies on FAIR principles (Findable, Accessible, Interoperable, and Reusable) and FAIR enabling resources like standards and vocabularies (Schultes et al. 2020) to enable the exchange, sharing, and reuse of biodiversity information, fostering collaboration among participating research infrastructures (DiSSCo, eLTER, GBIF, and LifeWatch) and data providers. It also involves creating a harmonised abstraction layer using Persistent Identifiers (PID) and FAIR Digital Object (FDO) records, alongside semantic mapping and crosswalk techniques to provide machine-actionable metadata (Schultes and Wittenburg 2019, Schwardmann 2020). Governance and engagement with research infrastructure stakeholders play crucial roles in this regard, with a focus on aligning technical and data standards discussions. In addition to data, models and workflows are key elements in BioDT. Models in the BioDT context are formal representations of problems or processes, implemented through equations, algorithms, or a combination of both, which can be executed by machine entities. The current twin prototypes are considering both statistical and mechanistic models, introducing significant variations in (1) data requirements, (2) modelling approaches and philosophy, and (3) model output. The BioDT consortium will develop guidelines and protocols for how to describe these models, what metadata to include, and how they will interact with the diverse datasets. While discussions on this topic exist within the broader context of biodiversity and ecological sciences (Jeltsch et al. 2013, Fer et al. 2020), the BioDT project is strongly committed to finding a solution within its scope. In the twinning context, data and models need to be executed within a computing infrastructure and also need to adhere to FAIR principles. Software within BioDT includes a suite of tools that facilitate data acquisition, storage, processing, and analysis. While some of these tools already exist, the challenge lies in integrating them within the digital twinning framework. One approach to achieving integration is through workflow representation, encompassing standardised procedures and protocols that guide the acquisition, packaging, processing, and analysis of data. The project is exploring Research Object Crate (RO-Crate) implementation for this (Soiland-Reyes et al. 2022). Implementing workflows can ensure reproducibility, scalability, and transparency in research practices, enabling scientists to validate and replicate findings. The BioDT project offers a novel and transformative approach to biodiversity research and application. By leveraging collaborative research infrastructures and adhering to data standards, BioDT aims to harness the power of data, software, supercomputers, models, and expertise to provide new insights. The foundation provided by the data standards, including those of Biodiversity Information Standards (TDWG), is crucial in realising the full potential of digital twins, facilitating the seamless integration of diverse data sources and combinations with models.
With the rise of Artificial Intelligence (AI), a large set of new tools and services is emerging that supports specimen data mapping, standards alignment, quality enhancement and enrichment of the data. These tools currently operate in isolation, targeted to individual collections, collection management systems and institutional datasets. To address this challenge, DiSSCo, the Distributed System of Scientific Collections, is developing a new infrastructure for digital specimens, transforming them into actionable information objects. This infrastructure incorporates a framework for annotation and curation that allows the objects to be enriched or enhanced by both experts and machines. This creates the unique possibility to plug-in AI-assisted services that can then leverage digital specimens through this infrastructure, which serves as a harmonised Findable, Accessible, Interoperable and Reusable (FAIR) abstraction layer on top of individual institutional systems or datasets. An early example of such services are the ones developed in the Specimen Data Refinery workflow (Hardisty et al. 2022). The new architecture, DS Arch or Digital Specimen Architecture, is built on the concept of FAIR Digital Objects (FDO) (Islam et al. 2020). All digital specimens and related objects are served with persistent identifiers and machine-readable FDO records with information for machines about the object together with a pointer to its machine-readable type description. The type describes the structure of the object, its attributes and describes allowed operations. The digital specimen type and specimen media type are based on existing Biodiversity Information Standards (TDWG) such as Darwin Core, Access to Biological Collection Data (ABCD) Schema and Audiovisual Core Multimedia Resources Metadata Schema, and include support for annotation operations based on the World Wide Web Consortium (W3C) Annotations Data Model. This enables AI-assisted services registered with DS Arch to autonomously discover digital specimen objects and determine the actions they are authorised to perform. AI-assisted services can facilitate various tasks such as digitisation, extract new information from specimen images, create relations with other objects or standardise data. These operations can be done autonomously, upon user request, or in tandem with expert validation. AI-assisted services registered with DS Arch, can interact in the same way with all digital specimens worldwide when served through DS Arch with their uniform FDO representation, even if the content richness, level of standardisation and scope of the specimen is different. DS Arch has been designed to serve digital specimens for living and preserved specimens, and preserved environmental, earth system and astrogeology samples. With the AI-assisted services, data can be annotated with new data, alternative values, corrections, and with new entity relationships. As a result, the digital specimens become Digital Extended Specimens enabling new science and application (Webster et al. 2021). With the implementation of a sophisticated trust model in DS Arch for community acceptance, these annotations will become part of the data itself and can be made available for inclusion in source systems such as collection management systems and aggregators such as Global Biodiversity Information Facility (GBIF), Geoscience Collections Access Service (GeoCASe) and Catalogue of Life. We aim to demonstrate in the session how AI-assisted services can be registered and used to annotate specimen data. Although the DiSSCo DS Arch is still in development and planned to become operational in 2025, we already have a sandbox environment available in which the concept can be tested and AI-assisted services can be piloted to act on digital specimen data. For testing purposes, the operations on specimens are currently limited to individual specimens and open data, however batch operations will also be possible in the future production environment.
DiSSCo, a Distributed System of Scientific Collections, is a Research Infrastructure (RI) with 114 self-sustaining partners in Europe aiming at providing physical and digital (data) access to the approximately 1.5 billion biological and geological specimens in collections distributed across Europe. It is a facility to generate and aggregate data derived from the collections and repackage them as linked data objects with unified access to enable science and underpin FAIR (Findable, Accessible, Interoperable, Reusable) data principles. In the European landscape of environmental Research Infrastructures, the effectiveness of services that aim at aggregating, monitoring, analysing and modelling geo-diversity information relies on the primary description of the bio- and geo- diversity. It also relies on the availability of this primary reference data that today is scattered and disconnected. DiSSCo provides the required bio-geographical, taxonomic and species trait data at the level of precision and accuracy required to enable and speed up research towards achieving the Targets of the Sustainable Development Goals for Life on Earth, Life below Water and Climate Action.\n \nDiSSCo requires further development of TDWG standards, RDA (Research Data Alliance) recommendations, practices developed in the CETAF, Consortium of European Taxonomic Facilities network and novel technological approaches to deliver data at the economies of scale and scope needed.\n \nIn this paper, we:\n \n\n \n \n \ndiscuss technical barriers for interoperability and possible action lines to overcome these including practices and technologies to underpin the FAIR data principles;\n \n \n \noutline the DiSSCo API (Application Programming Interface) services to provide data suitable for thematic services in environmental Research Infrastructures like LifeWatch, eLTER (European Long-Term Ecosystem and socio-ecological Research Infrastructure) as well as RIs in other domains such as E-RIHS (European Research Infrastructure for Heritage Science) in the field of social sciences. The services enable better connections between collection data and observations in biodiversity observation networks, such as EUBON (European Biodiversity Observation Network) and GEOBON (Group on Earth Observations Biodiversity Observation Network);\n \n \n \nexplain the DiSSCo strategy to align project outcomes and standards development towards a common unified research infrastructure.\n \n \n \n
The infrastructure for the Distributed System of Scientific Collections (DiSSCo) is in full development. Work within the DiSSCo Transition Project has been focused on building infrastructure, creating data models, and setting up Application Programming Interfaces (APIs) (Koureas et al. 2024). In the past years, DiSSCo has presented this work at different Biodiversity Information Standards (TDWG) conferences (Leeflang and Addink 2023, Leeflang et al. 2022, Addink et al. 2021). In this year’s session, we would like to focus on the human-facing application: DiSSCover. DiSSCover is the graphical user interface through which users can interact with Findable, Accessible, Interoperable and Reusable (FAIR) Digital Objects (FDOs), facilitating the curation and enhancement of specimen data (Islam 2024). Development started in 2022 and is ongoing. The interface acts as a gateway into the DiSSCo infrastructure, providing access to digital specimens and media. Extracted from the core DiSSCo API, the data is converted into an easily readable format and made discoverable through a diverse set of filters. DiSSCover’s main focus is to allow users to make annotations upon the data. Through the concept of annotations, we connect expert and machine-generated information, to create extended digital specimens (Hardisty et al. 2022), e.g, by creating linkages to other infrastructures, correcting or adding new information, or by triggering machine annotation services. Machine annotation services are automated scalable tools that run in the background and automatically curate and extend the specimen (Addink et al. 2023). Human users will remain important, as all annotations made by machine annotation services can be reviewed by a trusted person. Annotations are Fair Digital Objects and target a specific part of a specimen, be it a data fragment or an associated media file. At the heart of DiSSCover lies the Open Digital Specimen data specification (Leeflang and Addink 2023). It tries to harmonise multiple data standards into one generic specification based on the new Global Biodiversity Information Facility (GBIF) Unified Model (Robertson et al. 2022). The data is stored as JavaScript Object Notation (JSON) based on JSON Schemas (Anonymous 2024). Annotations are linked to specific data attributes using a JSON-path as the identifier. Data attributes can be individual terms, collections of terms called classes, or the whole object. This creates a flexible but complex data structure, the basis for which we used the World Wide Web Consortium (W3C) web annotation data model (Sanderson et al. 2017). The W3C Web annotation data model contains two main components: the target and body. The target specifies which data attribute the annotation is made on, for example, the term: ‘ods:specimenName’. This is a local term within the open Digital Specimen namespace (ods), which holds the accepted name of the digital specimen. The annotation body holds the value(s) that are appended to the digital specimen, and differ based upon the annotation motivation. DiSSCo recognises five different annotation motivations: addition, modification, comment, assessment and deletion, each of which has its own unique function. This creates a flexible structure that should be able to handle any information the user wants to add to the object. The challenge of DiSSCover is to preserve the complex structure of annotations, whilst making it convenient for users to work with. The session will provide a look at the different kinds of annotations and their use from a practical perspective. A demonstration of DiSSCover will show how users can create annotations, providing knowledge about the process that will give shape to DiSSCo’s main goal of enriching natural history data.