Main Page: Difference between revisions

From Protist-Prokaryote Symbiosis Database
Jump to navigation Jump to search
(updates section)
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Introduction ==
== Introduction ==


This project aims to describe symbiotic interactions between protists and prokaryotes as [https://en.wikipedia.org/wiki/Linked_data Linked Open Data]. Information in the database is compiled from the scientific literature; over 500 symbiotic interactions described in 300 scientific publications are represented at present. The database is hosted on [https://www.wikibase.cloud/ Wikibase Cloud], a hosting service for Wikibase instances provided by [https://www.wikimedia.de/ Wikimedia Deutschland].
This project aims to describe symbiotic interactions between protists and prokaryotes as [https://en.wikipedia.org/wiki/Linked_data Linked Open Data]. Information in the database is compiled from the scientific literature; ~700 symbiotic interactions described in ~380 scientific publications are represented at present. The database is hosted on [https://www.wikibase.cloud/ Wikibase Cloud], a hosting service for Wikibase instances provided by [https://www.wikimedia.de/ Wikimedia Deutschland].


What can the database be used for?
What can the database be used for?
Line 9: Line 9:
* Share data to be indexed in [https://www.globalbioticinteractions.org/ GloBI], through [https://github.com/kbseah/ppsdb-globi-export periodic data exports]
* Share data to be indexed in [https://www.globalbioticinteractions.org/ GloBI], through [https://github.com/kbseah/ppsdb-globi-export periodic data exports]


What information do we wish to capture about symbiotic interactions?
Documentation of the workflow and project administration is linked from [[Project:Main]]. Your questions may be answered at [[Project:Q&A]]
 
==Updates==
 
* 2024-09-21 : Addshore, the original developer of the precursor to Wikibase.cloud (WBStack), wrote a [https://addshore.com/2024/09/2-years-of-wikibase-cloud-by-wmde/ blog post] reflecting on the past two years since the project was transferred to Wikimedia Deutschland. I contributed a short user testimonial.
* 2024-08-26 : Released a [https://doi.org/10.32942/X2ZW4S preprint manuscript] describing the database motivation and design, intended for biologists who may use the database or want to build similar projects
 
==Explore the data==
 
Some example entries to see how the data are modeled:
 
* [[Item:Q22|''Parduczia'' sp. from Santa Barbara Basin]] ("brown ciliate"), a marine ciliate
* [[Item:Q206|''Pelomyxa palustris'']], a freshwater amoebozoan
* [[Item:Q296|''Mixotricha paradoxa'']], itself a hint gut symbiont of the termite [[Item:Q298|''Mastotermes darwiniensis'']]
* [[Item:Q141|''Candidatus'' Megaira polyxenophila]], a bacterial endosymbiont of a diverse array of protists and algae
 
Each interaction statement is supported by one or more references to the scientific literature, linked by their DOI if available.
 
Entities (e.g. taxa, publications) can be found by a '''free-text search''' of their labels with the search bar (top of each page) or at [[Special:Search]].
 
'''Semantic queries''', which make use of the data model (e.g. "find host species where symbionts are Alphaproteobacteria"), require the use of the SPARQL query language. Use the Query Service (link on menu bar) to launch SPARQL queries; try the [[Project:SPARQL/examples|example queries]] to get started.
 
==Data linking==
 
We wish to capture different facets of symbiotic interactions, and link these to other databases and ontologies.


{| class="wikitable"
{| class="wikitable"
Line 29: Line 53:
| Environment where organisms were isolated
| Environment where organisms were isolated
| [https://sites.google.com/site/environmentontology/ Environment Ontology]
| [https://sites.google.com/site/environmentontology/ Environment Ontology]
|-
| Representative SSU rRNA or genome sequences
| [https://www.ncbi.nlm.nih.gov/genbank/ Genbank]
|-
| Scientific publications describing symbiosis
| DOI, [https://www.wikidata.org/ Wikidata]
|}
|}


Terms will be linked to other linked open data or ontologies, if there is a suitable exact match. The [https://www.ebi.ac.uk/ols4/ EMBL-EBI Ontology Lookup Service] is a useful resource for browsing life science related ontologies.
Terms will be linked to other linked open data or ontologies, if there is a suitable exact match. The [https://www.ebi.ac.uk/ols4/ EMBL-EBI Ontology Lookup Service] is a useful resource for browsing life science related ontologies.
==Links==


This project originated as part of my [http://nbn-resolving.de/urn:nbn:de:gbv:46-00106172-12 doctoral dissertation] (2017).
This project originated as part of my [http://nbn-resolving.de/urn:nbn:de:gbv:46-00106172-12 doctoral dissertation] (2017).


Similar projects elsewhere:
Similar projects elsewhere:
* [https://www.globalbioticinteractions.org Global Biotic Interactions (GloBI)], an aggregator for biotic interactions datasets across all domains of life
* [https://www.globalbioticinteractions.org Global Biotic Interactions (GloBI)] ([https://doi.org/10.1016/j.ecoinf.2014.08.005 Poelen et al., 2014]) an aggregator for biotic interactions datasets across all domains of life
* [https://github.com/ramalok/PIDA Protist Interaction Database (PIDA)], [https://doi.org/10.1038/s41396-019-0542-5 Bjorbækmo et al., 2019] (last updated 2018)
* [https://github.com/ramalok/PIDA Protist Interaction Database (PIDA)] ([https://doi.org/10.1038/s41396-019-0542-5 Bjorbækmo et al., 2019]) (last updated 2018)
* [https://github.com/FloraVincent/DIDB Diatom Interaction Database (DIDB)] (last updated 2019)
* [https://github.com/FloraVincent/DIDB Diatom Interaction Database (DIDB)] ([https://doi.org/10.1128/msystems.00444-19 Vincent & Bowler, 2020]) (last updated 2019)
* [http://www.aquasymbio.fr/ AQUASYMBIO] (last updated 2017)
* [http://www.aquasymbio.fr/ AQUASYMBIO] (last updated 2017)
Documentation of the workflow and project administration is linked from [[Project:Main]].
==Explore the data==
Some example entries to see how the data are modeled:
* [[Item:Q22|''Parduczia'' sp. from Santa Barbara Basin]] ("brown ciliate"), a marine ciliate
* [[Item:Q206|''Pelomyxa palustris'']], a freshwater amoebozoan
* [[Item:Q296|''Mixotricha paradoxa'']], itself a hint gut symbiont of the termite [[Item:Q298|''Mastotermes darwiniensis'']]
* [[Item:Q141|''Candidatus'' Megaira polyxenophila]], a bacterial endosymbiont of a diverse array of protists and algae
Each interaction statement is supported by one or more references to the scientific literature, linked by their DOI if available.
Use the Query Service (link on menu bar) to launch SPARQL queries; try the [[Project:SPARQL/examples|example queries]] to get started.
== Q & A ==
; Which types of interactions are in the scope of this database?
: Interactions between protists (microbial eukaryotes) and prokaryotes (bacteria and archaea). In a few cases, interactions between two eukaryotic partners, or interactions with viruses, may also be included if they occur in organisms that also participate in protist-prokaryote interactions.
: Fungi and macroalgae are generally not included, as are microbiome studies. Reports of putative symbionts described solely by morphology without further phylogenetic characterization may be included if the host organism is somehow notable, or if they are in older literature about host taxa that have not yet been studied with modern methods.
; How are biological interactions between taxa represented in the database?
: Taxa are represented as items in the database, linked together by statements (claims) representing the biological interaction. Each statement is supported by one or more references to the scientific literature.
: Example: ''Pelomyxa palustris'' ([[Item:Q206]]) interacts with ([[Property:P19]]) ''Methanosaeta'' sp. ARCH ([[Item:Q203]]).
: The item that goes before "interacts with" is the '''subject''' of the statement, while the item after it is the '''object'''.
: In this database, we follow the convention that the subject of the "interacts with" statements is the "host" of the biological interaction, i.e. the physically larger organism in a pair, usually the eukaryote.
: Further information about the interaction are encoded as "qualifier" statements. Attached to the example statement above is the following qualifier:
: subject body part ([[Property:P20]]) cytoplasm ([[Item:Q29]])
: I.e. the symbiont ''Methanosaeta'' sp. ARCH is located in the cytoplasm of the host, ''Pelomyxa palustris''.
; How are taxa represented in the database?
: Ideally, the taxa should be identified to the species/strain level, be associated with published sequence information, and be linked to external taxonomy databases. Sequence data is especially important for microorganisms, where morphology alone is insufficient to identify most species reliably. The associated sequence information provides an empirical reference point for disambiguation, should the taxonomy change in the future.
:; 1. The taxon name/concept in the original scientific publication(s) correspond to the NCBI taxon, and sequence data are available.
:: Example: ''Candidatus'' Megaira polyxenophila is a bacterium that has been described in the scientific literature, and given a provisional ("Candidatus") name at the species level. An NCBI taxon ID has been assigned to this taxon name, which corresponds to the taxon concept from the literature.
:: In this database, such items are encoded as instances of the class "taxon".
:: In some cases, a taxon concept has been assigned a different name in the publication from the name that appears in the NCBI Taxonomy record. This usually happens when a sequence was first published in NCBI under a temporary name, which has not been updated with the subsequent formal name. These are nonetheless encoded here as "taxon" items, but are additionally tagged as [[Item:Q658|"NCBI taxon name needs updating"]]. Example: [[Item:Q2018|''Candidatus'' Vesiculincola pelomyxae]] is still listed under a provisional name "Eubacteriales bacterium SKADARSKE-1" in the NCBI Taxonomy, but the provisional name is clearly unique and refers to the same taxon entity as the published name.
:; 2. The taxon name/concept in the original publication does not correspond exactly to the NCBI taxon, but sequence data are available.
:: Example: The ciliate species [[Item:Q52|''Eufolliculina methanicola'']] was described and named in a scientific publication, but sequences from that study were published under a non-specific identifier "Folluculinidae sp." in the NCBI Taxonomy, that potentially represents sequences from multiple taxa.
:: Therefore, the item for ''Eufolliculina methanicola'' is encoded in this database as an instance of [[Item:Q56|"taxon without equivalent NCBI taxon"]], because it does not have an exact equivalent in the NCBI Taxonomy, although we think it represents a legitimate taxon, based on the information in the cited references.
:: To allow us to track the identity, should the NCBI Taxonomy be updated in the future, a representative NCBI sequence record for this taxon, as given in the original publication that described it, is linked to this item with the property [[Property:P28|"representative sequence for taxon without equivalent NCBI taxon"]].
:; 3. No sequence data are available for a given taxon name/concept.
:: Possible reasons:
::* Sequence data were produced in a study but not published
::* Organism was identified by morphology alone, without using sequencing methods
:: Such items are encoded as instances of [[Item:Q1579|"named organism without published sequence data"]].
:: Example: A [[Item:Q410|species of ''Arcobacter'']] was identified as a symbiont of [[Item:Q409|''Bihospites bacati'']], but although sequencing of a marker gene was reported, the sequence was not published.
:; 4. The organism is only identified to a higher taxonomic level.
:: Possible reasons:
::* Organism was identified by morphology alone, without using sequencing methods
::* Organism was identified with group-specific molecular probes, but these were not sufficient to identify them to the species or strain level.
:: Example: The ciliate [[Item:Q1782|''Frontonia leucas'']] is associated with [[Item:Q158|unclassified Alphaproteobacteria]], identified with group-specific molecular probes. However, no sequence data were reported that would allow the bacteria to be identified more specifically.
:: Such items are also encoded as instances of [[Item:Q1579|"named organism without published sequence data"]].
:; Summary:
{| class="wikitable"
! Item class/subclass in this database
! Sequence data available
! NCBI Taxon equivalent
! Wikidata item equivalent
! taxon rank
|-
| [[Item:Q2|taxon]]
| yes
| yes
| maybe
| species or strain
|-
| [[Item:Q56|taxon without equivalent NCBI taxon]]
| yes
| no
| maybe
| species or strain
|-
| [[Item:Q1579|named organsim without published sequence data]]
| no
| no
| maybe
| any
|}
; How is the taxonomic hierarchy represented in the database?
: The taxon items linked by "interacts with" statements should be at the species or strain level, similar to the concept of a "submittable organism name" in the ENA database (https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html#submittable-organism-names).
: Where possible, taxon items are linked to their corresponding NCBI taxon ID and/or Wikidata item, so they can be placed into the context of either the NCBI Taxonomy or the Wikidata taxonomy trees.
: However, not all taxa have corresponding NCBI taxon IDs or Wikidata items. In some cases ("placeholder taxon"), the NCBI Taxonomy may be inconsistent with the published literature or not yet updated. In other cases ("named organism without published sequence data"), there is information about their taxonomy, but sequence data are not available. Or the taxon may not meet the notability criteria for inclusion in the Wikidata database (e.g. strains or species that have not been formally named).
: Therefore, every taxon item is linked to its next highest formally named parent taxon. The parent taxon item is linked to Wikidata. This allows us to query the database taxonomically using the Wikidata taxonomy tree, with a [[Project:SPARQL/examples#Find_interactions_where_symbiont_is_member_of_Alphaproteobacteria_according_to_Wikidata|federated SPARQL query]], even for taxon items that are not represented in external databases.
: The parent taxon items are instances of the class "higher taxon" and should not be used for "interacts with" statements.
; Is this a standalone project? Why not directly integrate this into Wikidata?
: This project was built as a standalone Wikibase instance, instead of directly adding entries to Wikidata, because:
:* Not all taxa meet the notability criteria for Wikidata, e.g. species that have not been formally named
:* We wish to structure the concepts and statements different from Wikidata, e.g. using "interacts with" statements
:* Our preferred classification may conflict with what is already present in Wikidata
: Nonetheless, this is not a standalone project as items are cross-referenced to Wikidata, NCBI Taxonomy, and other ontologies, to permit federated queries. Where possible, taxa that were previously not represented in Wikidata and which meet their notability criteria have been added there. We have also proactively informed the NCBI Taxonomy team about taxon names that required updating in their database.
; Why use a single 'interacts with' statement, with qualifiers for interaction type, instead of different properties for each interaction type?
: The nature of an interaction is often not fully understood, or may have multiple facets. Coding interaction types as qualifiers allows us to stack multiple functional roles on a single interaction
; Why host this on Wikibase?
: This database has seen a number of iterations: starting as a table in a word processor file, to spreadsheets, a custom SQLite database, and an attempt to homebrew a structured data base with XML files and Python scripts. After getting some experience on Wikidata, I found that Wikibase offers the key features that I wanted: flexible and extensible schemata, graphical frontend for manual data entry, options for programmatic data import from tables, integration with external databases, and a sophisticated query interface.
; What is the beautiful organism depicted in the logo?
: [[Item:Q7|''Kentrophoros'' sp. H]]
: (The logo may not be visible in the mobile version of this site.)

Latest revision as of 09:41, 23 September 2024

Introduction

This project aims to describe symbiotic interactions between protists and prokaryotes as Linked Open Data. Information in the database is compiled from the scientific literature; ~700 symbiotic interactions described in ~380 scientific publications are represented at present. The database is hosted on Wikibase Cloud, a hosting service for Wikibase instances provided by Wikimedia Deutschland.

What can the database be used for?

  • Search and browse symbiotic interactions by biological taxonomy, leveraging cross-references to external taxonomies and databases
  • Find interactions that were described in earlier literature but not yet studied with modern methods
  • Programatically find sequence data, literature, etc. by querying the NCBI databases using linked NCBI taxon IDs
  • Share data to be indexed in GloBI, through periodic data exports

Documentation of the workflow and project administration is linked from Project:Main. Your questions may be answered at Project:Q&A

Updates

  • 2024-09-21 : Addshore, the original developer of the precursor to Wikibase.cloud (WBStack), wrote a blog post reflecting on the past two years since the project was transferred to Wikimedia Deutschland. I contributed a short user testimonial.
  • 2024-08-26 : Released a preprint manuscript describing the database motivation and design, intended for biologists who may use the database or want to build similar projects

Explore the data

Some example entries to see how the data are modeled:

Each interaction statement is supported by one or more references to the scientific literature, linked by their DOI if available.

Entities (e.g. taxa, publications) can be found by a free-text search of their labels with the search bar (top of each page) or at Special:Search.

Semantic queries, which make use of the data model (e.g. "find host species where symbionts are Alphaproteobacteria"), require the use of the SPARQL query language. Use the Query Service (link on menu bar) to launch SPARQL queries; try the example queries to get started.

Data linking

We wish to capture different facets of symbiotic interactions, and link these to other databases and ontologies.

Information Relevant database or ontology
Taxonomy of interacting organisms NCBI Taxonomy, Wikidata
Localization of symbionts in host organism Gene Ontology, Uberon
Nature of biotic interactions, if known/inferred Relation Ontology
Analytical methods used to identify organisms or interaction type OBI, Evidence Ontology
Environment where organisms were isolated Environment Ontology
Representative SSU rRNA or genome sequences Genbank
Scientific publications describing symbiosis DOI, Wikidata

Terms will be linked to other linked open data or ontologies, if there is a suitable exact match. The EMBL-EBI Ontology Lookup Service is a useful resource for browsing life science related ontologies.

Links

This project originated as part of my doctoral dissertation (2017).

Similar projects elsewhere: