Project:Curation workflow

From Protist-Prokaryote Symbiosis Database
Jump to navigation Jump to search

How to curate/edit entries (overview)

For more details on each step, refer to the section "Annotation guidance" below.

  • Find descriptions of symbiotic interactions in the literature, see current backlog: Project:New studies to add
  • Check if taxon items representing symbiotic interaction partners are already represented in the database (search for taxon names, potential alternative names)
  • If not, create taxon items, linking NCBI taxon ID [[Property:P11] if available
  • From the item representing host, link to the symbiont with an 'interacts with' Property:P19 statement
  • Add the following information about the symbiotic interaction as qualifiers to the 'interacts with' statement. Skip if not known or not specified in the publication.
  • Localization of symbiont in host cell/body, Property:P20
  • Analytical methods used to identify the symbiont, Property:P22, e.g. light microscopy, metagenome sequencing, phylogenetic marker sequencing
  • Analytical methods used to identify the host, Property:P42, e.g. microbiological culture, light microscopy
  • Nature/outcome of the interaction, Property:P26 (under development)
  • If the interaction is not found in nature but the result of experimental manipulation, use the 'interacts experimentally with' Property:P41 statement instead, with the same qualifiers.
  • Cite the source of the information in a reference, using the reference DOI Property:P27 (without the 'doi:' prefix). Reference items will be semi-automatically created/linked later.
  • Add information about the environment where the organism was found. For cultured organisms, this should reflect the original environment where the isolate was collected, if known.

Automated maintenance tasks

Scripts to run some annotation tasks: https://github.com/kbseah/ppsdb-utils

Scripts and export for GloBI indexing: https://github.com/kbseah/ppsdb-globi-export

The following tasks are (semi)-automated, so it is not necessary to do them manually. Scripts are currently triggered manually and run on an ad hoc basis.

  • New citations can be added with only Property:P27 "reference DOI" in a reference claim to a statement
  • A new reference item will be created if necessary and linked to the reference claim by the scripts in ppsdb-utils; citations are linked to their corresponding Wikidata items and metadata is pulled from there.

Annotation guidance

NCBI Taxon IDs and representative sequences

(details to come)

Labels

Labels for taxon items

  • The label for a formally named taxon should be its scientific taxon name, without authors. Other taxon synonyms can be given as aliases
  • Retain the "Candidatus" prefix for Candidatus names. Use the originally published form in the initial publication. Subsequent "corrected" names can be given as aliases.
  • For informally named taxa, use the name that is in the cited publication, and add other synonyms as aliases.
  • If there is an equivalent NCBI taxon concept, add the names/synonyms listed in NCBI Taxonomy as aliases too.
  • In some cases, the informal name used in a publication may not be enough to disambiguate it, e.g. "Rickettsiales endosymbiont" -- there are potentially many studies that describe a "Rickettsiales endosymbiont" from host species. In such cases, formulate a descriptive primary label, e.g. "Rickettsiales sp. ex Nuclearia pattersoni", or use the NCBI taxon name as the primary label.
  • Experimental: Labels are therefore not uniformly formatted taxon names. For the purposes of name alignment, formulate a taxon name with Open Nomenclature (ON) signs for Property:P12 (see below)

Labels for reference items

  • The primary label should be the title of the publication
  • The DOI of the article should be an alias of the publication, to enable easier searching with the search bar on the webpage

Higher taxonomy (P29, P32)

Higher taxa (ranks genus and above) are present in PPSDB to ensure that all taxon items can be placed within a taxonomy tree, even if they are not formally described taxa.

For example, consider an informal taxon concept "Alphaproteobacteria sp. A" which is mentioned in a publication on the basis of a SSU rRNA sequence. There is no corresponding NCBI Taxonomy item, nor a formal taxon name that can be added to Wikidata. Therefore the taxon item itself cannot be mapped to an external taxonomy. However, informal taxa are almost always contextualized in relation to a formal taxon. In this case, the authors claim that it is contained in Alphaproteobacteria , which can be mapped to both the external taxonomies (NCBI and Wikidata). We therefore link the informal taxon "Alphaproteobacteria sp. A" to its parent taxon item, Alphaproteobacteria . We can then use that property in SPARQL queries to find taxa that are themselves not mapped to other taxonomies: example query.

  • Link taxon items to their higher taxa with Property:P29 "parent taxon".
  • Higher taxa are instances of Item:Q1488 "higher taxon".
  • Higher taxon items should not be used in interaction claims (Property:P19) directly.
  • Higher taxa must be formally named taxa that are suitable for inclusion in Wikidata
  • They should at least be mapped to Wikidata with Property:P2 "Wikidata mapping". Create new Wikidata items if necessary, and ensure that they have the basic properties required for a taxon item on Wikidata.
  • Taxon rank should be specified with Property:P32 "has taxon rank".
  • It is not necessary to populate a complete taxonomy tree in PPSDB, as we are mapping our taxon items to the Wikidata and NCBI taxonomies. (There is an experiment with representing the PR2 taxonomy tree in PPSDB, but this is not up to date).

Interaction claims (P19, P41) and their qualifiers

Subject body part (P20)

Terms should be mapped to Gene Ontology or Uberon.

Methods used to identify interaction partners (P22, P42)

Interaction type (function, outcome) (P26)

Methods used to characterize interaction type (P45)

References that support or refute a given claim (P23, P43)

Environment terms (P36, P38, P40)

Follow the guidelines for using EnvO terms in the MIxS standard: https://github.com/EnvironmentOntology/envo/wiki/ENVO-annotations-for-MIxS-v5

Use subclasses/instances of the following classes with the respective properties. The items within each class should be mapped to subclasses/instances of the corresponding EnvO terms with Property:P37.

Create new environment term items if necessary. These should be drawn from EnvO and mapped to the equivalent EnvO term. Within PPSDB, the environment term should be filed as a subclass of one of the three classes below. For now, it is not necessary to replicate the subclass structure of EnvO within PPSDB.

Property Class EnvO term
Property:P36 environmental material of origin Item:Q1597 environmental material http://purl.obolibrary.org/obo/ENVO_00010483
Property:P38 environmental system of origin Item:Q1602 environmental system http://purl.obolibrary.org/obo/ENVO_01000254
Property:P40 local environmental context Item:Q1673 astronomical body part http://purl.obolibrary.org/obo/ENVO_01000813

References without DOIs

If a publication does not have a DOI, a reference item has to be created and linked manually. If the full text is available online, specify the URL Property:P15, otherwise give the citation as free text Property:P14.

Representative sequence records (P34, P46)

  • A representative sequence is important to clarify the identity of taxa, particularly informally named ones, in the case of updates or changes to classification and taxonomy.
  • The representative sequence record must be one that is stated in a publication that describes the interaction/organism. The publication DOI/item should be cited in a reference statement.
  • Instances of taxon without equivalent NCBI taxon must have a representative SSU rRNA or genome sequence accession, to ensure that the taxon concept can be reconstructed later.
  • For SSU rRNA sequences Property:P34, use the Genbank accession, with or without the version suffix. This is typically one that is directly stated in a publication as a single accession or within a range of accession IDs.
  • For genome data Property:P46, publications may cite various identifiers (BioProject, BioSample, WGS, Genbank), so accession for the genome sequence itself may not appear directly in the cited publication. The preferred accession to cite is either the assembly (GCA_ prefix) or WGS contig set accession.

Images (P33)

(Experimental)

  • Open-licensed images hosted on Wikimedia Commons may be linked to taxon items. A thumbnail of that image should appear as a preview in PPSDB, and can also be used in visualizations in the SPARQL query service.
  • The images should be micrographs or illustrations that depict the named organisms.
  • The corresponding item in Commons should be annotated with the organisms they depict (under Structured Data).
  • If the organism is described in an open access publication published under a suitable license (e.g. CC-BY-4.0), the figures can be uploaded to Commons and then linked to the PPSDB item.

Taxon name with Open Nomenclature signs (P12)

Schema validation