Project:Curation workflow
Manual annotation steps
- Find descriptions of symbiotic interactions in the literature, see current backlog: Project:New studies to add
- Check if taxon items representing symbiotic interaction partners are already represented in the database (search for taxon names, potential alternative names)
- If not, create taxon items, linking NCBI taxon ID [[Property:P11] if available
- From the item representing host, link to the symbiont with an 'interacts with' Property:P19 statement
- Add the following information about the symbiotic interaction as qualifiers to the 'interacts with' statement. Skip if not known or not specified in the publication.
- Localization of symbiont in host cell/body, Property:P20
- Analytical methods used to identify the symbiont, Property:P22, e.g. light microscopy, metagenome sequencing, phylogenetic marker sequencing
- Analytical methods used to identify the host, Property:P42, e.g. microbiological culture, light microscopy
- Nature/outcome of the interaction, Property:P26 (under development)
- If the interaction is not found in nature but the result of experimental manipulation, use the 'interacts experimentally with' Property:P41 statement instead, with the same qualifiers.
- Cite the source of the information in a reference, using the reference DOI Property:P27 (without the 'doi:' prefix). Reference items will be semi-automatically created/linked later.
- Add information about the environment where the organism was found. For cultured organisms, this should reflect the original environment where the isolate was collected, if known.
- Environmental material of origin Property:P36
- Broad environmental context Property:P38
- Local environmental context Property:P42
- The order in which properties are displayed is set at MediaWiki:Wikibase-SortedProperties; otherwise, statements are displayed in the order they are added.
Automated maintenance tasks
(under development) The following tasks are (semi)-automated, so it is not necessary to do them manually. Scripts are currently run ad-hoc.
- Scripts to create new reference items from DOIs in reference statements and link them
Annotation guidance
NCBI Taxon IDs and representative sequences
(details to come)
Higher taxonomy (P29, P32)
Higher taxa (ranks genus and above) are present in PPSDB to ensure that all taxon items can be placed within a taxonomy tree, even if they are not formally described taxa.
For example, consider an informal taxon concept "Alphaproteobacteria sp. A" which is mentioned in a publication on the basis of a SSU rRNA sequence. There is no corresponding NCBI Taxonomy item, nor a formal taxon name that can be added to Wikidata. Therefore the taxon item itself cannot be mapped to an external taxonomy. However, informal taxa are almost always contextualized in relation to a formal taxon. In this case, the authors claim that it is contained in Alphaproteobacteria , which can be mapped to both the external taxonomies (NCBI and Wikidata). We therefore link the informal taxon "Alphaproteobacteria sp. A" to its parent taxon item, Alphaproteobacteria . We can then use that property in SPARQL queries to find taxa that are themselves not mapped to other taxonomies: example query.
- Link taxon items to their higher taxa with Property:P29 "parent taxon".
- Higher taxa are instances of Item:Q1488 "higher taxon".
- Higher taxon items should not be used in interaction claims (Property:P19) directly.
- Higher taxa must be formally named taxa that are suitable for inclusion in Wikidata
- They should at least be mapped to Wikidata with Property:P2 "Wikidata mapping". Create new Wikidata items if necessary, and ensure that they have the basic properties required for a taxon item on Wikidata.
- Taxon rank should be specified with Property:P32 "has taxon rank".
- It is not necessary to populate a complete taxonomy tree in PPSDB, as we are mapping our taxon items to the Wikidata and NCBI taxonomies. (There is an experiment with representing the PR2 taxonomy tree in PPSDB, but this is not up to date).
Interaction claims (P19, P41) and their qualifiers
Subject body part (P20)
Terms should be mapped to Gene Ontology or Uberon.
Methods used to identify interaction partners (P22, P42)
Interaction type (function, outcome) (P26)
Methods used to characterize interaction type (P45)
References that support or refute a given claim (P23, P43)
Environment terms (P36, P38, P40)
Follow the guidelines for using EnvO terms in the MIxS standard: https://github.com/EnvironmentOntology/envo/wiki/ENVO-annotations-for-MIxS-v5
Use subclasses/instances of the following classes with the respective properties. The items within each class should be mapped to subclasses/instances of the corresponding EnvO terms with Property:P37.
Create new environment term items if necessary. These should be drawn from EnvO and mapped to the equivalent EnvO term. Within PPSDB, the environment term should be filed as a subclass of one of the three classes below. For now, it is not necessary to replicate the subclass structure of EnvO within PPSDB.
Property | Class | EnvO term |
---|---|---|
Property:P36 environmental material of origin | Item:Q1597 environmental material | http://purl.obolibrary.org/obo/ENVO_00010483 |
Property:P38 environmental system of origin | Item:Q1602 environmental system | http://purl.obolibrary.org/obo/ENVO_01000254 |
Property:P40 local environmental context | Item:Q1673 astronomical body part | http://purl.obolibrary.org/obo/ENVO_01000813 |
References without DOIs
If a publication does not have a DOI, a reference item has to be created and linked manually. If the full text is available online, specify the URL Property:P15, otherwise give the citation as free text Property:P14.
Representative sequence records (P34, P46)
- A representative sequence is important to clarify the identity of taxa, particularly informally named ones, in the case of updates or changes to classification and taxonomy.
- The representative sequence record must be one that is stated in a publication that describes the interaction/organism. The publication DOI/item should be cited in a reference statement.
- Instances of taxon without equivalent NCBI taxon must have a representative SSU rRNA or genome sequence accession, to ensure that the taxon concept can be reconstructed later.
- For SSU rRNA sequences Property:P34, use the Genbank accession, with or without the version suffix. This is typically one that is directly stated in a publication as a single accession or within a range of accession IDs.
- For genome data Property:P46, publications may cite various identifiers (BioProject, BioSample, WGS, Genbank), so accession for the genome sequence itself may not appear directly in the cited publication. The preferred accession to cite is either the assembly (GCA_ prefix) or WGS contig set accession.
Images (P33)
(Experimental)
- Open-licensed images hosted on Wikimedia Commons may be linked to taxon items. A thumbnail of that image should appear as a preview in PPSDB, and can also be used in visualizations in the SPARQL query service.
- The images should be micrographs or illustrations that depict the named organisms.
- The corresponding item in Commons should be annotated with the organisms they depict (under Structured Data).
- If the organism is described in an open access publication published under a suitable license (e.g. CC-BY-4.0), the figures can be uploaded to Commons and then linked to the PPSDB item.