Project:Main: Difference between revisions

From Protist-Prokaryote Symbiosis Database
Jump to navigation Jump to search
(→‎Items that need some attention: conflated items fixed)
Line 20: Line 20:


* [https://github.com/kbseah/ppsdb-utils/ ppsdb-utils] -- Python scripts for various maintenance task (private GitHub repo)
* [https://github.com/kbseah/ppsdb-utils/ ppsdb-utils] -- Python scripts for various maintenance task (private GitHub repo)
 
* [https://github.com/kbseah/ppsdb-globi-export ppsdb-globi-export] -- Export of core interaction information from PPSDB for indexing by GloBI


== To do lists ==
== To do lists ==

Revision as of 07:53, 9 July 2024

Lists of pages

Project pages

EntitySchema pages

MediaWiki pages

Software

  • ppsdb-utils -- Python scripts for various maintenance task (private GitHub repo)
  • ppsdb-globi-export -- Export of core interaction information from PPSDB for indexing by GloBI

To do lists

NCBI taxon IDs requiring attention from NCBI Taxonomy team

Items that need some attention

  • Item:Q448 - should be split to two entries
  • Item:Q994 - check that all listed host species included

Taxonomic groups that need updates

  • Euplotes symbionts
  • Ciliate symbionts generally

Project maintenance to-dos

In progress:

  • Add P42 statements "Method used to identify subject"
  • Add environmental origin statements: P36, P38, P40
  • Check for statements that should be moved from P19 to P41 "interacts experimentally with"
  • Interaction statements: If different aspects of the same symbiosis are described in different publications (e.g. phylogenetic identity vs. interaction type), encode these as separate statements with different qualifier values and references, instead of merging them into one statement where it is not clear which reference is cited in support of which claim. Example: Item:Q501. Use qualifier P45 "method used to determine interaction type".
  • To deal with taxa that are incertae sedis or not represented in NCBI or Wikidata: add "parent taxon" statements on all taxon items to link them to the next-higher-ranking formal taxon. The parent taxon is then linked to Wikidata. We should then be able to query taxonomically via parent taxa, using the Wikidata taxonomy.

Chores (have to periodically clear them):

  • Process backlog of new references
  • Add higher taxonomy for newly added taxa that are not yet represented (PR2 for eukaryotes)
  • Add parent taxa for all taxon items semi-automatically - try to match first word in taxon name to a genus name, remaining items have to be added manually
  • Ensure that all items have a class, check that class "placeholder taxon" is consistently used
  • Find references that have only DOIs and link them to reference items, or create new reference items for them by Wikidata lookup
  • Link taxon items to Wikidata by their NCBI taxon IDs
  • Find (prokaryote) taxon items that have NCBI taxon ID and a LPSN record, but which are not in Wikidata, and export them to Wikidata
  • Add formatted citations to reference items; get these from Crossref using the DOI: https://citation.crosscite.org/docs.html

Ideas:

  • Better documentation of the data modeling, example workflow for adding a new entry based on information in a publication
  • Interaction statement: Qualifier if identified in cultured strains (vs. in samples that were directly sampled from environment)
  • Copyright statement, privacy policy

NB: The taxonomy version (e.g. PR2 v5.0.0 for eukaryotes) is tagged by linking to a reference item for that particular taxonomy version to the 'parent taxon' statement in a 'stated in' reference. Ideally we should be able to specify which taxonomy version we want to use, if branches get moved around in the future...

Modeling:

  • Sane way to model placeholder taxa and non-specific taxon statements? E.g. general statements like "all members of this family are associated with methanogens", without creating items for each individual?
  • Modeling interactions: RO is insufficient?
  • Add statements about metabolism ("phototroph", "nitrogen fixer") to symbiont items?
  • Dummy taxon for "microbiome" to enable us to add references for microbiome studies?
  • Which environmental material is most appropriate for protists that are symbionts located in digestive tract eQnvironment?

Draft annotation guidelines:

Export for Globi

SPARQL query to export table with fields used by Globi. The output may need to be processed further.

#List all interactions, optionally the localization, interaction type, and references
PREFIX pp: <https://ppsdb.wikibase.cloud/entity/>
PREFIX ppt: <https://ppsdb.wikibase.cloud/prop/direct/>
PREFIX pps: <https://ppsdb.wikibase.cloud/prop/>
PREFIX ppss: <https://ppsdb.wikibase.cloud/prop/statement/>
PREFIX ppsq: <https://ppsdb.wikibase.cloud/prop/qualifier/>
PREFIX ppsr: <https://ppsdb.wikibase.cloud/prop/reference/>


#SELECT DISTINCT ?argumentTypeName ?sourceTaxon ?sourceTaxonName ?sourceWdmap ?sourceTaxonId ?interactionTypeName ?interactionTypeId ?targetTaxon ?targetTaxonName ?targetWdmap ?targetTaxonId ?sourceBodyPartName ?sourceBodyPartId ?referenceDoi ?referenceCitation WHERE {
SELECT DISTINCT ?argumentTypeName ?sourceTaxonName ?sourceTaxonId ?interactionTypeName ?interactionTypeId ?targetTaxonName ?targetTaxonId ?sourceBodyPartName ?sourceBodyPartId ?referenceDoi ?referenceCitation WHERE {
  ?sourceTaxon pps:P19 ?interaction.
  ?interaction ppss:P19 ?targetTaxon.
  OPTIONAL {
    ?interaction ppsq:P20 ?sourceBodyPart. 
    ?sourceBodyPart rdfs:label ?sourceBodyPartName.
    OPTIONAL { ?sourceBodyPart ppt:P17 ?sourceBodyPartId. }
    OPTIONAL { ?sourceBodyPart ppt:P44 ?sourceBodyPartId. }
  }
  OPTIONAL {
    ?interaction ppsq:P26 ?type. 
    ?type rdfs:label ?typeLabel. 
    OPTIONAL { ?type ppt:P16 ?interactionTypeId. }
  }
  # if no interaction type is given, then default to "host of"
  BIND (EXISTS { ?interaction ppsq:P26 ?type. } AS ?existsType )
  BIND (IF(?existsType, ?typeLabel, "host of") AS ?interactionTypeName)
  OPTIONAL {
    ?interaction prov:wasDerivedFrom ?refnode.
    # OPTIONAL { ?refnode ppsr:P27 ?doi }
    OPTIONAL {
      ?refnode ppsr:P23 ?statedIn.
      OPTIONAL { ?statedIn ppt:P13 ?referenceDoi. }
      OPTIONAL { ?statedIn ppt:P14 ?referenceCitation. }
      BIND (STR("support") AS ?argumentTypeName)
    }
    OPTIONAL {
      ?refnode ppsr:P43 ?statedIn.
      OPTIONAL { ?statedIn ppt:P13 ?referenceDoi. }
      OPTIONAL { ?statedIn ppt:P14 ?referenceCitation. }
      BIND (STR("refute") AS ?argumentTypeName)
    }
  }
  OPTIONAL {
    ?sourceTaxon ppt:P11 ?sourceTaxon_ncbi. 
    BIND ( CONCAT("NCBI:txid", STR(?sourceTaxon_ncbi)) as ?sourceTaxonId )
  }
  OPTIONAL {
    ?targetTaxon ppt:P11 ?targetTaxon_ncbi. 
    BIND ( CONCAT("NCBI:txid", STR(?targetTaxon_ncbi)) as ?targetTaxonId )
  }
  ?sourceTaxon rdfs:label ?sourceTaxonName .
  OPTIONAL { ?targetTaxon rdfs:label ?targetTaxonName. }
  OPTIONAL { ?sourceTaxon ppt:P2 ?sourceWdmap . }
  OPTIONAL { ?targetTaxon ppt:P2 ?targetWdmap . }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} ORDER BY ?sourceTaxonName ?targetTaxonName

Try it!