Skip to content

Commit

Permalink
Add curation guide for publications and improve testing
Browse files Browse the repository at this point in the history
  • Loading branch information
cthoyt committed Oct 16, 2024
1 parent 5ef643d commit 8d07ce2
Show file tree
Hide file tree
Showing 3 changed files with 87 additions and 5 deletions.
5 changes: 3 additions & 2 deletions docs/guides/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

This folder contains various task-specific curation guides.

- [Curating new providers](/curation/providers)
- [Curating new providers](curation/providers)
- [Curating new publications and references](curation/publications)

## How to add new guides

Expand All @@ -11,7 +12,7 @@ This folder contains various task-specific curation guides.
(see
[here](https://github.com/biopragmatics/bioregistry/blob/fe2a685503ae2c9ff863908bf885c71fd240c21d/docs/guides/providers.md?plain=1#L1-L5)
for an example)
3. Add it to the list above
3. Add it to the list above. Don't include a forward slash `/` in the beginning of the link!

## What makes a good guide

Expand Down
76 changes: 76 additions & 0 deletions docs/guides/publications.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
layout: page
title: Curating Publications and References
permalink: /curation/publications
---

The example below shows a subset of the record for
[3D Metabolites (3dmet)](https://bioregistry.io/3dmet) that highlights the `publications` list.
Note that each entry is a dictionary with several parts:

1. `title` (required) - the title of the paper
2. `year` (highly recommended) - the year of publication of the paper
3. `pubmed`, `doi`, and `pmc` (one or more required) - identifiers for the paper

```json
"3dmet": {
"name": "3D Metabolites",
"publications": [
{
"doi": "10.2142/biophysico.15.0_87",
"pmc": "PMC5992871",
"pubmed": "29892514",
"title": "Chemical curation to improve data accuracy: recent development of the 3DMET database",
"year": 2018
},
{
"doi": "10.1021/ci300309k",
"pubmed": "23293959",
"title": "Three-dimensional structure database of natural metabolites (3DMET): a novel database of curated 3D structures",
"year": 2013
}
]
},
```

Similarly, there are URL references that are not _publications_ that are worth curating. These can be
stored in the `references` list. For example, the
[Registry of Toxic Effects of Chemical Substances (rtecs)](https://bioregistry.io/rtecs) entry appears in the
Bioregistry because of its usage, but it is hard to find information on the internet about it. Therefore, the
references list is perfect for storing references to PDFs and webpages that describe the resource.

```json
"rtecs": {
"name": "Registry of Toxic Effects of Chemical Substances",
"publications": [
{
"doi": "10.1016/s1074-9098%2899%2900058-1",
"title": "An overview of the Registry of Toxic Effects of Chemical Substances (RTECS): Critical information on chemical hazards",
"year": 1999
}
],
"references": [
"https://www.cdc.gov/niosh/docs/97-119/pdfs/97-119.pdf",
"https://www.cdc.gov/niosh/npg/npgdrtec.html"
]
}
```

What else is good to keep track of in the references list:

1. Bioregistry issue or pull requests about the resource
2. Links to webpages describing the identifier resource
3. Links to discussions on Slack or other platforms (keeping in mind links might not last forever)
4. Any other context that's useful for a Bioregistry reader

## Why Should I Curate Publications and References?

1. They give additional context for Bioregistry readers who want to know more about the paper
2. They make it easier to attribute usage of identifiers from a given resource to its authors
3. They enable global landscape analysis of when and where identifier resources are being made. The following image is
automatically regenerated with each Bioregistry update:

![](https://raw.githubusercontent.com/biopragmatics/bioregistry/refs/heads/main/docs/img/bibliography_years.svg)
4. They support the training of a machine learning for semi-automated curation of additional literature. See
this [talk](https://docs.google.com/presentation/d/1h2IajyGkUxUPHubEi8_WE6xW6TOuOihn5zsmi4kYrrc/edit?usp=sharing)
from the 2022 Workshop on Prefixes, CURIEs, and IRIs.
11 changes: 8 additions & 3 deletions tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -854,13 +854,18 @@ def test_request_issue(self):

def test_publications(self):
"""Test references and publications are sorted right."""
msg_fmt = (
"Rather than writing a {} link in the `references` list, "
"you should encode it in the `publications` instead. "
"See https://biopragmatics.github.io/bioregistry/curation/publications for help."
)
for prefix, resource in self.registry.items():
with self.subTest(prefix=prefix):
if resource.references:
for reference in resource.references:
self.assertNotIn("doi", reference)
self.assertNotIn("pubmed", reference)
self.assertNotIn("pmc", reference)
self.assertNotIn("doi", reference, msg=msg_fmt.format("DOI"))
self.assertNotIn("pubmed", reference, msg=msg_fmt.format("PubMed"))
self.assertNotIn("pmc", reference, msg_fmt.format("PMC"))
self.assertNotIn("arxiv", reference)
if resource.publications:
for publication in resource.publications:
Expand Down

0 comments on commit 8d07ce2

Please sign in to comment.