-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Indicate papers that cite a Dandiset in the DLP #1897
Comments
quick one: I would feel great if such information ("Cited By ??" badge leading to listing) was displayed on DLP. |
This would be great to include on the DLP as yet another way of demonstrating DANDI usage (in addition to the work in progress access stats) Also great for reporting purposes I'd imagine |
could someone check more if https://support.datacite.org/docs/consuming-citations-and-references could be the one to go after? all our DOIs are minted by datacite (through dartmouth library subscription). I could not resist so here is some crude script where I used their REST API on a list of our dandisets most recent (so not all versions per dandiset -- to be tuned!) crude bash script#!/bin/bash
cd /tmp
curl -X 'GET' \
'https://api.dandiarchive.org/api/dandisets/?page_size=1000&draft=false&empty=false&embargoed=false' \
-H 'accept: application/json' | jq . > published_dandisets.json
mkdir citations
jq -r '.results[] | "\(.identifier) \(.most_recent_published_version.version)"' < /tmp/published_dandisets.json \
| while read id version; do
curl --silent https://api.datacite.org/events?doi=10.48324/dandi.$id/$version > citations/$id-$version.json
done
and looking at results which are not empty
we get some! 000458 is not in the list :-/ but looking inside for different types, interesting one seems to be
e.g.
so it points to https://www.nature.com/articles/s41597-022-01280-y which is paper telling that data was shared on DANDI. So I think for now we could easily provide some basic "citations gatherer" service to run on cron, e.g. weekly, and produce badges per each dandiset . The question only would be how to integrate with the archive -- I do not think it should modify metadata record since that one could be later changed by the author(s) |
note that this loosely relates also to @magland 's annotations -- we might want to post banner which would point to list of annotations for NWBs in the dandiset.. also relates to notebooks etc -- i.e. how should we build services which provide extra linkages which we do not want to become part of metadata records. |
This is great, @yarikoptic! It looks like this could work well for automatically gathering citation information. |
I very much support this idea. This feature would allow us to notify dataset owners when their data is reused, create a data reuse score for researchers like an h-index that can be used in performance evaluations / career advancement, show funders that standards and archives can generate new science and methods, and generally foster a culture of data sharing and reuse.
I found many examples of such when searching for data reuse examples of dandisets (ad hoc listing here). Data are often not cited in the References section but in the Data Availability section, and I think DataCite / CrossRef does not pick those up. (editors need to do better at addressing this!) I also found DataCite to be more effective at finding examples than CrossRef. Some general heuristics that I used were to search "dandi", "nwb", "dandiarchive.org", "neurophysiology data available" and related terms on google scholar. I think LLMs are well-suited to help solve this problem, assuming papers can be scraped from pubmed/biorxiv/elsewhere (maybe using NeuroQuery?). The LLM could 1) detect that a DANDI dataset has been used and 2) distinguish between primary use, secondary use, and just referencing (maybe it could give a general score that a human can go in and review afterward). Some related efforts:
|
Great info @rly -- thanks!!
|
This is on our roadmap, but I do not believe we have started on this. I briefly looked into the DataCite API, but I didn't get as far as Yarik did. @nellh or @rwblair may have, so pinging them. We have previously tasked @jbwexler with finding reuses and citations. I believe this was mostly scraping search engine results, but he might have thoughts here. |
Agreed this would be a great feature to add for both Dandi and ON. I unfortunately don't have too much to add. My approach was basically a semi-automated version of:
The first two steps could of course be easily automated. If we skip the third to avoid the labor cost, that would leave us with a list of "papers that might mention this dataset". That seems potentially useful but probably too much room for error for something akin to an h-index. I like the LLM idea to do step 3. That would be fun to try to get that working. |
+1 for working with DataCite. There is emerging work happening related to the Global Data Citation Corpus that could be helpful here, and your use case for Dandiset might be an interesting test of what they have already. Because only citations that happen in the References section of articles are counted in the Crossref/DataCite shared EventsDB, they are working with CZI on applying AI (named entity recognition) to the scholarly record (PubMed?) to pull out unstructured data into something that makes sense.
|
I suppose I can take up the baton here. In Python, for mere mortals: import os
import requests
import json
from tqdm import tqdm
# get all published dandiset IDs
dandi_api_url = "https://api.dandiarchive.org/api/dandisets/"
params = {
"page_size": 1000,
"empty": "false",
"draft": "false",
"embargoed": "false",
}
headers = {"accept": "application/json"}
# Fetch the list of published dandisets
response = requests.get(dandi_api_url, headers=headers, params=params)
response.raise_for_status() # Check for HTTP request errors
published_dandisets = response.json()
published_dandisets_ids = [x["identifier"] for x in published_dandisets["results"]]
# get all versions of each published dandiset
all_versions = {}
for id_ in tqdm(published_dandisets_ids, desc="get dandiset versions"):
dandi_api_url = f"https://api.dandiarchive.org/api/dandisets/{id_}/versions"
params = {"page_size": 1000}
headers = {"accept": "application/json"}
response = requests.get(dandi_api_url, headers=headers, params=params)
versions = response.json()
all_versions[id_] = [x["version"] for x in versions["results"] if x["version"] != "draft"]
# Iterate over each version of each dandiset and fetch citation data from DataCite
from collections import defaultdict
from dateutil import parser
results = []
# iterate over versions of dandisets and get citations
for identifier, versions in tqdm(all_versions.items(), desc="get citations"):
for version in versions:
datacite_url = f"https://api.datacite.org/events?doi=10.48324/dandi.{identifier}/{version}"
citation_response = requests.get(datacite_url)
citation_response.raise_for_status()
citation_data = citation_response.json()
for x in citation_data["data"]:
if "dandi" in x["attributes"]["subj-id"]:
continue # exclude citations from other dandisets
results.append(
dict(
dandiset_id=identifier,
doi=x["attributes"]["subj-id"],
timestamp=parser.parse(x["attributes"]["timestamp"]),
)
)
import pandas as pd
df = pd.DataFrame(results)
df
|
This is great! though may people don't actually cite the doi, which is why @jbwexler had to resort to the manual approach. |
Yes, I also see a lot of references to unpublished dandisets that don't have DOIs so they don't show up here. Still, it's nice to get what we can from the fully automated approach. This might work better for ON. |
definitely! |
A systematic (looking "forward") solution IMHO would be to provide DOIs for draft dandisets too. Related: |
FWIW in a chat with chatgpt for a different but very similar need , it pointed me to https://opencitations.net/ and their API here is the script it gave to feed DOI and get what references it#!/usr/bin/env python3
import requests
import os
import sys
import json
from platformdirs import user_cache_dir
import bibtexparser
CACHE_DIR = os.path.join(user_cache_dir("citing_works"))
os.makedirs(CACHE_DIR, exist_ok=True)
def fetch_and_cache_doi_info(doi):
"""
Fetches the citation record for a DOI and caches the result.
"""
cache_file = os.path.join(CACHE_DIR, f"{doi.replace('/', '_')}.json")
if os.path.exists(cache_file):
with open(cache_file, "r") as f:
return json.load(f)
bibtex_url = f"https://doi.org/{doi}"
headers = {"Accept": "application/x-bibtex"}
try:
response = requests.get(bibtex_url, headers=headers)
response.raise_for_status()
bibtex_entry = response.text
# Parse BibTeX
bib_database = bibtexparser.loads(bibtex_entry)
if bib_database.entries:
entry = bib_database.entries[0]
result = {
"title": entry.get("title", "No Title"),
"year": entry.get("year", "No Year"),
"doi": doi,
}
with open(cache_file, "w") as f:
json.dump(result, f)
return result
else:
print(f"No valid BibTeX found for DOI: {doi}")
return {"title": "No Title", "year": "No Year", "doi": doi}
except requests.exceptions.RequestException as e:
print(f"Error fetching citation data for DOI {doi}: {e}")
return {"title": "No Title", "year": "No Year", "doi": doi}
def get_citing_works(doi):
"""
Fetches works citing the given DOI using the OpenCitations API.
"""
base_url = "https://opencitations.net/index/coci/api/v1/citations/"
headers = {"Accept": "application/json"}
try:
response = requests.get(f"{base_url}{doi}", headers=headers)
response.raise_for_status()
data = response.json()
if data:
return data
else:
print("No citing works found for this DOI.")
return []
except requests.exceptions.RequestException as e:
print(f"Error fetching citing works: {e}")
return []
def main():
"""
Main entry point for the script.
"""
if len(sys.argv) != 2:
print("Usage: python get_citing_works.py <DOI>")
sys.exit(1)
doi = sys.argv[1].strip()
if doi.startswith("https://doi.org/"):
doi = doi.replace("https://doi.org/", "")
print(f"Fetching works citing DOI: {doi}")
citing_works = get_citing_works(doi)
if citing_works:
print(f"\nFound {len(citing_works)} citing works:\n")
for i, work in enumerate(citing_works, start=1):
citing_doi = work.get("citing", "No DOI")
if citing_doi != "No DOI":
record = fetch_and_cache_doi_info(citing_doi)
title = record.get("title", "No Title")
year = record.get("year", "No Year")
print(f"{i}. Title: {title}, Year: {year}, DOI: {citing_doi}")
else:
print("No citing works found.")
if __name__ == "__main__":
main() paired with that ugly bash script#!/bin/bash
# cd /tmp
curl -X 'GET' \
'https://api.dandiarchive.org/api/dandisets/?page_size=1000&draft=false&empty=false&embargoed=false' \
-H 'accept: application/json' | jq . > published_dandisets.json
mkdir citations
jq -r '.results[] | "\(.identifier) \(.most_recent_published_version.version)"' < /tmp/published_dandisets.json \
| while read id version; do
works=$(python citeref_publications2.py 10.48324/dandi.$id/$version )
if ! echo "$works" | grep -q "No citing" ; then
echo "$works"
fi
done
gives us this listingFetching works citing DOI: 10.48324/dandi.000055/0.220127.0436
Found 1 citing works:
1. Title: AJILE12: Long-term naturalistic human intracranial neural recordings and pose, Year: 2022, DOI: 10.1038/s41597-022-01280-y
Fetching works citing DOI: 10.48324/dandi.000140/0.220113.0408
Found 1 citing works:
1. Title: A spiking neural network with continuous local learning for robust online brain machine interface, Year: 2023, DOI: 10.1088/1741-2552/ad1787
Fetching works citing DOI: 10.48324/dandi.000301/0.230806.0034
Found 1 citing works:
1. Title: Neural mechanisms for the localization of unexpected external motion, Year: 2023, DOI: 10.1038/s41467-023-41755-z
Fetching works citing DOI: 10.48324/dandi.000488/0.230602.2022
Found 1 citing works:
1. Title: Differential encoding of temporal context and expectation under representational drift across hierarchically connected areas, Year: 2023, DOI: 10.1101/2023.06.02.543483
so seems significantly less than what we get from datacite |
@jgrethe shared pointer to their scripts for SPARC https://github.com/SciCrunch/SPARC-Citations which was used to produce quite a comprehensive https://github.com/SciCrunch/SPARC-Citations/blob/main/dataset_data_citations.tsv . |
I found a paper (https://doi.org/10.1016/j.neuron.2023.08.005) that cites Dandiset 000458 (https://doi.org/10.48324/dandi.000458/0.230317.0039). When I went to the Dandiset landing page, I find that there are some papers associated with this Dandiset but not the paper that I found. This is because the paper is a secondary use of this Dandiset, and did not exist when the Dandise was published.
I think we are missing a huge opportunity here. If we want to influence the behavior of scientists to reuse data, one of the best ways to do that is to educate them about others that are already doing this behavior. In doing so, we will establish that this is a high-quality dataset worth analyzing, demonstrate that you can achieve publications through reuse of data, and advance social norms around using data. All the better if the publications are from high-impact journals like Neuron. Therefore, I think in some way indicating papers that use and cite a Dandiset should be a high priority. While GitHub-like stars, page views, and download stats are all very important, IMO this metric is even more important than all of those.
I think this should really go on the DLP, and should not be in control of the Dandiset owner. Ideally, this would reflect UX patterns that the user is already familiar with. For example every scientist is familiar with the Google Scholar "Cited By [x]" link:
I think the most straightforward UX solution would be to add a button here:
that says "Cited by [#]". Then that button would lead to a modal window that contains a list of papers that cite this Dandiset, formatted similarly to how this is done in Google Scholar:
This may not be ideal because it does not make the citation metrics as prominent as I would like, but it would be a massive improvement over not having this metric on the DLP at all.
Then the question is: how do we gather this information? It looks like this can be done with crossref (https://www.crossref.org/documentation/cited-by/retrieve-citations/), which would require credentials, and I don't know whether crossref even tracks using of DANDI DOIs.
opencitations provides a service for this that works on Science papers, e.g.
http://opencitations.net/index/coci/api/v1/citations/10.1126/science.abf4588
but not on Dandisets.http://opencitations.net/index/coci/api/v1/citations/10.48324/dandi.000458/0.230317.0039
returns an empty list. It is possible the citations has just not been indexed yet. This is hard to test because a lot of publications like https://www.nature.com/articles/s41586-023-06031-6 do not properly cite the Dandiset DOI. This is another issue: we might want to be able to manually add citation information for examples like this where high-profile papers use Dandisets but do not cite them in a way that our system will be able to detect.Once we have the DOIs of the citing papers, I can confirm that crossref is a great tool for gathering information about a specific publication. https://api.crossref.org/works/{doi} returns all the information we would need, e.g.
https://api.crossref.org/works/10.1126/science.abf4588
Beyond putting on the DLP, this is a very important metric for us to track. Looking at publications over the last year or so, I am seeing examples of high-profile papers that use Dandisets that we don't even know about, and this is quickly getting to a point where we need automated tools to track this.
The text was updated successfully, but these errors were encountered: