Skip to content

Journal article data collected for ContentMine

Richard Smith-Unna edited this page Jan 11, 2015 · 1 revision

Every journal scraper in the collection targets the same data. A scraper should collect as many as possible of the elements in the list below. Words that are styled like this are the keywords that should be used as element names in the scraper definition.

Metadata

  • publisher - the name of the publisher
  • journal:
    • journal_name
    • journal_issn
    • volume
    • issue
    • firstpage
  • title
  • keywords - either a single string containing all the keywords, or each keyword can be captured separately
  • authors:
    • author_name
    • author_institution
    • author_givenName
    • author_familyName
    • author_orcid
  • date:
    • date_published
    • date_accepted
    • date_submitted
  • identifiers:
    • doi
    • pmid - PubMed ID
  • license
  • copyright

Content

  • links:
    • fulltext_html
    • fulltext_pdf
    • fulltext_xml
    • supplementary_file
  • sections - generally in either/both of HTML or text. HTML versions should use the html attribute, while text versions should use the text attribute.
    • abstract:
      • abstract_html
      • abstract_text
    • introduction:
      • introduction_html
      • introduction_text
    • methods:
      • methods_html
      • methods_text
    • results:
      • results_html
      • results_text
    • discussion:
      • discussion_html
      • discussion_text
    • conclusions:
      • conclusion_html
      • conclusion_text
    • author contributions:
      • author_contrib_html
      • author_contrib_text
    • competing interests:
      • competing_interests_html
      • competing_interests_text
    • figures - currently only captured as HTML and image file download
      • figures_html
      • figures_image - a download of the image file, with no renaming
    • tables - currently only captured as HTML
      • tables_html
    • references - currently only captured as HTML
      • references_html