Tool for importing openXML format papers into a wikibase instance

This tool takes a JSON list of papers that are both on WikiData and PubMedCentral, processes them into HTML and does simple dictionary based text mining, the output of which is pushed to a wikibase instance.

This tool requires you have xsltproc installed to let it process the openXML papers.

Run ./bin/ScienceSourceIngest -help to see the command options and details. There are six things you need to provide typically (explained in more detail below):

-feed [file path] - this is a JSON file that contains a list of the papers as fetched from WikiData
-output [directory path] - this is a directory where the tool will store its working state
-urlbase [http(s)://wikibase.server.name] - This should be the protocol and hostname of your Wikibase server
-oauth [file path] - a JSON file containing the Consumer and Access information for your Wikibase
-dictionaries [directory path] - this is a directory where the dictionaries of words to be annotated are found
-xsltproc [file path] - this is the location of the xsltproc tool. Defaults to "/usr/bin/xsltproc"

Paper Feed

There is an example feed of papers in the repository, please see that as a full example. It was generated from Wikidata using a SPARQL query of the form:

SELECT DISTINCT ?item ?itemLabel ?pmcid ?journalLabel ?title ?date ?licenseLabel ?mainsubjectLabel
  WHERE
        {
          ?item wdt:P31 wd:Q13442814 .
          ?item wdt:P5008 wd:Q55439927 .
          ?item wdt:P932 ?pmcid .
          ?item wdt:P1433 wd:Q3359737.
          ?item wdt:P1433 ?journal .
          ?item wdt:P1476 ?title .
          ?item wdt:P577 ?date .
          ?item wdt:P275 ?license .
          ?item wdt:P921 ?mainsubject .
          ?mainsubject wdt:P361* wd:Q18123741 .

        SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

        }
 LIMIT 100

This should be exported from wikidata as a JSON feed.

Output

The output directory is where ScienceSourceIngest will store its state, and it is recommend you use the same output directory for multiple runs to the same wikibase target server, but a different output directory per wikibase target server.

URL base

This should be the prefix of the URL for the wikibase instance you're uploading to. E.g., to upload to Science Source you'd set it to https://sciencesource.wmflabs.org, or for a local test instance to http://localhost:8181 and so forth.

OAuth information

For ScienceSourceIngest to talk to your wikibase target server you need to authenticate with it. On the target wikibase server you should navigate to /wiki/Special:OAuthConsumerRegistration/propose and pick the following options:

Application name - set to something you'll remember for this use
Consumer version - doesn't matter, so set it to v1.0 or such
Application description - set to something you'll remember
This consumer is for use only by [USERNAME] - set this on

The rest can remain at defaults. That last one is the most important, so please ensure you select that. If not then wikibase will require you to authorise the client via a web interface which will not work.

For Applicable grants select the following:

Basic rights
High-volume editing
Edit existing pages
Edit protected pages
Create, edit, and move pages
Protect and unprotect pages

When you click done you will find a page with the following 4 strings on it:

Consumer Token
Consumer Secret
Access Token
Access Secret

If you don't see these you probably forgot to tick "This consumer is for use only by [USERNAME]". You should take these values and put them in a JSON file like so:

{
    "consumer": {
        "key": "633d4025c53c4179ba7260a801a6aee3",
        "secret": "9b4403abf1e9e7fbb208866081df0e5f5770d322"
    },
    "access": {
        "token": "9f5b9e2e20b655299c699e5172dd9ba1",
        "secret": "3c31dd374458e2ce34a4b5163f3ae231cb304d3c"
    }
}

You then pass this file as a parameter when you start ScienceSourceIngest.

Dictionaries

The annotations that ScienceSourceIngest finds in the papers are based on the dictionaries supplied here. There are sample dictionaries in the project dictionaries folder.

Usage notes

The program requires the three xsl files (jats-html.xsl, jats-parsoid.xsl, and jats-common.xsl) in the current working directory.

Please note that uploading data in bulk can be slow - annotations require a lot of items to be created and properties to be set in the Wikibase instance, and each call will take around a second to complete on a remote server, which means papers can take a minute or so to upload fully.

If you re-run the program with the same input feed and output directory then it should safely resume upload from where it left off and not re-upload anything it had already uploaded.

Wikibase Configuration

The ingest process assumes certain Items and Properties are defined in your wikibase instance before you run the tool. Note that the labels are important, as that's what the tool uses rather than hard coded Item or Property IDs that will invariable change between servers (e.g., test, staging, production). Label lookup is case sensitive, and they must be unique labels for their kind on the server.

If the below item and property definitions are not found on the server then they will be automatically created.

Items

Label	Example instance
anchor point	https://sciencesource.wmflabs.org/wiki/Item:Q2
article	https://sciencesource.wmflabs.org/wiki/Item:Q4
annotation	https://sciencesource.wmflabs.org/wiki/Item:Q5
terminus	http://sciencesource.wmflabs.org/wiki/Item:Q6

Properties

Label	Type	Example instance
Wikidata item code	External identifier	https://sciencesource.wmflabs.org/wiki/Property:P2
instance of	Item	https://sciencesource.wmflabs.org/wiki/Property:P3
subclass of	Item	https://sciencesource.wmflabs.org/wiki/Property:P4
preceding anchor point	Item	https://sciencesource.wmflabs.org/wiki/Property:P6
following anchor point	Item	https://sciencesource.wmflabs.org/wiki/Property:P7
distance to preceding	Quantity	https://sciencesource.wmflabs.org/wiki/Property:P8
distance to following	Quantity	https://sciencesource.wmflabs.org/wiki/Property:P9
character number	Quantity	https://sciencesource.wmflabs.org/wiki/Property:P10
article text title	String	https://sciencesource.wmflabs.org/wiki/Property:P11
anchor point in	Item	https://sciencesource.wmflabs.org/wiki/Property:P12
preceding phrase	String	https://sciencesource.wmflabs.org/wiki/Property:P13
following phrase	String	https://sciencesource.wmflabs.org/wiki/Property:P14
term found	String	https://sciencesource.wmflabs.org/wiki/Property:P15
dictionary name	String	https://sciencesource.wmflabs.org/wiki/Property:P16
publication date	Point in time	https://sciencesource.wmflabs.org/wiki/Property:P17
length of term found	Quantity	https://sciencesource.wmflabs.org/wiki/Property:P18
based on	Item	https://sciencesource.wmflabs.org/wiki/Property:P19
ScienceSource article title	String	https://sciencesource.wmflabs.org/wiki/Property:P20
time code1	Point in time	https://sciencesource.wmflabs.org/wiki/Property:P22
anchors	Item	https://sciencesource.wmflabs.org/wiki/Property:P24
page ID	Quantity	https://sciencesource.wmflabs.org/wiki/Property:P25

Building

ScienceSourceIngest is written in Go, and built with Make. You also need to set GOPATH to the directory of the project. If you're in the root directory of this source tree then you can simply type:

export GOPATH=$PWD

You will need to fetch the libraries that this depends on, which you can do with:

make get

Then you should be able to just run:

make

And the tool will be built and put into the $GOPATH/bin directory.

License

This software is copyright Content Mine Ltd 2018, and released under the Apache 2.0 License.

Dependencies

Relies on

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
dictionaries		dictionaries
src/github.com/ContentMine/ScienceSourceIngest		src/github.com/ContentMine/ScienceSourceIngest
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example-feed.json		example-feed.json
jats-common.xsl		jats-common.xsl
jats-parsoid.xsl		jats-parsoid.xsl
jats-text.xsl		jats-text.xsl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tool for importing openXML format papers into a wikibase instance

Paper Feed

Output

URL base

OAuth information

Dictionaries

Usage notes

Wikibase Configuration

Items

Properties

Building

License

Dependencies

About

Releases

Packages

Languages

License

ContentMine/ScienceSourceIngest

Folders and files

Latest commit

History

Repository files navigation

Tool for importing openXML format papers into a wikibase instance

Paper Feed

Output

URL base

OAuth information

Dictionaries

Usage notes

Wikibase Configuration

Items

Properties

Building

License

Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages