Skip to content

Commit

Permalink
Update code to add elsevier_api_key and
Browse files Browse the repository at this point in the history
wiley_api_key, changes are reflated in the readme and
version 0.3.13
  • Loading branch information
s2010515 committed Jul 10, 2023
1 parent e690644 commit d472308
Show file tree
Hide file tree
Showing 10 changed files with 261 additions and 146 deletions.
38 changes: 31 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@ You need to git clone the project and install it.

An API key from NCBI (this is used to search PubMed for articles using a search string or list of PubMed IDs, you can find more information [here](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)).

**Recommended requirements:**

An API key from Wiley, this key will allow you to get access to the OA and publications you or your institution have the right to access from Wiley. You can find more information [here](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining)

An API key from Elsevier, this key will allow you to get access to the OA and publications you or your institution have the right to access from Elsevier. You can find more information [here](https://dev.elsevier.com/)


## Installation
Cadmus has a number of dependencies on other Python packages, it is recommended to install it in an isolated environment.

Expand All @@ -31,18 +38,20 @@ In order to create your corpora you are going to use the function called `bioscr

The function can also receive optional parameters.

1. The "start" parameter tells the function at which service we were at before failure (e.g. crossref, doi, PubMed Central API. . .).
2. The "idx" parameter tells the function what is the last saved row index (article).
1. wiley_api_key parameter allows Wiley to identy which publications you or your institution have the right to access. It will give you access to the OA publications that wihout key you would not get access to. RECOMMENDED
2. elsevier_api_key parameter allows Elsevier to identy which publications you or your institution have the right to access. It will give you access to the OA publications that wihout key you would not get access to. RECOMMENDED
3. The "start" parameter tells the function at which service we were at before failure (e.g. crossref, doi, PubMed Central API. . .).
4. The "idx" parameter tells the function what is the last saved row index (article).

Start and idx are designed to use when restarting cadmus after a program failure. When Cadmus is running, there is a repeated output feed at the top of the live output. This line will show you the stage and index that your output dataframe was last saved in case of failure for whatever reason. By using these optional parameters, the programme will take off where it left off, saving you starting the process from the beginning again.

1. "full_search", in case you want to check if a document became available since the last time you tried. "full_search" has three predefined values:
5. "full_search", in case you want to check if a document became available since the last time you tried. "full_search" has three predefined values:

- The default Value 'None', the function only looks for the new articles since the last run.
- 'light', the function looks for the new articles since the last run and re-tried the row where we did not get any format.
- 'heavy', the function looks for the new articles since the last run and re-tried the row where it did not retrieve at least one tagged version (i.e. html or xml) in combination with the pdf format.

2. The "keep_abstract" parameter has the default value 'True' and can be changed to 'False'. When set to 'True', our parsing will load any format from the begining of the document. If change to 'False', our parsing is trying to identify the abstract from any format and start to extract the text after it. We are offering the option of removing the abstract but we can not guarantee that our approach is the more realiable for doing so. In case you would like to apply your own parsing method for removing the abstract feel free to load any file saved during the retrieval availble in the output folder:
6. The "keep_abstract" parameter has the default value 'True' and can be changed to 'False'. When set to 'True', our parsing will load any format from the begining of the document. If change to 'False', our parsing is trying to identify the abstract from any format and start to extract the text after it. We are offering the option of removing the abstract but we can not guarantee that our approach is the more realiable for doing so. In case you would like to apply your own parsing method for removing the abstract feel free to load any file saved during the retrieval availble in the output folder:
```"output/formats/{format}s/{index}.{suffix}.zip"```.

You need to set the export path before every use so that cadmus is able to retrieve more than 10 000 records from NCBI. For that we offer a function called `display_export_path`. You just need to call this function and copy past the result into your terminal before calling `bioscraping`.
Expand All @@ -59,12 +68,24 @@ export PATH=${PATH}:YOUR_WORKING_DIRECTORY/output/medline/edirect

After copying and paste the above export into your terminal you can now run `bioscraping` with the following example:

**Minimum requirements:**
```python
from cadmus import bioscraping
bioscraping(
INPUT, #type str
EMAIL, #type str
NCBI_API_KEY #type str
)
```
**Minimum recommended requirements:**
```python
from cadmus import bioscraping
bioscraping(
INPUT,
EMAIL,
NCBI_APY_KEY
INPUT, #type str
EMAIL, #type str
NCBI_API_KEY, #type str
wiley_api_key = YOUR_WILEY_API_KEY, #type str
elsevier_api_key = YOUR_ELSEVIER_API_KEY #type str
)
```

Expand Down Expand Up @@ -251,6 +272,9 @@ A: It seems that you are on a shared computer, you need to identify who is the o

## Version

### Version 0.3.13
-> Since Crossref retired the API key feature to let Elsevier and Wiley identified the author of the publication request. wiley_api_key and elsevier_api_key optional parameters have been added as input parameters. These are not mandatory parameters but increase greatly the retrieval rate as they give access to Wiley and Elsevier publications respectively.

### Version 0.3.12
-> Applied some changes in clean_up_dir.py.

Expand Down
1 change: 1 addition & 0 deletions cadmus/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from cadmus.pre_retrieval.creation_retrieved_df import creation_retrieved_df
from cadmus.pre_retrieval.ncbi_id_converter_batch import ncbi_id_converter_batch
from cadmus.retrieval.HTTP_setup import HTTP_setup
from cadmus.retrieval.HTTP_setup_elsevier import HTTP_setup_elsevier
from cadmus.retrieval.get_request import get_request
from cadmus.retrieval.get_tdm_links import get_tdm_links
from cadmus.pre_retrieval.key_fields import key_fields
Expand Down
Loading

0 comments on commit d472308

Please sign in to comment.