Skip to content

Latest commit

 

History

History
147 lines (115 loc) · 5.67 KB

README.md

File metadata and controls

147 lines (115 loc) · 5.67 KB

Scopus Harvester

A set of functions to call Scopus Serial Title Metadata API and harvest following serial title's attributes:

Scopus Serial Title's attributes
Journal Title eISSN
Journal ID Scopus Subject Area
SJR Scopus Subject Area Code
Citescore Scopus Subject Classification
ISSN Open Access

Installation

For the installation, you need to have a git on your system.

pip install git+https://github.com/rbrtjwrk/scopus_harvester.git

Usage

Although it is possible to call standalone functions separately, I recommend you to call the function scopus_journals(subject_abbrev=None, subject_code=None, count=None, start=0) to obtain all of the attributes at once.

To see all Scopus Subject Areas, call function scopus_subject_areas().

>>> import scopus_harvester as sh
>>> 
>>> sh.scopus_subject_areas().head()
  Subject_Area                        Subject_Area: Full_Name
0         AGRI           Agricultural and Biological Sciences
1         ARTS                            Arts and Humanities
2         BIOC   Biochemistry, Genetics and Molecular Biology
3         BUSI            Business, Management and Accounting
4         CENG                           Chemical Engineering
>>>

To see all Scopus Subject Area Codes, call function scopus_subject_area_codes().

>>> import scopus_harvester as sh
>>>
>>> sh.scopus_subject_area_codes().head()
     Subject_Area_Code                             Subject_Classification
0                 1000                                  Multidisciplinary
1                 1100         Agricultural and Biological Sciences (all)
2                 1101   Agricultural and Biological Sciences (miscell...
3                 1102                          Agronomy and Crop Science
4                 1103                         Animal Science and Zoology
>>>

Before harvesting, you must first manually set up your API Key in the file scopus_get_journals.py.

Then call the function scopus_journals(subject_abbrev=None, subject_code=None, count=None, start=0).
Parameters:

  • subject_abbrev: str, default None; you could either leave this parameter unspecified or select exactly one subject area.
  • subject_code: int, defalut None; you could either leave this parameter unspecified or select exactly one subject area code.
  • count: int, default None; count cannot be lower than 1.
  • start: int, default 0.
>>> df=sh.scopus_journals(subject_abbrev="ARTS", count=3, start=0)
>>>
>>> df
                           Journal_Title   Journal_ID       ISSN  ...  Subject_Area_Code                                 Subject_Classification  Open_Access
0                     21st Century Music  18500162600  1534-3219  ...  [1210]              [Music]                                               None
1  3L: Language, Linguistics, Literature  19700200922  0128-5157  ...  [1203, 3310, 1208]  [Language and Linguistics, Linguistics and Language]  1
2                                   452F  21101005201             ...  [1208]              [Literature and Literary Theory]                      1
[3 rows x 8 columns]
>>>

If you want to harvest all Scopus indexed serial titles at once, you may encounter API limits, therefore it is advisable to download the data in batches. E.g. harvest data in batches per subject area/subject area code:

>>> import pandas as pd
>>> import scopus_harvester as sh
>>>
>>> def get_entries(subject_area):
...    output=[]
...    s=0
...    for _ in range(1000):
...        try:
...            r=sh.scopus_journals(subject_abbrev=subject_area, count=200, start=s)
...            ooutput.append(r)
...            s+=200
...        # if there are no more journals in a given subject area
...        except KeyError: 
...            return output
>>>
>>> def flatten_dfs(list_of_dfs):
...    ooutput=pd.DataFrame()
...    for _ in list_of_dfs:
...        output=output.append(_)
...    return output
>>>
>>> subject_areas=sh.scopus_subject_areas()
>>>
>>> res=pd.DataFrame()
>>>
>>> for sa in subject_areas.Subject_Area:
...    print(sa)
...    entries=get_entries(sa)
...    print("--- entries downloaded")
...    flattened_entries=flatten_dfs(entries)
...    print(f"--- {sa}: {len(flattened_entries)}")
...    print("--- entries flattend")
...    res=res.append(flattened_entries)
...    print("--- entries appended")
...    print("")
>>>

It is also possible to compute SJR rank per subject area code per each serial title. To do that, call the function sjr_rank_per_subject_area_code(dataframe).

>>> df=sh.sjr_rank_per_subject_area_code(df)
>>>
>>> df.iloc[:, [0,6,8]]
                           Journal_Title Subject_Area_Code  SJR_Rank_per_Subject_Area_Code
0                     21st Century Music              1210                             1.0
1  3L: Language, Linguistics, Literature              1203                             1.0
2  3L: Language, Linguistics, Literature              3310                             1.0
3  3L: Language, Linguistics, Literature              1208                             1.0
4                                   452F              1208                             NaN
>>>

Note that the SJR rank per subject area code is computed only on the data you harvest. Also, for serial titles that do not have a SJR, the rank is not computed.

License

MIT, see.