A set of functions to call Scopus Serial Title Metadata API and harvest following serial title's attributes:
Scopus Serial Title's attributes | |
---|---|
Journal Title | eISSN |
Journal ID | Scopus Subject Area |
SJR | Scopus Subject Area Code |
Citescore | Scopus Subject Classification |
ISSN | Open Access |
For the installation, you need to have a git on your system.
pip install git+https://github.com/rbrtjwrk/scopus_harvester.git
Although it is possible to call standalone functions separately, I recommend you to call the function scopus_journals(subject_abbrev=None, subject_code=None, count=None, start=0) to obtain all of the attributes at once.
To see all Scopus Subject Areas, call function scopus_subject_areas().
>>> import scopus_harvester as sh
>>>
>>> sh.scopus_subject_areas().head()
Subject_Area Subject_Area: Full_Name
0 AGRI Agricultural and Biological Sciences
1 ARTS Arts and Humanities
2 BIOC Biochemistry, Genetics and Molecular Biology
3 BUSI Business, Management and Accounting
4 CENG Chemical Engineering
>>>
To see all Scopus Subject Area Codes, call function scopus_subject_area_codes().
>>> import scopus_harvester as sh
>>>
>>> sh.scopus_subject_area_codes().head()
Subject_Area_Code Subject_Classification
0 1000 Multidisciplinary
1 1100 Agricultural and Biological Sciences (all)
2 1101 Agricultural and Biological Sciences (miscell...
3 1102 Agronomy and Crop Science
4 1103 Animal Science and Zoology
>>>
Before harvesting, you must first manually set up your API Key in the file scopus_get_journals.py.
Then call the function scopus_journals(subject_abbrev=None, subject_code=None, count=None, start=0).
Parameters:
- subject_abbrev: str, default None; you could either leave this parameter unspecified or select exactly one subject area.
- subject_code: int, defalut None; you could either leave this parameter unspecified or select exactly one subject area code.
- count: int, default None; count cannot be lower than 1.
- start: int, default 0.
>>> df=sh.scopus_journals(subject_abbrev="ARTS", count=3, start=0)
>>>
>>> df
Journal_Title Journal_ID ISSN ... Subject_Area_Code Subject_Classification Open_Access
0 21st Century Music 18500162600 1534-3219 ... [1210] [Music] None
1 3L: Language, Linguistics, Literature 19700200922 0128-5157 ... [1203, 3310, 1208] [Language and Linguistics, Linguistics and Language] 1
2 452F 21101005201 ... [1208] [Literature and Literary Theory] 1
[3 rows x 8 columns]
>>>
If you want to harvest all Scopus indexed serial titles at once, you may encounter API limits, therefore it is advisable to download the data in batches. E.g. harvest data in batches per subject area/subject area code:
>>> import pandas as pd
>>> import scopus_harvester as sh
>>>
>>> def get_entries(subject_area):
... output=[]
... s=0
... for _ in range(1000):
... try:
... r=sh.scopus_journals(subject_abbrev=subject_area, count=200, start=s)
... ooutput.append(r)
... s+=200
... # if there are no more journals in a given subject area
... except KeyError:
... return output
>>>
>>> def flatten_dfs(list_of_dfs):
... ooutput=pd.DataFrame()
... for _ in list_of_dfs:
... output=output.append(_)
... return output
>>>
>>> subject_areas=sh.scopus_subject_areas()
>>>
>>> res=pd.DataFrame()
>>>
>>> for sa in subject_areas.Subject_Area:
... print(sa)
... entries=get_entries(sa)
... print("--- entries downloaded")
... flattened_entries=flatten_dfs(entries)
... print(f"--- {sa}: {len(flattened_entries)}")
... print("--- entries flattend")
... res=res.append(flattened_entries)
... print("--- entries appended")
... print("")
>>>
It is also possible to compute SJR rank per subject area code per each serial title. To do that, call the function sjr_rank_per_subject_area_code(dataframe).
>>> df=sh.sjr_rank_per_subject_area_code(df)
>>>
>>> df.iloc[:, [0,6,8]]
Journal_Title Subject_Area_Code SJR_Rank_per_Subject_Area_Code
0 21st Century Music 1210 1.0
1 3L: Language, Linguistics, Literature 1203 1.0
2 3L: Language, Linguistics, Literature 3310 1.0
3 3L: Language, Linguistics, Literature 1208 1.0
4 452F 1208 NaN
>>>
Note that the SJR rank per subject area code is computed only on the data you harvest. Also, for serial titles that do not have a SJR, the rank is not computed.
MIT, see.