Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ukvsconverter follows DRY conventions for JSON #40

Merged
merged 44 commits into from
Aug 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
bdd3552
Removing merge to main in main.yml
WinstonShields Jul 20, 2022
3ccfb5c
Merge branch 'date-range-for-each-individual-ids' of github.com:oduws…
WinstonShields Jul 20, 2022
a51d42d
Add changes
actions-user Jul 20, 2022
4272014
modifying main.yml
WinstonShields Jul 20, 2022
6266e45
Add changes
actions-user Jul 20, 2022
a903fb9
modified main.yml
WinstonShields Jul 20, 2022
aadf70a
Merge branch 'date-range-for-each-individual-ids' of github.com:oduws…
WinstonShields Jul 20, 2022
57dcd03
Add changes
actions-user Jul 20, 2022
16a05d2
modified main.yml
WinstonShields Jul 20, 2022
af46064
Merge branch 'date-range-for-each-individual-ids' of github.com:oduws…
WinstonShields Jul 20, 2022
c08dfcf
modified main.yml
WinstonShields Jul 20, 2022
5b823e9
modified main.yml
WinstonShields Jul 20, 2022
7ee9c1e
modified main.yml
WinstonShields Jul 20, 2022
d838818
modified main.yml
WinstonShields Jul 20, 2022
f319708
Add changes
actions-user Jul 20, 2022
4f23151
modified main.yml
WinstonShields Jul 20, 2022
d9cea5d
Add changes
actions-user Jul 20, 2022
29d0f0c
modified main.yml
WinstonShields Jul 20, 2022
2f3123a
Merge branch 'date-range-for-each-individual-ids' of github.com:oduws…
WinstonShields Jul 20, 2022
b45235d
Add changes
actions-user Jul 20, 2022
61cfa3e
start year and end year now optional
WinstonShields Jul 27, 2022
0cbc2ca
removed generated files
WinstonShields Jul 27, 2022
786bb3f
Add changes
actions-user Jul 27, 2022
1d144af
updated main.yml and readme
WinstonShields Jul 27, 2022
8afcf3a
Merge branch 'date-range-for-each-individual-ids' of github.com:oduws…
WinstonShields Jul 27, 2022
bd401de
removed extra '-i'
WinstonShields Jul 27, 2022
412a40c
Add changes
actions-user Jul 27, 2022
5f4d68b
actions should fail if no HTML is generated
WinstonShields Jul 27, 2022
e5def22
Add changes
actions-user Jul 27, 2022
9b0e17f
testing if fail statement works
WinstonShields Jul 27, 2022
319cb67
testing if fail statement works
WinstonShields Jul 27, 2022
83212b9
testing
WinstonShields Jul 27, 2022
ae8b42f
testing
WinstonShields Jul 27, 2022
3972ef8
added output verification to git actions
WinstonShields Jul 27, 2022
5553943
added output verification to git actions
WinstonShields Jul 27, 2022
916c54d
Add changes
actions-user Jul 27, 2022
f66742f
Add changes
actions-user Aug 3, 2022
c6ba795
markdown files extracted year now shows
WinstonShields Aug 3, 2022
8dfa9ab
Merge branch 'markdown-files-do-not-always-display-extracted-year' of…
WinstonShields Aug 3, 2022
699c598
JSON object is now parsed to simplify program
WinstonShields Aug 3, 2022
00bf6c4
Add changes
actions-user Aug 3, 2022
8d74f59
make sure year is displayed
WinstonShields Aug 3, 2022
91ad47b
Stringify year
WinstonShields Aug 3, 2022
8b51f2a
Add changes
actions-user Aug 3, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 47 additions & 18 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,49 @@ jobs:
- name: htmlsave
run: |
sudo find output/data -type f -iname \*.html -delete
python3 ./code/htmlsave.py --output output/data oWQaPnwAAAAJ MOLPTqcAAAAJ OkEoChMAAAAJ -eRsYx8AAAAJ QjHw7ugAAAAJ Of8dNP0AAAAJ
python3 ./code/htmlsave.py --output output/data oWQaPnwAAAAJ MOLPTqcAAAAJ OkEoChMAAAAJ -eRsYx8AAAAJ QjHw7ugAAAAJ Of8dNP0AAAAJ jDmcdsUAAAAJ
echo "num_of_html=$(ls output/*.html | wc -l)" >> $GITHUB_ENV
echo "empty_html=$(find . -name '*.html' -size 0 | wc -l)" >> $GITHUB_ENV
echo "corrupted_html=$( grep -irm 1 '<p class=\"a2CQh\" jsname=\"VdSJob\">to continue to Google Scholar Citations</p>' --include \*.html . | wc -l)"
- name: Check if HTML files are generated on htmlsave
if: ${{ env.num_of_html < 1 }}
uses: actions/github-script@v3
with:
script: |
core.setFailed('No files generated on htmlsave')
- name: Check for empty HTML files generated on htmlsave
if: ${{ env.empty_html > 1 }}
uses: actions/github-script@v3
with:
script: |
core.setFailed('Empty HTML files')
- name: Check for corrupted HTML files generated on htmlsave
if: ${{ env.corrupted_html > 1 }}
uses: actions/github-script@v3
with:
script: |
core.setFailed('Corrupted HTML files')

- name: html2ukvs
run: |
sudo apt-get install python3-bs4
sudo apt-get install w3m
sudo find output/data -type f -iname \*.ukvs -delete
python3 ./code/html2ukvs.py output/data/*html

python3 ./code/html2ukvs.py output/data/*.html -i oWQaPnwAAAAJ --start=2002 -i MOLPTqcAAAAJ --start=2011 -i -eRsYx8AAAAJ --start=2018 -i OkEoChMAAAAJ --start=2018 -i Of8dNP0AAAAJ --start=2019 -i QjHw7ugAAAAJ --start=2020 -i jDmcdsUAAAAJ --start=2002 --end=2005
echo "num_of_ukvs=$(ls output/*.ukvs | wc -l)" >> $GITHUB_ENV
echo "empty_ukvs=$(find . -name '*.ukvs' -size 0 | wc -l)" >> $GITHUB_ENV
- name: Check if HTML files are generated on html2ukvs
if: ${{ env.num_of_ukvs < 1 }}
uses: actions/github-script@v3
with:
script: |
core.setFailed('No files generated on html2ukvs')
- name: Check for empty UKVS files generated on html2ukvs
if: ${{ env.empty_ukvs > 1 }}
uses: actions/github-script@v3
with:
script: |
core.setFailed('Empty UKVS files')
- name: Sort
run: |
cat output/data/*ukvs | sort -u -k1,1 | sort -k2 -rn > output/comprehensive.ukvs
Expand All @@ -48,7 +82,6 @@ jobs:
- name: Commit files
id: commit
run: |
git branch
git config --local user.email "[email protected]"
git config --local user.name "github-actions"
git add --all
Expand All @@ -58,23 +91,19 @@ jobs:
git commit -m "Add changes" -a
echo "::set-output name=push::true"
fi
echo ${GITHUB_REF##*/}
git pull origin ${GITHUB_REF##*/}
git push origin ${GITHUB_REF##*/}
shell: bash
- name: Push changes via push
if: steps.commit.outputs.push == 'true' && github.event_name == 'push'
# uses: ad-m/github-push-action@master
run: |
git push origin ${GITHUB_REF##*/}
- uses: actions/checkout@v2
with:
ref: main
- name: Pull to main
ref: ${{ github.event.pull_request.head.ref }}
- name: Push changes via pull request
if: steps.commit.outputs.push == 'true' && github.event_name == 'pull_request'
# uses: ad-m/github-push-action@master
run: |
git branch
git merge origin/${GITHUB_REF##*/}
git add --all
git add .
shell: bash
- name: Push changes
if: steps.commit.outputs.push == 'true'
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
git push origin ${{ github.event.pull_request.head.ref }}

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ Specifying only a start or end year will give you the remaining results when an
```
#### HTML List Format:

Articles may be formatted into different list types such as one list(1), lists separated by year(all), lists separated by scholar ID(scholarid), or no list(none) by using the command line argument --list:
Articles may be formatted into different list types such as one list(1), lists separated by year(all), or no list(none) by using the command line argument --list:

```
./ukvsconvert.py --html --startyear "2010" --endyear "2021" --list=all --title "Article Results" comprehensive.ukvs > all.html
Expand Down
39 changes: 27 additions & 12 deletions code/html2ukvs.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from bs4 import BeautifulSoup
import argparse
import math
import json

"""
The findinitiallink(): function is used to search the saved HTML file for the link for each
Expand Down Expand Up @@ -56,14 +57,25 @@ def createpopupURL(initial_link):
parser.add_argument('files', action='append', nargs='+')
# Add parser argument for the scholar IDs and their start years and end years.
parser.add_argument('-i', action='append', nargs='+')

args = parser.parse_args()

files = args.files[0]

scholar_options = {}

if args.i:
scholar_options = { option[0]: list(map(int, option[1:])) for option in args.i }
# scholar_options = { option[0]: list(map(int, option[1:])) for option in args.i }
scholar_options = {}

for option in args.i:
start = [string for string in option if '--start' in string]
start = int(start[0].split('=')[1]) if start else None
end = [string for string in option if '--end' in string]
end = int(end[0].split('=')[1]) if end else None

scholar_options[option[0]] = [start, end]


# Import the html file contents and open it with Beautiful Soup. The HTML file is read by
# byte and uses the 'lxml' conversion parser. This uses Beautiful Soup version 4.
Expand Down Expand Up @@ -92,6 +104,8 @@ def createpopupURL(initial_link):
if scholar_options:
if scholar_id_key in scholar_options.keys():
startyear, endyear = scholar_options[scholar_id_key]
startyear = startyear if startyear is not None else -math.inf
endyear = endyear if endyear is not None else math.inf

"""
The program uses Beautiful Soup functions to extract specific elements. The elements are
Expand Down Expand Up @@ -135,17 +149,18 @@ def createpopupURL(initial_link):

# Items in the UKVS file are arrays of entries with each entry being saved as
# multiple key-value pairs in a dictionary format following a hash and year key..
gs_lists.append((
hashID + ' ' + pageYear + ' { ' + \
'"DirectURL":"' + directURL + '", ' + \
#'"PopURL":"' + popURL + '", ' + \ # Removed while no longer functional in GS page
'"Title":"' + title + '", ' + \
'"Authors":"' + authors + '", ' + \
'"Source":"' + source + '", ' + \
'"CitedBy":"' + citedBy + '", ' + \
'"Citations":"' + citations + '", ' + \
'"PageYear":"' + pageYear + '"}'
))

items = {
'DirectURL' : directURL,
'Title' : title,
'Authors' : authors,
'Source' : source,
'CitedBy' : citedBy,
'Citations' : citations,
'PageYear' : pageYear
}

gs_lists.append(hashID + ' ' + pageYear + ' ' + json.dumps(items))

# Save the contents as an UKVS file with the same name as the original HTML file
f_name, f_ext = os.path.splitext(html_file.name)
Expand Down
4 changes: 3 additions & 1 deletion code/htmlsave.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,8 @@ def createURL():
qualifier = '>There are no articles in this profile.<'
article_test = True

corrupted_file = '<p class="a2CQh" jsname="VdSJob">to continue to Google Scholar Citations</p>'

# Program checks status code to verify a valid page was received. A status code
# of '200' is valid. A '302' redirect to a '200' is normally accepted as well.
statuscode = page.status_code
Expand All @@ -126,7 +128,7 @@ def createURL():
begin_value = begin_value + 100
page = requests.get(createURL())
new_test = page.text
if qualifier in new_test:
if qualifier in new_test or corrupted_file in new_test:
article_test = False
statuscode = page.status_code
x = x+1
Expand Down
Loading