Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use OCW FM exported keywords for CR_KEYWORDS #9

Merged
merged 16 commits into from
Jan 29, 2024

Conversation

ibrahimjaved12
Copy link
Collaborator

@ibrahimjaved12 ibrahimjaved12 commented Jan 19, 2024

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/3378

Description (What does it do?)

Uses FM exported keywords for CR_keywords field in OER template. Previously we were just using OCW course topics as keywords.
Now we are relying on FM exported keywords. If they're not available for a course, then we use course's topics as keywords.

We were able to get 2269/2496 OCW courses in FM list.

Updated CSV

Here is the updated CSV file (3.1mb):
https://docs.google.com/spreadsheets/d/1zT0atM14ZbBx7fcy1NWfr94pdhygbIJgKgKaM661krM/edit#gid=1443282546

How to test

  1. Look at the updated CSV file above. Observe CR_KEYWORDS
  2. Make sure there isn't anything weird or inconsistent in that column. By that I mean, in the file, ocw_oer_export/mapping_files/fm_keywords_export.csv, I found different cases:
    • Comma separated keywords
    • Semicolon separated keywords
    • Newline separated, etc (and multiples of those). Make sure there isn't anything else.

Now checkout to this branch, run the program, make sure everything is good:
3. Build the container:

docker compose build
  1. Start the container:
docker compose run --rm app
  1. To generate a JSON file containing complete API data:
docker compose run --rm app --create_json
  1. To create a CSV file from the local JSON file:
docker compose run --rm app --create_csv --source=json
  1. To run unit tests:
docker run --rm ocw_oer_export python -m unittest discover

@ibrahimjaved12 ibrahimjaved12 changed the base branch from main to github-action-tests January 19, 2024 14:05
Base automatically changed from github-action-tests to main January 22, 2024 06:25
@ibrahimjaved12 ibrahimjaved12 added the Needs Review An open Pull Request that is ready for review label Jan 22, 2024
@ibrahimjaved12 ibrahimjaved12 marked this pull request as ready for review January 22, 2024 09:59
@ibrahimjaved12 ibrahimjaved12 changed the title Use fm exported keywords Use FM exported keywords for CR_KEYWORDS Jan 22, 2024
@ibrahimjaved12 ibrahimjaved12 changed the title Use FM exported keywords for CR_KEYWORDS Use OCW FM exported keywords for CR_KEYWORDS Jan 22, 2024
@gumaerc gumaerc self-assigned this Jan 22, 2024
Copy link

@gumaerc gumaerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

This looks good to me. I did notice one thing that was kind of strange, but I don't think it's related to the code in this PR. In some of the CR_KEYWORDS values, where there was an apostrophe s and the S is capitalized. For example:

|First-Order Ode'S|Linear Ode'S|

In this particular case, I think the intention is to indicate multiples rather than as a possessive, so it should be:

|First-Order Odes|Linear Odes|

This shows up in a few places and I'm not exactly sure why.

@ibrahimjaved12
Copy link
Collaborator Author

ibrahimjaved12 commented Jan 25, 2024

👍

This looks good to me. I did notice one thing that was kind of strange, but I don't think it's related to the code in this PR. In some of the CR_KEYWORDS values, where there was an apostrophe s and the S is capitalized. For example:

|First-Order Ode'S|Linear Ode'S|

In this particular case, I think the intention is to indicate multiples rather than as a possessive, so it should be:

|First-Order Odes|Linear Odes|

This shows up in a few places and I'm not exactly sure why.

Thanks, that's a good catch. The 'S problem is coming from this PR. We are using Python's title() function to capitalize. I noticed that CR_KEYWORDS expects keywords to be capitalized (although not by requirement). But most of our keywords from the mapping file are lower-case (except acronyms or explicit stuff like that). Plus the other keywords, the ones taken as subject, were capitalized. So there was also an inconsistency.

The title() function capitalizes but it also has its side effects. For example,
U.S.A --> U.S.A (good, as it should be)
TCP --> Tcp (should be TCP)
USA --> Usa (that's not how it's supposed to be)

so it just capitalizes the first letter, and ignores everything else. And in case of special characters, such as apostrophes, it will also capitalize the next letter. So we get Newton'S. Which originally is intended for possession, but the function messes it up.

string.capwords() fixes our apostrophe problem, but introduces its different set of problems. Eg. U.S.A becomes U.s.a.

The best fix out there should be titlecase. It has a bit of smart conditions (heuristics) involved, it will ignore to capitalize common words like in, or a (using some list containing those words, + it also gives us some room for customization if wanted), and also fixes those other problems mentioned above.

So for academic context, this should be best to use.

Here's an example using all 3:

Input: "in the U.S.A, TCP's and UDPs are widely used in the Networking applications."

title(): In The U.S.A, Tcp'S And Udps Are Widely Used In The Networking Applications
string.capwords(): In The U.s.a, Tcp's And Udps Are Widely Used In The Networking Applications.
titlecase(): In the U.S.A, TCP's and UDPs Are Widely Used in the Networking Applications.

Edit: Turns out I wasn't entirely right about titlecase(). I think the logic it uses is pertaining to particular words, eg. TCP becomes TCP, yet UDP becomes Udp, DNA becomes Dna, RNA becomes Rna.

I think we could however fix this by using isupper() with titlecase(). That should do the trick.

@pt2302
Copy link

pt2302 commented Jan 25, 2024

ODEs should be capitalized like this, since it’s an acronym for “ordinary differential equation”.

@ibrahimjaved12
Copy link
Collaborator Author

ibrahimjaved12 commented Jan 25, 2024

ODEs should be capitalized like this, since it’s an acronym for “ordinary differential equation”.

That is true, but multiples vs possession would be out of scope here. Since keywords were manually added by course authors (I think so?). We could look into that but would need to see if it's worth solving it.

I've manually added the ODEs fix though in the mapping file.

The possession problem though, like Newton'S was a different problem that needed fixing.

@ibrahimjaved12
Copy link
Collaborator Author

@gumaerc I have updated the PR, pertaining to the capitalization and apostrophe problems.

  • Added titlecase library
  • Used isupper() along with above
  • Manually fixed ODE's to ODEs.
  • Added unit tests related to above
  • Removed requirements.txt as we had shifted to using Poetry. Added the new dependency in pyproject.toml

I have updated the CSV file in the PR. PR is ready to be re-reviewed.

Copy link

@gumaerc gumaerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM!

@ibrahimjaved12 ibrahimjaved12 merged commit 632c5dc into main Jan 29, 2024
2 checks passed
@ibrahimjaved12 ibrahimjaved12 deleted the use-fm-exported-keywords branch January 29, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review An open Pull Request that is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants