Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check random subset of dicts to verify they match each other and the website #12

Closed
jwzimmer-zz opened this issue Oct 27, 2020 · 16 comments
Assignees

Comments

@jwzimmer-zz
Copy link
Owner

from issue #8 - before we close the issue, i think we should randomly spot check our results:

  • pick a subset of individual trope articles at random
  • check that the results from julia and phil are the same
  • manually inspect a different small random subset on the website vs. in either julia's or phil's results, and make sure that matches
@jwzimmer-zz
Copy link
Owner Author

Revised plan for sanity checking due to #13 (comment):

  • i can't think of an obvious way to use the two kinds of lists to verify each other - no obvious subset relationships - but we can still randomly choose some articles to check manually against the website to make sure it means what we think it means.

@nguyenhphilip
Copy link
Collaborator

Will do this tn!

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Oct 27, 2020

Problems

@jwzimmer-zz
Copy link
Owner Author

I think I fixed the issues I found above ^ in 2485da7 and 1c0110a?

=== Before I re-run my script to get updated dicts: ===

  • Are all the tropes I'm finding now things we actually want?
  • Do we want to record duplicates (if something links to something else multiple times)? I think... yes?
    • Are the duplicates my script finds in Absent Aliens legit duplicates anyway?
  • Are there other folders we don't want to ignore tropes from, besides Main and UsefulNotes?
  • Are there articles structured in other ways than the ones my script expects now? (Paragraphs or ULs)

@jwzimmer-zz
Copy link
Owner Author

@nguyenhphilip I'll check this issue again later/ tmrw to see what changes I should make based on whatever you find, too. : )

@nguyenhphilip
Copy link
Collaborator

Hm, so this is written on the UsefulNotes page:

"Useful Notes articles are not tropes and are not to be included in a work's trope list. See, however, Historical Domain Character. Similarly, tropes are not to be used to describe the subject of a Useful Notes article directly. You may, however, list tropes that are commonly found in media portraying the subject."

The the links on this UsefulNote page GunsOfFiction verify this, it just links to different types of guns like 'revolver' and 'sniper rifle', but there isn't really any content inside of them, which makes me think the purpose of these pages aren't to be meta tropes.

One UsefulNote, MuhammadAli, does link to other tropes that are in Main, but I don't think Muhammad Ali would be a meta trope that ties together these other tropes (could be wrong). Probably ties to this part in above paragraph: Similarly, tropes are not to be used to describe the subject of a Useful Notes article directly.

The World War II UsefulNote links link to other Useful Notes (i.e. 1 and 2 and 3)

So maybe UsefulNotes won't be so useful for us :\

@jwzimmer-zz
Copy link
Owner Author

Hmmmm ok, well, that's easy to take back out! : )

@nguyenhphilip
Copy link
Collaborator

nguyenhphilip commented Oct 28, 2020

Made a dict of links from the articles in every subfolder in Indices

Also while spot checking ABoyAGirlAndABabyFamily using my dict and Julia's dict, it looks like Julia's script captures links that are nested in the Example folders that you have to click to expand on their website that mine doesn't. I think this is because J loops through <p> and <li> tags for links, while I use only <div id = main-article>. Some of these links aren't in the main trope_list (AlwaysMale), but some are (UntoUsASonandDaughterAreBorn).

Doing some further digging, it looks like AlwaysMale IS in Main, the folder we used for our initial master trope list. Maybe a better filtering strategy would be to grab all links within <p> and <li> tags, and exclude links neither in Main or the master trope_list.

Everything in Julia's FightSceneFailure dict lines up with what's in mine except for an extra link to FightSceneFailure. Looks like there are some other <li> items on the page that link back to itself?

@jwzimmer-zz
Copy link
Owner Author

jwzimmer-zz commented Oct 28, 2020

Cool, super helpful!

So my script is currently doing this "Maybe a better filtering strategy would be to grab all links within <p> and <li> tags, and exclude links not in Main" but it isn't doing the additional step you mentioned, checking against the master trope list. Should I add that? That would screen out things like the "Trope" page, is that the idea?

Allowing self-links and duplicates: I think there are analyses that could care about this, especially the duplicates part (e.g. it would be cool to make a network only showing edges above a certain threshhold, like articles that connected to each other more than 10 times or whatever)... maybe we should keep duplicates but get rid of self-links?

@nguyenhphilip
Copy link
Collaborator

Yeah the idea i had in mind was to grab tropes that were in Main but not listed in trope_list so that we can capture as many tropes as we can.

self-links and duplicates: sorry i forgot to think about this last night! yes i agree, i think having duplicates could be interesting depending on the analysis and that we probably don't care about self-links.

side note: Once we do some final filtering/update of scripts I feel like we probably have a large enough sample to begin looking at some questions! :)

@jwzimmer-zz
Copy link
Owner Author

ok so i should make the dicts such that: (0) one dict for every trope in the masterlist of tropes, (1) include links to things not on the masterlist, which we can ignore or include as needed, (2) only include links in the Main namespace or in the masterlist of tropes, (3) include duplicates, (4) do not include self-links

@jwzimmer-zz
Copy link
Owner Author

Remade in 026cfc1 such that:

  • there is one dict for every trope in the master trope list
  • duplicates are included (linking to something more than once)
  • self-links are not included (if a page links to itself)
  • links from the page were included in the dict if they had Main in their link, OR if they were themselves in the trope masterlist

To reiterate from elsewhere: the trope masterlist is all the articles they've labelled as tropes, https://github.com/jwzimmer/tv-tropes/tree/main/trope_list/tropes

i will manually check that these pages #15 are fine to exclude (if not, I'll add them as new dicts).

@jwzimmer-zz
Copy link
Owner Author

@nguyenhphilip these dicts I think look reasonable - not too different from the ones we had before but with the revisions above - so at some point would you mind sanity-checking a few and making sure they look like what you expect? Thanks.

@nguyenhphilip
Copy link
Collaborator

QCd random 3 articles in list linked_article_tropes:

  • TimeKeepsOnTicking looks good. J's script appears to have links only in Main and our master trope_list while ignoring other links (i.e. WhoWantsToBeAMillionare )

  • SharingABody looks good. 179 tropes in this list, so just spot checked a few random ones. They were all listed in Main.

  • DidntSeeThatComing Looks good as well with spot check of first, last, and a few tropes in the middle of the page.

I feel confident that Julia's script got the things we wanted!

@jwzimmer-zz
Copy link
Owner Author

Great! Thanks, @nguyenhphilip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants