Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annie's Chpt 17 Notes #50

Open
anniecollins opened this issue Oct 18, 2022 · 0 comments
Open

Annie's Chpt 17 Notes #50

anniecollins opened this issue Oct 18, 2022 · 0 comments

Comments

@anniecollins
Copy link
Contributor

  • Line 65: What do you mean by simulate when it comes to text data? This chapter does not appear to follow the same workflow as previous chapters
  • Line 67: Chapter does not include regression or word embeddings. Are these to come?
  • Codeblock @ line 157: When removing stop words, are adjacent stop words meant to remain in the string? Originally the code here was not removing these due to spacing issues (see output in book) so I added in a few extra cleaning steps assuming they are meant to be removed, however if this is not the case then it should be explained why some stop words are allowed to remain in the data.
  • Codeblock @ line 282: This was not a very informative example for ngrams, I changed it to some text from Don Quixote
  • Line 296: The explanation of "Canadians", "Canadian", and "Canada” here does not match the output of char_wordstem(c("Canadians", "Canadian", "Canada"))
  • Line 743: I don’t think there is a way to explain the Dirichlet distribution theoretically in a meaningful way in this amount of time. Is there a way to give a more applied overview of the distribution? Otherwise I would skip over this part
  • Section 17.4.1: What is talked about in Canadian parliament?: It would be useful to demonstrate the results of testing different K values with stm() in this section, and potentially a test and training set process like is mentioned at the end of section 17.4
  • In general I think a lot of the examples in this chapter are not contextualized enough. It would be useful to give more concrete suggestions as to how the cleaning steps may come in handy, or how techniques could be used to draw conclusions about a body of text, i.e. how might you analyze Table 17.1 or the TF/IDF/TF-IDF scores themselves? Is there something that could be drawn from your work with Callie for the Topic Models section?
  • Have you considered including a section on sentiment analysis? That might be interesting with the horoscope data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant