-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4. Exploring Semantic Spaces - challenge #34
Comments
Two intuitions: The semantic networks in speeches given by East German members of the Bundestag differ significantly from those given by West German members of the Bundestag *Over time (post-1990), the above difference will become less noticeable within parties, but East German (and) right-wing extremist parties will occupy an increasingly distinguishable semantic space. Data: The dataset I will be drawing on is the first corpus containing speeches given in the German Bundestag, spotty before 1990 but comprehensive after 1990. I will only analyze those speeches given after the German reunification in 1990. The data can be found here. I would also love to discuss this dataset with anyone who's interested in the project! |
Intuitions: Computing word embeddings for popular music lyric corpus from 1970s-2010s will allow me to analyze the shifts in the connotations of gender-slurs. I expect to see trends for such words towards neutrality, or even positivity. However, hegemonic gender associations are expected to be present overtime. Data: |
Intuitions:
Data: |
You might find the GDELT project interesting! https://www.gdeltproject.org/ Intuition:
|
Intuition:
Data: web-scraped from Sina Weibo and freeweibo. |
Intuitions: 1.Topics of around physical appearance and caste will emerge in year-wise documents of matrimonial ads, and this has not changed over the years.* Data can be provided upon request, but will need to scrape advertisements for more years and compile into a meaningful corpus. |
Intuitions:
Data:
I'll welcome anyone who's willing to discuss this project with me. |
I stick to the same intuitions from last week, but using different datasets. Intuition: Financial news Data: Done Jones Newswires; analyst level data: I/B/E/S I love to talk about the model building with TAs and anyone interested! |
Intuitions:
Data Source: |
Intution:
Newspaper data available upon request! |
Two intuitions:
The environmental dataset is the same one we used last week. |
Intuitions:
|
Intuitions:
Dataset: not available. Should be all Twitter data with a specific hashtag like #BLM. Currently working on how to get all historical twitter data (its own API needs official token that is hard to get). |
Intuitions:
Data: the Movies corpus available on Canvas. |
Intuition: Data: |
Intuitions:
Dataset: the discussion on seven social media communities hosted by seven ethnic minority groups at the most used Chinese communication platform. An example is like this. |
Intuitions: dataset: the fake news corpus I proposed to use last week. |
Intuitions about short movie reviews(https://movie.douban.com):
Douban movie reviews(I found a scraper script on Github: https://github.com/csuldw/AntSpider ) |
Intuitions:
|
Intuition:
Data: NOW corpora. (Still working on getting an optimal subsample). |
Intuition: Data: from Netflix "Terras House" |
Intuition:
Data: Currently still scraping, transcribing and translating speeches of various political figures |
Two intuitions:
The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/. |
Similar with @Halifaxi , I'm interested in the changing contextual embeddings of covid overtime. I would focus more on the fake news during the pandemic. The intuitions are:
Dataset: a kaggle dataset collects some covid-related fake news articles and posts. Would like to discuss with anyone interested! |
Words that changed the most from the cleaned telehealth corpus from 2017 to 2023:
Data is from purchased NOW data from 2017-2023 |
Post your response to our challenge questions.
First, write down two intuitions you have about broad content patterns you will discover in your data. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from continuous, high-dimensional embeddings. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Note that these expectations become the basis of abduction--to condition your surprise. Second, describe the dataset(s) on which you will build an embedding model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, OR (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).
The text was updated successfully, but these errors were encountered: