Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4. Exploring Semantic Spaces - challenge #34

Open
JunsolKim opened this issue Jan 10, 2022 · 25 comments
Open

4. Exploring Semantic Spaces - challenge #34

JunsolKim opened this issue Jan 10, 2022 · 25 comments

Comments

@JunsolKim
Copy link

JunsolKim commented Jan 10, 2022

Post your response to our challenge questions.

First, write down two intuitions you have about broad content patterns you will discover in your data. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from continuous, high-dimensional embeddings. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Note that these expectations become the basis of abduction--to condition your surprise. Second, describe the dataset(s) on which you will build an embedding model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, OR (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

@konratp
Copy link

konratp commented Feb 3, 2022

Two intuitions:

The semantic networks in speeches given by East German members of the Bundestag differ significantly from those given by West German members of the Bundestag

*Over time (post-1990), the above difference will become less noticeable within parties, but East German (and) right-wing extremist parties will occupy an increasingly distinguishable semantic space.

Data:

The dataset I will be drawing on is the first corpus containing speeches given in the German Bundestag, spotty before 1990 but comprehensive after 1990. I will only analyze those speeches given after the German reunification in 1990. The data can be found here. I would also love to discuss this dataset with anyone who's interested in the project!

@Jasmine97Huang
Copy link

Intuitions:

Computing word embeddings for popular music lyric corpus from 1970s-2010s will allow me to analyze the shifts in the connotations of gender-slurs. I expect to see trends for such words towards neutrality, or even positivity. However, hegemonic gender associations are expected to be present overtime.

Data:
Billboard and Spotify data available on request.

@Qiuyu-Li
Copy link

Qiuyu-Li commented Feb 3, 2022

Intuitions:

  1. The similarity or relatedness between news reports in two different countries can be used to identify big events in the two countries. The intuition is that news media from two different countries care about different problems. When they start reporting the same issue, it must be very important and has global impacts.
  2. The auto-similarity of news media can be used to track recurring topics in history.

Data:
(d) Dear TA: I'm a little bit attracted by my ideas the moment I wrote them down. Do you think they are interesting and novel enough to work on? If they are, is there any data source you would recommend for me to explore? Thank you!

@isaduan
Copy link

isaduan commented Feb 4, 2022

Intuitions:

1. The similarity or relatedness between news reports in two different countries can be used to identify big events in the two countries. The intuition is that news media from two different countries care about different problems. When they start reporting the same issue, it must be very important and has global impacts.

2. The auto-similarity of news media can be used to track recurring topics in history.

Data: (d) Dear TA: I'm a little bit attracted by my ideas the moment I wrote them down. Do you think they are interesting and novel enough to work on? If they are, is there any data source you would recommend for me to explore? Thank you!

You might find the GDELT project interesting! https://www.gdeltproject.org/

Intuition:

  • American technology policy are more closer to evidence-based terms (citing more scientific, quantitative studies) and having number-like token
  • New technology tends to be discussed in a negative light, and then neutralized
    • American technology policy are more closer to nationalistic terms (jobs, America) than globalist terms (e.g. international, UN)

Data: https://api.govinfo.gov/docs/

@sizhenf
Copy link

sizhenf commented Feb 4, 2022

Intuition:

  • The Chinese government censorship very harshly on critiques related to its leader's personalization behaviors, but tolerates critiques on its public goods provision
  • On topics that it censors more (personalization), there is more government propaganda, and vice versa
  • We expect to see volume burst when the government release news on new policies.

Data: web-scraped from Sina Weibo and freeweibo.

@pranathiiyer
Copy link

Intuitions:

1.Topics of around physical appearance and caste will emerge in year-wise documents of matrimonial ads, and this has not changed over the years.*
2. Topics around caste, and appearance fail to emerge in advertisements of more recent years. +
3. Topics centered around women are different from topics centered around men. Word embeddings can be used to understand what words are associated with men vs those associated with women.

Data can be provided upon request, but will need to scrape advertisements for more years and compile into a meaningful corpus.

@Sirius2713
Copy link

Sirius2713 commented Feb 4, 2022

Intuitions:

  1. Trump's tweets mentioning company names impacted the public's attitudes toward the company. Consequently, the public attitudes would impose affect in stock market. *
  2. The public's attitudes were polarized towards Trump's tweets. Some agreed strongly, while some disagreed strongly. Therefore, the impact of his tweets would not be always consistent with his sentiments.

Data:

  1. the archive of Trump's tweets https://www.thetrumparchive.com/
  2. Stock price data can be gathered through Yahoo! Finance

I'll welcome anyone who's willing to discuss this project with me.

@chentian418
Copy link

I stick to the same intuitions from last week, but using different datasets.

Intuition:
*Uncertainty are grouped with macroeconomic cyclicality, as they jointly affect management and analyst forecast.
+Non-financial news can be grouped by market sentiment, as there should be implied chains of impact through which different companies connect and have influence on relevant parties.
Financial news can be grouped by their relevance to market expectation, i.e., whether they are relevant to increasing/decreasing expectation of corresponding firm's performances.

Financial news Data: Done Jones Newswires; analyst level data: I/B/E/S

I love to talk about the model building with TAs and anyone interested!

@Emily-fyeh
Copy link

Intuitions:

  1. The human rights authority (who holds the narrative of defining and interpreting international human rights) has been enlarging the meaning and coverage of human rights over years. For example, now human rights officials/scholars pay more attention to states' obligation to interfere with the human rights violation of other countries.
  2. The new, extensive content of human rights can be observed in the documents published by UN human right institutes and NGOs.

Data Source:
Country Report of Human Right Practices (U.S. Department of State)
[(https://www.state.gov/reports/2020-country-reports-on-human-rights-practices/)]
Universal Periodical Review Dataset

@ValAlvernUChic
Copy link

Intution:

  1. Topics about race are hardly mentioned in Singaporean newspapers and if so, are only mentioned in either neutral or a positive context*
  2. Racism will only be addressed distantly

Newspaper data available upon request!

@mikepackard415
Copy link

Two intuitions:

  1. Word embeddings trained on a corpus of online articles and blog posts written by a diverse range of people will have some significant differences to word embeddings trained on a corpus of peer-reviewed academic journal articles. *

  2. Semantic change will be detected more significantly in a corpus of environment blogs and articles than in a corpus of environmental science academic literature. +

The environmental dataset is the same one we used last week.
I don't have the academic dataset fully constructed yet.

@hshi420
Copy link

hshi420 commented Feb 4, 2022

Intuitions:

  1. Topics in Chinese social media and topics in US social media.
  2. Posts on Chinese social media and US social media may show different sentiment towards the same event. *
  3. Posts on Chinese social media and US social media show different sentiment towards their government.
    Dataset: the dataset is not available. It can be super large both cross-sectional and longitudinal. Chinese social media companeis also have policies that might forbid using their data to conduct social science or political science researches.

@LuZhang0128
Copy link

LuZhang0128 commented Feb 4, 2022

Intuitions:

  1. Elite actors get involved in online social movements at a later stage, and the increasing elite participation indicates the start of the bureaucratization stage.
  2. There exist distinguishable sub-groups focusing on different sub-topics in an online social movement's network. Those sub-groups are talking about different topics that can be tested using embedding models.
  3. Non-elites can also be core actors in online social movements, whose tweets can cause a shift in the existing cultural pattern.

Dataset: not available. Should be all Twitter data with a specific hashtag like #BLM. Currently working on how to get all historical twitter data (its own API needs official token that is hard to get).

@Jiayu-Kang
Copy link

Intuitions:

  1. In movie scripts, gendered pronouns' associations with family, career, etc., reflects gender stereotypes.
  2. The biases are decreasing overtime.

Data: the Movies corpus available on Canvas.

@YileC928
Copy link

YileC928 commented Feb 4, 2022

Intuition:
Negative financial news tends to attract more investor attention.+
Investors tend to consume more firm-level news than macro news.*
Negative financial news tends to travel faster and deeper among social networks.

Data:
Possibly twitter data through scraping. Happy to discuss and take any suggestions.

@NaiyuJ
Copy link

NaiyuJ commented Feb 4, 2022

Intuitions:

  • Whereas most ethnic minorities in China are content with the preferential policies, they're concerned more about networking and job-seeking in their daily life. *
  • Compared to groups without religious beliefs, minority groups with religions more frequently talk about sensitive topics in China, like terrorism, democracy, and independence.
  • Ethnic minority groups that have their own languages are more content with the state and policies in contrast to groups without unique languages.

Dataset: the discussion on seven social media communities hosted by seven ethnic minority groups at the most used Chinese communication platform. An example is like this.

@kelseywu99
Copy link

Intuitions:
*Fake news articles use the simpler yet strong adjective to gear toward a broader sentimental audience base with or without backgrounds in higher education.
+Fake news articles are about politics and national security, rather than soft news and feature stories.
Fake news stories craft their own "dumbed-down" words to break down opaque political terms to their audiences despite being fake in content per se.

dataset: the fake news corpus I proposed to use last week.

@Hongkai040
Copy link

Intuitions about short movie reviews(https://movie.douban.com):

  1. reviews upvotes could be predicted by post time and content(sentiment, relatedness to the movie, etc..) *

  2. reviews are more polarized and self-centered overtime.+

Douban movie reviews(I found a scraper script on Github: https://github.com/csuldw/AntSpider )

@chuqingzhao
Copy link

chuqingzhao commented Feb 4, 2022

Intuitions:

  1. Self business description of early-stage companies are more likely to use terms or institutionalized words (consumer-intelligence platform, cloud-computing products) than buzz words over time.
  2. Self business description become professional and specific with the growth of business, because their targeted audiences tend to shift from ordinary people to professional investors.
    Cruchbase data (scraped from wayback machine) available if requested.

@facundosuenzo
Copy link

Intuition:

  • Technological innovations (e.g., AI, Blockchain, Cryptocurrencies, Social Media Networks):
    a) will be generally framed negatively in news corpora across the years.
    b) we'll be able to see different gradients: skepticism, rejection, demonization.
  • Perceived social media's negative impact on society will increase over the years.*

Data: NOW corpora. (Still working on getting an optimal subsample).

@ttsujikawa
Copy link

Intuition:
I will employ semantic analysis on scripts of the reality show and compare the results of one from the United States and one from Japan. This would allow me to reveal cultural differences in how people build relationships with each other (somehow) in reality.

Data: from Netflix "Terras House"

@sudhamshow
Copy link

sudhamshow commented Feb 4, 2022

Intuition:

  • Politicians follow perceivably different discourse and vocabulary usage while addressing people during different contexts. The contexts of interests are campaigning for election/ rallying support, addressing the nation in times of difficulty/ grief/national achievement / historical day, addressing opponents in the congress/parliament and addressing counterparts on an international stage. *
  • Albeit with a different vocabulary set, most politicians use similar words during different speech contexts and these words lie close to each other in the hyperspace (when normalised for the meaning of the word). I suspect most politicians will be using similar vocabulary (attacking the opponent during a rally, calling for unity during a tragic event etc) given a particular speech setting.
  • I suspect that the vocabulary used during election rallies has more fervour and greater call for action than ones delivered during a national address. +

Data: Currently still scraping, transcribing and translating speeches of various political figures

@GabeNicholson
Copy link

Two intuitions:

  • *Covid information keywords will have different contextual embeddings later in the pandemic.
  • +Vaccines and booster shots have different contextual embeddings associated with them, for better or worse.

The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/.
It has detailed text records from articles and media sources that show how the language around the pandemic has evolved and changed over the course of the pandemic. With so many articles and text data, I suspect a word embedding model could reveal some very interesting insights about how words have changed over the course of the pandemic.

@AllisonXiong
Copy link

AllisonXiong commented Feb 4, 2022

Two intuitions:

  • *Covid information keywords will have different contextual embeddings later in the pandemic.
  • +Vaccines and booster shots have different contextual embeddings associated with them, for better or worse.

The corpora I am interested in is on the Coronavirus from https://www.english-corpora.org/corona/. It has detailed text records from articles and media sources that show how the language around the pandemic has evolved and changed over the course of the pandemic. With so many articles and text data, I suspect a word embedding model could reveal some very interesting insights about how words have changed over the course of the pandemic.

Similar with @Halifaxi , I'm interested in the changing contextual embeddings of covid overtime. I would focus more on the fake news during the pandemic. The intuitions are:

  • The contextual embeddings of covid and vaccination changes overtime, the sentiment becomes more positive;
  • The word embedding would vary in fake and mainstream news, as the former would attach more skepticism and conspiracy theory to covid-vaccine.

Dataset: a kaggle dataset collects some covid-related fake news articles and posts. Would like to discuss with anyone interested!

@floriatea
Copy link

floriatea commented Feb 2, 2024

Words that changed the most from the cleaned telehealth corpus from 2017 to 2023:

  • Words like 'pacifica', 'persistence', 'analytics', 'evangelist', 'telecardiology', 'mantra', 'enduser', 'wise', and 'devoted' are among those that have shown significant semantic shifts.
  • These words likely represent emerging concepts, technologies, or trends that have gained prominence or evolved in meaning during the period studied.
  • For instance, 'analytics' and 'telecardiology' might indicate a growing focus on data analysis and remote healthcare, respectively. 'Evangelist' in a modern context often refers to someone who promotes a particular technology or innovation, which could suggest an evolving role in the tech or business sectors.

Data is from purchased NOW data from 2017-2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests