-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2. Counting Words & Phrases - challenge #46
Comments
I have a dataset of 200 letters in which billionaires promise to give away their wealth. In the letters, the authors describe their intentions and motivations. I also have data on the authors, including their net worth, place of residence, age, date of pledging, industry, and Forbes rank. Hypotheses: Higher age, source of wealth in traditional industries, and earlier date of pledging is correlated with use of moral words connected to family, religion, tradition, and social responsibility. Lower age, source of wealth in new industries, and later date of pledging is correlated with use of language related to impact, efficiency, cost-benefit analysis, and return on investment. Reasoning: Over the past three decades there has been a shift in elite philanthropy towards effective, impact-oriented giving. This shift coincides with a demographic shift in the population of billionaires, who are now on average younger and got their wealth in technology or in finance. Here we would check if the change in language of philanthropy is related to the change in personal characteristics. Corpus: The billionaire letters are from the Buffet-Gates Giving Pledge, accessible here: https://givingpledge.org/pledgerlist The variables describing characteristics of letter writers—net worth, place of residence, age, date of pledging, industry, Forbes rank—come from a website collecting data on transparency in philanthropy: https://glasspockets.org/philanthropy-in-focus/eye-on-the-giving-pledge/profiles/jared-and-monica-isaacman The datasets could be made available to the class. |
Hunch Reasoning Corpus Corpus not yet completed, but happy to make available to class. |
Hypothesis: Reasoning: Data: |
My hunch:
Reasoning: Data: |
My hunch:
Reasoning: Data: |
Hunch Reasoning Corpus |
Hypothesis: Reasoning: (A follow-up would be to find whether there are more quotations in journal-linking comments than in news- or blog-linking comments) Data: The pushshift Reddit comment data |
Hypothesis: Reasoning: Data: |
Hunch:
Reasoning: Data- |
Hypothesis: Members of the German parliament born in West Germany refer to the communist regime in the former German Democratic Republic (East Germany) in negative terms more than those born in East Germany do. Reasoning: In the general political discourse in Germany, it is more than common to talk about the GDR in negative terms, pointing to its shooting of people at the Berlin Wall, the Stasi surveillance state, etc. However, many East Germans are fundamentally alienated in reunified Germany and hold positive nostalgic sentiments towards the GDR. West Germans are largely oblivious to this, and I believe this to be true even for West Germans who now represent East German districts (which is quite common -- Chancellor Olaf Scholz is one of them, alongside Foreign Minister Annalena Baerbock). If this is true, one can make the case that this dynamic plays into the political alienation of East Germans, among whom turnout is particularly low. Data: I plan on using this dataset, containing all speeches given in the German Bundestag since 1990. I don't know how feasible it is to discuss this in class since the dataset is in German. |
Hypothesis: Reasoning: Data: |
Hypothesis: Reasoning: Data: |
Hypothesis Reasoning Corpus |
Hypothesis: Reasoning: Corpus: |
Hypothesis Reasoning Corpus |
Hypothesis: Reasoning: Corpus: |
Hunch: Reasoning: Corpus: Ideally, we can use Google Ngrams dataset( or a portion of it). |
Hypothesis: Reasoning: Corpus:
|
Hypothesis: Reasoning: Data: |
Data used for my final project: Tweets that are identified to be sexist. Hunch: Terms or phrases in tweets that are identified to be sexist may evolve over time. Reasoning: Data and examination: |
Hypothesis: |
My hunch: Reasoning: Data: |
Hypothesis: Reasoning: Data: |
Hunch: The numeric review online is inflated, and the reviews from customers and communication messages with the sellers can be good complements to measure the quality of the goods and sellers. Reasoning: Past literature shows that the numeric reviews are inflated (Jie et al, 2020), thus it might be good enough to measure the quality of the goods. There are also rich text resources - the reviews by the customers, which should be real and can be used to measure the quality of the goods. Another text resource can be obtained by communicating with the sellers directly and can detect how soon and what they reply, which are 100% real and can be used to check the reviews of the sellers. Data: Web scrapping data from e-commerce. The communication data is available, but the review data still need to be collected. |
Hunch: Reasoning: Data: |
Hunch: Traditional wife influencers' veiled message on femininity actually appeals more to people who subscribe to alt-right beliefs. Reasoning: Traditional wife influencers usually write and talk about traditional femininity, which ranges from how to act femininely and not act in a masculine way on a surface level, to topics such as to conform to traditional gender roles, how to treat their partner/husband in a feminine way, and why should one be proud of their (European) heritage. While those content may appeal to some women who wish to be a feminine housewives, there is this large overlapping with alt-right ideology, i.e. gender roles, white pride, anti-egalitarianism, etc. Data: |
Hypothesis: the gender difference in writing is becoming less discernible. Reasoning: There are long-standing stereotypes about femail writers, that they focus on family, romantic relationships and maternal stuff, while male writers focus on more scientific and/or grand themes; that their wordings tend to be recognizable as men are more daring, logical, and have no problem using sex-related words. As the improvement in gender equality, however, there're now more female writers and have a larger in-group variation. There are female write about science, war, alchohol and sex. It's reasonable to assume that gender signals have become less discernible across time. Data: the datasource mentioned in the orientation reading that contains about 4% of all books ever printed and metadata about their authors. |
Hypothesis: startup's pitches and pivots become more concrete over time, but pitches that try to differentiate themselves from the markets might not necessarily lead to future investment. Differentiated strategy in business communication would cause future success if their pitches are structurally balanced what we have already known (such as using analogy) and what consumers' expectations for new products. Reasoning: There is a learning process for startup companies. In the beginning, startups may only have a broad or novel concept. Over time the startups learn from the investors and markets by adapting their expectations. From their self business description, companies would add more words to describe their products based on what they learn from the markets. Intuitively, novel products or novel ideas can be attractive to venture capitalist, but not every novel idea make senses from the perspective of investors. Sometimes, their idea might be too "novel" or too difficult to understand. Data: Crunchbase business description. |
Hypothesis Data |
Hypothesis: Reasoning: Data: |
Hunch: Reasoning: Corpus: |
Hypothesis: Reasoning: Data: |
Hypothesis: Rationale: Corpora options: |
Hypothesis: Words associated with violence (e.g., "war") would be more likely to be found in game titles with larger player bases. Rationale: A key factor affecting people's motivations to play video games is engagement. Real life can often be boring or depressing, so video games are frequently used to turn one's focus away from real-life happenings. Violent media content has been shown to, on average, be more engaging than non-violent media content, so violent associations likely signal more engaging games. Additionally, violent environments have been shown to help fulfill certain needs, like autonomy (feeling in control of one's life) and comradery (since violent games often involve needing help and/or helping others). This likely contributes to the appeal of violent connotations as well. Data: The steamcharts.com webpage, which tracks player counts and game titles. |
Post your response to our challenge questions.
Articulate a one-sentence computational linguistics hunch or hypothesis regarding the distribution of words, phrases or parsed claims within your corpus relative to some variable (e.g., time, city size, number of likes), between your corpora, or between your corpus and some linguistic baseline (e.g., all current Wikipedia articles; a sample of 2020 news articles; French tweets from 2016 Paris). This need not be critical to your final project...but it could lead there. Next, in a short (2-5 sentence) paragraph, describe why you reason this hunch or hypothesis might be correct. Finally, list the corpus or corpora on which you will test it, and mention whether it could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) Please do NOT spend time/space explaining how you will explore your hunch or validate your hypothesis with the mentioned corpus. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).
The text was updated successfully, but these errors were encountered: