5. Classifying Meanings & Documents - oritenting #33

JunsolKim · 2022-01-10T16:48:46Z

Post questions here for this week's oritenting readings: Hopkins, Daniel J. and Gary King. 2010. A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54(1): 229-247.

pranathiiyer · 2022-02-09T01:45:40Z

There are certain problems of social science where there might be interest to individually classify documents (for spam/hate speech/defamation/misinformation etc.) and use these for classification problems, but also understand the proportion of documents that fall under each category. These problems might also be volatile and sensitive to time. Accounting for population drift, how can this non-parametric method be applied meaningfully to such problems?

GabeNicholson · 2022-02-09T18:40:47Z

I'm confused as to why taking a random sample of blogs and then using individual classification on these blogs (which is unbiased) can't be used for unbiased estimates of document category proportions? Couldn't you just take the total of the individual classifications, normalize it, and then make an estimate of the population proportion based on this estimate? This makes me believe that the biased results from my example above happen in strange conditions that aren't likely in applied settings.

isaduan · 2022-02-10T16:54:59Z

I am pretty confused in general what this paper is doing and why we care ... would really appreciate a short, clear exposition of the method!

konratp · 2022-02-10T22:15:49Z

The authors lay out two issues with existing approaches to estimating P(D), namely the lack of randomness in random samples, and secondly, the assumption that S (the word-stem profile?) can predict D, when in reality, the opposite is true. I'm wondering if someone could elaborate on why random sampling is flawed as an approach, since I found the authors' explanation a bit hard to follow. Secondly, I don't fully understand the issue with using S to predict D, or why this is done in the first place.

melody1126 · 2022-02-10T22:39:03Z

In the section critiquing existing methods, the authors wrote that "when the labeled set is not a random sample from the population, both methods fail." (pg 234) Why would the proposed alternative method work with non-random sample and not have skewed results?

Sirius2713 · 2022-02-10T23:14:46Z

Like some of my classmates above, I'm also confused why estimating population proportions is biased. And how does authors resolve this problem?

ValAlvernUChic · 2022-02-10T23:44:23Z

I think it's pretty cool that this method handles the work of having to essentially count and sort our documents of interest in our categories and then manually coming up with category proportions. Though, it seems that this method is restricted to corpora where all the documents are more or less in the same domain - “the prevalence of particular word profiles in the labeled set should be the same in expectation as in the population set”. Broader cultural research often takes into account documents across domains - policy speeches, tweets, reddit threads, newspapers, etc. - wondering if there was a way to mcgyver the method to account for this.

Jiayu-Kang · 2022-02-11T01:54:46Z

Like other classmates, the paper is a bit difficult for me to understand, especially the "Issues with Existing Approaches" section. I'm wondering 1) how is the aggregation of individual document classifications used to estimate P(D) flawed; 2) why estimating population proportions can still be biased even if classification succeeds with high accuracy; 3) why the approach will also work with a biased classifier.

Hongkai040 · 2022-02-11T02:30:21Z

After reading the paper, I think the authors are trying to do a cool job to overcome ecological fallacy. However, in the "What can go wrong section" I think I found something tricky. "Third, each category of D should be defined so as to be mutually exclusive, exhaustive, and relatively homogeneous." Does this mean the limitation of the application of this approach? We can't really do analysis with something has heterogeneous attributes. And, if we want our categories to be mutually exclusive and exhaustive, I think it literally means that we can only do some classifications similar to the one the authors proposed in the paper:"extremely negative (−2), negative (−1), neutral (0), positive (1), extremely positive (2), no opinion (NA), and not a blog (NB)." If this is the truth, why don't we choose to use individual level automated classification tools and make corrections to aggregate results to group level?

NaiyuJ · 2022-02-11T02:55:22Z

I have two questions on this paper: (1) About the critical assumption in equation 7: why would we think the documents in the hand-coded set contain sufficient good examples of the language used for each document category in the population and why would this assumption be more practical? Compared to other possible methods? (2) I'm curious about which kinds of political science corpora fit this method the most. There're many different classification methods. How do we know which one is better in a certain research context?

Qiuyu-Li · 2022-02-11T04:09:16Z

I think the paper is doing a very brave and excellent job here! However, I think the authors are trying to establish a general procedure for performing text classification tasks. However, I wonder how general it can be. Perhaps the authors' method works well with blogs, but what about news, speeches, books, and all the other kind of materials from other languages. Also, what about different tasks besides sentiment classification?

mikepackard415 · 2022-02-11T04:12:42Z

This is an interesting one, and it's interesting that it's boggling so many of us! I think it is useful to identify the difference in goal between classifying individual documents (and aggregating) and estimating topic proportions directly. I wonder, though, would it be possible to apply this method to get topic proportions not at the highest level but maybe within different time slices?

Jasmine97Huang · 2022-02-11T04:22:26Z

Also got confused about the article, particularly when the authors claim that "the quantity of interest in most of the supervised learning literature is the set of individual classifications for all documents in the population... the quantity of interest for most content analyses in social science is the aggregate proportion of all (or a subset of all) of these population documents that fall into each category."

LuZhang0128 · 2022-02-11T04:34:43Z

I think this article made a good point that there should be a difference between computer science's models and social science's models. Since there are a lot of data mentioned in the article, I wonder in the "how many documents need to be hand coded" section, does this 100 document rule apply for any other data, or is it true as a proportion? Also since all the examples in the paper are fairly small (a few thousand instances), I wonder if the bias would naturally be smaller if the sample size increased even without the algorithm.

hshi420 · 2022-02-11T04:35:20Z

Is this method language-specific? I would like to see this method's performance on other languages, especially on language from other langauge families (e.g. Sino-Tibetan).

kelseywu99 · 2022-02-11T04:35:24Z

I agree with other classmates that the paper is a bit obscure and would like to know if anyone cares to elaborate on the aggregation of individual classifications? Specifically, the methods to reverse misclassification of unlabeled documents.

YileC928 · 2022-02-11T05:13:09Z

The paper proposes a statistical method distinct from conventional computer science classification to estimate document category proportions, especially for social science contexts. Though the authors emphasize that their approach is simple and does not rely on strong assumptions (i.e., random selection), I was wondering - isn’t ensuring the same misclassification probability of labeled and unlabeled data also a strong assumption and would also require proper randomization?

sudhamshow · 2022-02-11T05:23:17Z

A couple of what-ifs after reading the paper-

From the textbook (Text as data) we realise how important the word count in finding discriminating words and how this is useful in grouping together documents based on words and topics. By only considering the presence or absence of the word does this have an impact on the accuracy of the trained classification model?
Since the hand coding was done for a limited number of documents in a time frame of very particular interest (November 2006), is there a possibility this could bias predictions? Would the results be reproducible had the hand-coders used data from a different time frame?

chentian418 · 2022-02-11T05:29:54Z

Firstly, I was wondering what is the the advantage of estimating the proportion of documents in given categories than broad characterizations about the whole set of documents in social science context? When does the Individual-level classification becomes unimportant?

Secondly, I am confused about the process of this supervised learning. Does each text has a pre-determined category of interest?

Thanks!

Emily-fyeh · 2022-02-11T05:53:11Z

I would wonder how the method mentioned in this paper would really outperform in other cases, and how it would justify itself. For me, the paper did point out that in practice we would like to know the proportion of each category of documents, but it does not persuade me to embrace the new method.

ttsujikawa · 2022-02-11T11:17:24Z

As some of the classmates ask above, I was wondering how the estimating proportion of documents is leveraged in social science settings? How is the biasedness of the method related to research in social science?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5. Classifying Meanings & Documents - oritenting #33

5. Classifying Meanings & Documents - oritenting #33

JunsolKim commented Jan 10, 2022

pranathiiyer commented Feb 9, 2022

GabeNicholson commented Feb 9, 2022

isaduan commented Feb 10, 2022

konratp commented Feb 10, 2022

melody1126 commented Feb 10, 2022

Sirius2713 commented Feb 10, 2022

ValAlvernUChic commented Feb 10, 2022

Jiayu-Kang commented Feb 11, 2022

Hongkai040 commented Feb 11, 2022 •

edited

Loading

NaiyuJ commented Feb 11, 2022

Qiuyu-Li commented Feb 11, 2022

mikepackard415 commented Feb 11, 2022

Jasmine97Huang commented Feb 11, 2022

LuZhang0128 commented Feb 11, 2022

hshi420 commented Feb 11, 2022

kelseywu99 commented Feb 11, 2022

YileC928 commented Feb 11, 2022 •

edited

Loading

sudhamshow commented Feb 11, 2022

chentian418 commented Feb 11, 2022

Emily-fyeh commented Feb 11, 2022

ttsujikawa commented Feb 11, 2022

5. Classifying Meanings & Documents - oritenting #33

5. Classifying Meanings & Documents - oritenting #33

Comments

JunsolKim commented Jan 10, 2022

pranathiiyer commented Feb 9, 2022

GabeNicholson commented Feb 9, 2022

isaduan commented Feb 10, 2022

konratp commented Feb 10, 2022

melody1126 commented Feb 10, 2022

Sirius2713 commented Feb 10, 2022

ValAlvernUChic commented Feb 10, 2022

Jiayu-Kang commented Feb 11, 2022

Hongkai040 commented Feb 11, 2022 • edited Loading

NaiyuJ commented Feb 11, 2022

Qiuyu-Li commented Feb 11, 2022

mikepackard415 commented Feb 11, 2022

Jasmine97Huang commented Feb 11, 2022

LuZhang0128 commented Feb 11, 2022

hshi420 commented Feb 11, 2022

kelseywu99 commented Feb 11, 2022

YileC928 commented Feb 11, 2022 • edited Loading

sudhamshow commented Feb 11, 2022

chentian418 commented Feb 11, 2022

Emily-fyeh commented Feb 11, 2022

ttsujikawa commented Feb 11, 2022

Hongkai040 commented Feb 11, 2022 •

edited

Loading

YileC928 commented Feb 11, 2022 •

edited

Loading