Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5. Classifying Meanings & Documents - oritenting #33

Open
JunsolKim opened this issue Jan 10, 2022 · 21 comments
Open

5. Classifying Meanings & Documents - oritenting #33

JunsolKim opened this issue Jan 10, 2022 · 21 comments

Comments

@JunsolKim
Copy link

Post questions here for this week's oritenting readings: Hopkins, Daniel J. and Gary King. 2010. A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54(1): 229-247.

@pranathiiyer
Copy link

There are certain problems of social science where there might be interest to individually classify documents (for spam/hate speech/defamation/misinformation etc.) and use these for classification problems, but also understand the proportion of documents that fall under each category. These problems might also be volatile and sensitive to time. Accounting for population drift, how can this non-parametric method be applied meaningfully to such problems?

@GabeNicholson
Copy link

I'm confused as to why taking a random sample of blogs and then using individual classification on these blogs (which is unbiased) can't be used for unbiased estimates of document category proportions? Couldn't you just take the total of the individual classifications, normalize it, and then make an estimate of the population proportion based on this estimate? This makes me believe that the biased results from my example above happen in strange conditions that aren't likely in applied settings.

@isaduan
Copy link

isaduan commented Feb 10, 2022

I am pretty confused in general what this paper is doing and why we care ... would really appreciate a short, clear exposition of the method!

@konratp
Copy link

konratp commented Feb 10, 2022

The authors lay out two issues with existing approaches to estimating P(D), namely the lack of randomness in random samples, and secondly, the assumption that S (the word-stem profile?) can predict D, when in reality, the opposite is true. I'm wondering if someone could elaborate on why random sampling is flawed as an approach, since I found the authors' explanation a bit hard to follow. Secondly, I don't fully understand the issue with using S to predict D, or why this is done in the first place.

@melody1126
Copy link

In the section critiquing existing methods, the authors wrote that "when the labeled set is not a random sample from the population, both methods fail." (pg 234) Why would the proposed alternative method work with non-random sample and not have skewed results?

@Sirius2713
Copy link

Like some of my classmates above, I'm also confused why estimating population proportions is biased. And how does authors resolve this problem?

@ValAlvernUChic
Copy link

I think it's pretty cool that this method handles the work of having to essentially count and sort our documents of interest in our categories and then manually coming up with category proportions. Though, it seems that this method is restricted to corpora where all the documents are more or less in the same domain - “the prevalence of particular word profiles in the labeled set should be the same in expectation as in the population set”. Broader cultural research often takes into account documents across domains - policy speeches, tweets, reddit threads, newspapers, etc. - wondering if there was a way to mcgyver the method to account for this.

@Jiayu-Kang
Copy link

Like other classmates, the paper is a bit difficult for me to understand, especially the "Issues with Existing Approaches" section. I'm wondering 1) how is the aggregation of individual document classifications used to estimate P(D) flawed; 2) why estimating population proportions can still be biased even if classification succeeds with high accuracy; 3) why the approach will also work with a biased classifier.

@Hongkai040
Copy link

Hongkai040 commented Feb 11, 2022

After reading the paper, I think the authors are trying to do a cool job to overcome ecological fallacy. However, in the "What can go wrong section" I think I found something tricky. "Third, each category of D should be defined so as to be mutually exclusive, exhaustive, and relatively homogeneous." Does this mean the limitation of the application of this approach? We can't really do analysis with something has heterogeneous attributes. And, if we want our categories to be mutually exclusive and exhaustive, I think it literally means that we can only do some classifications similar to the one the authors proposed in the paper:"extremely negative (−2), negative (−1), neutral (0), positive (1), extremely positive (2), no opinion (NA), and not a blog (NB)." If this is the truth, why don't we choose to use individual level automated classification tools and make corrections to aggregate results to group level?

@NaiyuJ
Copy link

NaiyuJ commented Feb 11, 2022

I have two questions on this paper: (1) About the critical assumption in equation 7: why would we think the documents in the hand-coded set contain sufficient good examples of the language used for each document category in the population and why would this assumption be more practical? Compared to other possible methods? (2) I'm curious about which kinds of political science corpora fit this method the most. There're many different classification methods. How do we know which one is better in a certain research context?

@Qiuyu-Li
Copy link

I think the paper is doing a very brave and excellent job here! However, I think the authors are trying to establish a general procedure for performing text classification tasks. However, I wonder how general it can be. Perhaps the authors' method works well with blogs, but what about news, speeches, books, and all the other kind of materials from other languages. Also, what about different tasks besides sentiment classification?

@mikepackard415
Copy link

This is an interesting one, and it's interesting that it's boggling so many of us! I think it is useful to identify the difference in goal between classifying individual documents (and aggregating) and estimating topic proportions directly. I wonder, though, would it be possible to apply this method to get topic proportions not at the highest level but maybe within different time slices?

@Jasmine97Huang
Copy link

Also got confused about the article, particularly when the authors claim that "the quantity of interest in most of the supervised learning literature is the set of individual classifications for all documents in the population... the quantity of interest for most content analyses in social science is the aggregate proportion of all (or a subset of all) of these population documents that fall into each category."

@LuZhang0128
Copy link

I think this article made a good point that there should be a difference between computer science's models and social science's models. Since there are a lot of data mentioned in the article, I wonder in the "how many documents need to be hand coded" section, does this 100 document rule apply for any other data, or is it true as a proportion? Also since all the examples in the paper are fairly small (a few thousand instances), I wonder if the bias would naturally be smaller if the sample size increased even without the algorithm.

@hshi420
Copy link

hshi420 commented Feb 11, 2022

Is this method language-specific? I would like to see this method's performance on other languages, especially on language from other langauge families (e.g. Sino-Tibetan).

@kelseywu99
Copy link

I agree with other classmates that the paper is a bit obscure and would like to know if anyone cares to elaborate on the aggregation of individual classifications? Specifically, the methods to reverse misclassification of unlabeled documents.

@YileC928
Copy link

YileC928 commented Feb 11, 2022

The paper proposes a statistical method distinct from conventional computer science classification to estimate document category proportions, especially for social science contexts. Though the authors emphasize that their approach is simple and does not rely on strong assumptions (i.e., random selection), I was wondering - isn’t ensuring the same misclassification probability of labeled and unlabeled data also a strong assumption and would also require proper randomization?

@sudhamshow
Copy link

A couple of what-ifs after reading the paper-

  1. From the textbook (Text as data) we realise how important the word count in finding discriminating words and how this is useful in grouping together documents based on words and topics. By only considering the presence or absence of the word does this have an impact on the accuracy of the trained classification model?
  2. Since the hand coding was done for a limited number of documents in a time frame of very particular interest (November 2006), is there a possibility this could bias predictions? Would the results be reproducible had the hand-coders used data from a different time frame?

@chentian418
Copy link

Firstly, I was wondering what is the the advantage of estimating the proportion of documents in given categories than broad characterizations about the whole set of documents in social science context? When does the Individual-level classification becomes unimportant?

Secondly, I am confused about the process of this supervised learning. Does each text has a pre-determined category of interest?

Thanks!

@Emily-fyeh
Copy link

I would wonder how the method mentioned in this paper would really outperform in other cases, and how it would justify itself. For me, the paper did point out that in practice we would like to know the proportion of each category of documents, but it does not persuade me to embrace the new method.

@ttsujikawa
Copy link

As some of the classmates ask above, I was wondering how the estimating proportion of documents is leveraged in social science settings? How is the biasedness of the method related to research in social science?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests