Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Credibility Of LDA #34

Open
amritbhanu opened this issue Oct 11, 2016 · 4 comments
Open

Credibility Of LDA #34

amritbhanu opened this issue Oct 11, 2016 · 4 comments
Assignees
Labels

Comments

@amritbhanu
Copy link
Contributor

amritbhanu commented Oct 11, 2016

IDEA:

ACTUAL
          T1        T2         T3      .. .. . . .
Doc1
Doc2
Doc3

PREDICTED - Selected from Dominant topic from doc topic distribution.
          W1        W2         W3      .. .. . . .
Doc1
Doc2
Doc3

**According to literature, If a document is asked to belong to one of the dominant 
topic (hard assignment), the top words from the dominant topic should be in the 
actual document. If not:
 - then the probability of dominant topic is very less and there might be other topic which 
can be made dominant.
- or the top words are wrongly selected. The weights of words could be better to find 
the same dominant topic.**

Experiment:

  • Once top n words are selected from each topic, now those topics are represented with those n words.
  • A dominant topic is selected to represent a document, we call that as actual.
  • we will check for each topic which are now represented with n words. We will find most 'm' words out of those 'n' in a document. Whichever topic will have the most 'm' words, according to this, now that document is represented with this topic.
We have now x no of documents. For eg x=4, k(no of topics)=3
for x=4, we have [D1,D2,D3,D4]
Actual=[1,1,2,0]
Predicted=[1,0,2,0]
The score is = 2/4=0.50

Results:

  • Higher the better
    file

Conclusion:

  • tuned with top 7 words is performing much better than untuned (default, k=10) top 7 words.
  • tuned with top 7 words is performing better or same than untuned (default, k=10) top 10 words.
  • With tuning we have better top 7 words defining that topic.
@timm
Copy link

timm commented Oct 12, 2016

plz clarify:

  • was this with using lda as the terms for a subsequent use of SVM?
  • the above results show 5 cases where tuend wa as good or better than other things. so why are you reporting this as a negative result?

@amritbhanu
Copy link
Contributor Author

amritbhanu commented Oct 12, 2016

We have 2 tracks in lda now:

  • for reporting stable conclusions. (related to model stability)
  • another one for using LDA features into svm. (related to classification)

This one is related to the first track. We want to report stable topics generation and that only top 7 words are important after tuning rather than reporting 10 words with default.

I am reporting positive results.

@timm
Copy link

timm commented Oct 12, 2016

ur reporting positive results for...

  1. for reporting stable conclusions.
  2. another one for using LDA features into svm.

@amritbhanu
Copy link
Contributor Author

just for the first right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants