Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify LDA input parameterization #143

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Simplify LDA input parameterization #143

wants to merge 1 commit into from

Conversation

gokceneraslan
Copy link

I tried to simplify LDA input representation by using a simple M x V matrix of word frequencies where M and V represent number of documents and words. In the model, now instead of iterating over all words of all documents, iterations are over each element of the M x V matrix.

@bob-carpenter
Copy link
Contributor

Thanks for submitting. I've been out for a while, so haven't been able to review this, but I'll get to it ASAP.

Copy link
Contributor

@bob-carpenter bob-carpenter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just add new models rather than replacing the existing ones.

The new implementations have very different memory properties (which will only be better in some dense cases).

if (count > 0) {
for (k in 1:K) {
gamma[k] = (log(theta[i,k]) + log(phi[k,j]))*count;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than looping to define gamma, a one-liner will do it:

gamma = count * (log(theta[i, ]) + log(phi[, j]);

The loop works for the count == 0 case (though a bit ineffciently) but presumably there aren't any zero-length documents in well-formed data sets, so I'd just write replace this whole loop with:

for (i in 1:M)
  for (j in 1:V)
    target += log_sum_exp(count * (log(theta[i, ]) + log(phi[, j]));

increment_log_prob is deprecated---this model has been around for a while without being updated.

for (m in 1:K) {
Sigma[m,m] <- sigma[m] * sigma[m] * Omega[m,m];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for replacing <- --- that would be a good change for the original model, as well as replacing increment_log_prob with target +=.

square(sigma[m]) will be a bit more efficient, as would be sigma[m]^2.

@bob-carpenter
Copy link
Contributor

Oh, and I'd suggest adding suffixes to existing model names like _counts to indicate you're taking sufficient stats rather than the raw data.

@gokceneraslan
Copy link
Author

Oh, and I'd suggest adding suffixes to existing model names like _counts to indicate you're taking sufficient stats rather than the raw data.

You mean adding _counts to the new model? Because it's the one uses counts.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Nov 18, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants