Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess the difference between two stats methods #10

Open
chrismbryant opened this issue Mar 23, 2020 · 3 comments
Open

Assess the difference between two stats methods #10

chrismbryant opened this issue Mar 23, 2020 · 3 comments
Labels
question Further information is requested

Comments

@chrismbryant
Copy link
Owner

For each Amazon product, there exists a distribution of star ratings (number of ratings per star value). To compute the probability of positive experience (i.e. satisfaction probability) with the product, we'd like to assign 4 and 5 stars the binary label of "positive" and {1, 2, 3} stars the binary label of "not positive". Given this assignment, we assume that each rating can be viewed as an independent Bernoulli trial with fixed (but unknown) probability of success. Taking this assumption, we can use the Beta distribution to compute a confidence interval on our measurement in a straightforward way.

However, it is not feasible to access the full histogram of star ratings for a full page of products immediately on page load (Amazon locks you out if you fetch these results too quickly, and rate limiting results in a poor user experience). Instead, the only variables we have readily accessible to us are the average star rating and number of ratings for each product. Given these two pieces of information, we can obtain an alternative satisfaction probability and confidence interval by linearly scaling the star rating in range [1, 5] to a success probability in range [0, 1], then building a confidence interval off of that proportion in the same way as before.

What are the consequences of choosing the "average star rating" as a proxy for "proportion of 4 or 5 star ratings"? Can the Beta-distribution-derived confidence interval still be trusted?

@chrismbryant chrismbryant added the question Further information is requested label Mar 23, 2020
@aeciorc
Copy link
Collaborator

aeciorc commented Apr 9, 2020

@chrismbryant @musicin3d , I just noticed this disclaimer under the reviews:
Screen Shot 2020-04-09 at 1 42 11 PM

Could this mean that using the average star rating actually serves us better?

@chrismbryant
Copy link
Owner Author

Interesting! I had no idea. I think you might be right then that it’s probably better to use the average star rating rather than the raw star distribution.

@musicin3d
Copy link
Collaborator

I have returned from my sulking. I needed some time to accept that all my work was in vain. ;)

Since we've realized the averages serve us better, are we ready to close this issue or are there questions remaining?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants