You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For each Amazon product, there exists a distribution of star ratings (number of ratings per star value). To compute the probability of positive experience (i.e. satisfaction probability) with the product, we'd like to assign 4 and 5 stars the binary label of "positive" and {1, 2, 3} stars the binary label of "not positive". Given this assignment, we assume that each rating can be viewed as an independent Bernoulli trial with fixed (but unknown) probability of success. Taking this assumption, we can use the Beta distribution to compute a confidence interval on our measurement in a straightforward way.
However, it is not feasible to access the full histogram of star ratings for a full page of products immediately on page load (Amazon locks you out if you fetch these results too quickly, and rate limiting results in a poor user experience). Instead, the only variables we have readily accessible to us are the average star rating and number of ratings for each product. Given these two pieces of information, we can obtain an alternative satisfaction probability and confidence interval by linearly scaling the star rating in range [1, 5] to a success probability in range [0, 1], then building a confidence interval off of that proportion in the same way as before.
What are the consequences of choosing the "average star rating" as a proxy for "proportion of 4 or 5 star ratings"? Can the Beta-distribution-derived confidence interval still be trusted?
The text was updated successfully, but these errors were encountered:
Interesting! I had no idea. I think you might be right then that it’s probably better to use the average star rating rather than the raw star distribution.
For each Amazon product, there exists a distribution of star ratings (number of ratings per star value). To compute the probability of positive experience (i.e. satisfaction probability) with the product, we'd like to assign 4 and 5 stars the binary label of "positive" and {1, 2, 3} stars the binary label of "not positive". Given this assignment, we assume that each rating can be viewed as an independent Bernoulli trial with fixed (but unknown) probability of success. Taking this assumption, we can use the Beta distribution to compute a confidence interval on our measurement in a straightforward way.
However, it is not feasible to access the full histogram of star ratings for a full page of products immediately on page load (Amazon locks you out if you fetch these results too quickly, and rate limiting results in a poor user experience). Instead, the only variables we have readily accessible to us are the average star rating and number of ratings for each product. Given these two pieces of information, we can obtain an alternative satisfaction probability and confidence interval by linearly scaling the star rating in range [1, 5] to a success probability in range [0, 1], then building a confidence interval off of that proportion in the same way as before.
What are the consequences of choosing the "average star rating" as a proxy for "proportion of 4 or 5 star ratings"? Can the Beta-distribution-derived confidence interval still be trusted?
The text was updated successfully, but these errors were encountered: