The opinions of different people on the same object are often quite variable, usually leading to a quite broad distribution of ratings for the same object.

Usually the variance of the distribution of rating is dismissed with a sentence like “taste is personal” or “beauty is in the eye of the beholder”. We rarely think about which facets of a person influence taste or which parts of the eye define beauty.

Before buying an item, we usually compare the averages of the ratings and buy the item with the greatest one. We assume to be the “average reviewer”, hence that our satisfaction with that object will, in expectation, be that average rating. This is surely a good prior, but what if we could refine our estimation? What if we could only look at reviews by people “with the same eyes”, or weight the ratings by our similarity with the specific reviewer?

An adjacent problem, that of recommender systems, is at the core of various tech companies (e.g. Netflix). Their approach, though, is different from ours. Their goal is to find products of theirs that they expect a certain user to like, based on the users’ metadata and similarity with other users. Explainability has a zero marginal increase in revenue, therefore huge neural networks are the workhorses of those systems — providing accurate predictions, but not great insight into human nature.

We instead want a transparent model which will allow us to test some hypothesis on what underlies the rating we give to an object. In this way we’ll see if we can find those “facets of a person” which determine their taste and therefore explain that variance.



Our analysis will be done on the BeerAdvocate dataset. It is composed of reviews (numerical and textual) of beers and metadata about users, beers and breweries collected from 2001 to 2017 from the beer-reviewing website

The dataset has been cleaned by removing the reviews without the textual part or with difficult to analyze text, reviews of users whose nationality was not known and reviews of beers with just one review. We went from 8393032 rows down to 2457794 rows, containing the rating and different features collected and created about a review.

We will decompose a given rating as rating = beer_quality + “taste_buds” and focus on the second addend. We think it is appropriate to define beer_quality as the average of the ratings for a given beer.



Here above is the distribution of the standard deviations of the ratings for each beer. Using the D’Agostino and Pearson’s test, we can state that it is normally distributed with mean 0 and variance 0.5.


Here we have the distribution of ratings for a given beer and that of the average rating across beers, as indicated by their standard deviations, are quite similar (0.39, on average, and 0.46 respectively). This remarks the significance of our analysis; there is almost as much to gain from being able to predict where in the distribution for a given beer a user would fall, as there is from being able to predict where the beer falls in the beer-distribution.


In this section we will introduce the various features we expect to be able to capture “taste_buds” and show our thought process

Number of reviews by each user

We think it might very well be possible that the number of reviews a user writes will be reflected in the final rating. For example a user who has done a lot of reviews might think of themself as an expert and therefore be harsher in their judgment.

User’s location

We think that the nationality of users might influence the final rating. Users from different countries might be used to different scholastic grading schemes and customs (e.g. grades in Germany vary from 1 to 6, where 1 is the best. On the contrary, in Switzerland the grades are still from 1 to 6, but 1 is the worst and 6 the best ). Other cultural differences might make a difference as well.

Because most of the users are from the US, to have a significant amount of users for every area, we have coarse grained our map and grouped countries and states into areas.

Users-Brewery location match

A user might be more accustomed to the taste and style of beer from their country or even have a more or less subconscious nationalistic bias towards beers made in their own country. To see if this is the case we add a binary feature which is 1 if the coarse grained location of the user and brewery match, 0 otherwise.

Complexity and length of textual review

From the textual review we wanted to extract a proxy for “amount of care put in the review”. We hypothesize that users who put less time in the review (captured by the length of the text) might stick closer to the average, while users extremely disappointed with the beer will use a more complex lexicon.

To estimate the complexity of the textual review we have used two very famous metrics: the Flesch reading score and the Dale-Chall readability score, both of which score passages of text, assigning them reading levels corresponding to the minimum US school grade required to comprehend such a text.

Time of review relative to the first review for said beer

This feature can, for example, capture the potential phenomenon of a beer leaving its niche and therefore have its ratings decline.

Time of the review relative to the first ever review

It’s reasonable to assume that the users adjust their ratings with the rating distribution they see in the “environment” (i.e. if on a website the ratings tend to be all between 3 and 5, a user would be pulled towards giving 3.5 to a pretty bad beer and 4.5 to a pretty good one, even if the possible range is different. Same argument, only translated, for different ranges). It is very possible that the customs and “effective rating range” evolves over time inside the forum; maybe starting lower and increasing or vice versa. This phenomenon would be captured by this feature.

KDE scatter

Scatter plot matrix of insightful features with kernel density estimation


To understand which, if any, of the features proposed are able to explain the variation in ratings for a beer, we do a Ridge Regression on the features, trying to predict “taste_buds”. We interpret the $R^2$ as the variance explained by the model. To decompose the amount of variance explained between the different features, we add them one by one and look at the variation in $R^2$ before and after adding the feature . The variance explained by each feature is then the average of this difference over all possible orders (permutations) of insertion of the features. We followed the method explained here. The features have been prepared for the fit by (log)scaling and normalizing as necessary, both to aid with the fit and to be able to compare the coefficients of the model. We used ridge regression (linear regression with gaussian prior, hence $L^2$ regularization) because we believe some features might just be mutually correlated or even spuriously correlated with the target. The regularization hyperparameter is optimized by bootstrapping.