The Wine Quality dataset consists of red wine samples. The inputs include objective tests (e.g. pH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
The dataset input variables (based on physicochemical tests) are:
The output variable (based on sensory data) is:
The goal of this analysis is to understand better what red wine features (variables 1-11) may have most impact on red wine good or bad quality (variable 12).
# Read data to rw dataframe
# row.names = 1 to avoid an index column creation upon dataset reading into a dataframe
rw <- read.csv('wineQualityReds.csv', sep = ',', row.names = 1)
Let’s start with learning more about Red Wine dataset.
# Find dimensions of rw dataframe
dim(rw)
## [1] 1599 12
There are 1599 observations and 12 variables in the dataset.
# List rw dataframe's column names, types and a subset of values
str(rw)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
There are only numerical values in the dataframe. Quality variable is the only one of integer type. All column names are consistent.
The dataset variables summary can be found below:
# Display summary statistics for each variable
summary(rw)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The values ranges of different variables vary significantly and may require normalization in later phases of the analysis.
The data set is tidy and does not contain missing values:
# Check how many missing values (NA) are in each column/variable, sum them up per column
colSums(is.na(rw))
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
The following histogram matrix depics distributions of all variables in the dataset.
# Draw a histogram for a given dataframe and variable
# Use deparse() and substitute() functions to decode column name from
# a variable passed as an argument to the function, to be displayed
# on x axis (xlab())
draw_hist <- function(dataframe, variable)
{
# Save histogram definition to the plot variable
plot <- ggplot(data = dataframe, aes(x = variable)) +
geom_histogram(color = 'black', fill = '#099DD9') +
xlab(deparse(substitute(variable)))
return(plot)
}
# Build a matrix of small histograms with 3 columns
# using customly defined draw_hist() function
grid.arrange(draw_hist(rw, rw$fixed.acidity),
draw_hist(rw, rw$volatile.acidity),
draw_hist(rw, rw$citric.acid),
draw_hist(rw, rw$residual.sugar),
draw_hist(rw, rw$chlorides),
draw_hist(rw, rw$free.sulfur.dioxide),
draw_hist(rw, rw$total.sulfur.dioxide),
draw_hist(rw, rw$density),
draw_hist(rw, rw$pH),
draw_hist(rw, rw$sulphates),
draw_hist(rw, rw$alcohol),
draw_hist(rw, rw$quality),
ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histograms reveal that density and pH are normally disributed but the rest of variables are more or less right skewed (long-tailed). The quality dependent variable has a semi-normal discrete distribution.
# Plot a histogram of quality values
ggplot(data = rw, aes(x = quality)) +
geom_histogram(color = 'black', fill = '#099DD9', binwidth = 1) +
# Used to show 0-10 range, even if there are no values close to 0 or 10
scale_x_continuous(limits = c(0, 10), breaks = seq(0, 10, 1)) +
xlab('Quality of Red Wine') +
ylab('Number of Red Wines')
# Display summary statistics for quality
summary(rw$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
According to dataset documentation, the quality value should be between 0-10. In the dataset (based on the histogram plot and numerical summary), the minimum quality value is 3 and maximum is 8. The majority of red wines (more than 1200) have either 5 or 6 ranking for their quality. About 200 red wines have quality ranking as 7, but the number is not as significant as 5 and 6. The rest are either 3, 4 or 8. The average (mean) quality of red wines in the database is 5.63 and median is 6.
As we want to find variables differentiating most good or bad wines, it might be a good idea to bin quality variable into 3 categories (thus converting it into a factor variable). This may reduce comparison complexity and improve readability of the results.
The quality variable will be split into 3 bins: low [0-5), medium [5-7) and high [7-10].
# Set boundaries for intervals
breaks <- c(0, 5, 7, 10)
# Bucket data points into intervals
rw$quality.category <- cut(rw$quality, breaks, include.lowest = TRUE, right = FALSE)
# Check intervals
summary(rw$quality.category)
## [0,5) [5,7) [7,10]
## 63 1319 217
# Add labels to intervals
labels <- c("Low", "Medium", "High")
rw$quality.category <- cut(rw$quality, breaks, include.lowest = TRUE, right = FALSE, labels=labels)
# Check if labels are applied properly
table(rw$quality.category)
##
## Low Medium High
## 63 1319 217
# Draw the result
# Combine rows by quality category
y <- cbind(rw, rw$quality.category)
# Plot category bars and fill the colour based on number counted
ggplot(data = y, aes(x = y$quality.category, fill = ..count..)) +
geom_bar(color = 'black', alpha = 0.9) +
# Add text labels with number of rows in each category
stat_count(geom = "text", aes(label = ..count..), hjust = -0.1) +
theme_bw() +
labs(y = "Number of Observations", x = "Red Wine Quality Category") +
# Make the plot horizontal
coord_flip() +
# Limit y axis
ylim(0, 2000) +
# Include bins of length zero
scale_x_discrete(drop = FALSE)
The number of observations in each quality interval is not balanced with the rest. Most observations are in the Medium category (1319), 217 are in the High category and 63 in the Low category. The low number of observations in the Low and High categories may influence quality of our analysis.
Because the dataset is not balanced in terms of number of observation for each quality category, instead of number of observation, a proportion will be used when comparing samples belonging to different quality categories (normalization).
In order to find biggest differentiators (variables having much more influence on differentiation of wine quality then others) let’s look at each variable distributions and density separately and see how observations fall into bad and good wine categories. To distinguish the best candidates for further analysis, low and high guality intervals should be investigated first.
For each dependent variable I will compute and display:
Based on different views I will identify four biggest differentiators (based on their variability regarding low and high quality) and perform deeper analysis on how they relate to the quality level and each other.
draw_hist(rw, rw$fixed.acidity)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution of fixed.acidity data is lightly right skewed with minimum value of 4.60, maximum of 15.9 and median of 7.90 and mean of 8.32. Let’s see how many observations fall in each quality category.
# Split fixed.acidity by quality and summarize
by(rw$fixed.acidity, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.871 8.400 12.500
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.100 7.800 8.254 9.100 15.900
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.700 8.847 10.100 15.600
# Histograms
ggplot(data = rw, aes(x = fixed.acidity)) +
geom_histogram(binwidth = 0.2, color = 'black', fill = '#099DD9') +
# Split data by category and display plots horizontally
facet_wrap(~quality.category)
The distribution shapes for low and high categories looks similar, but the high distribution tends to be moved to the right. This kind of a diagram cannot show precisely how far the means of low and high distributions are from each other.
Let’s compare the shapes of the quality category distributions using frequency polygons and density plots to see if we can reveal more.
# Frequency polygons
# For y calculate proportions within each quality category group
ggplot(data = rw, aes(x = fixed.acidity, y = ..density../sum(..density..))) +
geom_freqpoly(aes(color = quality.category), binwidth = 0.5)
# Density Plot
ggplot(rw, aes(citric.acid)) +
geom_density(aes(fill = factor(quality.category)), alpha = 0.8)
Having all distributions stacked on top of each other, we can more easily see the differences in their shape but is is still hard to reason about the means.
Using box plots allows us to combine all required information in a visual way.
# Box Plots
ggplot(data = rw, aes(x = quality.category, y = fixed.acidity, color = quality.category)) +
geom_boxplot()
The low and high means are not far from each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is moderate comparing to other variables and this feature will not be investigated later.
draw_hist(rw, rw$volatile.acidity)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution of volatile.acidity data is almost normal with minimum value of 0.12, maximum of 1.58 and median of 0.52 and mean of 0.5278.
by(rw$volatile.acidity, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
ggplot(data = rw, aes(x = quality.category, y = volatile.acidity, color = quality.category)) +
geom_boxplot()
The low and high means are far from each other and even their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.
draw_hist(rw, rw$citric.acid)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of citric.acid data is right skewed with minimum value of 0, maximum of 1 (outlayer) and median of 0.26 and mean of 0.271.
by(rw$citric.acid, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0800 0.1737 0.2700 1.0000
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2400 0.2583 0.4000 0.7900
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.4000 0.3765 0.4900 0.7600
ggplot(data = rw, aes(x = quality.category, y = citric.acid, color = quality.category)) +
geom_boxplot()
The low and high means are very far from each other and their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.
draw_hist(rw, rw$residual.sugar)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of residual.sugar data is right skewed (long-tailed) with minimum value of 0.9, maximum of 15.5 (many outlayers) and median of 2.2 and mean of 2.539.
by(rw$residual.sugar, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.100 2.685 2.950 12.900
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.504 2.600 15.500
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.709 2.700 8.900
ggplot(data = rw, aes(x = quality.category, y = residual.sugar, color = quality.category)) +
geom_boxplot()
The low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is minimum comparing to other variables and this feature will not be investigated later.
draw_hist(rw, rw$chlorides)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of chlorides data is also right skewed (long-tailed) and very similar to distribution of residual.sugar with minimum value of 0.012, maximum of 0.611 (many outlayers) and median of 0.079 and mean of 0.087.
by(rw$chlorides, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06850 0.08000 0.09573 0.09450 0.61000
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.07100 0.08000 0.08897 0.09100 0.61100
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07591 0.08500 0.35800
ggplot(data = rw, aes(x = quality.category, y = chlorides, color = quality.category)) +
geom_boxplot()
Similarily to residual.sugar, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is minimum comparing to other variables and this feature will not be investigated later.
draw_hist(rw, rw$free.sulfur.dioxide)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of free.sulfur.dioxide data is right skewed with minimum value of 1, maximum of 72 and median of 14 and mean of 15.87.
by(rw$free.sulfur.dioxide, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 9.00 12.06 15.50 41.00
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 16.37 22.00 72.00
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 13.98 18.00 54.00
ggplot(data = rw, aes(x = quality.category, y = free.sulfur.dioxide, color = quality.category)) +
geom_boxplot()
Similarily to residual.sugar or chlorides, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above), but in this case the data is much more dispersed. The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.
draw_hist(rw, rw$total.sulfur.dioxide)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The distribution of total.sulfur.dioxid data is right skewed (long-tailed) and very similar to distribution of free.sulfur.dioxide with minimum value of 6, maximum of 289 (outlayers) and median of 38 and mean of 46.47.
by(rw$total.sulfur.dioxide, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 13.50 26.00 34.44 48.00 119.00
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 24.00 40.00 48.95 65.00 165.00
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.00 27.00 34.89 43.00 289.00
ggplot(data = rw, aes(x = quality.category, y = total.sulfur.dioxide, color = quality.category)) +
geom_boxplot()
Similarily to free.sulfur.dioxide, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above), but in this case the data is much less dispersed. The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.
draw_hist(rw, rw$density)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The distribution of density data is normal with minimum value of 0.9901, maximum of 1.0037 and median of 0.9968 and mean of 0.9967.
by(rw$density, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9966 0.9967 0.9977 1.0010
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9958 0.9968 0.9969 0.9979 1.0037
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9947 0.9957 0.9960 0.9973 1.0032
ggplot(data = rw, aes(x = quality.category, y = density, color = quality.category)) +
geom_boxplot()
The low and high means are close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.
draw_hist(rw, rw$pH)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The distribution of pH data is normal with minimum value of 2.740, maximum of 4.010 and median of 3.310 and mean of 3.311.
by(rw$pH, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.380 3.384 3.500 3.900
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.210 3.310 3.311 3.400 4.010
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.270 3.289 3.380 3.780
ggplot(data = rw, aes(x = quality.category, y = pH, color = quality.category)) +
geom_boxplot()
The low and high means are not too close to each other but their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is semi-low comparing to other variables and this feature will not be investigated later.
draw_hist(rw, rw$sulphates)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulphates data is right skewed (long-tailed) with minimum value of 0.33, maximum of 2 and median of 0.62 and mean of 0.6581.
by(rw$sulphates, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4950 0.5600 0.5922 0.6000 2.0000
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6100 0.6473 0.7000 1.9800
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7435 0.8200 1.3600
ggplot(data = rw, aes(x = quality.category, y = sulphates, color = quality.category)) +
geom_boxplot()
The low and high means are far from each other and their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.
draw_hist(rw, rw$alcohol)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(rw$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution of alcohol data is right skewed but does not have many outliers with minimum value of 8.4, maximum of 14.9 and median of 10.2 and mean of 10.42.
by(rw$alcohol, rw$quality.category, summary)
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
ggplot(data = rw, aes(x = quality.category, y = alcohol, color = quality.category)) +
geom_boxplot()
The low and high means are very far from each other and their IQRs do overlap only slightly being very dispersed (see also summary statistics per category above). The variability between low and high quality categories is very high comparing to other variables and this feature will have to be investigated later.
For every variable distribution and density differences were explored from different perspectives: numerical summary statistics, histograms, frequency polygons, density distributions and boxplots. The different perspectives complement each other giving most complete view on each variable and their variability related to red wine quality (biggest distribution difference for Low and High red wine quality categories).
It seems that volatile.acidity, citric.acid, sulphates and alcohol show the biggest variability (difference in means and IQR ranges) and these variables will be used in further analysis.
draw_boxplot <- function(dataframe, variable, ylab)
{
plot <- ggplot(data = dataframe, aes(x = quality.category, y = variable, color = quality.category)) +
geom_boxplot() +
xlab('Quality') +
#ylab(deparse(substitute(variable))) +
ylab(ylab) +
theme(legend.position = "none")
return(plot)
}
# Build 4 boxplots summarizing distributions of 4 selected features
draw_univ_summary <- function()
{
grid.arrange(draw_boxplot(rw, rw$volatile.acidity, expression(Volatile~Acidity~(g/dm^{3}))),
draw_boxplot(rw, rw$citric.acid, expression(Citric~Acid~(g/dm^{3}))),
draw_boxplot(rw, rw$sulphates, expression(Sulphates~(g/dm^{3}))),
draw_boxplot(rw, rw$alcohol, 'Alcohol (% by Volume)'),
ncol = 4,
top = 'Features With Biggest Variability by Quality Category')
}
draw_univ_summary()
Just for curiosity, let’s see what variables are correlated with each other (if any).
# Create a new dataframe and calculate correlations
# between rw variables
rwcor <- cor(rw[c(1:11, 12)])
# Draw a correlation matrix
corrplot(rwcor, method = 'square', order = "hclust",
tl.col = "black", tl.cex = 0.8, tl.offset = 1)
The correlation matrix shows that fixed.acidity is highly positively correlated with density and citric.acid. total.sulfur.dioxide is highly positively correlated with free.sulful.dioxide. pH is highly negatively correlated with fixed.acidity. citric.acid is correlated negatively with volatile.acidity and pH. We will not be investigating most of these correlations as most of the correlated features were eliminated earlier from the analysis as less relevant for meeting this EDA objective.
Next, I will focus on negative correlation between volatile.acidity and citric.acid, and also on correlations between quality and the four features selected during univariate exploration phase.
Let’s start with a closer look at how the four features selected for the next round of analysis relate to each other and to quality.
# Create a dataframe with only selected variables
rw_sample = subset(rw, select = c(volatile.acidity, citric.acid, sulphates, alcohol, quality))
# Draw a correlation matrix
ggpairs(rw_sample,
# Define how scatterplots should be displayed
lower = list(continuous = wrap('points', shape = I('.')))) +
theme(legend.position = "none", # do not show legend
panel.grid.major = element_blank(), # do not display grid
panel.border = element_rect(linetype = "dashed", colour = "black", fill = NA))
The above correlation matrix between variables selected for further analysis and the quality variable reveals that:
The strongest correlation found is between volatile.acidity and citric.acid (cerrelation coefficient = -0.552) nevertheless it will not explain much which red wine sample features influence (low and high) wine quality.
Let’s look deeper into quality versus selected features to see how they influence red wine quality. Scatter plots with some overplotting reduction as well as grouping and summarizing features by quality category should help in confirming initial findings and in revealing new ones.
cor.test(rw$volatile.acidity,rw$quality)
##
## Pearson's product-moment correlation
##
## data: rw$volatile.acidity and rw$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
The negative correlation between volatile.acidity and quality is weak-to-moderate and this can be noticed in the plot, where one could argue if less volatile.acidity in a sample results in higher wine quality. More samples in the dataset in low and high quality categories could help in clarification.
How does the average volatile.acidity varies with quality?
# Calculate means, medians and number of observations per existing
# in rw quality levels, sort ascending by quality
rw.volatile_acidity_by_quality <- rw %>%
group_by(quality) %>%
summarize(volatile_acidity_mean = mean(volatile.acidity),
volatile_acidity_median = median(volatile.acidity),
number_of_obs = n()) %>%
arrange(quality)
# Show the result
head(rw.volatile_acidity_by_quality)
## # A tibble: 6 x 4
## quality volatile_acidity_mean volatile_acidity_median number_of_obs
## <int> <dbl> <dbl> <int>
## 1 3 0.884 0.845 10
## 2 4 0.694 0.67 53
## 3 5 0.577 0.580 681
## 4 6 0.497 0.49 638
## 5 7 0.404 0.37 199
## 6 8 0.423 0.37 18
Looking at the average mean and median values of volatile.acidity per each quality rating, we can see that there is a clear pattern showing that the less volatile.acidity is in a sample on average the better quality of the sample is. The plot below confirms this visually.
ggplot(rw.volatile_acidity_by_quality) +
# Jitter data points for better visibility and add depth with alpha
geom_jitter(data = rw, aes(x = volatile.acidity, y = quality,
color = quality.category), width = 0, alpha = 0.5) +
# Put calculated earlier means/medians (line plots on top)
geom_line(aes(x = volatile_acidity_mean, y = quality), color = "blue") +
geom_line(aes(x = volatile_acidity_median, y = quality), color = "red") +
# Set labels and title
xlab("Volatile Acidity Mean (blue) and Median (red)") +
ylab("Red Wine Quality Level") +
ggtitle("Volatile Acidity Mean and Median by Quality")
cor.test(rw$citric.acid,rw$quality)
##
## Pearson's product-moment correlation
##
## data: rw$citric.acid and rw$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The correlation between citric.acid and quality is positive and weak.
rw.citric_acid_by_quality <- rw %>%
group_by(quality) %>%
summarize(citric_acid_mean = mean(citric.acid),
citric_acid_median = median(citric.acid),
number_of_obs = n()) %>%
arrange(quality)
head(rw.citric_acid_by_quality)
## # A tibble: 6 x 4
## quality citric_acid_mean citric_acid_median number_of_obs
## <int> <dbl> <dbl> <int>
## 1 3 0.171 0.035 10
## 2 4 0.174 0.09 53
## 3 5 0.244 0.23 681
## 4 6 0.274 0.26 638
## 5 7 0.375 0.4 199
## 6 8 0.391 0.42 18
Looking at the average mean and median values of citric.acid per each quality rating, we can see a pattern showing that the bigger citric.acid level is in a sample on average the better quality of the sample is. Furthermore, as clearly shown on the plot below, the samples with citric.acid level above 0.5 will almost always be classsified as of Medium or High quality.
ggplot(rw.citric_acid_by_quality) +
geom_jitter(data = rw, aes(x = citric.acid, y = quality,
color = quality.category), width = 0, alpha = 0.5) +
geom_line(aes(x = citric_acid_mean, y = quality), color = "blue") +
geom_line(aes(x = citric_acid_median, y = quality), color = "red") +
xlab("Citric Acid Mean (blue) and Median (red)") +
ylab("Red Wine Quality Level") +
ggtitle("Citric Acid Mean and Median by Quality")
cor.test(rw$sulphates,rw$quality)
##
## Pearson's product-moment correlation
##
## data: rw$sulphates and rw$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
The correlation between sulphates and quality is positive and weak similarily to the correlation between citric.acid and quality.
rw.sulphates_by_quality <- rw %>%
group_by(quality) %>%
summarize(sulphates_mean = mean(sulphates),
sulphates_median = median(sulphates),
number_of_obs = n()) %>%
arrange(quality)
head(rw.sulphates_by_quality)
## # A tibble: 6 x 4
## quality sulphates_mean sulphates_median number_of_obs
## <int> <dbl> <dbl> <int>
## 1 3 0.57 0.545 10
## 2 4 0.596 0.56 53
## 3 5 0.621 0.580 681
## 4 6 0.675 0.64 638
## 5 7 0.741 0.74 199
## 6 8 0.768 0.74 18
Looking at the average mean and median values of sulphates per each quality rating, we can see that the bigger sulphates level is in a sample on average the better quality of the sample is. However, the sulphates values are less spread than values of other variables.
ggplot(rw.sulphates_by_quality) +
geom_jitter(data = rw, aes(x = sulphates, y = quality,
color = quality.category), width = 0, alpha = 0.5) +
geom_line(aes(x = sulphates_mean, y = quality), color = "blue") +
geom_line(aes(x = sulphates_median, y = quality), color = "red") +
xlab("Sulphates Mean (blue) and Median (red)") +
ylab("Red Wine Quality Level") +
ggtitle("Sulphates Mean and Median by Quality")
cor.test(rw$alcohol,rw$quality)
##
## Pearson's product-moment correlation
##
## data: rw$alcohol and rw$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
The correlation between alcohol and quality is positive and moderate. It is the strongest correlation identified between the four differentiating features and quality.
rw.alcohol_by_quality <- rw %>%
group_by(quality) %>%
summarize(alcohol_mean = mean(alcohol),
alcohol_median = median(alcohol),
number_of_obs = n()) %>%
arrange(quality)
head(rw.alcohol_by_quality)
## # A tibble: 6 x 4
## quality alcohol_mean alcohol_median number_of_obs
## <int> <dbl> <dbl> <int>
## 1 3 9.96 9.93 10
## 2 4 10.3 10 53
## 3 5 9.90 9.7 681
## 4 6 10.6 10.5 638
## 5 7 11.5 11.5 199
## 6 8 12.1 12.2 18
Looking at the average mean and median values of alcohol per each quality rating, we can see a pattern showing that only alcohol level above 12 gives more certainty that the sample will be considered as of Medium or High quality. If the alcohol level goes below 10 a sample will most likely be considered as of a Medium or Low quality.
ggplot(rw.alcohol_by_quality) +
geom_jitter(data = rw, aes(x = alcohol, y = quality,
color = quality.category), width = 0, alpha = 0.5) +
geom_line(aes(x = alcohol_mean, y = quality), color = "blue") +
geom_line(aes(x = alcohol_median, y = quality), color = "red") +
xlab("Alcohol Mean (blue) and Median (red)") +
ylab("Red Wine Quality Level") +
ggtitle("Alcohol Mean and Median by Quality")
So far, in this phase of EDA I have confirmed the following:
All of these are clearly visible in the following plot.
draw_main_quality_corrs <- function(dataframe, variable, title)
{
plot <- ggplot(data = dataframe, aes(x = variable, y = quality.category)) +
geom_point(aes(color = quality.category), alpha = 1/4, position = 'jitter') +
ggtitle(title) +
xlab('') +
ylab('Quality') +
theme(legend.position = "none")
return(plot)
}
draw_biv_summary <- function()
{
grid.arrange(draw_main_quality_corrs(rw, rw$volatile.acidity, expression(Volatile~Acidity~(g/dm^{3}))),
draw_main_quality_corrs(rw, rw$citric.acid, expression(Citric~Acid~(g/dm^{3}))),
draw_main_quality_corrs(rw, rw$sulphates, expression(Sulphates~(g/dm^{3}))),
draw_main_quality_corrs(rw, rw$alcohol, 'Alcohol (% by Volume)'),
ncol = 2,
top = 'Quality and Features Correlation by Category')
}
draw_biv_summary()
The last thing to check is the relation between volatile.acidity, alcohol and quality.
A scatterplot with local polynomial regression fitting line should smoth the plot.
# Plot a scatter plot
ggplot(data = rw, aes(x = volatile.acidity, y = alcohol, color = quality.category)) +
geom_point(alpha = 0.6) +
# Use local polynomial regression fitting line
geom_smooth(method="loess", col="firebrick", size=1)
draw_muv_summary <- function()
{
ggplot(data = rw, aes(x = volatile.acidity, y = alcohol)) +
geom_jitter(aes(color = quality.category, bg = quality.category), alpha = 1/2, pch = 21, cex = 2) +
facet_wrap(~quality.category) +
scale_color_brewer(type = 'div') +
xlab(expression(Volatile~Acidity~(g/dm^{3}))) +
ylab("Alcohol (% by Volume)") +
ggtitle('Volatile.acidity and Alcohol by Quality Category')
}
draw_muv_summary()
ggplot(data = rw, aes(x = volatile.acidity, y = alcohol)) +
geom_jitter(aes(color = quality.category, bg = quality.category), alpha = 1/2, pch = 21, cex = 2) +
facet_wrap(~quality) +
scale_color_brewer(type = 'div') +
ggtitle('Volatile.acidity and Alcohol Relationship by Quality Level')
It is interesting to notice that high quality red wines are the ones having rather lower level of volatile.acidity and higher level of alcohol while low quality red wines tend to have move volatile.acidity and less alcohol levels. It is worth mentioning that the low category data is very much dispersed and low amount of observations (comparing to medium category for example) may influence limited clarity of the results.
To summarize:
draw_muv_summary()
The goal of this EDA was to understand better what red wine features may have most impact on red wine good or bad quality.
The Red Wine dataset was tidy and no data munging activities were necessary, nevertheless the number of observations in each quality interval was not balanced with the rest. Most observations are in the Medium category (1319), 217 are in the High category and 63 in the Low category. The low number of observations in the Low and High categories might influence quality of our analysis.
In order to find biggest red wine quality influencers a univariate analysis was performed and four features with the highest variability by quality category were selected.
draw_univ_summary()
The four features: volatile.acidity, citric.acid, sulphates and alcohol showing the biggest variability were used in further analysis.
Based on a correlation matrix, correlations between quality and the four features selected during univariate exploration phase, and a negative correlation between volatile.acidity and citric.acid were chosen as a next steps focus.
draw_biv_summary()
The second phase of the analysis revealed that:
draw_muv_summary()
In addition (as depicted on the plot above):
This analysis was done in a top-down manner, where through elimination most significant wine quality features were selected leading to some final conclusions. It is one of the possible approaches and the author of this EDA does not claim that the approach taken in this EDA is the best one.
There were some challenges spotted while performing the analysis:
In the future, there could be more features added (grown country, weather conditions, wine making process specifics, etc.) to the dataset and sample number for low and high wine quality should be balanced with the medium level for better results.