Exploratory Data Analysis of Red Wine Quality Dataset

Dataset Description

The Wine Quality dataset consists of red wine samples. The inputs include objective tests (e.g. pH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

The dataset input variables (based on physicochemical tests) are:

fixed acidity (tartaric acid - g / dm^3)
volatile acidity (acetic acid - g / dm^3)
citric acid (g / dm^3)
residual sugar (g / dm^3)
chlorides (sodium chloride - g / dm^3
free sulfur dioxide (mg / dm^3)
total sulfur dioxide (mg / dm^3)
density (g / cm^3)
pH
sulphates (potassium sulphate - g / dm^3)
alcohol (% by volume)

The output variable (based on sensory data) is:

quality (score between 0 and 10)

EDA Objective

The goal of this analysis is to understand better what red wine features (variables 1-11) may have most impact on red wine good or bad quality (variable 12).

# Read data to rw dataframe
# row.names = 1 to avoid an index column creation upon dataset reading into a dataframe
rw <- read.csv('wineQualityReds.csv', sep = ',', row.names = 1)

Dataset Exploration

Let’s start with learning more about Red Wine dataset.

# Find dimensions of rw dataframe
dim(rw)

## [1] 1599   12

There are 1599 observations and 12 variables in the dataset.

# List rw dataframe's column names, types and a subset of values
str(rw)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There are only numerical values in the dataframe. Quality variable is the only one of integer type. All column names are consistent.

The dataset variables summary can be found below:

# Display summary statistics for each variable
summary(rw)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The values ranges of different variables vary significantly and may require normalization in later phases of the analysis.

The data set is tidy and does not contain missing values:

# Check how many missing values (NA) are in each column/variable, sum them up per column
colSums(is.na(rw))

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0

Univariate and Bivariate Exploration

The following histogram matrix depics distributions of all variables in the dataset.

# Draw a histogram for a given dataframe and variable
# Use deparse() and substitute() functions to decode column name from 
# a variable passed as an argument to the function, to be displayed
# on x axis (xlab())
draw_hist <- function(dataframe, variable)
{
  # Save histogram definition to the plot variable
  plot <- ggplot(data = dataframe, aes(x = variable)) + 
           geom_histogram(color = 'black', fill = '#099DD9') +
           xlab(deparse(substitute(variable)))
  return(plot)
}

# Build a matrix of small histograms with 3 columns
# using customly defined draw_hist() function
grid.arrange(draw_hist(rw, rw$fixed.acidity),
             draw_hist(rw, rw$volatile.acidity),
             draw_hist(rw, rw$citric.acid),
             draw_hist(rw, rw$residual.sugar),
             draw_hist(rw, rw$chlorides),
             draw_hist(rw, rw$free.sulfur.dioxide),
             draw_hist(rw, rw$total.sulfur.dioxide),
             draw_hist(rw, rw$density),
             draw_hist(rw, rw$pH),
             draw_hist(rw, rw$sulphates),
             draw_hist(rw, rw$alcohol),
             draw_hist(rw, rw$quality),
             ncol = 3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histograms reveal that density and pH are normally disributed but the rest of variables are more or less right skewed (long-tailed). The quality dependent variable has a semi-normal discrete distribution.

Dependent Variable

quality

# Plot a histogram of quality values
ggplot(data = rw, aes(x = quality)) +
  geom_histogram(color = 'black', fill = '#099DD9', binwidth = 1) +
  # Used to show 0-10 range, even if there are no values close to 0 or 10
  scale_x_continuous(limits = c(0, 10), breaks = seq(0, 10, 1)) +
  xlab('Quality of Red Wine') +
  ylab('Number of Red Wines')

# Display summary statistics for quality
summary(rw$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

According to dataset documentation, the quality value should be between 0-10. In the dataset (based on the histogram plot and numerical summary), the minimum quality value is 3 and maximum is 8. The majority of red wines (more than 1200) have either 5 or 6 ranking for their quality. About 200 red wines have quality ranking as 7, but the number is not as significant as 5 and 6. The rest are either 3, 4 or 8. The average (mean) quality of red wines in the database is 5.63 and median is 6.

As we want to find variables differentiating most good or bad wines, it might be a good idea to bin quality variable into 3 categories (thus converting it into a factor variable). This may reduce comparison complexity and improve readability of the results.

The quality variable will be split into 3 bins: low [0-5), medium [5-7) and high [7-10].

# Set boundaries for intervals
breaks <- c(0, 5, 7, 10)

# Bucket data points into intervals
rw$quality.category <- cut(rw$quality, breaks, include.lowest = TRUE, right = FALSE)

# Check intervals
summary(rw$quality.category)

##  [0,5)  [5,7) [7,10] 
##     63   1319    217

# Add labels to intervals
labels <- c("Low", "Medium", "High")
rw$quality.category <- cut(rw$quality, breaks, include.lowest = TRUE, right = FALSE, labels=labels)

# Check if labels are applied properly
table(rw$quality.category)

## 
##    Low Medium   High 
##     63   1319    217

# Draw the result
# Combine rows by quality category
y <- cbind(rw, rw$quality.category)
# Plot category bars and fill the colour based on number counted
ggplot(data = y, aes(x = y$quality.category, fill = ..count..)) + 
  geom_bar(color = 'black', alpha = 0.9) + 
  # Add text labels with number of rows in each category
  stat_count(geom = "text", aes(label = ..count..), hjust = -0.1) +
  theme_bw() + 
  labs(y = "Number of Observations", x = "Red Wine Quality Category") +
  # Make the plot horizontal
  coord_flip() +
  # Limit y axis
  ylim(0, 2000) +
  # Include bins of length zero
  scale_x_discrete(drop = FALSE)

The number of observations in each quality interval is not balanced with the rest. Most observations are in the Medium category (1319), 217 are in the High category and 63 in the Low category. The low number of observations in the Low and High categories may influence quality of our analysis.

Because the dataset is not balanced in terms of number of observation for each quality category, instead of number of observation, a proportion will be used when comparing samples belonging to different quality categories (normalization).

Independent Variables

In order to find biggest differentiators (variables having much more influence on differentiation of wine quality then others) let’s look at each variable distributions and density separately and see how observations fall into bad and good wine categories. To distinguish the best candidates for further analysis, low and high guality intervals should be investigated first.

For each dependent variable I will compute and display:

Numerical summary statistics (overall and split into quality categories),
Histograms (distribution of values for all observations and by quality category),
Frequency polygons and density plots (to better compare distributions for each quality category),
Box plots (to visualize all important aspects and find easier the differentiators).

Based on different views I will identify four biggest differentiators (based on their variability regarding low and high quality) and perform deeper analysis on how they relate to the quality level and each other.

fixed.acidity

draw_hist(rw, rw$fixed.acidity)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$fixed.acidity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The distribution of fixed.acidity data is lightly right skewed with minimum value of 4.60, maximum of 15.9 and median of 7.90 and mean of 8.32. Let’s see how many observations fall in each quality category.

# Split fixed.acidity by quality and summarize
by(rw$fixed.acidity, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.871   8.400  12.500 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.100   7.800   8.254   9.100  15.900 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.700   8.847  10.100  15.600

# Histograms
ggplot(data = rw, aes(x = fixed.acidity)) +
  geom_histogram(binwidth = 0.2, color = 'black', fill = '#099DD9') +
  # Split data by category and display plots horizontally
  facet_wrap(~quality.category)

The distribution shapes for low and high categories looks similar, but the high distribution tends to be moved to the right. This kind of a diagram cannot show precisely how far the means of low and high distributions are from each other.

Let’s compare the shapes of the quality category distributions using frequency polygons and density plots to see if we can reveal more.

# Frequency polygons
# For y calculate proportions within each quality category group
ggplot(data = rw, aes(x = fixed.acidity, y = ..density../sum(..density..))) +
  geom_freqpoly(aes(color = quality.category), binwidth = 0.5)

# Density Plot
ggplot(rw, aes(citric.acid)) +
  geom_density(aes(fill = factor(quality.category)), alpha = 0.8)

Having all distributions stacked on top of each other, we can more easily see the differences in their shape but is is still hard to reason about the means.

Using box plots allows us to combine all required information in a visual way.

# Box Plots
ggplot(data = rw, aes(x = quality.category, y = fixed.acidity, color = quality.category)) +
  geom_boxplot()

The low and high means are not far from each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is moderate comparing to other variables and this feature will not be investigated later.

volatile.acidity

draw_hist(rw, rw$volatile.acidity)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$volatile.acidity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The distribution of volatile.acidity data is almost normal with minimum value of 0.12, maximum of 1.58 and median of 0.52 and mean of 0.5278.

by(rw$volatile.acidity, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5650  0.6800  0.7242  0.8825  1.5800 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4055  0.4900  0.9150

ggplot(data = rw, aes(x = quality.category, y = volatile.acidity, color = quality.category)) +
  geom_boxplot()

The low and high means are far from each other and even their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.

citric.acid

draw_hist(rw, rw$citric.acid)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$citric.acid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The distribution of citric.acid data is right skewed with minimum value of 0, maximum of 1 (outlayer) and median of 0.26 and mean of 0.271.

by(rw$citric.acid, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0800  0.1737  0.2700  1.0000 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2400  0.2583  0.4000  0.7900 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.4000  0.3765  0.4900  0.7600

ggplot(data = rw, aes(x = quality.category, y = citric.acid, color = quality.category)) +
  geom_boxplot()

The low and high means are very far from each other and their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.

residual.sugar

draw_hist(rw, rw$residual.sugar)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$residual.sugar)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of residual.sugar data is right skewed (long-tailed) with minimum value of 0.9, maximum of 15.5 (many outlayers) and median of 2.2 and mean of 2.539.

by(rw$residual.sugar, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.100   2.685   2.950  12.900 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.504   2.600  15.500 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.709   2.700   8.900

ggplot(data = rw, aes(x = quality.category, y = residual.sugar, color = quality.category)) +
  geom_boxplot()

The low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is minimum comparing to other variables and this feature will not be investigated later.

chlorides

draw_hist(rw, rw$chlorides)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$chlorides)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The distribution of chlorides data is also right skewed (long-tailed) and very similar to distribution of residual.sugar with minimum value of 0.012, maximum of 0.611 (many outlayers) and median of 0.079 and mean of 0.087.

by(rw$chlorides, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06850 0.08000 0.09573 0.09450 0.61000 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.07100 0.08000 0.08897 0.09100 0.61100 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07591 0.08500 0.35800

ggplot(data = rw, aes(x = quality.category, y = chlorides, color = quality.category)) +
  geom_boxplot()

Similarily to residual.sugar, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is minimum comparing to other variables and this feature will not be investigated later.

free.sulfur.dioxide

draw_hist(rw, rw$free.sulfur.dioxide)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$free.sulfur.dioxide)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The distribution of free.sulfur.dioxide data is right skewed with minimum value of 1, maximum of 72 and median of 14 and mean of 15.87.

by(rw$free.sulfur.dioxide, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    9.00   12.06   15.50   41.00 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   16.37   22.00   72.00 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   13.98   18.00   54.00

ggplot(data = rw, aes(x = quality.category, y = free.sulfur.dioxide, color = quality.category)) +
  geom_boxplot()

Similarily to residual.sugar or chlorides, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above), but in this case the data is much more dispersed. The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.

total.sulfur.dioxide

draw_hist(rw, rw$total.sulfur.dioxide)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$total.sulfur.dioxide)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The distribution of total.sulfur.dioxid data is right skewed (long-tailed) and very similar to distribution of free.sulfur.dioxide with minimum value of 6, maximum of 289 (outlayers) and median of 38 and mean of 46.47.

by(rw$total.sulfur.dioxide, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   13.50   26.00   34.44   48.00  119.00 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   24.00   40.00   48.95   65.00  165.00 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.00   27.00   34.89   43.00  289.00

ggplot(data = rw, aes(x = quality.category, y = total.sulfur.dioxide, color = quality.category)) +
  geom_boxplot()

Similarily to free.sulfur.dioxide, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above), but in this case the data is much less dispersed. The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.

density

draw_hist(rw, rw$density)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$density)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The distribution of density data is normal with minimum value of 0.9901, maximum of 1.0037 and median of 0.9968 and mean of 0.9967.

by(rw$density, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9966  0.9967  0.9977  1.0010 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9958  0.9968  0.9969  0.9979  1.0037 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9947  0.9957  0.9960  0.9973  1.0032

ggplot(data = rw, aes(x = quality.category, y = density, color = quality.category)) +
  geom_boxplot()

The low and high means are close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.

pH

draw_hist(rw, rw$pH)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$pH)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The distribution of pH data is normal with minimum value of 2.740, maximum of 4.010 and median of 3.310 and mean of 3.311.

by(rw$pH, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.380   3.384   3.500   3.900 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.210   3.310   3.311   3.400   4.010 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.270   3.289   3.380   3.780

ggplot(data = rw, aes(x = quality.category, y = pH, color = quality.category)) +
  geom_boxplot()

The low and high means are not too close to each other but their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is semi-low comparing to other variables and this feature will not be investigated later.

sulphates

draw_hist(rw, rw$sulphates)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$sulphates)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The distribution of sulphates data is right skewed (long-tailed) with minimum value of 0.33, maximum of 2 and median of 0.62 and mean of 0.6581.

by(rw$sulphates, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4950  0.5600  0.5922  0.6000  2.0000 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5400  0.6100  0.6473  0.7000  1.9800 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7435  0.8200  1.3600

ggplot(data = rw, aes(x = quality.category, y = sulphates, color = quality.category)) +
  geom_boxplot()

The low and high means are far from each other and their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.

alcohol

draw_hist(rw, rw$alcohol)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(rw$alcohol)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The distribution of alcohol data is right skewed but does not have many outliers with minimum value of 8.4, maximum of 14.9 and median of 10.2 and mean of 10.42.

by(rw$alcohol, rw$quality.category, summary)

## rw$quality.category: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## rw$quality.category: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## rw$quality.category: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

ggplot(data = rw, aes(x = quality.category, y = alcohol, color = quality.category)) +
  geom_boxplot()

The low and high means are very far from each other and their IQRs do overlap only slightly being very dispersed (see also summary statistics per category above). The variability between low and high quality categories is very high comparing to other variables and this feature will have to be investigated later.

Univariate and Bivariate Exploration Summary

For every variable distribution and density differences were explored from different perspectives: numerical summary statistics, histograms, frequency polygons, density distributions and boxplots. The different perspectives complement each other giving most complete view on each variable and their variability related to red wine quality (biggest distribution difference for Low and High red wine quality categories).

It seems that volatile.acidity, citric.acid, sulphates and alcohol show the biggest variability (difference in means and IQR ranges) and these variables will be used in further analysis.

draw_boxplot <- function(dataframe, variable, ylab)
{
  plot <- ggplot(data = dataframe, aes(x = quality.category, y = variable, color = quality.category)) + 
    geom_boxplot() + 
    xlab('Quality') +
    #ylab(deparse(substitute(variable))) +
    ylab(ylab) +
    theme(legend.position = "none")
  return(plot)
}

# Build 4 boxplots summarizing distributions of 4 selected features
draw_univ_summary <- function()
{
  grid.arrange(draw_boxplot(rw, rw$volatile.acidity, expression(Volatile~Acidity~(g/dm^{3}))),
             draw_boxplot(rw, rw$citric.acid, expression(Citric~Acid~(g/dm^{3}))),
             draw_boxplot(rw, rw$sulphates, expression(Sulphates~(g/dm^{3}))),
             draw_boxplot(rw, rw$alcohol, 'Alcohol (% by Volume)'),
             ncol = 4,
             top = 'Features With Biggest Variability by Quality Category')
}

draw_univ_summary()

Bivariate and Multivariate Exploration

Just for curiosity, let’s see what variables are correlated with each other (if any).

# Create a new dataframe and calculate correlations 
# between rw variables
rwcor <- cor(rw[c(1:11, 12)])
# Draw a correlation matrix
corrplot(rwcor, method = 'square', order = "hclust", 
         tl.col = "black", tl.cex = 0.8, tl.offset = 1)

The correlation matrix shows that fixed.acidity is highly positively correlated with density and citric.acid. total.sulfur.dioxide is highly positively correlated with free.sulful.dioxide. pH is highly negatively correlated with fixed.acidity. citric.acid is correlated negatively with volatile.acidity and pH. We will not be investigating most of these correlations as most of the correlated features were eliminated earlier from the analysis as less relevant for meeting this EDA objective.

Next, I will focus on negative correlation between volatile.acidity and citric.acid, and also on correlations between quality and the four features selected during univariate exploration phase.

Let’s start with a closer look at how the four features selected for the next round of analysis relate to each other and to quality.

# Create a dataframe with only selected variables
rw_sample = subset(rw, select = c(volatile.acidity, citric.acid, sulphates, alcohol, quality))
 
# Draw a correlation matrix
ggpairs(rw_sample, 
        # Define how scatterplots should be displayed
        lower = list(continuous = wrap('points', shape = I('.')))) +
  theme(legend.position = "none", # do not show legend
        panel.grid.major = element_blank(), # do not display grid
        panel.border = element_rect(linetype = "dashed", colour = "black", fill = NA))

The above correlation matrix between variables selected for further analysis and the quality variable reveals that:

Less volatile.acidity in a sample results in higher wine quality (correlation coefficient = -0.391),
The presence of citric.acid and sulphates indicates rather medium quality wine samples,
The more alcohol in a sample the higher quality score it can get (correlation coefficient = 0.476).

The strongest correlation found is between volatile.acidity and citric.acid (cerrelation coefficient = -0.552) nevertheless it will not explain much which red wine sample features influence (low and high) wine quality.

Let’s look deeper into quality versus selected features to see how they influence red wine quality. Scatter plots with some overplotting reduction as well as grouping and summarizing features by quality category should help in confirming initial findings and in revealing new ones.

volatile.acidity & quality

cor.test(rw$volatile.acidity,rw$quality)

## 
##  Pearson's product-moment correlation
## 
## data:  rw$volatile.acidity and rw$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

The negative correlation between volatile.acidity and quality is weak-to-moderate and this can be noticed in the plot, where one could argue if less volatile.acidity in a sample results in higher wine quality. More samples in the dataset in low and high quality categories could help in clarification.

How does the average volatile.acidity varies with quality?

# Calculate means, medians and number of observations per existing
# in rw quality levels, sort ascending by quality 
rw.volatile_acidity_by_quality <- rw %>%
  group_by(quality) %>%
  summarize(volatile_acidity_mean = mean(volatile.acidity),
            volatile_acidity_median = median(volatile.acidity),
            number_of_obs = n()) %>%
  arrange(quality)

# Show the result
head(rw.volatile_acidity_by_quality)

## # A tibble: 6 x 4
##   quality volatile_acidity_mean volatile_acidity_median number_of_obs
##     <int>                 <dbl>                   <dbl>         <int>
## 1       3                 0.884                   0.845            10
## 2       4                 0.694                   0.67             53
## 3       5                 0.577                   0.580           681
## 4       6                 0.497                   0.49            638
## 5       7                 0.404                   0.37            199
## 6       8                 0.423                   0.37             18

Looking at the average mean and median values of volatile.acidity per each quality rating, we can see that there is a clear pattern showing that the less volatile.acidity is in a sample on average the better quality of the sample is. The plot below confirms this visually.

ggplot(rw.volatile_acidity_by_quality)  +
  # Jitter data points for better visibility and add depth with alpha
  geom_jitter(data = rw, aes(x = volatile.acidity, y = quality,
                      color = quality.category), width = 0, alpha = 0.5) + 
  # Put calculated earlier means/medians (line plots on top)
  geom_line(aes(x = volatile_acidity_mean, y = quality), color = "blue") +
  geom_line(aes(x = volatile_acidity_median, y = quality), color = "red") +
  # Set labels and title
  xlab("Volatile Acidity Mean (blue) and Median (red)") +
  ylab("Red Wine Quality Level") +
  ggtitle("Volatile Acidity Mean and Median by Quality")

citric.acid & quality

cor.test(rw$citric.acid,rw$quality)

## 
##  Pearson's product-moment correlation
## 
## data:  rw$citric.acid and rw$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

The correlation between citric.acid and quality is positive and weak.

rw.citric_acid_by_quality <- rw %>%
  group_by(quality) %>%
  summarize(citric_acid_mean = mean(citric.acid),
            citric_acid_median = median(citric.acid),
            number_of_obs = n()) %>%
  arrange(quality)
head(rw.citric_acid_by_quality)

## # A tibble: 6 x 4
##   quality citric_acid_mean citric_acid_median number_of_obs
##     <int>            <dbl>              <dbl>         <int>
## 1       3            0.171              0.035            10
## 2       4            0.174              0.09             53
## 3       5            0.244              0.23            681
## 4       6            0.274              0.26            638
## 5       7            0.375              0.4             199
## 6       8            0.391              0.42             18

Looking at the average mean and median values of citric.acid per each quality rating, we can see a pattern showing that the bigger citric.acid level is in a sample on average the better quality of the sample is. Furthermore, as clearly shown on the plot below, the samples with citric.acid level above 0.5 will almost always be classsified as of Medium or High quality.

ggplot(rw.citric_acid_by_quality)  +
  geom_jitter(data = rw, aes(x = citric.acid, y = quality,
              color = quality.category), width = 0, alpha = 0.5) + 
  geom_line(aes(x = citric_acid_mean, y = quality), color = "blue") +
  geom_line(aes(x = citric_acid_median, y = quality), color = "red") +
  xlab("Citric Acid Mean (blue) and Median (red)") +
  ylab("Red Wine Quality Level") +
  ggtitle("Citric Acid Mean and Median by Quality")

sulphates & quality

cor.test(rw$sulphates,rw$quality)

## 
##  Pearson's product-moment correlation
## 
## data:  rw$sulphates and rw$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

The correlation between sulphates and quality is positive and weak similarily to the correlation between citric.acid and quality.

rw.sulphates_by_quality <- rw %>%
  group_by(quality) %>%
  summarize(sulphates_mean = mean(sulphates),
            sulphates_median = median(sulphates),
            number_of_obs = n()) %>%
  arrange(quality)
head(rw.sulphates_by_quality)

## # A tibble: 6 x 4
##   quality sulphates_mean sulphates_median number_of_obs
##     <int>          <dbl>            <dbl>         <int>
## 1       3          0.57             0.545            10
## 2       4          0.596            0.56             53
## 3       5          0.621            0.580           681
## 4       6          0.675            0.64            638
## 5       7          0.741            0.74            199
## 6       8          0.768            0.74             18

Looking at the average mean and median values of sulphates per each quality rating, we can see that the bigger sulphates level is in a sample on average the better quality of the sample is. However, the sulphates values are less spread than values of other variables.

ggplot(rw.sulphates_by_quality)  +
  geom_jitter(data = rw, aes(x = sulphates, y = quality,
              color = quality.category), width = 0, alpha = 0.5) + 
  geom_line(aes(x = sulphates_mean, y = quality), color = "blue") +
  geom_line(aes(x = sulphates_median, y = quality), color = "red") +
  xlab("Sulphates Mean (blue) and Median (red)") +
  ylab("Red Wine Quality Level") +
  ggtitle("Sulphates Mean and Median by Quality")

alcohol & quality

cor.test(rw$alcohol,rw$quality)

## 
##  Pearson's product-moment correlation
## 
## data:  rw$alcohol and rw$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

The correlation between alcohol and quality is positive and moderate. It is the strongest correlation identified between the four differentiating features and quality.

rw.alcohol_by_quality <- rw %>%
  group_by(quality) %>%
  summarize(alcohol_mean = mean(alcohol),
            alcohol_median = median(alcohol),
            number_of_obs = n()) %>%
  arrange(quality)
head(rw.alcohol_by_quality)

## # A tibble: 6 x 4
##   quality alcohol_mean alcohol_median number_of_obs
##     <int>        <dbl>          <dbl>         <int>
## 1       3         9.96           9.93            10
## 2       4        10.3           10               53
## 3       5         9.90           9.7            681
## 4       6        10.6           10.5            638
## 5       7        11.5           11.5            199
## 6       8        12.1           12.2             18

Looking at the average mean and median values of alcohol per each quality rating, we can see a pattern showing that only alcohol level above 12 gives more certainty that the sample will be considered as of Medium or High quality. If the alcohol level goes below 10 a sample will most likely be considered as of a Medium or Low quality.

ggplot(rw.alcohol_by_quality)  +
  geom_jitter(data = rw, aes(x = alcohol, y = quality,
              color = quality.category), width = 0, alpha = 0.5) + 
  geom_line(aes(x = alcohol_mean, y = quality), color = "blue") +
  geom_line(aes(x = alcohol_median, y = quality), color = "red") +
  xlab("Alcohol Mean (blue) and Median (red)") +
  ylab("Red Wine Quality Level") +
  ggtitle("Alcohol Mean and Median by Quality")

So far, in this phase of EDA I have confirmed the following:

Less volatile.acidity in a sample results in higher wine quality,
The bigger citric.acid level is in a sample on average the better quality of the sample is. The samples with citric.acid level above 0.5 will almost never be classsified as of Low quality,
The bigger sulphates level is in a sample on average the better quality of the sample is. However, the sulphates values are less spread than values of other variables,
Only alcohol level above 12 gives more certainty that the sample will be considered as of Medium or High quality. If the alcohol level goes below 10 a sample will most likely be considered as of a Medium or Low quality.

All of these are clearly visible in the following plot.

draw_main_quality_corrs <- function(dataframe, variable, title)
{
  plot <- ggplot(data = dataframe, aes(x = variable, y = quality.category)) + 
    geom_point(aes(color = quality.category), alpha = 1/4, position = 'jitter') +
    ggtitle(title) +
    xlab('') +
    ylab('Quality') +
    theme(legend.position = "none")
  return(plot)
}

draw_biv_summary <- function()
{
  grid.arrange(draw_main_quality_corrs(rw, rw$volatile.acidity, expression(Volatile~Acidity~(g/dm^{3}))),
             draw_main_quality_corrs(rw, rw$citric.acid, expression(Citric~Acid~(g/dm^{3}))),
             draw_main_quality_corrs(rw, rw$sulphates, expression(Sulphates~(g/dm^{3}))),
             draw_main_quality_corrs(rw, rw$alcohol, 'Alcohol (% by Volume)'),
             ncol = 2,
             top = 'Quality and Features Correlation by Category')
}

draw_biv_summary()

The last thing to check is the relation between volatile.acidity, alcohol and quality.

volatile.acidity, alcohol & quality

A scatterplot with local polynomial regression fitting line should smoth the plot.

# Plot a scatter plot
ggplot(data = rw, aes(x = volatile.acidity, y = alcohol, color = quality.category)) + 
  geom_point(alpha = 0.6) +
  # Use local polynomial regression fitting line
  geom_smooth(method="loess", col="firebrick", size=1)

draw_muv_summary <- function()
{
  ggplot(data = rw, aes(x = volatile.acidity, y = alcohol)) + 
    geom_jitter(aes(color = quality.category, bg = quality.category), alpha = 1/2, pch = 21, cex = 2) +
    facet_wrap(~quality.category) +
    scale_color_brewer(type = 'div') +
    xlab(expression(Volatile~Acidity~(g/dm^{3}))) + 
    ylab("Alcohol (% by Volume)") +
    ggtitle('Volatile.acidity and Alcohol by Quality Category')
}

draw_muv_summary()

ggplot(data = rw, aes(x = volatile.acidity, y = alcohol)) + 
  geom_jitter(aes(color = quality.category, bg = quality.category), alpha = 1/2, pch = 21, cex = 2) +
  facet_wrap(~quality) +
  scale_color_brewer(type = 'div') +
  ggtitle('Volatile.acidity and Alcohol Relationship by Quality Level')

It is interesting to notice that high quality red wines are the ones having rather lower level of volatile.acidity and higher level of alcohol while low quality red wines tend to have move volatile.acidity and less alcohol levels. It is worth mentioning that the low category data is very much dispersed and low amount of observations (comparing to medium category for example) may influence limited clarity of the results.

Multivariate Exploration Summary

To summarize:

Medium quality red wines have alcohol level between 9-12 and volatile.acidity level between 0.2-0.8.
High quality red wines have alcohol level between 10-14 and volatile.acidity level between 0.2-0.6.
Low quality red wines have alcohol level between 9-12 and volatile.acidity level between 0.4-1.2.

draw_muv_summary()

Final Plots and Summary

The goal of this EDA was to understand better what red wine features may have most impact on red wine good or bad quality.

The Red Wine dataset was tidy and no data munging activities were necessary, nevertheless the number of observations in each quality interval was not balanced with the rest. Most observations are in the Medium category (1319), 217 are in the High category and 63 in the Low category. The low number of observations in the Low and High categories might influence quality of our analysis.

In order to find biggest red wine quality influencers a univariate analysis was performed and four features with the highest variability by quality category were selected.

draw_univ_summary()

The four features: volatile.acidity, citric.acid, sulphates and alcohol showing the biggest variability were used in further analysis.

Based on a correlation matrix, correlations between quality and the four features selected during univariate exploration phase, and a negative correlation between volatile.acidity and citric.acid were chosen as a next steps focus.

draw_biv_summary()

The second phase of the analysis revealed that:

Less volatile.acidity in a sample results in higher wine quality,
The bigger citric.acid level is in a sample on average the better quality of the sample is. The samples with citric.acid level above 0.5 will almost never be classsified as of Low quality,
The bigger sulphates level is in a sample on average the better quality of the sample is. However, the sulphates values are less spread than values of other variables,
Only alcohol level above 12 gives more certainty that the sample will be considered as of Medium or High quality. If the alcohol level goes below 10 a sample will most likely be considered as of a Medium or Low quality.

draw_muv_summary()

In addition (as depicted on the plot above):

Medium quality red wines have alcohol level between 9-12 and volatile.acidity level between 0.2-0.8.
High quality red wines have alcohol level between 10-14 and volatile.acidity level between 0.2-0.6.
Low quality red wines have alcohol level between 9-12 and volatile.acidity level between 0.4-1.2.

Reflection

This analysis was done in a top-down manner, where through elimination most significant wine quality features were selected leading to some final conclusions. It is one of the possible approaches and the author of this EDA does not claim that the approach taken in this EDA is the best one.

There were some challenges spotted while performing the analysis:

Unballanced number of samples belonging to different quality levels. This problem was minimized (but not entirely) by grupping samples into three categories and normalization (ratios) while comparing samples distributions among different quality categories.
To avoid misleading visualization and overplotting due to dataset size difference, different alpha levels were used in scatterplots for better transparency.
A wine subject matter expert quidance would result in a more directional analysis and maybe better results in terms of a practical use.

In the future, there could be more features added (grown country, weather conditions, wine making process specifics, etc.) to the dataset and sample number for low and high wine quality should be balanced with the medium level for better results.