The Wine Quality dataset consists of red wine samples. The inputs include objective tests (e.g. pH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
The dataset input variables (based on physicochemical tests) are:
The output variable (based on sensory data) is:
The goal of this analysis is to understand better what red wine features (variables 1-11) may have most impact on red wine good or bad quality (variable 12).
Let’s start with learning more about Red Wine dataset.
## [1] 1599 12
There are 1599 observations and 12 variables in the dataset.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
There are only numerical values in the dataframe. Quality variable is the only one of integer type. All column names are consistent.
The dataset variables summary can be found below:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The values ranges of different variables vary significantly and may require normalization in later phases of the analysis.
The data set is tidy and does not contain missing values:
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
The following histogram matrix depics distributions of all variables in the dataset.
The histograms reveal that density and pH are normally disributed but the rest of variables are more or less right skewed (long-tailed). The quality dependent variable has a semi-normal discrete distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
According to dataset documentation, the quality value should be between 0-10. In the dataset (based on the histogram plot and numerical summary), the minimum quality value is 3 and maximum is 8. The majority of red wines (more than 1200) have either 5 or 6 ranking for their quality. About 200 red wines have quality ranking as 7, but the number is not as significant as 5 and 6. The rest are either 3, 4 or 8. The average (mean) quality of red wines in the database is 5.63 and median is 6.
As we want to find variables differentiating most good or bad wines, it might be a good idea to bin quality variable into 3 categories (thus converting it into a factor variable). This may reduce comparison complexity and improve readability of the results.
The quality variable will be split into 3 bins: low [0-5), medium [5-7) and high [7-10].
## [0,5) [5,7) [7,10]
## 63 1319 217
##
## Low Medium High
## 63 1319 217
The number of observations in each quality interval is not balanced with the rest. Most observations are in the Medium category (1319), 217 are in the High category and 63 in the Low category. The low number of observations in the Low and High categories may influence quality of our analysis.
Because the dataset is not balanced in terms of number of observation for each quality category, instead of number of observation, a proportion will be used when comparing samples belonging to different quality categories (normalization).
In order to find biggest differentiators (variables having much more influence on differentiation of wine quality then others) let’s look at each variable distributions and density separately and see how observations fall into bad and good wine categories. To distinguish the best candidates for further analysis, low and high guality intervals should be investigated first.
For each dependent variable I will compute and display:
Based on different views I will identify four biggest differentiators (based on their variability regarding low and high quality) and perform deeper analysis on how they relate to the quality level and each other.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution of fixed.acidity data is lightly right skewed with minimum value of 4.60, maximum of 15.9 and median of 7.90 and mean of 8.32. Let’s see how many observations fall in each quality category.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.871 8.400 12.500
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.100 7.800 8.254 9.100 15.900
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.700 8.847 10.100 15.600
The distribution shapes for low and high categories looks similar, but the high distribution tends to be moved to the right. This kind of a diagram cannot show precisely how far the means of low and high distributions are from each other.
Let’s compare the shapes of the quality category distributions using frequency polygons and density plots to see if we can reveal more.
Having all distributions stacked on top of each other, we can more easily see the differences in their shape but is is still hard to reason about the means.
Using box plots allows us to combine all required information in a visual way.
The low and high means are not far from each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is moderate comparing to other variables and this feature will not be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution of volatile.acidity data is almost normal with minimum value of 0.12, maximum of 1.58 and median of 0.52 and mean of 0.5278.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
The low and high means are far from each other and even their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of citric.acid data is right skewed with minimum value of 0, maximum of 1 (outlayer) and median of 0.26 and mean of 0.271.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0800 0.1737 0.2700 1.0000
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2400 0.2583 0.4000 0.7900
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.4000 0.3765 0.4900 0.7600
The low and high means are very far from each other and their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of residual.sugar data is right skewed (long-tailed) with minimum value of 0.9, maximum of 15.5 (many outlayers) and median of 2.2 and mean of 2.539.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.100 2.685 2.950 12.900
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.504 2.600 15.500
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.709 2.700 8.900
The low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is minimum comparing to other variables and this feature will not be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of chlorides data is also right skewed (long-tailed) and very similar to distribution of residual.sugar with minimum value of 0.012, maximum of 0.611 (many outlayers) and median of 0.079 and mean of 0.087.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06850 0.08000 0.09573 0.09450 0.61000
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.07100 0.08000 0.08897 0.09100 0.61100
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07591 0.08500 0.35800
Similarily to residual.sugar, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is minimum comparing to other variables and this feature will not be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of free.sulfur.dioxide data is right skewed with minimum value of 1, maximum of 72 and median of 14 and mean of 15.87.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 9.00 12.06 15.50 41.00
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 16.37 22.00 72.00
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 13.98 18.00 54.00
Similarily to residual.sugar or chlorides, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above), but in this case the data is much more dispersed. The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The distribution of total.sulfur.dioxid data is right skewed (long-tailed) and very similar to distribution of free.sulfur.dioxide with minimum value of 6, maximum of 289 (outlayers) and median of 38 and mean of 46.47.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 13.50 26.00 34.44 48.00 119.00
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 24.00 40.00 48.95 65.00 165.00
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.00 27.00 34.89 43.00 289.00
Similarily to free.sulfur.dioxide, the low and high means are very close to each other and their IQRs overlap (see also summary statistics per category above), but in this case the data is much less dispersed. The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The distribution of density data is normal with minimum value of 0.9901, maximum of 1.0037 and median of 0.9968 and mean of 0.9967.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9966 0.9967 0.9977 1.0010
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9958 0.9968 0.9969 0.9979 1.0037
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9947 0.9957 0.9960 0.9973 1.0032
The low and high means are close to each other and their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is low comparing to other variables and this feature will not be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The distribution of pH data is normal with minimum value of 2.740, maximum of 4.010 and median of 3.310 and mean of 3.311.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.380 3.384 3.500 3.900
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.210 3.310 3.311 3.400 4.010
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.270 3.289 3.380 3.780
The low and high means are not too close to each other but their IQRs overlap (see also summary statistics per category above). The variability between low and high quality categories is semi-low comparing to other variables and this feature will not be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulphates data is right skewed (long-tailed) with minimum value of 0.33, maximum of 2 and median of 0.62 and mean of 0.6581.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4950 0.5600 0.5922 0.6000 2.0000
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6100 0.6473 0.7000 1.9800
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7435 0.8200 1.3600
The low and high means are far from each other and their IQRs do not overlap (see also summary statistics per category above). The variability between low and high quality categories is high comparing to other variables and this feature will have to be investigated later.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution of alcohol data is right skewed but does not have many outliers with minimum value of 8.4, maximum of 14.9 and median of 10.2 and mean of 10.42.
## rw$quality.category: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## rw$quality.category: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## rw$quality.category: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
The low and high means are very far from each other and their IQRs do overlap only slightly being very dispersed (see also summary statistics per category above). The variability between low and high quality categories is very high comparing to other variables and this feature will have to be investigated later.
For every variable distribution and density differences were explored from different perspectives: numerical summary statistics, histograms, frequency polygons, density distributions and boxplots. The different perspectives complement each other giving most complete view on each variable and their variability related to red wine quality (biggest distribution difference for Low and High red wine quality categories).
It seems that volatile.acidity, citric.acid, sulphates and alcohol show the biggest variability (difference in means and IQR ranges) and these variables will be used in further analysis.
Just for curiosity, let’s see what variables are correlated with each other (if any).
The correlation matrix shows that fixed.acidity is highly positively correlated with density and citric.acid. total.sulfur.dioxide is highly positively correlated with free.sulful.dioxide. pH is highly negatively correlated with fixed.acidity. citric.acid is correlated negatively with volatile.acidity and pH. We will not be investigating most of these correlations as most of the correlated features were eliminated earlier from the analysis as less relevant for meeting this EDA objective.
Next, I will focus on negative correlation between volatile.acidity and citric.acid, and also on correlations between quality and the four features selected during univariate exploration phase.
Let’s start with a closer look at how the four features selected for the next round of analysis relate to each other and to quality.
The above correlation matrix between variables selected for further analysis and the quality variable reveals that:
The strongest correlation found is between volatile.acidity and citric.acid (cerrelation coefficient = -0.552) nevertheless it will not explain much which red wine sample features influence (low and high) wine quality.
Let’s look deeper into quality versus selected features to see how they influence red wine quality. Scatter plots with some overplotting reduction as well as grouping and summarizing features by quality category should help in confirming initial findings and in revealing new ones.
##
## Pearson's product-moment correlation
##
## data: rw$volatile.acidity and rw$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
The negative correlation between volatile.acidity and quality is weak-to-moderate and this can be noticed in the plot, where one could argue if less volatile.acidity in a sample results in higher wine quality. More samples in the dataset in low and high quality categories could help in clarification.
How does the average volatile.acidity varies with quality?
## # A tibble: 6 x 4
## quality volatile_acidity_mean volatile_acidity_median number_of_obs
## <int> <dbl> <dbl> <int>
## 1 3 0.884 0.845 10
## 2 4 0.694 0.67 53
## 3 5 0.577 0.580 681
## 4 6 0.497 0.49 638
## 5 7 0.404 0.37 199
## 6 8 0.423 0.37 18
Looking at the average mean and median values of volatile.acidity per each quality rating, we can see that there is a clear pattern showing that the less volatile.acidity is in a sample on average the better quality of the sample is. The plot below confirms this visually.
##
## Pearson's product-moment correlation
##
## data: rw$citric.acid and rw$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The correlation between citric.acid and quality is positive and weak.
## # A tibble: 6 x 4
## quality citric_acid_mean citric_acid_median number_of_obs
## <int> <dbl> <dbl> <int>
## 1 3 0.171 0.035 10
## 2 4 0.174 0.09 53
## 3 5 0.244 0.23 681
## 4 6 0.274 0.26 638
## 5 7 0.375 0.4 199
## 6 8 0.391 0.42 18
Looking at the average mean and median values of citric.acid per each quality rating, we can see a pattern showing that the bigger citric.acid level is in a sample on average the better quality of the sample is. Furthermore, as clearly shown on the plot below, the samples with citric.acid level above 0.5 will almost always be classsified as of Medium or High quality.
##
## Pearson's product-moment correlation
##
## data: rw$sulphates and rw$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
The correlation between sulphates and quality is positive and weak similarily to the correlation between citric.acid and quality.
## # A tibble: 6 x 4
## quality sulphates_mean sulphates_median number_of_obs
## <int> <dbl> <dbl> <int>
## 1 3 0.57 0.545 10
## 2 4 0.596 0.56 53
## 3 5 0.621 0.580 681
## 4 6 0.675 0.64 638
## 5 7 0.741 0.74 199
## 6 8 0.768 0.74 18
Looking at the average mean and median values of sulphates per each quality rating, we can see that the bigger sulphates level is in a sample on average the better quality of the sample is. However, the sulphates values are less spread than values of other variables.
##
## Pearson's product-moment correlation
##
## data: rw$alcohol and rw$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
The correlation between alcohol and quality is positive and moderate. It is the strongest correlation identified between the four differentiating features and quality.
## # A tibble: 6 x 4
## quality alcohol_mean alcohol_median number_of_obs
## <int> <dbl> <dbl> <int>
## 1 3 9.96 9.93 10
## 2 4 10.3 10 53
## 3 5 9.90 9.7 681
## 4 6 10.6 10.5 638
## 5 7 11.5 11.5 199
## 6 8 12.1 12.2 18
Looking at the average mean and median values of alcohol per each quality rating, we can see a pattern showing that only alcohol level above 12 gives more certainty that the sample will be considered as of Medium or High quality. If the alcohol level goes below 10 a sample will most likely be considered as of a Medium or Low quality.
So far, in this phase of EDA I have confirmed the following:
All of these are clearly visible in the following plot.
The last thing to check is the relation between volatile.acidity, alcohol and quality.
A scatterplot with local polynomial regression fitting line should smoth the plot.
It is interesting to notice that high quality red wines are the ones having rather lower level of volatile.acidity and higher level of alcohol while low quality red wines tend to have move volatile.acidity and less alcohol levels. It is worth mentioning that the low category data is very much dispersed and low amount of observations (comparing to medium category for example) may influence limited clarity of the results.
To summarize:
The goal of this EDA was to understand better what red wine features may have most impact on red wine good or bad quality.
The Red Wine dataset was tidy and no data munging activities were necessary, nevertheless the number of observations in each quality interval was not balanced with the rest. Most observations are in the Medium category (1319), 217 are in the High category and 63 in the Low category. The low number of observations in the Low and High categories might influence quality of our analysis.
In order to find biggest red wine quality influencers a univariate analysis was performed and four features with the highest variability by quality category were selected.
The four features: volatile.acidity, citric.acid, sulphates and alcohol showing the biggest variability were used in further analysis.
Based on a correlation matrix, correlations between quality and the four features selected during univariate exploration phase, and a negative correlation between volatile.acidity and citric.acid were chosen as a next steps focus.
The second phase of the analysis revealed that:
In addition (as depicted on the plot above):
This analysis was done in a top-down manner, where through elimination most significant wine quality features were selected leading to some final conclusions. It is one of the possible approaches and the author of this EDA does not claim that the approach taken in this EDA is the best one.
There were some challenges spotted while performing the analysis:
In the future, there could be more features added (grown country, weather conditions, wine making process specifics, etc.) to the dataset and sample number for low and high wine quality should be balanced with the medium level for better results.