Wine quality prediction
The goal of this work is to predict the quality of red wine by using parameters such as
the pH value, or the density of the wine. This shall make it possible to presort the
wines so that not every substandard wine must be tasted by expensive expert
sommeliers.
Implementation
We used RapidMiner The following methods have been tested:
1. Linear Regression: NAE = 0.746 +/- 0.038
2. Regression Tree: NAE = 0.745 +/- 0.041
3. Neural Net: NAE = 0.862 +/- 0.149
4. Model Tree: NAE = 0.743 +/- 0.035
Data source
We used the following data set: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision
Support Systems, Elsevier, 47(4):547-553, 2009. It can be accessed
here: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
2. Sources
Created by: Paulo Cortez (Univ. Minho), António Cerdeira, Fernando Almeida, Telmo
Matos and José Reis (CVRVV) @ 2009
3. Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems>, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory
data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine
quality
between 0 (very bad) and 10 (very excellent). Several data mining methods were
applied to model
these datasets under a regression approach. The support vector machine model
achieved the
best results. Several metrics were computed: MAD, confusion matrix for a fixed error
tolerance (T),
etc. Also, we plot the relative importances of the input variables (as measured by a
sensitivity
analysis procedure).
4. Relevant Information:
These datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are munch more normal wines
than
excellent or poor ones). Outlier detection algorithms could be used to detect the few
excellent
or poor wines. Also, we are not sure if all input variables are relevant. So
it could be interesting to test feature selection methods.
5. Number of Instances: red wine - 1599; white wine - 4898.
6. Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some
sort of
feature selection.
7. Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
8. Missing Attribute Values: None