在机器学习中划分数据的过程中总是有几个疑问?
(1).Why only three partitions? (training, validation, test)?
(2).What is the difference between test set and validation set?
In machine learning, the study and construction of algorithms that can learn from and make predictions on data is a common task. Such algorithms work by making data-driven predictions or decisions,through building a mathematical model from input data.
The data used to build the final model usually comes from multiple datasets. In particular, three data sets are commonly used in different stages of the creation of the model.
The model is initially fit on a training dataset that is a set of examples used to fit the parameters (e.g. weights of connections between neurons in artificial neural networks of the model. The model (e.g. a neural net or a naive Bayes classifier is trained on the training dataset using a supervised learning method (e.g. gradient descent or stochastic gradient descent). In practice, the training dataset often consist of pairs of an input vector and the corresponding answer vector or scalar, which is commonly denoted as the target. The current model is run with the training dataset and produces a result, which is then compared with the target, for each input vector in the training dataset. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted. The model fitting can include both Feature_selection and parameter estimation.
Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset.The validation dataset provides an unbiased evaluation of a model fit on the training dataset while tuning the model’s hyperparameters (e.g. the number of hidden units in a neural network) by early stopping: stop training when the error on the validation dataset increases, as this is a sign of overfitting to the training dataset. This simple procedure is complicated in practice by the fact that the validation dataset’s error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when overfitting has truly begun.
Finally, the test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset.
1.Training dataset
A training dataset is a dataset of examples used for learning, that is to fit the parameters (e.g., weights) of, for example, a classifier).
Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify apparent relationships in the training data that do not hold in general.
2.Test dataset
A test dataset is a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset. If a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place . A better fitting of the training dataset as opposed to the test dataset usually points to overfitting.
A test set is therefore a set of examples used only to assess the performance (i.e. generalization ) of a fully specified classifier.
3.Validation dataset
A validation dataset is a set of examples used to tune the hyperparameters (i.e. the architecture) of a classifier. In artificial neural networks, a hyperparameter is, for example, the number of hidden units. It, as well as the testing set (as mentioned above), should follow the same probability distribution as the training dataset.
In order to avoid overfitting , when any classification parameter needs to be adjusted, it is necessary to have a validation dataset in addition to the training and test datasets. For example, if the most suitable classifier for the problem is sought, the training dataset is used to train the candidate algorithms, the validation dataset is used to compare their performances and decide which one to take and, finally, the test dataset is used to obtain the performance characteristics such as accuracy, sensitivity, specificity, F-measure, and so on. The validation dataset functions as a hybrid: it is training data used by testing, but neither as part of the low-level training nor as part of the final testing.
4.意义
一般来说,机器学习中包含两种参数。一是普通参数和超参数。对应到神经网络中,普通参数为权重,超参数为隐藏层单元的个数,迭代的次数等。我们通常会将数据集划分为三部分,训练集、验证集、测试集。比例为8:1:1。三个集合无交集是同分布的。训练集用来学习普通的参数。验证集用来验证学习到的模型的准确率,调整模型的超参数,在神经网络中可以是隐藏单元的个数或者确定网络的结构(model selection)。测试集是对验证集中挑选出的模型进行最终的性能检测。
例如在神经网络的训练中,验证集是没有参与训练的过程的。但是,在超参数的选择中,我们根据验证集的结果调整迭代的次数等,从这个角度上看,验证集也是参与了调参的过程。
最后,如果是在比赛中,官方提供的数据只有一个训练集和无标签的测试集。由于官方一般给的训练集不是很多,一般不用再划分一个测试集。
参考资料:维基百科