Ideal separation of a data set in: Training, Validation and Testing

2

I would like to know if there is a rule of thumb recommendation for a Machine Learning problem to split a set of data into 3 sets: Training, Validation and Testing.

If yes, what would be the ideal split?

I'd also like to better understand what the difference is between the validation set and the test set, and why you need to have both.

    
asked by anonymous 02.02.2018 / 01:45

1 answer

3

In general, we randomly separate about 70% for training, 15% validation and 15% for tests ... But this varies a lot and may depend on the problem, for example when there is a temporal factor, we can not separate randomly and there it is common to take a different periods for training, validation and testing. Depending on the size of the dataset, it also does not even make sense to use these percentages ...

About your other question: Why do we use one validation set and one test set?

In general, we adjust a large number of models and verify the prediction error in the validation set, in the end we choose the model with the lowest error in the validation set. The problem is that as we adjust many models, it is easy to find a model that becomes specific (over-tightened or overfed ) to the validation base and does not work for other data sets. So we leave a test set to estimate the prediction error of the chosen model and make sure the model is not over-tightened.

    
21.02.2018 / 12:00