IMPORTANT: When importing the dataset in R, DO NOT uncheck the option: import strings as factors. This is needed for the cross_validate function to operate correctly. Import the dataset as is.
EDIT: There used to be an error while running the cross-validation function that said new levels encountered in test. Reinstall the cross-validation function (if it was previously installed). Make sure the dataset is imported again with strings as factors is true. Restart R-studio, and it should work. If it does not, remove the cross validation function by using
remove.packages("CrossValidation")
Restart R-studio, install the package again, and it should work.
devtools::install_github("devanshagr/CrossValidation")
You can also see the documentation of this function inside R-studio by:
?cross_validation
Overfitting takes place when you have a high accuracy on training dataset, but a low accuracy on the test dataset. But how do you know whether you are overfitting or not? Especially since you cannot determine accuracy on the test dataset? That is where cross-validation comes into play.
Because we cannot determine accuracy on test dataset, we partition our training dataset into train and validation (testing). We train our model (rpart or lm) on train partition and test on the validation partition. The partition is defined by split ratio. If split ratio =0.7, 70% of the training dataset will be used for the actual training of your model (rpart or lm), and 30 % will be used for validation (or testing). The accuracy of this validation data is called cross-validation accuracy.
To know if you are overfitting or not, compare the training accuracy with the cross-validation accuracy. If your training accuracy is high, and cross-validation accuracy is low, that means you are overfitting.
Cross-validation for Rpart
Install the cross-validation using the following commands:
install.packages("devtools")
devtools::install_github("devanshagr/CrossValidation")
How the function works - The function takes the following arguments:
1) Data Frame
2) Decision Tree or rpart object
3) Number of iterations
4) Split Ratio
To call the function, first create a decision tree.
tree <- rpart(GRADE~SCORE+PARTICIPATION, data=M2018_train, control = rpart.split(minsplit=30))
And call the function:
CrossValidation::cross_validate(M2017_train, tree, 2, 0.8)
The way the function works is that it randomly partitions your data into training and validation. It constructs the following two decision trees on training partition:
1. The tree that you pass to the function
2. The tree constructed on the set of all attributes
It then determines the accuracy of the two trees on validation partition and returns you the accuracy values for both the trees. The first column corresponds to the cross-validation accuracy on the tree that you pass; the second is the accuracy on the tree with all attributes.
The values in the first column(accuracy_subset) returned by cross-validation function are more important when it comes to detecting overfitting. If these values are much lower than the training accuracy you get, that means you are overfitting.
We would also want the values in accuracy_subset to be close to each other (in other words, have low variance). If the values are quite different from each other, that means your model (or tree) has a high variance which is not desired.
The second column(accuracy_all) tells you what happens if you construct a tree based on all attributes. If these values are larger than accuracy_subset, that means you are probably leaving out attributes from your tree that are relevant.
Each iteration of cross-validation creates a different random partition of train and validation, and so you have possibly different accuracy values for every iteration.
Example of cross-validation output:
tree<-rpart(GRADE~SCORE+PARTICIPATION, data=M2017_train) predictedGrade<-predict(tree, newdata=M2017_train, type="class")
training_accuracy<-mean(predictedGrade==M2017_train$GRADE)
training_accuracy
[1] 0.942789
CrossValidation::cross_validate(M2017_train, tree, 5, 0.7)
accuracy_subset accuracy_all
1 0.9087302 0.9087302
2 0.9087302 0.9087302
3 0.9246032 0.9246032
4 0.8849206 0.8849206
5 0.9126984 0.9126984
You can see that the cross-validation accuracies for the tree that was passed (accuracy_subset) are fairly high and close to our training accuracy of 94.2%. This means we are not overfitting. Also observe that accuracy_subset and accuracy_all have the same values, which means that the only relevant attributes are score and participation, and adding more attributes doesn't make any difference to the tree. Finally, the values in accuracy_subset are reasonably close to each other, which mean low variance.