advanced category---cut

cut(x, ...)

cut divides the range of x into intervals and codes the values in x according to which interval they fall in. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

Example

Sometimes, we may need to group different values into different intervals. For example, I have scores from 1 to 100 and I want to analyze them. However, it doesn't really matter whether the student got 50 or 50.1 (After all they all failed).

We want to group the data with similar property together for easy interpretation. In this case, we use the "cut" function.

In order to make a better decision, we should first see the range of the value that we want to deal with.

 wine <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/541WINE.csv")
 range(wine$PRICE)
 #the range is 0.5 to 909.0. It implies that we divide the value starting from about 0.5 and ending at about 909.

So now let's try to divide those values into the ranges 0-200, 200-400, 400-600, 600-800, 800-1000 by using "cut".

 wine <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/541WINE.csv")
 price <- cut(wine$PRICE,5)
 price
 # "5" represents the number of intervals that we want to create. Since we wish to create five categories, here the number is five.

But if we see the output, it didn't divide the data in a way we want. At least we want all the borders of intervals to be integers.

So how can we create better intervals?

The pretty function can be used to make nicer default labels, but then it may not return the number of levels that are actually desired.

 wine <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/541WINE.csv")
 price <- cut(wine$PRICE,pretty(wine$PRICE,5))
 price
 #we still divide the whole data into five categories, but this time it looks better.

We created new intervals, but the labels look expatiatory and they are hard to interpret.

So how can we also change the name of the labels of the intervals at the same time?

 wine <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/541WINE.csv")
 price <- cut(wine$PRICE,pretty(wine$PRICE,5),labels=c("very low","low","medium","high","very high"))
 price
 #You should name the labels exactly in the order shown above, in order to let each label correspond to the right interval.
 levels(price)

How can we apply quantile knowledge to create the intervals? In other words, we want to divide the data according to the same amount rather than the same range of value.

For example I want to divide the price into 4 categories, such that each category contains the same amount of data.

 wine <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/541WINE.csv")
 price <- cut(wine$PRICE,quantile(wine$PRICE,(0:4)/4))
 price
 #"(0:4)/4" means separating the data into four categories, and each category has exactly the same amount of data. The intervals are: 0/4(0.5) - 1/4(59); 1/4(59) - 2/4(102); 2/4(102) - 3/4(149); 3/4(149) - 4/4(909);

What if you don't wish to use any of the above methods to create intervals? We can create intervals with custom border values (user-defined intervals).

Say we want to create the price intervals 0-10, 10-100, 100-1000;

 wine <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/541WINE.csv")
 price <- cut(wine$PRICE,breaks=c(0,10,100,1000))
 #"breaks" can create the break point
 price

Now let's change the labels of the intervals like we did in the above methods:

 wine <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/541WINE.csv")
 price <- cut(wine$PRICE,breaks=c(0,10,100,1000),labels=c("low","medium","high"))
 price

This is how we can create user-defined intervals.