subset

subset()

The subset() method is used to get subsets of vectors, matrices or data frames which meet certain conditions.

Example

There are a lot of ways in R to subset the data. The Subset function is just one of them. subset() can be applied to vectors, matrices and data frames.

We will see how we can subset the data frames. First, we will see how to Subset rows using the subset function.

The subset function with a conditional statement lets us subset the data frame by rows. In the following example, the sub_df data frame contains only the rows for which the values of the variable SCORE is greater than 50. We create a new data frame named sub_df and store the subset results there.

 moody_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 moody_df
 sub_df <- subset(moody_df, SCORE > 50)
 sub_df

The cool thing is, you don't need to provide the long qualified name (stu_df$score) of the variable score. R assumes that this variable belongs to the data frame you are trying to subset.

Let us now see how to subset by columns. We use the select clause to subset by column. In the example to follow, we will only select STUDENT_ID and GRADE columns into a new, subset data frame. We pass a vector with the names of the columns that we want to be included in the subset.

 moody_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 moody_df
 sub_df <- subset(moody_df, select = c(STUDENTID, GRADE))
 sub_df

We have seen subset by rows and we have seen subset by columns, let's now try to subset by rows and columns together.

 moody_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 moody_df
 sub_df <- subset(moody_df, SCORE > 50,select = c(STUDENTID, GRADE))
 sub_df

If there are too many columns to be included and they all are continuous (adjacent to each other) in the data frame, then we can use ':' (the colon operator) to subset. This operator signifies 'through'. If we say v2:v5 than it means we are saying we want all the columns starting with column v2 and up-to and including column v5. If V3 and V4 fall between V2 and V5, they are included in the result. The following snippet shows how to subset from column STUDENTID to column ON_SMARTPHONE.

 moody_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 moody_df
 sub_df <- subset(moody_df, SCORE > 50,select = STUDENTID:ON_SMARTPHONE)
 sub_df