Boxplots

Generating a basic boxplot of scores of 1580 students

Boxplot only works on a numeric column and it will throw an error otherwise.

We refer to a particular column in a data frame using '$' operator.}

Example

For example to refer to the SCORE column in the student_performance data frame, we write: 'student_performance$SCORE'.

 
 #Reading the CSV file into a data frame using the assignment operator '<-'
 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 #Basic boxplot command
 boxplot(student_performance$SCORE)

The boxplot we get above is not that pretty. Let's work on the cosmetics a bit.

How about giving the plot a name/title?

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 #To give the Boxplot a name we use the 'main' parameter.
 
 boxplot(student_performance$SCORE, main='My first Boxplot')

Let's now label the X and Y axis.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 #To give the Y axis a label we use the 'ylab' parameter. X axis can be named using 'xlab' parameter
 boxplot(student_performance$SCORE, main='My first Boxplot', ylab='Score')

To change the size of the plot title text and the axis label text, we use 'the 'cex' argument. Setting cex.main=2 will double the size of the title and setting cex.main=0.5 will reduce the text size by half. Same goes for the axis labels.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE, main='My first Boxplot', ylab='Score', cex.main=2,cex.lab=1.5)

Looks more informative now, doesn't it?

Now, if we wish to compare the distribution of a numeric variable for different groups formed by a categorical variable using boxplots, then we use the '~' (tilde) operator. The column on the right side of the ~ operator can be of any type(character, numeric, logical, etc.), but the column on the left has to be a numerical column.

 #Reading the CSV file into a data frame using the assignmet operator '<-'
 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE)

Can you edit the code to display the title for the boxplot?

Can you edit the code to display the name of the Y-axis?

Fun fact: If you pass the data frame as an argument to the boxplot function using the 'data=' parameter, you don't have to type in the fully qualified column names (Just in case you have something against the '$' operator).

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(SCORE ~ GRADE, data=student_performance)

What do you think will happen in case you remove the 'data=' parameter from the above code? Can you try doing so, and then editing the code to fix this error without putting the 'data=' parameter back in the code?

Now wouldn't it be great if a glance at the boxplot itself can reveal some information about sample sizes in different groups?

We can do this by setting the width of the boxplot (on X-axis) proportional to the square root of the sample sizes. The argument used for this is 'varwidth=TRUE'.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE, varwidth=TRUE)

Can you edit the code so that there is no need to provide the fully qualified column names?

Now, sometimes(or every time for some people) it is more intuitive to plot horizontal boxplots rather than vertical ones. This can be done using the argument 'horizontal = TRUE'

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE, horizontal = TRUE)

Try to set the width of the boxplots in the above code proportional to the square root of the sample sizes.

You can also rotate the labels on the Y-axis (A, B, C, D and F in this case).

This is done using the argument 'las=1'.

The variable las takes 4 values - 0, 1, 2 and 3. 0 is the default value (similar to not writing the las variable in the code at all). Using las = 1 will rotate labels on the Y-axis (make them parallel to X-axis). Using las = 2 will rotate labels on the X-axis as well as the Y-axis. Using las = 3 will rotate labels on the X-axis.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE, horizontal=TRUE, las=1)

Generating a boxplot for only a subset of values of the numerical column can be achieved by passing the subsetting condition in '[ ]' with the numerical column.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE[student_performance$GRADE=="A"],student_performance$SCORE[student_performance$GRADE=="B"])

The '==' sign in the above code represents the meaning of equality in a mathematical sense.

Can you try to edit the above code to get the box plots for all the grades from A through F?

Now, can you get the same result using a '~' operator instead of subsets?

Now, if you want to only plot some subset of values of the numerical column based on the row number, you can achieve this by passing the starting row number and ending row number as follows:

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE[1:50])

Notched Boxplots:

Notched boxplots can be generated using the argument 'notch=TRUE'.

Significance: If the notches of two boxes do not overlap, then there is strong evidence that their medians differ.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE, notch=TRUE, horizontal=TRUE)

Try to set the width of the boxplots in the above code proportional to the square root of the sample sizes.

Adding color to the boxplots:

This can be done by passing a vector containing colors with the 'col' argument.

The colors provided will be applied to the boxplots in such a way that if you provide 4 colours, and there are 6 values, then the 5th value will have the colour similar to the 1st, and the 6th value will have the colour similar to the 2nd.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE, horizontal = TRUE, col=c("lightblue", "red"), outline=FALSE)

Try generating notched boxplots by editing the above code.

Outliers: Outliers can be removed by setting 'outline=FALSE'

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE, outline=FALSE)

Can you edit the above code to add colors to the boxplots?

Can you make the boxplots horizontal instead of vertical?

Custom fixed width boxplots:

To set a custom fixed width to the plots we use the 'boxwex' argument.

 student_performance <- read.csv("https://github.com/kunal0895/RDatasets/blob/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE, horizontal=TRUE, boxwex=0.9 )

Try playing around with the boxwex values.

Do you remember what else can be used to change the width(variable width) of the boxplots? Try that in the above code.

Can you edit the code such that the boxplots do not show outliers?

Now, if we wish to change the character used for showing the outliers, we can do that using the 'pch' argument.

Setting pch to different numerical values gives different characters.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 boxplot(student_performance$SCORE ~ student_performance$GRADE, horizontal=TRUE, pch=2 )

Try playing around with diffrent pch values.