Histograms

A histogram is a visual representation of the distribution of a dataset. As such, the shape of a histogram is its most obvious and informative characteristic: it allows you to easily see where a relatively large amount of the data is present and areas where lesser data is present. In other words, you can see where the mid value is in your data distribution, how close the data lies around this mid value and where possible outliers are to be found.

Example


 #Reading the CSV file into a data frame using the assignment operator '<-'
 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 #Plotting a very basic histogram
 hist(student_performance$SCORE)

Can you interpret the above plot? It represents the distribution of students in different score ranges. You can observe that the highest number of students that belong to a score range fall in the 40-50 score range since it has the tallest bar.

The histogram in the above section looks a bit dull. R allows us to make the visualization of histograms more lively by adding arguments to the hist() function.

Let us add a name to our histogram plot. We can do this by using the 'main' argument.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram")

Great! Let us now label the X-axis using the xlab parameter.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score")

The Y-axis in a histogram usually represents the frequency (number of occurrences). Let's keep that as the default label for Y-axis.

Now, what if you want to add color to your histogram? We have the col parameter for that.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", col="brown")

Can you guess what is it that the border parameter does?


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", col="brown", border="green")

Now, if you want to set custom lower and upper bounds on the values on X-axis that appear on the plot, you can do so by using the xlim parameter.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", col="brown", xlim=range(20:80))

Now the results of the above change are not so good. There are students who have scored below 20 and there are students who have scored above 80. Using xlim makes sense only in cases where we know that there are no data points outside those limits. So not every parameter is used in every plot.

The values on the Y-scale sometimes look better if they are written horizontally rather than vertically.

This can be done using the las argument. Setting it to 1 does the job.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", col="brown", las=1)

Now if you want your histogram to have a certain number of breaks (breakpoints between bins), then you can provide them using the break parameter. For example, setting break=5 means that you want your histogram plot (on the X-axis) to be divided into 5 intervals. Remember this is just a suggestion to R, if your suggestion is "pretty", then R will break with a pretty, rounded value. The number of bins in the histogram = breaks + 1.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", col="brown", breaks=5)

Probability density is closely related to histograms. A Histogram represents the frequency of a certain interval on the Y-axis. If you want to see the probability of a value falling under a certain interval, you are talking about the Probability density of that Interval. There are two ways to plot probability density of intervals.

a) setting the freq argument to false


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", freq=FALSE)

b) The same result can also be achieved by using prob argument to TRUE.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", prob=TRUE)

Let's make our plot cooler!. Using density parameter we can fill our bars with custom density shades. We provide a vector containing the density we want (in numerical value) for each bar.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", density=c(20,40,60,80,100))

If you want to change the angles at which the shading lines are at, this can be done by providing the angle parameter. angle = angle in degrees.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", density=c(20,40,60,80,100), angle=90)

You can change the size of the numeric axes labels and the bar labels using the cex parameter. Setting cex.main=2 will double the size of the title and setting cex.main=0.5 will reduce the text size by half. Same goes for axes labels.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", cex.main=2)

The parameter 'right', if set to TRUE, causes the histogram cells to be right-closed (left open) intervals. Note that the results of applying this parameter might not be very obvious for a visual observation. Intervals are left-open, right-closed by default.


 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score", right=TRUE)

Now, if you only want to use a subset of the data that you have, this can be done using the subset function. For example, if you only want to plot a histogram for students who never used cell phones in the classroom you can subset the data based on this condition and then apply the hist function.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 student_performance <- subset(student_performance, student_performance$ON_SMARTPHONE=="never")
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score")

Now suppose you want to plot a bar so that the bins are a unit score apart. For example, if the lowest score is 5 and the highest score is 90 then the first bin is from 5 to 6, second from 6 to 7 and so on. To do this we first need to know the minimum and maximum values of the scores in the data. We can get that with the min() and max() functions respectively.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 min(student_performance$SCORE)
 max(student_performance$SCORE)

Now we can create a sequence that starts with the minimum score and adds 1 to it to get the next bin until it reaches the max score. This can be done as follows. Since the minimum score is 1.12 and to keep it simple let's take the integer value 1, and for max score let's take it 99 (as the max score is 98.97).

Here is how we supply the values to the seq function: seq(minimum value, maximum value, length of the interval)

 bins=seq(1,99,1)
 bins

Check the values in the bins variable by just typing 'bins' and executing it.

Now to plot the histogram we use the 'bins' variable and pass it to the breaks parameter.

 student_performance <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 bins=seq(1,99,1)
 hist(student_performance$SCORE, main="Student score histogram", xlab="Score",breaks=bins)