read.table()

read.table()
The read.table() method is used to read data from files. These can be .csv files or .txt files. read.table is a very versatile function and we will see examples of what it can do.

Example

read.table() offers a lot of control over reading the CSV file. Lets us see a few examples. For the purpose of examples, we will be using a .csv file stored at location "https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv".

First, let us simply read the file into a data frame.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 moody_df

If you observe the data, you will notice that something is not right with the result. Let us look into it a bit more closely to see what exactly is happening here. Let us check the structure of our data frame.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv")
 str(moody_df)

When you observe this, you will notice that even though the data was read into the data frame, there is only one column in the frame. All the columns were merged into one. This happened because the data was separated by commas and we did not tell that to R. Is there a way to tell R that the values are comma separated? Yes.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',')
 str(moody_df)

We used the 'sep' parameter to specify that the file has comma separated values. If the file has some other separator, say ', then we would have used sep="'" instead.

If we do not explicitly specify that there is a header row in the data, then all rows are considered observations. If we use header=TRUE, then the first row is used for the names of the columns.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE)
 str(moody_df)

You will notice that the names of the columns make more sense now.

There is another argument - 'quote', that is used to specify the quoting characters in the data. For example, if some of the data fields contain the ' character then it is possible for R to consider the text starting at ' till it encounters another ' to be one string. To disable this default behavior we can use quote="", i.e empty quotes.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, quote = "")
 str(moody_df)

If you want to specify the double quotes as the quoting character, you should use quote="\"".

Sometimes, the decimal is represented by some different character (by a comma for example), so there is a parameter dec to specify what character is used to specify the decimal.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, quote = "", dec = ".")
 str(moody_df)

Now, if your data happens to contain decimal values, then you can specify a string indicating how to convert numbers whose conversion to double precision would lose accuracy. this parameter takes one of the following values, allow.loss, warn.loss and no.loss. allow.loss, as the name suggests allows for loss and warn.loss will warn about a potential accuracy loss.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, quote = "", dec = ".",numerals = c( "warn.loss"))
 str(moody_df)

If the file contains a header row, the column names are taken from that row. Is there any way we can give some other names to columns?

Yes, there is. Even if there is a header row and even if the header is set to TRUE, row.names overrides this behavior and enables us to give our own column names.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, col.names =c("ser_no","Stu_ID","Marks","Grade","on_smartphone","asks_questions","leaves_early","leate_in_class","final"))
 str(moody_df)

If you carefully observe the result, you will notice that some columns have been treated as factors. This tells us that the character variables are imported as factors or as categorical variables. A factor or a categorical variable is the one which only takes a few values. Can we stop this from happening? Yes, we can use as.is to do that. We can pass the information telling which columns should be imported as they are.

There are two methods to use it. First, we specify a vector of logical values. A TRUE value signifies that this column should be kept as it is and a FALSE signifies otherwise.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, as.is=c(FALSE,FALSE,TRUE,TRUE,TRUE,TRUE,FALSE))
 str(moody_df)

The second method is to specify the number of the column that we want to keep as it is. For example, if we want to keep the 4th and 5th column as it is, then we pass a vector containing 4 and 5.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, as.is=c(4,5))
 str(moody_df)

What if the data set is too large and you want to limit the number of rows to be read? Can this be done?

Yes, we can limit the number of rows to be read by using the argument nrows.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE,nrows=100)
 str(moody_df)

Can we offset some rows when we are reading the data? For example, if you want to start from the 4th row(and skip the first 3 rows) instead of the 1st row.

Yes, this can be done using the 'skip' parameter.

Observe the result of the following code snippet.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE,nrows=100)
 head(moody_df)

The 4th row has student ID 16792. If I want to skip the first 3 rows, then I need to use skip=3. The first row, in this case, should be Student ID 16792.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE,nrows=100, skip=3)
 head(moody_df)

Note: Can you notice that because you have skipped the first three rows and the data file has a header row, now the names are not what you would want them to be. The names are picked from the 3rd data row (Because the 3 rows that are skipped are the header and first 2 rows). Can this be fixed? Yes, we have already discussed how to give custom names to the columns. We need to use that parameter here.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE,nrows=100, skip=3,col.names =

  c("ser_no","Stu_ID","Marks","Grade","on_smartphone","asks_questions","leaves_early","leate_in_class","final"))
 head(moody_df)

Can we use something to check if the names we are giving to the columns, or the names that R is picking for the columns, are syntactically valid variable names?

Yes, we can do this by setting check.names=TRUE

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, col.names =c
  ("ser_no","Stu_ID","Marks","Grade","on_smartphone","asks_questions","leaves_early","leate_in_class","final"))
 head(moody_df)

If you observe the result of the above snippet, the name of the first column is 'Stu ID'. Although the space character is not syntactically correct in the name of a column, R allows it because check.names is not set to TRUE.

Let us see what happens if we set check.names=TRUE

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, col.names =c
  ("ser_no","Stu_ID","Marks","Grade","on_smartphone","asks_questions","leaves_early","leate_in_class","final"),
  check.names = TRUE)
 head(moody_df)

See how the name is adjusted to make it syntactically correct?

Let's see what happens if we have two identical column names.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, col.names =c
  ("ser_no","Stu_ID","Marks","Grade","on_smartphone","asks_questions","leaves_early","leate_in_class","final"),
  check.names = TRUE)
 head(moody_df)

One of them is adjusted so that they are not identical anymore.

What happens if the rows have unequal lengths? Can we fix this issue?

Yes, using the fill parameter.If TRUE then in case the rows have unequal length, blank fields are implicitly added.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, fill = TRUE)
 str(moody_df)

What if we have some character fields that have some leading white spaces? Can we get rid of those? (Note: Numeric fields are always stripped by default)

Yes, we can use strip.white=TRUE to get this done.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, strip.white = TRUE)
 str(moody_df)

What if we have some blank lines in the input and we don't want them to be read at all?

We can do this by using blank.lines.skip parameter.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, blank.lines.skip=TRUE)
 str(moody_df)

If there is a character in the data that signifies that the text following that character is a comment, then that can be informed to R using comment.char parameter.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, comment.char="#")
 str(moody_df)

comment.char="#" dignifies that the # character used is a comment character.

There is a parameter called stringsAsFactors. If set to FALSE, it prevents R from converting the character variables into factors.

NOTE: as.is is used to do something similar. as.is overrides stringsAsFactors.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, stringsAsFactors=FALSE)
 str(moody_df)

Is there a way we can skip null values?

Yes, skipNul is the parameter that we can use in this case.

 moody_df<- read.table("https://raw.githubusercontent.com/kunal0895/RDatasets/master/Moody2018.csv", sep = ',', header = TRUE, skipNul=TRUE)
 str(moody_df)

If instead of a data file you have a text string that you need to convert into a table, that can be done using the 'text' parameter. Replace the file path with the text parameter.

 moody_df<- read.table(text = "a b 1 2 3 4
  b c 5 6 7 8", quote = "")
 str(moody_df)
 moody_df