merge

merge()

Used to merge two data frames by common columns or row names. Analogous to the Join operation in SQL.

Example

Let us first see a very simple form of the merge operation. Remember that the merge operation only makes sense when there is a common field in the two data frames.

Now let us run a very simple form of merge on these two data frames.

 moody_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 student_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallSt.csv")
 merge(moody_df,student_df)

Okay, so we have our merged results. But if you notice the results, R merged based on the serial number column. Wouldn't it be better if we can merge the columns based on something more meaningful? Can we explicitly specify which column to use for the merge? Yes, we can.

 moody_df  <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 student_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallSt.csv")
 merge(moody_df,student_df,by.x = "STUDENTID",by.y="id")

You can even merge based on multiple columns. The 'by' parameter can take a vector containing all the column names you want to merge the data frames on.

Now if you observe the result carefully, you will notice that after the merge operation, only those rows which had the same agenID in both the data frames survived. Other rows were simply dropped from both the data frames.

What if we don't want that? What if we want that one of the frames should retain all its rows?

We can do that by using the arguments all.x and all.y. all.x signifies the first data frame that is passed to the merge function (moody_df in this case) and all.y signifies the second data frame passed to the merge function (student_df in this case).

 moody_df  <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 student_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallSt.csv")
 merge(moody_df,student_df,by.x = "STUDENTID",by.y="id",all.x = TRUE)

Since all the rows in the moody_df in our particular example have a match in the student_df, the results are the same.

The above code shows how to preserve all the rows of the first data frame. All the rows that do not have a match are filled with NA values where ever needed. Let's see how to preserve the rows in the second data frame.

 moody_df  <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 student_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallSt.csv")
 merge(moody_df,student_df,by.x = "STUDENTID",by.y="id",all.y = TRUE)

And if we want all the rows from both the data frames to be preserved, then what do we do?

We simply use the 'all' parameter.

 moody_df  <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 student_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallSt.csv")
 merge(moody_df,student_df,by.x = "STUDENTID",by.y="id",all.y = TRUE)

Did you notice that the results are sorted by the STUDENTID? This is because R by default sorts the results of the merge operation on the merge column. For some reason, if you don't want the result to be sorted, you can use sort = FALSE parameter to get that done. The result, in this case, is in the order in which the row appears in the first data frame (df1 in this case).

 moody_df <-  read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallMoody.csv")
 student_df <- read.csv("https://raw.githubusercontent.com/kunal0895/RDatasets/master/SmallSt.csv")
 merge(moody_df,student_df,by.x = "STUDENTID",by.y="id",sort=FALSE)