1 Merging datasets

Sometimes, data that we want to combine for analyses are separated across different spreadsheets or data tables. How can we combine these different data tables? Join operations (FMI on joining two data tables) offer a way to merge data across multiple data tables (also called data frames in R parlance).

Below, I will first create two data tables that store different characteristics about fruit.

### Load dplyr and stringr packages into R workspace
library(dplyr)
library(stringr)

### Table storing colors of 5 fruits
dt1 <- tibble::tibble(fruit=c("apple","pear","orange","kiwi","mangosteen"),
                      color=c("red","green","orange","brown","purple"))
dt1 # display table
## # A tibble: 5 × 2
##   fruit      color 
##   <chr>      <chr> 
## 1 apple      red   
## 2 pear       green 
## 3 orange     orange
## 4 kiwi       brown 
## 5 mangosteen purple
### Table storing prices of 3 fruits
dt2 <- tibble::tibble(fruit=c("pear","orange","mangosteen","cherimoya"),
                      price=c(1.25,1,5,4.7)) # price per pound
dt2 # display table
## # A tibble: 4 × 2
##   fruit      price
##   <chr>      <dbl>
## 1 pear        1.25
## 2 orange      1   
## 3 mangosteen  5   
## 4 cherimoya   4.7

All join operations assume there is some column between your data tables that has consistent values that you can use to merge the records in the data tables together. In the case of this simple example, the shared column is fruit.

1.1 Left join

The first join that we will look at is the left_join function from the dplyr package. The left_join will keep all of the rows in the left-hand side table and attempt to match entries from the second (right-hand side) table. If there is no match, R will populate NA (missing value) at that missing join location.

left_join(dt1, dt2)
## # A tibble: 5 × 3
##   fruit      color  price
##   <chr>      <chr>  <dbl>
## 1 apple      red    NA   
## 2 pear       green   1.25
## 3 orange     orange  1   
## 4 kiwi       brown  NA   
## 5 mangosteen purple  5
# Note that the left_join function detects that both data tables both have a column named "fruit". What would happen if they didn't have a column with the same name?

We see that R has merged the two data tables, dt1 and dt2 together; the resulting data table above has all of the unique columns across these two data tables (fruit - the column we used to merge the two data tables together, color from dt1, and price from dt2). We also see that R has matched the records across the two data tables–see for yourself for example that the row for pear has the correct values from dt1 (color=green) and dt2 (price=1.25).

1.1.1 What is going on with those weird NA values?

What we see above is that because dt2 didn’t have any data for apple and kiwi, R has attempted to match all 5 fruits from dt1 (left-hand side data table) with the fruits in dt2, and the fruits that were in dt1 but not dt2 get an NA value in the price column, which is merged from dt2 to dt1. Note also that the fruit that is only in dt2 but not dt1 doesn’t show up in the merged data table (no cherimoya row).

1.2 Right join

In this case, R retains all of the records from dt2. Because dt1 has some fruits that dt2 does not, that means that the fruits that are unique to dt1 will not show up in the merged data table.

right_join(dt1, dt2)
## # A tibble: 4 × 3
##   fruit      color  price
##   <chr>      <chr>  <dbl>
## 1 pear       green   1.25
## 2 orange     orange  1   
## 3 mangosteen purple  5   
## 4 cherimoya  <NA>    4.7

As before, because cherimoya is only defined in dt2, but not dt1, R populates an NA value for all of the columns that are coming from dt1 (in this case, the column color) to the final merged data table.

1.3 Inner join

In the inner_join function, only those rows that have exact matches across the two data tables are preserved in the final merged data table.

inner_join(dt1, dt2)
## # A tibble: 3 × 3
##   fruit      color  price
##   <chr>      <chr>  <dbl>
## 1 pear       green   1.25
## 2 orange     orange  1   
## 3 mangosteen purple  5

We see that this resulting data table presents only the three fruits that are included in both dt1 and dt2.

1.4 Full join

In a full join operation, all unique elements across dt1 and dt2 are preserved in the final merged data table. For non-matching pairs, R will fill in an NA value.

full_join(dt1, dt2)
## # A tibble: 6 × 3
##   fruit      color  price
##   <chr>      <chr>  <dbl>
## 1 apple      red    NA   
## 2 pear       green   1.25
## 3 orange     orange  1   
## 4 kiwi       brown  NA   
## 5 mangosteen purple  5   
## 6 cherimoya  <NA>    4.7

We see that this merged table has all 6 unique fruits across the two data tables. We see that R has assigned NA in price (the column from dt2) for the two fruits that are in dt1 but not dt2 (apple and kiwi). We also see that there is an NA value in color (the column from dt1) for cherimoya, which is a fruit that was only listed in dt2.

1.5 The need for consistency to ensure successful joins

R can’t read our minds, for better or for worse. So if we notice that our data tables are not consistent in how they present the fruits, then we’ll see different join behavior. R is looking for an exact match of the words in the joining column (fruit).

### Modifying dt2
dt2 <- tibble::tibble(fruit=c("Pear","Orange","Mangosteen","Cherimoya"),
                      price=c(1.25,1,5,4.7)) # price per pound
dt2 # display table
## # A tibble: 4 × 2
##   fruit      price
##   <chr>      <dbl>
## 1 Pear        1.25
## 2 Orange      1   
## 3 Mangosteen  5   
## 4 Cherimoya   4.7
# dt1 # display dt1 to compare the fruit column visually
### See what happens - R can't join the fruits up - the words are not identical because R is case sensitive
full_join(dt1,dt2) 
## # A tibble: 9 × 3
##   fruit      color  price
##   <chr>      <chr>  <dbl>
## 1 apple      red    NA   
## 2 pear       green  NA   
## 3 orange     orange NA   
## 4 kiwi       brown  NA   
## 5 mangosteen purple NA   
## 6 Pear       <NA>    1.25
## 7 Orange     <NA>    1   
## 8 Mangosteen <NA>    5   
## 9 Cherimoya  <NA>    4.7

Dang! R has treated pear as a different value than Pear. That’s really silly and unfortunate. It’s not “automagically” joining up the data across the two tables because R is case sensitive (that is, pear is different from Pear to R).

We can fix this example here by taking dt2 and making the fruit column all lower-case then re-running the join operation.

### Let's fix this problem and make the fruit names consistent
dt2new <- dt2 %>% 
  mutate(fruit=tolower(fruit)) # a function that takes character/strings (words) and converts them to lower case
dt2new # confirm that now the fruit names are lower case
## # A tibble: 4 × 2
##   fruit      price
##   <chr>      <dbl>
## 1 pear        1.25
## 2 orange      1   
## 3 mangosteen  5   
## 4 cherimoya   4.7
### Re-run full_join
full_join(dt1,dt2new)
## # A tibble: 6 × 3
##   fruit      color  price
##   <chr>      <chr>  <dbl>
## 1 apple      red    NA   
## 2 pear       green   1.25
## 3 orange     orange  1   
## 4 kiwi       brown  NA   
## 5 mangosteen purple  5   
## 6 cherimoya  <NA>    4.7

1.5.1 Additional exercises to build understanding

Given dt1, dt2, and dt2new that have just been defined above, run and contrast the outputs of the following:

  • left_join(dt1, dt2)
    • versus: left_join(dt1, dt2new)
  • right_join(dt1, dt2)
    • versus: right_join(dt1, dt2new)
  • inner_join(dt1, dt2)
    • versus: inner_join(dt1, dt2new)
    • notice that inner_join(dt1, dt2) tells you that it didn’t find any matches! It produces a table that has 0 x 3 dimensions, or 0 rows because it didn’t find any matching records across dt1 and dt2.

2 Adding socioeconomic data to the Provided Datasheet

As an additional illustration of data merging, I will illustrate below how one can add on socio-economic data from the USDA and the US Census to the Provided Datasheet. In the code chunk below, I pull in data from the GitHub repository that stores the class R tutorials.

### Read in data
CA_county_data <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_protectedareas_datasheet.tsv") # this is a link to the Provided Datasheet for the class project
  ### Take a look at the first few rows of the data table
CA_county_data
## # A tibble: 58 × 26
##    County   Birds Trees Reptiles Mammals area_of_county_ha lat_county_centroid_y
##    <chr>    <dbl> <dbl>    <dbl>   <dbl>             <dbl>                 <dbl>
##  1 Alameda    235    30       27      54           212722.                  37.7
##  2 Alpine     174    29       15      71           192485                   38.6
##  3 Amador     172    41       21      69           156940.                  38.4
##  4 Butte      184    45       21      67           434375.                  39.7
##  5 Calaver…   173    39       21      66           268563.                  38.2
##  6 Colusa     189    40       20      58           299495.                  39.2
##  7 Contra …   233    35       25      57           208175                   37.9
##  8 Del Nor…   199    56       19      71           318486.                  41.7
##  9 El Dora…   174    41       21      71           462662.                  38.8
## 10 Fresno     173    38       25      72          1556855.                  36.8
## # ℹ 48 more rows
## # ℹ 19 more variables: long_county_centroid_x <dbl>, PAs_gapstatus1_HAs <dbl>,
## #   PAs_gapstatus2_HAs <dbl>, PAs_gapstatus3_HAs <dbl>,
## #   PAs_gapstatus4_HAs <dbl>, PAs_gapstatus1thru4_HAs <dbl>,
## #   PAs_gapstatus1_Ct <dbl>, PAs_gapstatus2_Ct <dbl>, PAs_gapstatus3_Ct <dbl>,
## #   PAs_gapstatus4_Ct <dbl>, PAs_gapstatus1thru4_Cts <dbl>,
## #   farmland_value_USDperHA <dbl>, farmland_area_HAs <dbl>, …
### Data on socio-economic statuses in California
CA_county_SES <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_ses.tsv") # URL for spreadsheet storing socioeconomic data by county in California
  ### Take a look at the first few rows of the data table
CA_county_SES
## # A tibble: 58 × 6
##    State County              rurban PctPoverty PctChildPoverty MedianHHincome
##    <chr> <chr>               <chr>       <dbl>           <dbl>          <dbl>
##  1 CA    Alameda County      Urban         8.9             9.8         107589
##  2 CA    Alpine County       Rural        17.2            29.6          58112
##  3 CA    Amador County       Rural         9.8            12.5          62640
##  4 CA    Butte County        Urban        16.1            16.1          58394
##  5 CA    Calaveras County    Rural        12.1            19.6          68248
##  6 CA    Colusa County       Rural        12              15.3          59048
##  7 CA    Contra Costa County Urban         7.9             9.9         106555
##  8 CA    Del Norte County    Rural        17.9            25.8          48979
##  9 CA    El Dorado County    Urban         8.7             9.1          86202
## 10 CA    Fresno County       Urban        20.5            29            56926
## # ℹ 48 more rows
### Data on racial composition in California
CA_county_demog <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_census.tsv")
  ### Take a look at the first few rows of the data table
CA_county_demog
## # A tibble: 58 × 10
##    STNAME     CTYNAME TOT_POP    BAC    IAC    AAC     NAC LATINX NHLATINX    WA
##    <chr>      <chr>     <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>    <dbl> <dbl>
##  1 California Alamed… 1650950 0.133  0.0235 0.334  0.0162   0.224    0.776 0.493
##  2 California Alpine…    1047 0.0143 0.233  0.0210 0.00287  0.100    0.900 0.729
##  3 California Amador…   37429 0.0289 0.0418 0.0244 0.00532  0.134    0.866 0.900
##  4 California Butte …  226231 0.0296 0.0445 0.0616 0.00653  0.160    0.840 0.858
##  5 California Calave…   45322 0.0159 0.0390 0.0259 0.00519  0.116    0.884 0.914
##  6 California Colusa…   21496 0.0203 0.0377 0.0228 0.00791  0.587    0.413 0.911
##  7 California Contra… 1137268 0.111  0.0221 0.202  0.0113   0.254    0.746 0.654
##  8 California Del No…   27382 0.0424 0.127  0.0452 0.00427  0.196    0.804 0.782
##  9 California El Dor…  185976 0.0165 0.0278 0.0599 0.00545  0.127    0.873 0.890
## 10 California Fresno…  976830 0.0679 0.0421 0.121  0.00517  0.527    0.473 0.764
## # ℹ 48 more rows

2.1 Data explanation

Note that you can find a description of the Provided Datasheet Data Explanation here.

On the other hand, the columns in CA_county_SES are as follows:

  • State: Abbreviated name for California (CA)
  • County: County name
  • rurban: Rural or urban designation for the county
  • PctPoverty: The percentage of the county’s population that lives in poverty in 2019
  • PctChildPoverty: The percentage of the county’s children (ages 0-17) that live in poverty in 2019
  • MedianHHincome: Median annual household income in the county

These data for California were modified from national County-level Data setsat the USDA Economic Research Service. The county-level data sets website opens up a data viewer; here is the data viewer for California.

The columns in CA_county_demog are as follows:

  • STNAME: The name of the state
  • CTYNAME: The name of the county
  • TOT_POP: Total population of the county
  • BAC: The proportion of county residents who self-identify as Black or African-American alone or in combination (in combination includes multiracial respondents)
  • IAC: The proportion of county residents who self-identify as American Indian and Alaska Native alone or in combination
  • AAC: The proportion of county residents who self-identify as Asian alone or in combination
  • NAC: The proportion of county residents who self-identify as Native Hawaiian and Other Pacific Islander alone or in combination
  • LATINX: The proportion of the county residents who self-identify as Hispanic and/or Latinx
  • NHLATINX: The proportion of the county residents who self-identify as Not Hispanic and/or not Latinx
  • WA: The proportion of the county residents who self-identify as White

All of the racial/ethnic groupings above are those defined by the US Census as described in the metadata document. These data were calculated from the US Census Bureau County Population by Characteristics: 2010-2019; specifically the Annual County Resident Population Estimates by Age, Sex, Race... section of the site.

2.2 Performing iterative joining operations

Let’s say that for an analysis, you wanted to combine some information about biodiversity (say Birds species richness) with socio-economic data. We can use the data joining operations that we just covered above to merge these data tables together, which permits us to plot our data and (as we will see after Spring Break) perform other analyses such as linear regression models.

We will join up CA_county_data (the Provided Datasheet with biodiversity and land use data) with CA_county_SES and CA_county_demog. To do so, we will merge the data tables based on the column that stores county names.

The first thing that we will need to do is ensure that the county names are consistent (no weird issues with lower/upper case or anything else). What do we see?

### Show the first six counties in each data table
head(CA_county_data$County) # Here we are using the $ symbol to refer to a particular column that we want to pull out of a data table. This is pretty much equivalent to calling dplyr::select(CA_county_data, County)
## [1] "Alameda"   "Alpine"    "Amador"    "Butte"     "Calaveras" "Colusa"
head(CA_county_SES$County)
## [1] "Alameda County"   "Alpine County"    "Amador County"    "Butte County"    
## [5] "Calaveras County" "Colusa County"
head(CA_county_demog$CTYNAME) # note that county in this data table is stored in a column called CTYNAME, not County
## [1] "Alameda County"   "Alpine County"    "Amador County"    "Butte County"    
## [5] "Calaveras County" "Colusa County"

Oof! We see that our matches won’t work–both CA_county_SES and CA_county_demog have County added to the end of each county’s name, while CA_county_data does not. No problem there–we’ll use a “string substitution” function (akin to using find and replace to search for then delete a string or replace it with something else) to remove County from the end of the county names. We’ll then check the county name columns in the two data tables to ensure that we’ve got names for counties that will match across all three data tables.

CA_county_SES$County <- str_replace_all(CA_county_SES$County," County","") # replace " County" at the end of the county names with nothing - basically like find and replace deleting the " County" string.
CA_county_demog$CTYNAME <- str_replace_all(CA_county_demog$CTYNAME," County","")
### Checking that we've removed " County" by inspecting the first few values in the county name columns
head(CA_county_SES$County)
## [1] "Alameda"   "Alpine"    "Amador"    "Butte"     "Calaveras" "Colusa"
head(CA_county_demog$CTYNAME) 
## [1] "Alameda"   "Alpine"    "Amador"    "Butte"     "Calaveras" "Colusa"

Below, we will use inner_join to first join CA_county_data with CA_county_SES, then CA_county_demog.

### First we will check the dimensions of the CA_county_data column
dim(CA_county_data) # 58 rows by 26 columns
## [1] 58 26
### Merge the data table that has socioeconomic data to CA_county_data
CA_merged <- inner_join(CA_county_data, CA_county_SES)
dim(CA_merged) # now we have 31 columns
## [1] 58 31
### Merge the demographic data table to the merged data table above, which was stored in CA_merged
CA_merged <- inner_join(CA_merged, CA_county_demog, by=c("County"="CTYNAME"))
  # Remember my question earlier about doing matches across data tables where the column names aren't the same? Specifying by=c(columnNameForMatchingInDataTable1 = column_name_for_matching_in_data_table_2) is how we can tell R which columns to merge the data tables on.
dim(CA_merged) # now we have 40 columns
## [1] 58 40
### Display the final merged data table
# View(CA_merged) # uncomment to run