1 Data visualization with ggplot2

One of the most powerful ways to explore your data is through data visualization. These activities will focus on three types of plots: scatterplots, boxplots, and histograms. We will use the ggplot2 package to create plots in R.

ggplot2 creates plots based on different layers. The three necessary components are:

While there are additional layering components that one can include, we will confine our attention to these three layers. For more information, please consult the free, online ggplot2 Book.

2 Creating a scatterplot

Scatterplots are most useful for visualizing two continuous, numeric variables. In defining a plot using ggplot2, we specify the data source (in the ggplot function), and the mapping of the variables (aes, short for aesthetic, within ggplot). Subsequently, we specify the type of geom (geometry) we want in the plot.

Below, we will follow a convention where we assign an object (p) to store the plot. We specify what layers we want to add to the object p by using the + sign.

In the code chunk here, we will create a scatterplot of protected area counts on the x-axis and protected area spatial extent in hectares on the y-axis across all of the 58 California counties in this dataset. To that end, we will specify PAs_gapstatus1thru4_Cts as the x variable and PAs_gapstatus1thru4_HAs as the y variable in the aesthetics field.

### Loading packages into R workspace
library(dplyr)
library(ggplot2)

### Read in data
CA_county_data <- readr::read_tsv("https://raw.githubusercontent.com/chchang47/BIOL104PO/master/data/CA_protectedareas_datasheet.tsv") # this is a link to the Provided Datasheet for the class project

### Take a look at the first few rows of the data table
CA_county_data

### Creating a scatterplot
  # Note that we will store the scatterplot in an object named "p" for plot
p <- ggplot(CA_county_data, aes(x=PAs_gapstatus1thru4_Cts, y=PAs_gapstatus1thru4_HAs)) # specifying an aesthetic where x is the count of protected areas and y is the area in hectares of all PAs in each county
p <- p + geom_point() # adding on points to create a scatterplot
p <- p + labs(x="Protected area count", y="Protected area extent (ha)") # changing the default x and y axis labels to more informative labels
p # calling p which will be displayed in the plot viewer

2.1 Exercise

How would you create a scatterplot of iucn_threat_count_species (y variable) against the total sum of bird, tree, reptile, and mammal species (x variable) in each county in California?

3 Boxplot

Boxplots are useful for depicting how different discrete categories within a variable exhibit variation. Below, we will specify the variable that we use to group our data (the categories of the 5 regions in California) as the x variable, and the values of interest as the y variable.

### Creating a boxplot
p <- ggplot(CA_county_data, aes(x=Region, y=iucn_threat_count_species))
p <- p + geom_boxplot()
p <- p + labs(x="Region",y="IUCN Red-Listed species")
# p <- p + theme_classic() # compare and contrast what happens when you uncomment this line
p <- p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # here, we are specifying that R should rotate the x-axis text by 90 degrees to avoid the long regional names getting written over each other. No need to worry about this line of code too much.
p

3.1 Exercise

How would you create a boxplot of bird species richness across the 5 regions in California?

4 Histogram

Histograms are useful for displaying the distribution of values for a continuous, numeric variable. In this case, we are going to display the distribution of values for the area of all 58 counties in California in hectares by specifying area_of_county_ha as the x variable in the aesthetics field.

### Creating a histogram
p <- ggplot(CA_county_data, aes(x=area_of_county_ha))
p <- p + geom_histogram()
p <- p + labs(x="County area (ha)")
# p <- p + theme_minimal() # compare and contrast what happens when you uncomment this line
p

4.1 Exercise

How would you create a histogram of reptile species richness in this dataset?

5 Answers to exercises

5.1 Scatterplot of IUCN Red-listed species against total richness

How would you create a scatterplot of iucn_threat_count_species (y variable) against the total sum of bird, tree, reptile, and plant species (x variable) in each county in California? Note that we will first have to use the mutate function to create a new variable that stores the sum of bird, tree, reptile, and plant species richness.

### Running mutate command and storing output in CA_county_data
CA_county_data <- CA_county_data %>%
    mutate(totalRichness = Birds + Trees + Reptiles + Mammals)
    
### Creating a scatterplot
  # Note that we will store the scatterplot in an object named "p" for plot
p <- ggplot(CA_county_data, aes(x=totalRichness, y=iucn_threat_count_species)) # specifying an aesthetic where x is the count of protected areas and y is the area in hectares of all PAs in each county
p <- p + geom_point() # adding on points to create a scatterplot
p <- p + labs(x="Total species richness", y="IUCN Red-listed species") # changing the default x and y axis labels to more informative labels
p # calling p which will be displayed in the plot viewer

5.2 Boxplot of bird species richness

How would you create a boxplot of bird species richness across the 5 regions in California?

### Creating a boxplot
p <- ggplot(CA_county_data, aes(x=Region, y=Birds))
p <- p + geom_boxplot()
p <- p + labs(x="Region",y="Bird richness")
p <- p + theme_classic()
p <- p + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p

5.3 Histogram of reptile richness

How would you create a histogram of reptile species richness in this dataset?

### Creating a histogram for reptile species richness
p <- ggplot(CA_county_data, aes(x=Reptiles))
p <- p + geom_histogram()
p <- p + labs(x="Reptile richness")
p