Useful for showing the correlation across two variables.
In this example, we will be using a robust linear regression model. More information on this kind of model can be found here. Unlike Stata
, base R
does not have an option to specify a robust model. Fortunately, the MASS
package adds this functionality.
library("readr")
library("tidyverse")
library("MASS", warn.conflicts = FALSE)
Dataset stored as mydata
mydata <- read_csv('Data/line2.csv', show_col_types = FALSE)
mydata
We are running the regression model before we plot the graph. This allows us to obtain the \(\beta\) coefficient (slope) and the standard error and store them in variables. We will be using these variables later in the graph.
The rlm
model (robust linear model), is made available to us through the MASS
package that we have previously loaded.
model <- rlm(y_var ~ x_var, data = mydata)
model_summary <- summary(model)
model_summary
##
## Call: rlm(formula = y_var ~ x_var, data = mydata)
## Residuals:
## Min 1Q Median 3Q Max
## -0.353685 -0.117405 -0.007895 0.123693 0.518680
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 0.0896 0.0361 2.4859
## x_var 0.8523 0.0645 13.2142
##
## Residual standard error: 0.179 on 98 degrees of freedom
slope <- round(model_summary$coefficients[2,1], 2)
se <- round(model_summary$coefficients[2,2], 2)
print(paste0("The slope is ", slope, ", the SE is ", se))
## [1] "The slope is 0.85, the SE is 0.06"
Setting the aesthetic arguments which will be inherited by subsequent geometric objects.
plot <- mydata %>%
ggplot(aes(x = x_var, y = y_var))
plot
geom_smooth
This geometric object is placed first. We do this to ensure that the points appear on top of the fitted line. The following arguments have been used:
method
allows us to specify which regression model to use. We put rlm
to indicate that want a robust linear model.color
allows us to specify the color of the fitted line.formula
allows us to specify the regression equation.fill
allows us to specify the color of the 95% (default) confidence envelope.plot <- plot +
geom_smooth(method = rlm, color = "maroon", formula = y ~ x, fill = "grey10")
plot
geom_point
The following arguments have been used:
color
allows us to specify the color of the points. Here we have a chosen a shade of blue which is consistent with IDinsight brand colors.size
allows us to specify the size of the points.plot <- plot +
geom_point(color = "#264D96", size = 2.5)
plot
annotate
The following arguments have been used:
geom
allows us to specify how information will be presented. Here, we use the label
option as that will print a box surrounding the output (like the geom_label
object). Aside from the label
option, you can also try text
, but this will not provide a box around the output (like the geom_text
object).x
allows us to specify where we want the coefficient and standard error to appear on the x-axis.y
allows us to specify where we want the coefficient and standard error to appear on the y-axis.label
allows us to specify to what we want to display on the graph. Here we use the paste0
function to concatenate the coefficient and standard error into a single label. The \n
character is used to create a new line. We use \n
here to split the information in two lines, resulting in nicer presentation.size
allows us to specify the size of label we are displaying on the graph.plot <- plot +
annotate(geom = "label", x = 0.15, y = 0.9,
label = paste0("Slope: ", slope, "\n(", se, ")"),
size = 4.5)
plot
While the graph we have created so far is functional, we can improve the presentation by the tweaking the axes. Since we are analyzing continuous data, we will be using the scale_y_continuous
layer to modify the y-axis. If the data was discrete, we would be using the scale_y_discrete
layer. The following arguments have been used:
expand
allows us to specify the space between the data and the y-axis. Here we are specifying 3.5% space below 0. This is often tricky to master and we would recommend going over the documentation.limits
allows us to specify the length of the axis line. Here we are specifying that the y-axis line should range between 0 and 1.05. In effect, we are simply zooming out. The data is not affected since the y axis variable lies between 0 and 1. We do not recommend using the limits
argument to zoom in, as that will result in the underlying data to be restricted. For example, if we specified limits = c(0, 0.8)
, we would be restricting not just the display of the data on the plot, but also any associated analysis. In this case, the fitted line (and confidence envelope) would change as well since the y-axis variable now only ranges between 0 and 0.8. If you are interested in focusing or zooming into a specific region of the graph, we recommend using the xlim
and/or ylim
argument of the coord_cartesian
layer.breaks
allows us to specify the axis ticks. We take advantage of the seq
function to create ticks between 0 and 1 at 0.2 intervals.plot <- plot +
scale_y_continuous(expand = c(0.035, 0),
limits = c(0, 1.05),
breaks = c(seq(from = 0, to = 1, by = 0.2)))
plot
We use the scale_x_continuous
layer to modify the x-axis. The following arguments have been used:
expand
allows us to specify the space between the data and the x-axis. Here are we are specifying 0.2% space to the left of 0.breaks
allows us to specify the the axis ticks.labels
allows us to specify the text corresponding to each axis tick.plot <- plot +
scale_x_continuous(expand = c(0, 0.002),
breaks = c(seq(from = 0, to = 1, by = 0.2)),
labels = c("0", ".2", ".4", ".6", ".8", "1"))
plot
We use the labs
layer to create labels. The following arguments have been used:
x
allows us to specify the x-axis label.y
allows us to specify the y-axis label.title
allows us to the specify the graph title label.plot <- plot +
labs(x = "Variable 1",
y = "Variable 2",
title = "Graph Title")
plot
Up until this point, we have focused on getting the data onto the plot. While we have been able to modify some visual elements, the options available to us have been limited. In this section, we will modify the plot background, font face, axis text font and more.
ggplot
comes with a set of pre-installed themes. Often, applying one of them takes care of all of our customization needs. Since we want our graph to be as clutter free as possible, we apply the the classic theme (theme_classic
). This removes the plot background color, grid lines and applies a simple line for the x and y axes.
plot <- plot +
theme_classic()
plot
theme()
layerIf we want to customize the visual properties further, we can do so by specifying what we want to change inside the theme()
layer. If you choose to use this layer in addition to applying an in-built theme, please apply the latter first. Doing so will ensure that you are building on top of the changes already made by the in-built theme.
We will use the theme()
layer to change the font face, change the size of the axis text and horizontally align the graph title.
plot <- plot +
theme(text = element_text(family = "Comic Sans MS"),
axis.text = element_text(size = 12),
plot.title = element_text(hjust = 0.5, size = 15))
plot
Comic Sans used for demonstration only 🙃
Now that our graph is ready, we can save it using the ggsave
command:
ggsave("R/Line graphs/Exports/scatter_fitted_line.png")
## Saving 7 x 5 in image