Useful for showing the correlation across two variables.

In this example, we will be using a robust linear regression model. More information on this kind of model can be found here. Unlike Stata, base R does not have an option to specify a robust model. Fortunately, the MASS package adds this functionality.

Data preparation

Loading required packages
library("readr")
library("tidyverse")
library("MASS", warn.conflicts = FALSE)


Loading a dummy dataset

Dataset stored as mydata

mydata <- read_csv('Data/line2.csv', show_col_types = FALSE)
mydata


Regression model

We are running the regression model before we plot the graph. This allows us to obtain the \(\beta\) coefficient (slope) and the standard error and store them in variables. We will be using these variables later in the graph.
The rlm model (robust linear model), is made available to us through the MASS package that we have previously loaded.

model <- rlm(y_var ~ x_var, data = mydata)
model_summary <- summary(model)
model_summary
## 
## Call: rlm(formula = y_var ~ x_var, data = mydata)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.353685 -0.117405 -0.007895  0.123693  0.518680 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept)  0.0896  0.0361     2.4859
## x_var        0.8523  0.0645    13.2142
## 
## Residual standard error: 0.179 on 98 degrees of freedom
slope <- round(model_summary$coefficients[2,1], 2)
se <- round(model_summary$coefficients[2,2], 2)
print(paste0("The slope is ", slope, ", the SE is ", se))
## [1] "The slope is 0.85, the SE is 0.06"


Basic graphing

Creating base plot

Setting the aesthetic arguments which will be inherited by subsequent geometric objects.

plot <- mydata %>%
  ggplot(aes(x = x_var, y = y_var))
plot

Creating the fitted line using geom_smooth

This geometric object is placed first. We do this to ensure that the points appear on top of the fitted line. The following arguments have been used:

  • method allows us to specify which regression model to use. We put rlm to indicate that want a robust linear model.
  • color allows us to specify the color of the fitted line.
  • formula allows us to specify the regression equation.
  • fill allows us to specify the color of the 95% (default) confidence envelope.
plot <- plot +
  geom_smooth(method = rlm, color = "maroon", formula = y ~ x, fill = "grey10")
plot

Creating the points using geom_point

The following arguments have been used:

  • color allows us to specify the color of the points. Here we have a chosen a shade of blue which is consistent with IDinsight brand colors.
  • size allows us to specify the size of the points.
plot <- plot +
  geom_point(color = "#264D96", size = 2.5)
plot

Creating the box for the \(\beta\) coefficient and standard error using annotate

The following arguments have been used:

  • geom allows us to specify how information will be presented. Here, we use the label option as that will print a box surrounding the output (like the geom_label object). Aside from the label option, you can also try text, but this will not provide a box around the output (like the geom_text object).
  • x allows us to specify where we want the coefficient and standard error to appear on the x-axis.
  • y allows us to specify where we want the coefficient and standard error to appear on the y-axis.
  • label allows us to specify to what we want to display on the graph. Here we use the paste0 function to concatenate the coefficient and standard error into a single label. The \n character is used to create a new line. We use \n here to split the information in two lines, resulting in nicer presentation.
  • size allows us to specify the size of label we are displaying on the graph.
plot <- plot +
  annotate(geom = "label", x = 0.15, y = 0.9,
           label = paste0("Slope: ", slope, "\n(", se, ")"),
           size = 4.5)
plot

Customizing the y-axis

While the graph we have created so far is functional, we can improve the presentation by the tweaking the axes. Since we are analyzing continuous data, we will be using the scale_y_continuous layer to modify the y-axis. If the data was discrete, we would be using the scale_y_discrete layer. The following arguments have been used:

  • expand allows us to specify the space between the data and the y-axis. Here we are specifying 3.5% space below 0. This is often tricky to master and we would recommend going over the documentation.
  • limits allows us to specify the length of the axis line. Here we are specifying that the y-axis line should range between 0 and 1.05. In effect, we are simply zooming out. The data is not affected since the y axis variable lies between 0 and 1. We do not recommend using the limits argument to zoom in, as that will result in the underlying data to be restricted. For example, if we specified limits = c(0, 0.8), we would be restricting not just the display of the data on the plot, but also any associated analysis. In this case, the fitted line (and confidence envelope) would change as well since the y-axis variable now only ranges between 0 and 0.8. If you are interested in focusing or zooming into a specific region of the graph, we recommend using the xlim and/or ylim argument of the coord_cartesian layer.
  • breaks allows us to specify the axis ticks. We take advantage of the seq function to create ticks between 0 and 1 at 0.2 intervals.
plot <- plot + 
  scale_y_continuous(expand = c(0.035, 0),
                     limits = c(0, 1.05),
                     breaks = c(seq(from = 0, to = 1, by = 0.2)))
plot

Customizing the x-axis

We use the scale_x_continuous layer to modify the x-axis. The following arguments have been used:

  • expand allows us to specify the space between the data and the x-axis. Here are we are specifying 0.2% space to the left of 0.
  • breaks allows us to specify the the axis ticks.
  • labels allows us to specify the text corresponding to each axis tick.
plot <- plot +
  scale_x_continuous(expand = c(0, 0.002),
                     breaks = c(seq(from = 0, to = 1, by = 0.2)),
                     labels = c("0", ".2", ".4", ".6", ".8", "1"))
plot

Applying graph and axis labels

We use the labs layer to create labels. The following arguments have been used:

  • x allows us to specify the x-axis label.
  • y allows us to specify the y-axis label.
  • title allows us to the specify the graph title label.
plot <- plot + 
  labs(x = "Variable 1",
       y = "Variable 2",
       title = "Graph Title")
plot

Graph formatting

Applying an in-built theme

Up until this point, we have focused on getting the data onto the plot. While we have been able to modify some visual elements, the options available to us have been limited. In this section, we will modify the plot background, font face, axis text font and more.
ggplot comes with a set of pre-installed themes. Often, applying one of them takes care of all of our customization needs. Since we want our graph to be as clutter free as possible, we apply the the classic theme (theme_classic). This removes the plot background color, grid lines and applies a simple line for the x and y axes.

plot <- plot +
  theme_classic()
plot

Using the theme() layer

If we want to customize the visual properties further, we can do so by specifying what we want to change inside the theme() layer. If you choose to use this layer in addition to applying an in-built theme, please apply the latter first. Doing so will ensure that you are building on top of the changes already made by the in-built theme.
We will use the theme() layer to change the font face, change the size of the axis text and horizontally align the graph title.

plot <- plot + 
  theme(text = element_text(family = "Comic Sans MS"),
        axis.text = element_text(size = 12),
        plot.title = element_text(hjust = 0.5, size = 15))
plot

Comic Sans used for demonstration only 🙃

Saving

Now that our graph is ready, we can save it using the ggsave command:

ggsave("R/Line graphs/Exports/scatter_fitted_line.png")
## Saving 7 x 5 in image