1 Introduction
In this session, you will learn more about the factor type in R. Factors can be very useful, but you have to be mindful of the implicit conversions from simple vector to factor ! They are the source of loot of pain for R programmers.
As usual we will need the tidyverse
library.
Solution
library(tidyverse)
2 Creating factors
Imagine that you have a variable that records month:
x1 <- c("Dec", "Apr", "Jan", "Mar")
Using a string to record this variable has two problems:
- There are only twelve possible months, and there’s nothing saving you from typos:
x2 <- c("Dec", "Apr", "Jam", "Mar")
- It doesn’t sort in a useful way:
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"
You can fix both of these problems with a factor.
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
And any values not in the set will be converted to NA:
y2 <- parse_factor(x2, levels = month_levels)
Warning: 1 parsing failure.
row col expected actual
3 -- value in level set Jam
y2
[1] Dec Apr <NA> Mar
attr(,"problems")
# A tibble: 1 × 4
row col expected actual
<int> <int> <chr> <chr>
1 3 NA value in level set Jam
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.
f2 <- x1 %>% factor() %>% fct_inorder()
f2
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
levels(f2)
[1] "Dec" "Apr" "Jan" "Mar"
4 Modifying factor order
It’s often useful to change the order of the factor levels in a visualisation.
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder()
. fct_reorder()
takes three arguments:
f
, the factor whose levels you want to modify.x
, a numeric vector that you want to use to reorder the levels.- Optionally,
fun
, a function that’s used if there are multiple values ofx
for each value off
. The default value ismedian
.
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
As you start making more complicated transformations, I’d recommend moving them out of aes()
and into a separate mutate()
step. For example, you could rewrite the plot above as:
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()
5 fct_reorder2()
Another type of reordering is useful when you are colouring the lines on a plot. fct_reorder2()
reorders the factor by the y
values associated with the largest x
values. This makes the plot easier to read because the line colours line up with the legend.
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")
6 Materials
There are lots of material online for R and more particularly on tidyverse
and Rstudio
You can find cheat sheet for all the packages of the tidyverse
on this page:
https://www.rstudio.com/resources/cheatsheets/
The Rstudio
websites are also a good place to learn more about R and the meta-package maintenained by the Rstudio
community:
- https://www.rstudio.com/resources/webinars/
- https://www.rstudio.com/products/rpackages/
For example rmarkdown is a great way to turn your analyses into high quality documents, reports, presentations and dashboards.
In addition most packages will provide vignettes on how to perform an analysis from scratch. On the bioconductor.org website (specialised on R packages for biologists), you will have direct links to the packages vignette.
Finally, don’t forget to search the web for your problems or error in R websites like stackoverflow contains high quality and well-curated answers.