1 Introduction
Until now we have worked with data already formated in a nice way.
In the tidyverse
data formated in a nice way are called tidy
The goal of this practical is to understand how to transform an hugly blob of information into a tidy data set.
1.1 Tidydata
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Doing this kind and transformation is often called data wrangling, due to the felling that we have to wrangle with the data to force them into a tidy format.
But once this step is finish most of the subsequent analysis will be realy fast to do !
As usual we will need the tidyverse
library.
Solution
library(tidyverse)
For this practical we are going to use the table
set of datasets which demonstrate multiple ways to layout the same tabular data.
Use the help to know more about table1
dataset
Solution
?table1
table1
, table2
, table3
, table4a
, table4b
, and table5
all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.
2 Pivoting data
2.1 pivot longer
wide_example <- tibble(X1 = c("A","B"),
X2 = c(1,2),
X3 = c(0.1,0.2),
X4 = c(10,20))
If you have a wide dataset, such as wide_example
, that you want to make longer, you will use the pivot_longer()
function.
You have to specify the names of the columns you want to pivot into longer format (X2,X3,X4):
wide_example %>%
pivot_longer(c(X2,X3,X4))
… or the reverse selection (-X1):
wide_example %>% pivot_longer(-X1)
You can specify the names of the columns where the data will be tidy (by default, it is names
and value
):
long_example <- wide_example %>%
pivot_longer(-X1), names_to = "V1", values_to = "V2")
2.1.1 Exercice
Visualize the table4a
dataset (you can use the View()
function).
View(table4a)
Is the data tidy ? How would you transform this dataset to make it tidy ?
Solution
We have information about 3 variables in the table4a
: country
, year
and number of cases
.
However, the variable information (year
) is stored as column names.
We want to pivot the horizontal column year, vertically and make the table longer.
You can use the pivot_longer
fonction to make your table longer and have one observation per row and one variable per column.
For this we need to :
- specify which column to select (all except
country
). - give the name of the new variable (
year
) - give the name of the variable stored in the cells of the columns years (
case
)
table4a %>%
pivot_longer(-country,
names_to = "year",
values_to = "case")
# A tibble: 6 × 3
country year case
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
2.2 pivot wider
If you have a long dataset, that you want to make wider, you will use the pivot_wider()
function.
You have to specify which column contains the name of the output column (names_from
), and which column contains the cell values from (values_from
).
long_example %>% pivot_wider(names_from = V1,
values_from = V2)
2.2.1 Exercice
Visualize the table2
dataset
Is the data tidy ? How would you transform this dataset to make it tidy ? (you can now make also make a guess from the name of the subsection)
Solution
The column count
store two types of information: the population
size of the country and the number of cases
in the country.
You can use the pivot_wider
fonction to make your table wider and have one observation per row and one variable per column.
table2 %>%
pivot_wider(names_from = type,
values_from = count)
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
3 Merging data
3.1 Relational data
To avoid having a huge table and to save space, information is often splited between different tables.
In our flights
dataset, information about the carrier
or the airports
(origin and dest) are saved in a separate table (airlines
, airports
).
library(nycflights13)
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
airlines
# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
airports
# A tibble: 1,458 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# … with 1,448 more rows
weather
# A tibble: 26,115 × 15
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
# … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>,
# pressure <dbl>, visib <dbl>, time_hour <dttm>
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
3.2 Relational schema
The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
3.3 Joints
If you have to combine data from 2 tables in a a new table, you will use joints
.
There are several types of joints depending of what you want to get.
Small concrete examples:
3.3.1 inner_joint()
keeps observations in x
AND y
flights2 %>%
inner_join(airlines)
Joining, by = "carrier"
# A tibble: 336,776 × 9
year month day hour origin dest tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
# … with 336,766 more rows
3.3.2 left_joint()
keeps all observations in x
flights2 %>%
left_join(airlines)
Joining, by = "carrier"
# A tibble: 336,776 × 9
year month day hour origin dest tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
# … with 336,766 more rows
3.3.3 right_joint()
keeps all observations in y
flights2 %>%
right_join(airlines)
Joining, by = "carrier"
# A tibble: 336,776 × 9
year month day hour origin dest tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
# … with 336,766 more rows
3.3.4 full_joint()
keeps all observations in x
and y
flights2 %>%
full_join(airlines)
Joining, by = "carrier"
# A tibble: 336,776 × 9
year month day hour origin dest tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
# … with 336,766 more rows
3.4 Defining the key columns
The default, by = NULL
, uses all variables that appear in both tables, the so called natural join.
flights2 %>%
left_join(weather)
Joining, by = c("year", "month", "day", "hour", "origin")
# A tibble: 336,776 × 18
year month day hour origin dest tailnum carrier temp dewp humid
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
# … with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
# wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
# visib <dbl>, time_hour <dttm>
If the two tables contain columns with the same names but corresponding to different things (such as year
in flights2
and planes
) you have to manually define the key or the keys.
flights2 %>%
left_join(planes, by = "tailnum")
# A tibble: 336,776 × 16
year.x month day hour origin dest tailnum carrier year.y type
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult…
2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult…
3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult…
4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult…
5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult…
6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult…
7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult…
8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult…
9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult…
10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
# … with 336,766 more rows, and 6 more variables: manufacturer <chr>,
# model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: by = c("a" = "b")
. This will match variable a
in table x
to variable b
in table y
.
flights2 %>%
left_join(airports, c("dest" = "faa"))
# A tibble: 336,776 × 15
year month day hour origin dest tailnum carrier name lat lon alt
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
# … with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
# tzone <chr>
If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix .x
and .y
because all column names must be different in the output table.
flights2 %>%
left_join(airports, c("dest" = "faa")) %>%
left_join(airports, c("origin" = "faa"))
# A tibble: 336,776 × 22
year month day hour origin dest tailnum carrier name.x lat.x lon.x alt.x
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
# … with 336,766 more rows, and 10 more variables: tz.x <dbl>, dst.x <chr>,
# tzone.x <chr>, name.y <chr>, lat.y <dbl>, lon.y <dbl>, alt.y <dbl>,
# tz.y <dbl>, dst.y <chr>, tzone.y <chr>
You can change the suffix using the option suffix
flights2 %>%
left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin"))
# A tibble: 336,776 × 22
year month day hour origin dest tailnum carrier name.dest lat.dest
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA George Bush In… 30.0
2 2013 1 1 5 LGA IAH N24211 UA George Bush In… 30.0
3 2013 1 1 5 JFK MIA N619AA AA Miami Intl 25.8
4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA
5 2013 1 1 6 LGA ATL N668DN DL Hartsfield Jac… 33.6
6 2013 1 1 5 EWR ORD N39463 UA Chicago Ohare … 42.0
7 2013 1 1 6 EWR FLL N516JB B6 Fort Lauderdal… 26.1
8 2013 1 1 6 LGA IAD N829AS EV Washington Dul… 38.9
9 2013 1 1 6 JFK MCO N593JB B6 Orlando Intl 28.4
10 2013 1 1 6 LGA ORD N3ALAA AA Chicago Ohare … 42.0
# … with 336,766 more rows, and 12 more variables: lon.dest <dbl>,
# alt.dest <dbl>, tz.dest <dbl>, dst.dest <chr>, tzone.dest <chr>,
# name.origin <chr>, lat.origin <dbl>, lon.origin <dbl>, alt.origin <dbl>,
# tz.origin <dbl>, dst.origin <chr>, tzone.origin <chr>
3.5 Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
semi_join(x, y)
keeps all observations inx
that have a match iny
.anti_join(x, y)
drops all observations inx
that have a match iny
.
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
flights %>%
semi_join(top_dest)
Joining, by = "dest"
# A tibble: 141,145 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 542 540 2 923 850
2 2013 1 1 554 600 -6 812 837
3 2013 1 1 554 558 -4 740 728
4 2013 1 1 555 600 -5 913 854
5 2013 1 1 557 600 -3 838 846
6 2013 1 1 558 600 -2 753 745
7 2013 1 1 558 600 -2 924 917
8 2013 1 1 558 600 -2 923 937
9 2013 1 1 559 559 0 702 706
10 2013 1 1 600 600 0 851 858
# … with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
3.6 Set operations
These expect the x and y inputs to have the same variables, and treat the observations like sets:
intersect(x, y)
: return only observations in bothx
andy
.union(x, y)
: return unique observations inx
andy
.setdiff(x, y)
: return observations inx
, but not iny
.