::i_am("quarto/Wrangel-Task.qmd")
herelibrary(here)
library(dplyr)
library(tidyr)
library(data.table)
Possible solution to the data wrangling task
1 Task 3: data wrangling
There are two main aspects of this task:
- Import the raw data and make sure this works
- Make the data tidy
1.1 Import raw data
First, I import the raw data:
<- here("data/raw/wrangel_1.csv")
data_path <- as_tibble(data.table::fread(
data_raw file = data_path, header = TRUE))
head(data_raw)
# A tibble: 6 × 18
country name `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Germany Growth 0.732 3.82 2.98 0.960 -5.69 4.18 3.93 0.418 0.438
2 Germany Educat… NA 4.29 4.37 4.44 4.91 4.94 4.82 4.96 4.94
3 Germany Health… 10.3 10.2 10.1 10.3 11.2 11.1 10.8 10.9 11.0
4 Italy Growth 0.818 1.79 1.49 -0.962 -5.28 1.71 0.707 -2.98 -1.84
5 Italy Educat… 4.24 4.53 4.11 4.39 4.52 4.33 4.12 4.06 4.14
6 Italy Health… 8.34 8.44 8.14 8.53 8.95 8.92 8.77 8.78 8.78
# ℹ 7 more variables: `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>,
# `2018` <dbl>, `2019` <dbl>, `2020` <dbl>
Without header = TRUE
, the file will not be imported correctly. You can see this from the columns V1
, …
1.2 Make the data set tidy
In a tidy data set,…
- every column corresponds to one variable
- Here not satisfied, years should be placed in their proper column
year
- Also, under the column
name
, several varaiable names are mentioned
- Here not satisfied, years should be placed in their proper column
- every cell contains only one value
- This is already satisfied
- every row corresponds to one observation
- This is not true because the observations for Germany in 2008, for instance, are scattered across several rows
It is a good idea to describe how the tidy data set should look like. In the tidy data set, we should have the following columns:
country
,year
,Growth
,EducationSpending
,HealthSpending
We then start with one step that we expect to be helpful and then see how to continue. A good first stept to bring us closer to a tidy data set is to get rid of the year columns:
<- tidyr::pivot_longer(
data_v1 data = data_raw,
cols = -all_of(c("country", "name")), # also works without all_of
names_to = "year",
values_to = "value")
head(data_v1)
# A tibble: 6 × 4
country name year value
<chr> <chr> <chr> <dbl>
1 Germany Growth 2005 0.732
2 Germany Growth 2006 3.82
3 Germany Growth 2007 2.98
4 Germany Growth 2008 0.960
5 Germany Growth 2009 -5.69
6 Germany Growth 2010 4.18
A viable next step would be to transform the columns name
and value
into proper variable names with values. To this end, we make the data wider, taking the names for the new columns from the column name
and the values from the column value
:
<- tidyr::pivot_wider(
data_v2 data = data_v1, names_from = "name", values_from = "value")
head(data_v2)
# A tibble: 6 × 5
country year Growth EducationSpending HealthSpending
<chr> <chr> <dbl> <dbl> <dbl>
1 Germany 2005 0.732 NA 10.3
2 Germany 2006 3.82 4.29 10.2
3 Germany 2007 2.98 4.37 10.1
4 Germany 2008 0.960 4.44 10.3
5 Germany 2009 -5.69 4.91 11.2
6 Germany 2010 4.18 4.94 11.1
Hurray, finished! All three requirements are met!
If there was also the tasked of saving the data:
::fwrite(
data.tablex = data_v2, file = here::here("data/tidy/wrangel_1_tidy.csv"))