library(here)
library(dplyr)
library(tidyr)
library(ggplot2)
library(data.table)
Possible solutions for the recap exercises
1 Packages used
2 CO2
We first import the raw data. Please make sure you use the here
-package and adjust the relative paths of the following code.
Since many of the column headers were numbers (a.k.a. years), we need to make explicit that these are not values but header names. We do so by seeting the optional argument header = TRUE
:
<- fread("co2_raw.csv", header = TRUE) co2_data_raw
After inspecting the data using functions such as str()
, unique()
or head()
, we first remove colums we obviously do not need and that might be irritating:
<- co2_data_raw %>%
co2_data_tidy_1 select(-c(
"Indicator Name", "Indicator Code",
# unique() tells you there is only one indicator
"Country Code", # Not needed
"V69" # Sometimes such erroneous columns are part of what you download
))
We then move the year columns into rows by using tidyr::pivot_longer()
:
<- co2_data_tidy_1 %>%
co2_data_tidy_2 ::pivot_longer(
tidyrcols = -"Country Name",
names_to = "year",
values_to = "co2_percap")
head(co2_data_tidy_2)
# A tibble: 6 × 3
`Country Name` year co2_percap
<chr> <chr> <dbl>
1 Aruba 1960 NA
2 Aruba 1961 NA
3 Aruba 1962 NA
4 Aruba 1963 NA
5 Aruba 1964 NA
6 Aruba 1965 NA
We see that the year column is still a character
. So me transform it into a double
to then filter the years. We can also filter for the required countries within the same function call and then rename the column:
<- co2_data_tidy_2 %>%
co2_data_tidy_3 mutate(year = as.double(year)) %>%
filter(
>= 2000, year<=2020,
year `Country Name` %in% c(
"South Africa", "United States", "Sub-Saharan Africa",
"European Union", "Germany", "China")
%>%
) rename(country = `Country Name`)
We could have done everything in one call as well:
<- co2_data_raw %>%
co2_data_tidy select(-c(
"Indicator Name", "Indicator Code",
# unique() tells you there is only one indicator
"Country Code", # Not needed
"V69" # Sometimes such erroneous columns are part of what you download
%>%
)) ::pivot_longer(
tidyrcols = -"Country Name",
names_to = "year",
values_to = "co2_percap") %>%
mutate(year = as.double(year)) %>%
filter(
>= 2000, year<=2020,
year `Country Name` %in% c(
"South Africa", "United States", "Sub-Saharan Africa",
"European Union", "Germany", "China")
%>%
) rename(country = `Country Name`)
Then think about a useful location to store the data and do something like:
fwrite(co2_data_tidy, file = here("data/tidy/co2_tidy.csv"))
3 Data wrangling I
Please make sure you use the here
-package and adjust the relative paths of the following code:
Compute, for each country, the percentage change of the spending from the year 2010 to the year 2020 and save this as a variable called perc_change
.
<- fread("education_income.csv")
educ_exercise_data_raw
<- educ_exercise_data_raw %>%
educ_exercise_data ::select(-c("income", "GDPpc")) %>%
dplyr::filter(year %in% c(2010, 2020)) %>%
dplyr::pivot_wider(
tidyrnames_from = "year",
values_from = "EducationSpending"
%>%
) ::mutate(
dplyrperc_change = ((`2020`-`2010`)/`2010`)*100
%>%
) ::filter(!is.na(perc_change))
dplyrhead(educ_exercise_data)
# A tibble: 6 × 4
iso3c `2010` `2020` perc_change
<chr> <dbl> <dbl> <dbl>
1 ALB 3.41 3.34 -2.07
2 AND 2.98 2.86 -4.05
3 AGO 3.42 2.74 -19.8
4 ARG 5.02 5.28 5.18
5 ARM 3.25 2.71 -16.7
6 AUS 5.54 5.61 1.27
Then think about a useful location to store the data and do something like:
::fwrite(
data.tablex = educ_exercise_data,
file = here("data/tidy/educ_perc_change.csv"))
4 Data wrangling II
We use educ_exercise_data_raw
as imported above as a starting point and proceed as follows:
Compute for each income group the average expense of education over the whole period. Make sure missing values are ignored.
Save the new data set under a useful name in an adequate location.
<- educ_exercise_data_raw %>%
educ_exercise_summarized ::summarise(
dplyrEducExpense_avg = mean(EducationSpending, na.rm = TRUE),
.by = "income")
educ_exercise_summarized
income EducExpense_avg
1 Low income 3.342121
2 Upper middle income 4.723520
3 Lower middle income 4.534556
4 High income 4.644050
Then think about a useful location to store the data and do something like:
::fwrite(
data.tablex = educ_exercise_summarized,
file = here("data/tidy/educ_perc_income-groups.csv"))
5 Visualization and Quarto
The quarto header should look like this:
title: "Sessions 12 and 13: Recap and Practice"
author: "Claudius Gräbner-Radkowitsch"
format:
html:
number-sections: true
table-of-contents: true
toc-location: body
execute:
echo: false
warning: false
message: false
6 Visualization and Quarto
To read in the data set do something as the following, but make sure you are using the here
-package and set the path accordingly.
<- data.table::fread("child_mortality.csv")
child_mortality head(child_mortality)
iso3c year ChildMortality GDPpc
<char> <int> <num> <num>
1: AFG 2017 64.6 2096.093
2: AFG 2014 73.4 2110.830
3: AFG 2016 67.2 2023.835
4: AFG 2012 80.3 1958.448
5: AFG 2021 55.7 1673.144
6: AFG 2007 100.0 1287.064
To summarize the data:
<- child_mortality %>%
child_mortality_summarized ::summarise(
dplyrChildMortality = mean(ChildMortality, na.rm = TRUE),
GDPpc = mean(GDPpc, na.rm = TRUE),
.by = "iso3c")
head(child_mortality_summarized)
iso3c ChildMortality GDPpc
1 AFG 88.459091 1660.568
2 ALB 15.031818 9437.101
3 DZA 29.486364 11735.174
4 ASM NaN NaN
5 AND 4.718182 NaN
6 AGO 125.831818 6029.127
We can then directly create a simple scatter plot:
::ggplot(
ggplot2data = child_mortality_summarized,
mapping = aes(x = GDPpc, y = ChildMortality)
+
) geom_point() +
theme_linedraw()
We see a clear non-linear relationship.
We now plot the data in logarithms. You can do this by changing the underlying data, rescale an axis, or make the change directly in the data
argument of ggplot2::ggplot()
:
::ggplot(
ggplot2data = child_mortality_summarized,
mapping = aes(x = log(GDPpc), y = log(ChildMortality))
+
) geom_point() +
theme_linedraw()
The relationship now becomes almost linear. This is typical for relationships that are exponential. We can say: an increase in GDP per capita by one percent is on average associated with a reduction of child mortality by 0.83 per cent (the latter value is given by a regression, but we come to this later).