Chapter 6 Gapminder
The Gapminder Foundation7 is a Swedish non-profit organization that promotes global development through the use of statistics that can help reduce common myths and sensationalist stories about global health and economics. An important selection of data is already loaded in the dslabs library in the gapminder data frame. Our case/problem now will be to answer these two questions:
- Is it still reasonable to divide the world between Western countries* and developing countries?
- Is it true that every day we are worse off and rich countries get richer while poor countries get poorer?
(*): Samuel Huntington in 1993 published an article called Clash of Civilizations8 where he defined Western countries as those located in the regions of North America, Northern/Southern/Western Europe and Australia and New Zealand.
To address these questions, we will follow a structured data science workflow. We’ll start by exploring the data to understand its structure and content, then move to in-depth analysis to identify relevant variables. Finally, we will use visualization and summarization techniques to synthesize our findings and provide clear answers.
First let’s explore the structure of the data frame with str():
gapminder |>
str()
#> 'data.frame': 10545 obs. of 9 variables:
#> $ country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
#> $ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
#> $ infant_mortality: num 115.4 148.2 208 NA 59.9 ...
#> $ life_expectancy : num 62.9 47.5 36 63 65.4 ...
#> $ fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
#> $ population : num 1636054 11124892 5270844 54681 20619075 ...
#> $ gdp : num NA 1.38e+10 NA NA 1.08e+11 ...
#> $ continent : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
#> $ region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...We have a data frame with more than 10 thousand data points and 9 variables.
Now let’s take a look at the data with head():
gapminder |>
head()
#> country year infant_mortality life_expectancy fertility
#> 1 Albania 1960 115.40 62.87 6.19
#> 2 Algeria 1960 148.20 47.50 7.65
#> 3 Angola 1960 208.00 35.98 7.32
#> 4 Antigua and Barbuda 1960 NA 62.97 4.43
#> 5 Argentina 1960 59.87 65.39 3.11
#> 6 Armenia 1960 NA 66.86 4.55
#> population gdp continent region
#> 1 1636054 NA Europe Southern Europe
#> 2 11124892 13828152297 Africa Northern Africa
#> 3 5270844 NA Africa Middle Africa
#> 4 54681 NA Americas Caribbean
#> 5 20619075 108322326649 Americas South America
#> 6 1867396 NA Asia Western AsiaRemember that for library data frames we can usually find the documentation and understand each attribute faster:
Going directly to the questions would be not leaving that curiosity free to see what else is in the data. Thus, we are going to start with other variables such as infant mortality, fertility or population.
We can filter all the data that are from Peru and select the column country, year, infant mortality and population:
gapminder |>
filter(country == "Peru") |>
select(country, year, infant_mortality, population)
#> country year infant_mortality population
#> 1 Peru 1960 135.9 10061519
#> 2 Peru 1961 132.6 10350239
#> 3 Peru 1962 129.1 10650672
#> 4 Peru 1963 125.4 10961539
#> 5 Peru 1964 121.8 11281015
#> 6 Peru 1965 118.2 11607684
#> 7 Peru 1966 114.8 11941327
#> 8 Peru 1967 111.6 12282081
#> 9 Peru 1968 108.7 12629333
#> 10 Peru 1969 106.0 12982444
#> 11 Peru 1970 103.4 13341071
#> 12 Peru 1971 100.9 13704333
#> 13 Peru 1972 98.3 14072476
#> 14 Peru 1973 95.8 14447649
#> 15 Peru 1974 93.3 14832839
#> 16 Peru 1975 91.0 15229951
#> 17 Peru 1976 88.9 15639898
#> 18 Peru 1977 87.0 16061327
#> 19 Peru 1978 85.5 16491087
#> 20 Peru 1979 83.9 16924758
#> 21 Peru 1980 82.4 17359118
#> 22 Peru 1981 80.7 17792551
#> 23 Peru 1982 78.7 18225727
#> 24 Peru 1983 76.3 18660443
#> 25 Peru 1984 73.7 19099575
#> 26 Peru 1985 70.7 19544950
#> 27 Peru 1986 67.6 19996250
#> 28 Peru 1987 64.6 20451712
#> 29 Peru 1988 61.7 20909897
#> 30 Peru 1989 58.9 21368856
#> 31 Peru 1990 56.3 21826658
#> 32 Peru 1991 53.7 22283130
#> 33 Peru 1992 51.0 22737056
#> 34 Peru 1993 48.2 23184222
#> 35 Peru 1994 45.4 23619358
#> 36 Peru 1995 42.5 24038761
#> 37 Peru 1996 39.7 24441076
#> 38 Peru 1997 36.9 24827409
#> 39 Peru 1998 34.3 25199744
#> 40 Peru 1999 31.8 25561297
#> 41 Peru 2000 29.6 25914875
#> 42 Peru 2001 27.6 26261363
#> 43 Peru 2002 25.7 26601463
#> 44 Peru 2003 24.1 26937737
#> 45 Peru 2004 22.6 27273188
#> 46 Peru 2005 21.3 27610406
#> 47 Peru 2006 20.1 27949958
#> 48 Peru 2007 19.0 28292768
#> 49 Peru 2008 18.0 28642048
#> 50 Peru 2009 17.1 29001563
#> 51 Peru 2010 16.3 29373644
#> 52 Peru 2011 15.6 29759891
#> 53 Peru 2012 14.9 30158768
#> 54 Peru 2013 14.2 30565461
#> 55 Peru 2014 13.6 30973148
#> 56 Peru 2015 13.1 31376670
#> 57 Peru 2016 NA NALet’s add a filter to obtain only the data from 2015:
gapminder |>
filter(country == "Peru" & year == 2015) |>
select(country, year, infant_mortality, population)
#> country year infant_mortality population
#> 1 Peru 2015 13.1 31376670We can make a comparison between Peru and Chile if we create a vector and instead of the == operator we use the %in% operator that allows evaluating that our data are in that vector.
vector_countries = c("Peru", "Chile")
gapminder |>
filter(country %in% vector_countries & year == 2015) |>
select(country, year, infant_mortality, population)
#> country year infant_mortality population
#> 1 Chile 2015 7.0 17948141
#> 2 Peru 2015 13.1 31376670Infant mortality is measured in number of children who die per 1,000 infants. This means that it already takes into account the population. In 2015 Peru had a higher infant mortality rate than Chile.
6.1 Initial gapminder plots
However, if we want to analyze global data, comparing countries one by one would be impractical. Let’s use ggplot to see if there is a relationship in our data.
Let’s create a scatter plot with data from the year 1990 of the fertility variable (fertility), which is the average number of children per woman, and the life expectancy variable (life_expectancy).
gapminder |>
filter(year == 1990) |>
ggplot() +
aes(x = fertility, y = life_expectancy) +
geom_point()
From this graph we can see that countries where families have 7.5 children have a lower life expectancy. On the other hand, in countries with high life expectancy the average number of children is less than 2 children per family.
As we have done previously, we can color the points according to some other variable. In this case, knowing which continent they belong to could give us a better idea of the data.
gapminder |>
filter(year == 1990) |>
ggplot() +
aes(x = fertility, y = life_expectancy, color = continent) +
geom_point()
In this graph, groups begin to be seen. Several European countries are in the upper left quadrant, while several African countries are in the lower right quadrant.
6.2 Facets
Although the previous graph already shows us a correlation of variables, we cannot see how it has changed from one year to another. For this we will use the facet layer (facet_).
In the layer facet_grid(row_variable ~ column_variable) we replace “row_variable” with the name of our variable or replace it with a . if we don’t want any of them. For example, from the previous example let’s compare how the distribution changed by comparing the year 1960 with the year 2013.
vector_years <- c(1960, 2013)
gapminder |>
filter(year %in% vector_years) |>
ggplot() +
aes(x = fertility, y = life_expectancy, color = continent) +
geom_point() +
facet_grid(year ~ .)
We can make it even clearer which continent changed the most if we add the continent variable as a column.
vector_years <- c(1960, 2013)
gapminder |>
filter(year %in% vector_years) |>
ggplot() +
aes(x = fertility, y = life_expectancy, color = continent) +
geom_point() +
facet_grid(year ~ continent)
Having several columns for each continent makes it harder to understand because the columns become smaller. It is recommended to have few columns. So we invert the order between year and continent.
vector_years <- c(1960, 2013)
gapminder |>
filter(year %in% vector_years) |>
ggplot() +
aes(x = fertility, y = life_expectancy, color = continent) +
geom_point() +
facet_grid(continent ~ year)
Here the change by regions is much more evident: the majority of countries have reduced fertility per family while increasing life expectancy. We are living longer than in the 1960s and having fewer children per family. These trends have occurred across all continents.
We don’t always have to show all variables, in this case continents. We can continue applying filters so that it shows us a subset of continents that we want to compare. For example:
vector_years <- c(1960, 2013)
vector_continents <- c("Europe", "Asia")
gapminder |>
filter(year %in% vector_years & continent %in% vector_continents) |>
ggplot() +
aes(x = fertility, y = life_expectancy, color = continent) +
geom_point() +
facet_grid(continent ~ year)
In this case it would be visually better if the continents were not in separate rows, but could still be appreciated in the graph. To do this, we will use the wrap facet (facet_wrap(~ x)), where x is the variable we want to wrap. In our case it would be the year, instead of appearing in separate rows we can join and transpose them.
vector_years <- c(1960, 2013)
vector_continents <- c("Europe", "Asia")
gapminder |>
filter(year %in% vector_years & continent %in% vector_continents) |>
ggplot() +
aes(x = fertility, y = life_expectancy, color = continent) +
geom_point() +
facet_wrap( ~ year)
We can add more data by adding more data to the vectors. For example, let’s add a cut in the middle between 1960 and 2013.
vector_years <- c(1960, 1985, 2013)
vector_continents <- c("Europe", "Asia")
gapminder |>
filter(year %in% vector_years & continent %in% vector_continents) |>
ggplot() +
aes(x = fertility, y = life_expectancy, color = continent) +
geom_point() +
facet_wrap( ~ year)
6.3 Time series
Time series are sequences of data measured at determined moments and ordered chronologically. R allows us to easily plot time series, we only need our data frames to include some time variable.
6.3.1 Individual time series
In an individual time series we only analyze how a single variable has evolved, for example the evolution of the fertility rate in Peru. For this we can use a scatter plot with points or with lines.
As we will remember, we use geom_point() for points:
gapminder |>
filter(country == "Peru") |>
ggplot() +
aes(x = year, y = fertility) +
geom_point()
#> Warning: Removed 1 row containing missing values or values
#> outside the scale range (`geom_point()`).
We get a “warning” indicating that there are values that cannot be drawn because they are NA and are not available. This does not prevent showing the graph.
If we want a line graph, which is the most used in time series, we use geom_line():
gapminder |>
filter(country == "Peru") |>
ggplot() +
aes(x = year, y = fertility) +
geom_line()
#> Warning: Removed 1 row containing missing values or values
#> outside the scale range (`geom_line()`).
6.3.2 Multiple time series
In multiple time series we seek comparison to analyze in a time series how the data evolved. For example, this would be the time series if we compare Peru, Bolivia and Chile:
countries <- c("Peru", "Bolivia", "Chile")
gapminder |>
filter(country %in% countries) |>
ggplot() +
aes(x = year, y = fertility, color = country) +
geom_line()
#> Warning: Removed 3 rows containing missing values or values
#> outside the scale range (`geom_line()`).
We can also remove the legend and show the name of the countries as labels on the same graph. To do this we will first have to create a data frame using the function data.frame() that indicates the coordinates where we want each label to appear:
countries <- c("Peru", "Bolivia", "Chile")
labels <- data.frame(country_names = countries, x = c(1975, 1965, 1962), y = c(6, 7, 4))
labels
#> country_names x y
#> 1 Peru 1975 6
#> 2 Bolivia 1965 7
#> 3 Chile 1962 4We will use this to indicate that we want, for example, Bolivia to be written at the intersection of the year 1972 and with a fertility rate of 6.8.
To use these labels in ggplot we will edit the arguments in the geom_text layer. We will use the data attributes to indicate that we want to obtain the data from an external source, and we will include the aes layer inside geom_text to correlate the data frame we have created with the graph. We must keep in mind that the column name in both data frames must be the same, in this case country:
countries <- c("Peru", "Bolivia", "Chile")
labels <- data.frame(country = countries, x = c(1976, 1972, 1965), y = c(5.2, 6.8, 5.5))
gapminder |> filter(country %in% countries) |>
ggplot() +
aes(year, fertility, col = country) +
geom_line() +
geom_text(data = labels, aes(x, y, label = country)) +
theme(legend.position = "none")
#> Warning: Removed 3 rows containing missing values or values
#> outside the scale range (`geom_line()`).
6.4 Exercises
For these exercises we will continue using the gapminder data frame.
- Generate a scatter plot comparing fertility rates and life expectancy for the Americas in the year 2000. Use color to differentiate between the regions within the continent.
Solution
To create a vector of sequences we can use
X:Y. This creates a vector that goes from number X to number Y
- During the Vietnam War, both the US and Vietnam suffered significant losses. Create a line chart visualizing how life expectancy changed in both countries from 1955 to 1990 to observe the war’s impact.
Solution
- Expand the previous chart to include Cambodia, allowing us to visualize the devastating impact of the Khmer Rouge regime (1975-1979) on life expectancy alongside the Vietnam War data.
6.5 Histograms with ggplot
We could continue exploring the data until we understand it much better. Eventually we would get to the GDP (gdp) data and in turn we would understand that comparing only GDP alone makes no sense since there are countries with much more population than others. Data transformation is not something new, but we will see that it is something recurrent in our analyzes.
We are going to use a transformation that allows us to obtain how much is the GDP per capita per day in each country in each year
We could visualize this variable first by creating a histogram of it. A histogram in ggplot is nothing more than one of the geoms we have available, in this case it would be geom_histogram(binwidth = x), where x is the width of the bar. For example, let’s calculate the distribution of our created variable in the year 2010:
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010) |>
ggplot() +
aes(gdp_per_capita_per_day) +
geom_histogram(binwidth = 5)
#> Warning: Removed 9 rows containing non-finite outside the
#> scale range (`stat_bin()`).
We can filter out the NA so that we no longer get the low “warnings” with the function we saw previously is.na(). In this case since we don’t want the NA we will negate the function by placing the symbol ! at the beginning.
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
ggplot() +
aes(gdp_per_capita_per_day) +
geom_histogram(binwidth = 5)
At this point it should be quick to detect that there is a concentration of data from countries with low GDP per capita and we could be tempted to apply a scale transformation on the x-axis. Let’s try with logarithm in base 2:
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
ggplot() +
aes(gdp_per_capita_per_day) +
geom_histogram(binwidth = 0.5) + #Change the width to 0.5 due to logarithmic scale
scale_x_continuous(trans = "log2")
Let’s be careful interpreting this data. We cannot say that it is a symmetric distribution, even when with this scale we are tempted to do so. Remember the scale and use it appropriately.
Tip: For smooth distribution curves, you can also use
geom_density()instead ofgeom_histogram(). Density plots are particularly useful when comparing multiple groups on the same plot.
6.6 Box plots with ggplot
In the same way, box plots are one more geom within the available ones, for this we will use the geom_boxplot() layer.
For example, let’s create a box plot to analyze GDP per capita per day by continent:
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
ggplot() +
aes(continent, gdp_per_capita_per_day) +
geom_boxplot()
Now let’s zoom in. Within each continent we have regions, for example in the Americas we have South America, Central America, North America, and so on with each continent. Let’s change the continent variable to region.
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
ggplot() +
aes(region, gdp_per_capita_per_day) +
geom_boxplot()
As we can verify: this visualization allows us to infer very little. Before discarding a graph let’s think if we can change the configuration to improve the visualization.
The first thing we can improve is the names of the regions. They are in horizontal form, but we could rotate it 45 degrees using the theme() layer.
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
ggplot() +
aes(region, gdp_per_capita_per_day) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1) )
The names are understood, but if we want to find the top 3 (either by median or average) we would have to look for them one by one. Let’s reorder it, but first let’s be aware of some previous considerations:
The region column is a Factor type variable, it is not a character string. Even when visually we did not find a difference, factors are used to categorize data. For example, bronze, silver, platinum customers, etc.
Factors are useful because internally they are replaced by numbers and numbers, at a computational level, are faster to sort. The default sorting is alphabetical, as we can appreciate if we use the levels function.
levels(gapminder$region)
#> [1] "Australia and New Zealand" "Caribbean"
#> [3] "Central America" "Central Asia"
#> [5] "Eastern Africa" "Eastern Asia"
#> [7] "Eastern Europe" "Melanesia"
#> [9] "Micronesia" "Middle Africa"
#> [11] "Northern Africa" "Northern America"
#> [13] "Northern Europe" "Polynesia"
#> [15] "South America" "South-Eastern Asia"
#> [17] "Southern Africa" "Southern Asia"
#> [19] "Southern Europe" "Western Africa"
#> [21] "Western Asia" "Western Europe"We will use the reorder() function to change the order of the factors and since we are altering the dataframe we would have to use it inside the mutate() function. The reorder() function asks us as the first attribute the factor to reorder, then the vector that we will take into account and finally a grouping function. For example, order based on the median of each region (visually remember that it is the thick line inside each box):
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |>
ggplot() +
aes(region, gdp_per_capita_per_day) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Note that a mutate has been placed after filtering the data. This is to guarantee that we are removing the NA. Otherwise, we risk that all values are NA and the reordering is not performed and remains default.
We see at the far left some regions in Africa, and at the far right Europe and USA. Remember that we can add color according to some variable. In this case let’s add color based on the continent:
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |>
ggplot() +
aes(region, gdp_per_capita_per_day, color = continent) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Although we can already differentiate it, in a box plot it is usually the fill (fill in English) of the box that is painted. So, let’s change the color attribute to the fill attribute. And let’s remove the legend on the x-axis. It is not necessary in this case where the regions are self-explanatory.
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |>
ggplot() +
aes(region, gdp_per_capita_per_day, fill = continent) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("")
This graph helps us see the top 5, but since there are several regions concentrated in small values of GDP per capita we visually lose those regions. We need a scale transformation.
If you are thinking of adding a logarithmic scale layer for the y-axis you are on the right track. Let’s try:
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |>
ggplot() +
aes(region, gdp_per_capita_per_day, fill = continent) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
scale_y_continuous(trans = "log2")
Sometimes it is necessary not only to show the boxes, but also where each of the data points is located. For this we can add the geom_point() layer that we had previously used to show the points of each data.
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |>
ggplot() +
aes(region, gdp_per_capita_per_day, fill = continent) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
scale_y_continuous(trans = "log2") +
geom_point(size = 0.5)
6.7 Comparison of distributions
To be able to solve the first question of the case we would have to compare the distributions of the “Western” countries versus the developing countries.
For this, since we do not have a column that indicates which are from the West, we are going to create a western_countries with the list of regions that fall into this category:
western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")We will also use the
ifelse(test, yes, no)function to create a new column such that if the region is in the West it stores a value, and if it is not in the West it stores another value. It is recommended to read the documentation in?ifelse.
Let’s add the column for the group each country belongs to:
western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
mutate(group = ifelse(region %in% western_countries, "Western", "Developing")) |>
head()
#> country year infant_mortality life_expectancy fertility
#> 1 Albania 2010 14.8 77.2 1.74
#> 2 Algeria 2010 23.5 76.0 2.82
#> 3 Angola 2010 109.6 57.6 6.22
#> 4 Antigua and Barbuda 2010 7.7 75.8 2.13
#> 5 Argentina 2010 13.0 75.8 2.22
#> 6 Armenia 2010 16.1 73.0 1.55
#> population gdp continent region gdp_per_capita_per_day
#> 1 2901883 6137563946 Europe Southern Europe 5.794597
#> 2 36036159 79164339611 Africa Northern Africa 6.018638
#> 3 21219954 26125663270 Africa Middle Africa 3.373106
#> 4 87233 836686777 Americas Caribbean 26.277814
#> 5 41222875 434405530244 Americas South America 28.871158
#> 6 2963496 4102285513 Asia Western Asia 3.792527
#> group
#> 1 Western
#> 2 Developing
#> 3 Developing
#> 4 Developing
#> 5 Developing
#> 6 DevelopingNow that we have how to differentiate the countries we can see their distribution until we find how to answer our question. We start by creating a histogram with logarithmic scale in the x-axis and separate it using facet_grid based on the group it belongs to:
western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |>
mutate(group = ifelse(region %in% western_countries, "Western", "Developing")) |>
ggplot() +
aes(gdp_per_capita_per_day) +
geom_histogram(binwidth = 1) +
scale_x_continuous(trans = "log2") +
facet_grid(. ~ group)
We see that the daily GDP per capita has a distribution with higher values compared to developing countries. However, the picture in one year is not everything. We are ready to see if the separation was the same 40 years back from the date in the example (2010). We are also going to add the geom_histogram() layer the color attribute to see the border of the bars which by default are grey.
western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year %in% c(1970, 2010) & !is.na(gdp_per_capita_per_day)) |>
mutate(group = ifelse(region %in% western_countries, "Western", "Developing")) |>
ggplot() +
aes(gdp_per_capita_per_day) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
Both groups, both “Western” and “Developing” have improved in that 40-year span, but developing countries have advanced more than Western countries.
So far we have assumed something: that all countries that reported in 2010 also reported data in 1970. To make the comparison finer we have to look for the distribution of countries that have data reported both in 1970 and in 2010.
To do this, we are going to create a vector that lists the countries with data in 1970 and another of those that have data in 2010 and then look for the intersection. Remember that to extract a column we use the pull() function.
western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
list_1 <- gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year %in% c(1970) & !is.na(gdp_per_capita_per_day)) |>
pull(country)
list_2 <- gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year %in% c(2010) & !is.na(gdp_per_capita_per_day)) |>
pull(country)To find the intersection of these two vectors we will use the function intersect(vector_1, vector_2), which will give us the vector we are looking for.
So, we recreate our histogram including only the countries on this list.
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year %in% c(1970, 2010) & !is.na(gdp_per_capita_per_day)) |>
filter(country %in% intersection_vector) |>
mutate(group = ifelse(region %in% western_countries, "Western", "Developing")) |>
ggplot() +
aes(gdp_per_capita_per_day) +
geom_histogram(binwidth = 1, color = "black") +
scale_x_continuous(trans = "log2") +
facet_grid(year ~ group)
We see now more clearly with comparable data how there are more countries within the developing region that increased per capita GDP, and by a larger margin than Western countries. But this first inference is still visual, we need to compare how the median, range, etc. changed. For this we will use a box plot very similar to the previous one, but this time we will edit geom_boxplot() so that it shows us in a single graph how each region has changed from 1970 to 2010.
gapminder |>
mutate(gdp_per_capita_per_day = gdp/population/365) |>
filter(year %in% c(1970, 2010) & !is.na(gdp_per_capita_per_day)) |>
filter(country %in% intersection_vector) |>
mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |>
ggplot() +
aes(region, gdp_per_capita_per_day) +
geom_boxplot(aes(region, gdp_per_capita_per_day, fill=factor(year))) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("") +
scale_y_continuous(trans = "log2")
We see how there are regions within Asia that have grown substantially. As we know from general culture, some countries in Asia are already powers, but today with these graphs we can understand well how much each region has changed until becoming a power.
Therefore, we can now answer both questions of the case:
- It is not reasonable to continue using the categorization of “Western” and “developing” since there are more and more regions that are poorly represented by those categories, such as East Asia.
- It is not true that rich countries get richer while poor countries get poorer. We have seen that developing countries have even higher growth than the growth that Western countries have.
6.8 Exercises
For this series of exercises we will use the stars data frame from the dslabs library. This dataset contains attributes of stars including their temperature, spectral type, and magnitude. The magnitude column represents absolute magnitude—a measure of intrinsic brightness where more negative values indicate greater luminosity.
- The temperature data is currently in Kelvin. Create a new column
temp_celsiususing the formula \(C = K - 273.15\), then visualize the relationship between temperature and magnitude. Color the points by star type and use a base-10 logarithmic scale for the x-axis to better display the wide range of temperatures.
Solution
- Since lower magnitude values correspond to higher brightness, reverse the y-axis scale using
scale_y_reverse()to make the plot more intuitive (brighter stars at the top).
Solution
- The Sun is a G-type star. To determine if these are the most luminous, create a box plot comparing the magnitude distributions across different star types.
No, G-type stars are not the most luminous. For this we can elaborate this graph:
6.9 Key Takeaways
In this chapter, we followed a complete data exploration process, starting with understanding the data structure before diving into visualization. We utilized faceting to reveal trends across categories and time, and used time series plots to track specific country trajectories. We also explored distributions using histograms and box plots, applying scale transformations to handle skewed economic data. Finally, we practiced iterative refinement, adjusting our plots step-by-step to tell a clearer story.
Full text of Huntington’s article (in English)↩︎