Chapter 6 Gapminder

The Gapminder Foundation ⁷ is a Swedish non-profit organization that promotes global development through the use of statistics that can help reduce common myths and sensationalist stories about global health and economics. An important selection of data is already loaded in the dslabs library in the gapminder data frame. Our case/problem now will be to answer these two questions:

Is it still reasonable to divide the world between Western countries* and developing countries?

Is it true that every day we are worse off and rich countries get richer while poor countries get poorer?

(*): Samuel Huntington in 1993 published an article called Clash of Civilizations ⁸ where he defined Western countries as those located in the regions of North America, Northern/Southern/Western Europe and Australia and New Zealand.

To address these questions, we will follow a structured data science workflow. We’ll start by exploring the data to understand its structure and content, then move to in-depth analysis to identify relevant variables. Finally, we will use visualization and summarization techniques to synthesize our findings and provide clear answers.

First let’s explore the structure of the data frame with str():

gapminder |> 
  str()
#> 'data.frame':    10545 obs. of  9 variables:
#>  $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
#>  $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
#>  $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
#>  $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
#>  $ population      : num  1636054 11124892 5270844 54681 20619075 ...
#>  $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
#>  $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
#>  $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...

We have a data frame with more than 10 thousand data points and 9 variables.

Now let’s take a look at the data with head():

gapminder |> 
  head()
#>               country year infant_mortality life_expectancy fertility
#> 1             Albania 1960           115.40           62.87      6.19
#> 2             Algeria 1960           148.20           47.50      7.65
#> 3              Angola 1960           208.00           35.98      7.32
#> 4 Antigua and Barbuda 1960               NA           62.97      4.43
#> 5           Argentina 1960            59.87           65.39      3.11
#> 6             Armenia 1960               NA           66.86      4.55
#>   population          gdp continent          region
#> 1    1636054           NA    Europe Southern Europe
#> 2   11124892  13828152297    Africa Northern Africa
#> 3    5270844           NA    Africa   Middle Africa
#> 4      54681           NA  Americas       Caribbean
#> 5   20619075 108322326649  Americas   South America
#> 6    1867396           NA      Asia    Western Asia

Remember that for library data frames we can usually find the documentation and understand each attribute faster:

?gapminder

Going directly to the questions would be not leaving that curiosity free to see what else is in the data. Thus, we are going to start with other variables such as infant mortality, fertility or population.

We can filter all the data that are from Peru and select the column country, year, infant mortality and population:

gapminder |> 
  filter(country == "Peru") |> 
  select(country, year, infant_mortality, population)
#>    country year infant_mortality population
#> 1     Peru 1960            135.9   10061519
#> 2     Peru 1961            132.6   10350239
#> 3     Peru 1962            129.1   10650672
#> 4     Peru 1963            125.4   10961539
#> 5     Peru 1964            121.8   11281015
#> 6     Peru 1965            118.2   11607684
#> 7     Peru 1966            114.8   11941327
#> 8     Peru 1967            111.6   12282081
#> 9     Peru 1968            108.7   12629333
#> 10    Peru 1969            106.0   12982444
#> 11    Peru 1970            103.4   13341071
#> 12    Peru 1971            100.9   13704333
#> 13    Peru 1972             98.3   14072476
#> 14    Peru 1973             95.8   14447649
#> 15    Peru 1974             93.3   14832839
#> 16    Peru 1975             91.0   15229951
#> 17    Peru 1976             88.9   15639898
#> 18    Peru 1977             87.0   16061327
#> 19    Peru 1978             85.5   16491087
#> 20    Peru 1979             83.9   16924758
#> 21    Peru 1980             82.4   17359118
#> 22    Peru 1981             80.7   17792551
#> 23    Peru 1982             78.7   18225727
#> 24    Peru 1983             76.3   18660443
#> 25    Peru 1984             73.7   19099575
#> 26    Peru 1985             70.7   19544950
#> 27    Peru 1986             67.6   19996250
#> 28    Peru 1987             64.6   20451712
#> 29    Peru 1988             61.7   20909897
#> 30    Peru 1989             58.9   21368856
#> 31    Peru 1990             56.3   21826658
#> 32    Peru 1991             53.7   22283130
#> 33    Peru 1992             51.0   22737056
#> 34    Peru 1993             48.2   23184222
#> 35    Peru 1994             45.4   23619358
#> 36    Peru 1995             42.5   24038761
#> 37    Peru 1996             39.7   24441076
#> 38    Peru 1997             36.9   24827409
#> 39    Peru 1998             34.3   25199744
#> 40    Peru 1999             31.8   25561297
#> 41    Peru 2000             29.6   25914875
#> 42    Peru 2001             27.6   26261363
#> 43    Peru 2002             25.7   26601463
#> 44    Peru 2003             24.1   26937737
#> 45    Peru 2004             22.6   27273188
#> 46    Peru 2005             21.3   27610406
#> 47    Peru 2006             20.1   27949958
#> 48    Peru 2007             19.0   28292768
#> 49    Peru 2008             18.0   28642048
#> 50    Peru 2009             17.1   29001563
#> 51    Peru 2010             16.3   29373644
#> 52    Peru 2011             15.6   29759891
#> 53    Peru 2012             14.9   30158768
#> 54    Peru 2013             14.2   30565461
#> 55    Peru 2014             13.6   30973148
#> 56    Peru 2015             13.1   31376670
#> 57    Peru 2016               NA         NA

Let’s add a filter to obtain only the data from 2015:

gapminder |> 
  filter(country == "Peru" & year == 2015) |> 
  select(country, year, infant_mortality, population)
#>   country year infant_mortality population
#> 1    Peru 2015             13.1   31376670

We can make a comparison between Peru and Chile if we create a vector and instead of the == operator we use the %in% operator that allows evaluating that our data are in that vector.

vector_countries = c("Peru", "Chile")

gapminder |> 
  filter(country %in% vector_countries & year == 2015) |> 
  select(country, year, infant_mortality, population)
#>   country year infant_mortality population
#> 1   Chile 2015              7.0   17948141
#> 2    Peru 2015             13.1   31376670

Infant mortality is measured in number of children who die per 1,000 infants. This means that it already takes into account the population. In 2015 Peru had a higher infant mortality rate than Chile.

6.1 Initial gapminder plots

However, if we want to analyze global data, comparing countries one by one would be impractical. Let’s use ggplot to see if there is a relationship in our data.

Let’s create a scatter plot with data from the year 1990 of the fertility variable (fertility), which is the average number of children per woman, and the life expectancy variable (life_expectancy).

gapminder |> 
  filter(year == 1990) |> 
  ggplot() +
  aes(x = fertility, y = life_expectancy) +
  geom_point()

From this graph we can see that countries where families have 7.5 children have a lower life expectancy. On the other hand, in countries with high life expectancy the average number of children is less than 2 children per family.

As we have done previously, we can color the points according to some other variable. In this case, knowing which continent they belong to could give us a better idea of the data.

gapminder |> 
  filter(year == 1990) |> 
  ggplot() +
  aes(x = fertility, y = life_expectancy, color = continent) +
  geom_point()

In this graph, groups begin to be seen. Several European countries are in the upper left quadrant, while several African countries are in the lower right quadrant.

6.2 Facets

Although the previous graph already shows us a correlation of variables, we cannot see how it has changed from one year to another. For this we will use the facet layer (facet_).

In the layer facet_grid(row_variable ~ column_variable) we replace “row_variable” with the name of our variable or replace it with a . if we don’t want any of them. For example, from the previous example let’s compare how the distribution changed by comparing the year 1960 with the year 2013.

vector_years <- c(1960, 2013)

gapminder |> 
  filter(year %in% vector_years) |> 
  ggplot() +
  aes(x = fertility, y = life_expectancy, color = continent) +
  geom_point() +
  facet_grid(year ~ .)

We can make it even clearer which continent changed the most if we add the continent variable as a column.

vector_years <- c(1960, 2013)

gapminder |> 
  filter(year %in% vector_years) |> 
  ggplot() +
  aes(x = fertility, y = life_expectancy, color = continent) +
  geom_point() +
  facet_grid(year ~ continent)

Having several columns for each continent makes it harder to understand because the columns become smaller. It is recommended to have few columns. So we invert the order between year and continent.

vector_years <- c(1960, 2013)

gapminder |> 
  filter(year %in% vector_years) |> 
  ggplot() +
  aes(x = fertility, y = life_expectancy, color = continent) +
  geom_point() +
  facet_grid(continent ~ year)

Here the change by regions is much more evident: the majority of countries have reduced fertility per family while increasing life expectancy. We are living longer than in the 1960s and having fewer children per family. These trends have occurred across all continents.

We don’t always have to show all variables, in this case continents. We can continue applying filters so that it shows us a subset of continents that we want to compare. For example:

vector_years <- c(1960, 2013)
vector_continents <- c("Europe", "Asia")

gapminder |> 
  filter(year %in% vector_years & continent %in% vector_continents) |> 
  ggplot() +
  aes(x = fertility, y = life_expectancy, color = continent) +
  geom_point() +
  facet_grid(continent ~ year)

In this case it would be visually better if the continents were not in separate rows, but could still be appreciated in the graph. To do this, we will use the wrap facet (facet_wrap(~ x)), where x is the variable we want to wrap. In our case it would be the year, instead of appearing in separate rows we can join and transpose them.

vector_years <- c(1960, 2013)
vector_continents <- c("Europe", "Asia")

gapminder |> 
  filter(year %in% vector_years & continent %in% vector_continents) |> 
  ggplot() +
  aes(x = fertility, y = life_expectancy, color = continent) +
  geom_point() +
  facet_wrap( ~ year)

We can add more data by adding more data to the vectors. For example, let’s add a cut in the middle between 1960 and 2013.

vector_years <- c(1960, 1985, 2013)
vector_continents <- c("Europe", "Asia")

gapminder |> 
  filter(year %in% vector_years & continent %in% vector_continents) |> 
  ggplot() +
  aes(x = fertility, y = life_expectancy, color = continent) +
  geom_point() +
  facet_wrap( ~ year)

6.3 Time series

Time series are sequences of data measured at determined moments and ordered chronologically. R allows us to easily plot time series, we only need our data frames to include some time variable.

6.3.1 Individual time series

In an individual time series we only analyze how a single variable has evolved, for example the evolution of the fertility rate in Peru. For this we can use a scatter plot with points or with lines.

As we will remember, we use geom_point() for points:

gapminder |> 
  filter(country == "Peru") |> 
  ggplot() +
  aes(x = year, y = fertility) +
  geom_point()
#> Warning: Removed 1 row containing missing values or values
#> outside the scale range (`geom_point()`).

We get a “warning” indicating that there are values that cannot be drawn because they are NA and are not available. This does not prevent showing the graph.

If we want a line graph, which is the most used in time series, we use geom_line():

gapminder |> 
  filter(country == "Peru") |> 
  ggplot() +
  aes(x = year, y = fertility) +
  geom_line()
#> Warning: Removed 1 row containing missing values or values
#> outside the scale range (`geom_line()`).

6.3.2 Multiple time series

In multiple time series we seek comparison to analyze in a time series how the data evolved. For example, this would be the time series if we compare Peru, Bolivia and Chile:

countries <- c("Peru", "Bolivia", "Chile")

gapminder |> 
  filter(country %in% countries) |> 
  ggplot() +
  aes(x = year, y = fertility, color = country) +
  geom_line()
#> Warning: Removed 3 rows containing missing values or values
#> outside the scale range (`geom_line()`).

We can also remove the legend and show the name of the countries as labels on the same graph. To do this we will first have to create a data frame using the function data.frame() that indicates the coordinates where we want each label to appear:

countries <- c("Peru", "Bolivia", "Chile")

labels <- data.frame(country_names = countries, x = c(1975, 1965, 1962), y = c(6, 7, 4))
  
labels
#>   country_names    x y
#> 1          Peru 1975 6
#> 2       Bolivia 1965 7
#> 3         Chile 1962 4

We will use this to indicate that we want, for example, Bolivia to be written at the intersection of the year 1972 and with a fertility rate of 6.8.

To use these labels in ggplot we will edit the arguments in the geom_text layer. We will use the data attributes to indicate that we want to obtain the data from an external source, and we will include the aes layer inside geom_text to correlate the data frame we have created with the graph. We must keep in mind that the column name in both data frames must be the same, in this case country:

countries <- c("Peru", "Bolivia", "Chile")

labels <- data.frame(country = countries, x = c(1976, 1972, 1965), y = c(5.2, 6.8, 5.5))

gapminder |> filter(country %in% countries) |>
  ggplot() +
  aes(year, fertility, col = country) +
  geom_line() +
  geom_text(data = labels, aes(x, y, label = country)) +
  theme(legend.position = "none")
#> Warning: Removed 3 rows containing missing values or values
#> outside the scale range (`geom_line()`).

6.4 Exercises

For these exercises we will continue using the gapminder data frame.

Generate a scatter plot comparing fertility rates and life expectancy for the Americas in the year 2000. Use color to differentiate between the regions within the continent.

Solution

gapminder |>
  filter( continent == "Americas" & year == 2000) |>
  ggplot() +
  aes(fertility, life_expectancy, color = region) +
  geom_point()

To create a vector of sequences we can use X:Y. This creates a vector that goes from number X to number Y

During the Vietnam War, both the US and Vietnam suffered significant losses. Create a line chart visualizing how life expectancy changed in both countries from 1955 to 1990 to observe the war’s impact.

Solution

countries <- c("Vietnam", "United States")
year_sequence <- 1955:1990

gapminder |>
  filter(country %in% countries & year %in% year_sequence) |> 
  ggplot() +
  aes(year, life_expectancy, color = country) +
  geom_line()

Expand the previous chart to include Cambodia, allowing us to visualize the devastating impact of the Khmer Rouge regime (1975-1979) on life expectancy alongside the Vietnam War data.

Solution

countries <- c("Vietnam", "United States", "Cambodia")
year_sequence <- 1955:1990

gapminder |>
  filter(country %in% countries & year %in% year_sequence) |> 
  ggplot() +
  aes(year, life_expectancy, color = country) +
  geom_line()

6.5 Histograms with ggplot

We could continue exploring the data until we understand it much better. Eventually we would get to the GDP (gdp) data and in turn we would understand that comparing only GDP alone makes no sense since there are countries with much more population than others. Data transformation is not something new, but we will see that it is something recurrent in our analyzes.

We are going to use a transformation that allows us to obtain how much is the GDP per capita per day in each country in each year

gapminder <- gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365)

We could visualize this variable first by creating a histogram of it. A histogram in ggplot is nothing more than one of the geoms we have available, in this case it would be geom_histogram(binwidth = x), where x is the width of the bar. For example, let’s calculate the distribution of our created variable in the year 2010:

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010) |> 
  ggplot() +
  aes(gdp_per_capita_per_day) +
  geom_histogram(binwidth = 5)
#> Warning: Removed 9 rows containing non-finite outside the
#> scale range (`stat_bin()`).

We can filter out the NA so that we no longer get the low “warnings” with the function we saw previously is.na(). In this case since we don’t want the NA we will negate the function by placing the symbol ! at the beginning.

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  ggplot() +
  aes(gdp_per_capita_per_day) +
  geom_histogram(binwidth = 5)

At this point it should be quick to detect that there is a concentration of data from countries with low GDP per capita and we could be tempted to apply a scale transformation on the x-axis. Let’s try with logarithm in base 2:

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  ggplot() +
  aes(gdp_per_capita_per_day) +
  geom_histogram(binwidth = 0.5) + #Change the width to 0.5 due to logarithmic scale
  scale_x_continuous(trans = "log2")

Let’s be careful interpreting this data. We cannot say that it is a symmetric distribution, even when with this scale we are tempted to do so. Remember the scale and use it appropriately.

Tip: For smooth distribution curves, you can also use geom_density() instead of geom_histogram(). Density plots are particularly useful when comparing multiple groups on the same plot.

6.6 Box plots with ggplot

In the same way, box plots are one more geom within the available ones, for this we will use the geom_boxplot() layer.

For example, let’s create a box plot to analyze GDP per capita per day by continent:

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  ggplot() +
  aes(continent, gdp_per_capita_per_day) +
  geom_boxplot()

Now let’s zoom in. Within each continent we have regions, for example in the Americas we have South America, Central America, North America, and so on with each continent. Let’s change the continent variable to region.

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  ggplot() +
  aes(region, gdp_per_capita_per_day) +
  geom_boxplot()

As we can verify: this visualization allows us to infer very little. Before discarding a graph let’s think if we can change the configuration to improve the visualization.

The first thing we can improve is the names of the regions. They are in horizontal form, but we could rotate it 45 degrees using the theme() layer.

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  ggplot() +
  aes(region, gdp_per_capita_per_day) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1) )

The names are understood, but if we want to find the top 3 (either by median or average) we would have to look for them one by one. Let’s reorder it, but first let’s be aware of some previous considerations:

The region column is a Factor type variable, it is not a character string. Even when visually we did not find a difference, factors are used to categorize data. For example, bronze, silver, platinum customers, etc.

Factors are useful because internally they are replaced by numbers and numbers, at a computational level, are faster to sort. The default sorting is alphabetical, as we can appreciate if we use the levels function.

levels(gapminder$region)
#>  [1] "Australia and New Zealand" "Caribbean"                
#>  [3] "Central America"           "Central Asia"             
#>  [5] "Eastern Africa"            "Eastern Asia"             
#>  [7] "Eastern Europe"            "Melanesia"                
#>  [9] "Micronesia"                "Middle Africa"            
#> [11] "Northern Africa"           "Northern America"         
#> [13] "Northern Europe"           "Polynesia"                
#> [15] "South America"             "South-Eastern Asia"       
#> [17] "Southern Africa"           "Southern Asia"            
#> [19] "Southern Europe"           "Western Africa"           
#> [21] "Western Asia"              "Western Europe"

We will use the reorder() function to change the order of the factors and since we are altering the dataframe we would have to use it inside the mutate() function. The reorder() function asks us as the first attribute the factor to reorder, then the vector that we will take into account and finally a grouping function. For example, order based on the median of each region (visually remember that it is the thick line inside each box):

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |> 
  ggplot() +
  aes(region, gdp_per_capita_per_day) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Note that a mutate has been placed after filtering the data. This is to guarantee that we are removing the NA. Otherwise, we risk that all values are NA and the reordering is not performed and remains default.

We see at the far left some regions in Africa, and at the far right Europe and USA. Remember that we can add color according to some variable. In this case let’s add color based on the continent:

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |> 
  ggplot() +
  aes(region, gdp_per_capita_per_day, color = continent) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Although we can already differentiate it, in a box plot it is usually the fill (fill in English) of the box that is painted. So, let’s change the color attribute to the fill attribute. And let’s remove the legend on the x-axis. It is not necessary in this case where the regions are self-explanatory.

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |> 
  ggplot() +
  aes(region, gdp_per_capita_per_day, fill = continent) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("")

This graph helps us see the top 5, but since there are several regions concentrated in small values of GDP per capita we visually lose those regions. We need a scale transformation.

If you are thinking of adding a logarithmic scale layer for the y-axis you are on the right track. Let’s try:

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |> 
  ggplot() +
  aes(region, gdp_per_capita_per_day, fill = continent) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("") +
  scale_y_continuous(trans = "log2")

Sometimes it is necessary not only to show the boxes, but also where each of the data points is located. For this we can add the geom_point() layer that we had previously used to show the points of each data.

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |> 
  ggplot() +
  aes(region, gdp_per_capita_per_day, fill = continent) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("") +
  scale_y_continuous(trans = "log2") +
  geom_point(size = 0.5)

6.7 Comparison of distributions

To be able to solve the first question of the case we would have to compare the distributions of the “Western” countries versus the developing countries.

For this, since we do not have a column that indicates which are from the West, we are going to create a western_countries with the list of regions that fall into this category:

western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

We will also use the ifelse(test, yes, no) function to create a new column such that if the region is in the West it stores a value, and if it is not in the West it stores another value. It is recommended to read the documentation in ?ifelse.

Let’s add the column for the group each country belongs to:

western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  mutate(group = ifelse(region %in% western_countries, "Western", "Developing")) |> 
  head()
#>               country year infant_mortality life_expectancy fertility
#> 1             Albania 2010             14.8            77.2      1.74
#> 2             Algeria 2010             23.5            76.0      2.82
#> 3              Angola 2010            109.6            57.6      6.22
#> 4 Antigua and Barbuda 2010              7.7            75.8      2.13
#> 5           Argentina 2010             13.0            75.8      2.22
#> 6             Armenia 2010             16.1            73.0      1.55
#>   population          gdp continent          region gdp_per_capita_per_day
#> 1    2901883   6137563946    Europe Southern Europe               5.794597
#> 2   36036159  79164339611    Africa Northern Africa               6.018638
#> 3   21219954  26125663270    Africa   Middle Africa               3.373106
#> 4      87233    836686777  Americas       Caribbean              26.277814
#> 5   41222875 434405530244  Americas   South America              28.871158
#> 6    2963496   4102285513      Asia    Western Asia               3.792527
#>        group
#> 1    Western
#> 2 Developing
#> 3 Developing
#> 4 Developing
#> 5 Developing
#> 6 Developing

Now that we have how to differentiate the countries we can see their distribution until we find how to answer our question. We start by creating a histogram with logarithmic scale in the x-axis and separate it using facet_grid based on the group it belongs to:

western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year == 2010 & !is.na(gdp_per_capita_per_day)) |> 
  mutate(group = ifelse(region %in% western_countries, "Western", "Developing")) |> 
  ggplot() +
  aes(gdp_per_capita_per_day) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(trans = "log2") +
  facet_grid(. ~ group)

We see that the daily GDP per capita has a distribution with higher values compared to developing countries. However, the picture in one year is not everything. We are ready to see if the separation was the same 40 years back from the date in the example (2010). We are also going to add the geom_histogram() layer the color attribute to see the border of the bars which by default are grey.

western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year %in% c(1970, 2010) & !is.na(gdp_per_capita_per_day)) |> 
  mutate(group = ifelse(region %in% western_countries, "Western", "Developing")) |> 
  ggplot() +
  aes(gdp_per_capita_per_day) +
  geom_histogram(binwidth = 1, color = "black") +
  scale_x_continuous(trans = "log2") +
  facet_grid(year ~ group)

Both groups, both “Western” and “Developing” have improved in that 40-year span, but developing countries have advanced more than Western countries.

So far we have assumed something: that all countries that reported in 2010 also reported data in 1970. To make the comparison finer we have to look for the distribution of countries that have data reported both in 1970 and in 2010.

To do this, we are going to create a vector that lists the countries with data in 1970 and another of those that have data in 2010 and then look for the intersection. Remember that to extract a column we use the pull() function.

western_countries <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")

list_1 <- gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year %in% c(1970) & !is.na(gdp_per_capita_per_day)) |> 
  pull(country)

list_2 <- gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year %in% c(2010) & !is.na(gdp_per_capita_per_day)) |> 
  pull(country)

To find the intersection of these two vectors we will use the function intersect(vector_1, vector_2), which will give us the vector we are looking for.

intersection_vector <- intersect(list_1, list_2)

So, we recreate our histogram including only the countries on this list.

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year %in% c(1970, 2010) & !is.na(gdp_per_capita_per_day)) |> 
  filter(country %in% intersection_vector) |> 
  mutate(group = ifelse(region %in% western_countries, "Western", "Developing")) |> 
  ggplot() +
  aes(gdp_per_capita_per_day) +
  geom_histogram(binwidth = 1, color = "black") +
  scale_x_continuous(trans = "log2") +
  facet_grid(year ~ group)

We see now more clearly with comparable data how there are more countries within the developing region that increased per capita GDP, and by a larger margin than Western countries. But this first inference is still visual, we need to compare how the median, range, etc. changed. For this we will use a box plot very similar to the previous one, but this time we will edit geom_boxplot() so that it shows us in a single graph how each region has changed from 1970 to 2010.

gapminder |> 
  mutate(gdp_per_capita_per_day = gdp/population/365) |> 
  filter(year %in% c(1970, 2010) & !is.na(gdp_per_capita_per_day)) |> 
  filter(country %in% intersection_vector) |> 
  mutate(region = reorder(region, gdp_per_capita_per_day, FUN = median)) |> 
  ggplot() +
  aes(region, gdp_per_capita_per_day) +
  geom_boxplot(aes(region, gdp_per_capita_per_day, fill=factor(year))) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("") +
  scale_y_continuous(trans = "log2")

We see how there are regions within Asia that have grown substantially. As we know from general culture, some countries in Asia are already powers, but today with these graphs we can understand well how much each region has changed until becoming a power.

Therefore, we can now answer both questions of the case:

It is not reasonable to continue using the categorization of “Western” and “developing” since there are more and more regions that are poorly represented by those categories, such as East Asia.
It is not true that rich countries get richer while poor countries get poorer. We have seen that developing countries have even higher growth than the growth that Western countries have.

6.8 Exercises

For this series of exercises we will use the stars data frame from the dslabs library. This dataset contains attributes of stars including their temperature, spectral type, and magnitude. The magnitude column represents absolute magnitude—a measure of intrinsic brightness where more negative values indicate greater luminosity.

library(dslabs)
data(stars)
head(stars)

The temperature data is currently in Kelvin. Create a new column temp_celsius using the formula \(C = K - 273.15\), then visualize the relationship between temperature and magnitude. Color the points by star type and use a base-10 logarithmic scale for the x-axis to better display the wide range of temperatures.

Solution

stars |> 
  mutate(temp_celsius = temp - 273.15) |> 
  ggplot() +
  aes(temp_celsius, magnitude, color = type) +
  scale_x_log10() +
  geom_point()

Since lower magnitude values correspond to higher brightness, reverse the y-axis scale using scale_y_reverse() to make the plot more intuitive (brighter stars at the top).

Solution

stars |> 
  mutate(temp_celsius = temp - 273.15) |> 
  ggplot() +
  aes(temp_celsius, magnitude, color = type) +
  scale_x_log10() +
  geom_point() +
  scale_y_reverse()

The Sun is a G-type star. To determine if these are the most luminous, create a box plot comparing the magnitude distributions across different star types.

No, G-type stars are not the most luminous. For this we can elaborate this graph:

Solution

stars |> 
  ggplot() +
  aes(type, magnitude) +
  geom_boxplot() +
  scale_y_reverse()

6.9 Key Takeaways

In this chapter, we followed a complete data exploration process, starting with understanding the data structure before diving into visualization. We utilized faceting to reveal trends across categories and time, and used time series plots to track specific country trajectories. We also explored distributions using histograms and box plots, applying scale transformations to handle skewed economic data. Finally, we practiced iterative refinement, adjusting our plots step-by-step to tell a clearer story.

https://www.gapminder.org/↩︎
Full text of Huntington’s article (in English)↩︎