Chapter 3 Data Frames

3.1 Introduction to Data Frames

In previous chapters, we explored different types of objects in R, such as variables, vectors, lists, and matrices. These objects allow us to store information in more efficient ways. Now, in this chapter, we will delve into the world of data frames, an essential tool for organizing and analyzing information that will help you make the best decision about your move to the United States.

3.1.1 What are data frames?

Imagine a spreadsheet, with rows and columns organizing information in a tabular way. In R, a data frame is precisely that: a data structure that stores information in a tabular format, with rows representing observations (for example, every US city) and columns representing variables (such as population, cost of living, crime rate).

Each column of a data frame can contain a different data type: numeric, character, logical, factor, etc. This makes data frames very versatile for storing diverse information.

For example, a data frame about US cities could serve as a comprehensive record. It might contain a character column for the city name and another for the state it belongs to. Numeric columns could store the population and the area in square kilometers, while a logical column like has_beach could indicate whether the city is coastal.

3.1.2 Why data frames?

In R, there are various structures for organizing data, such as vectors, lists, and matrices. However, data frames stand out as a fundamental tool in data analysis. Why?

Data frames offer a unique combination of features that make them ideal for representing and manipulating complex information:

Data frames are uniquely suited for data analysis because of their specific features. Their tabular structure organizes data into rows and columns, similar to a spreadsheet, making it intuitive to visualize. They offer flexibility by allowing each column to hold a different data type, such as numbers, text, or dates. this structure also ensures efficiency, as most R analysis packages are optimized to work directly with data frames.

In summary, data frames are a versatile and powerful data structure that adapts to the needs of modern data analysis.

3.1.3 Data Frames in action: exploring information about the United States

In the context of your move to the United States, data frames will be essential for organizing and analyzing the information you need to make the best decision. We can use data frames to store information about:

We can use data frames to store and correlate various aspects of your potential new home. You might track crime rates across different states, compare the cost of living (housing, food, transportation) in target cities, analyze climate data like temperature and precipitation, or study demographics such as population age and education levels.

With this information organized in data frames, you will be able to perform deeper analyses and make more informed decisions about your move.

3.2 Creating Data Frames: Building your database for the move

Now that you know what data frames are and why they are so important in data analysis, it’s time to learn how to create them. In R, we can create data frames in different ways: importing data from external files or creating them manually.

3.2.1 Importing data from files: CSV, Excel

A common way to create data frames is by importing data from external files, such as CSV (Comma Separated Values) files or Excel files. R offers us functions to read data from different formats.

One of the most common ways to create data frames is by importing data from external files. For CSV (Comma Separated Values) files, we rely on the read_csv() function from the readr package (part of the tidyverse), which is faster and more robust than the base R equivalent. To import a file, you simply provide its URL or file path:

``` r
library(readr)
url <- "https://dparedesi.github.io/Data-Science-with-R-book/data/student-grades.csv"

# Import data from a CSV file called "student-grades.csv"
grades <- read_csv(url)
#> Rows: 21 Columns: 9
#> ── Column specification ─────────────────────────────
#> Delimiter: ","
#> chr (3): start_date, gender, type
#> dbl (6): P1, P2, P3, P4, P5, P6
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

grades
#> # A tibble: 21 × 9
#>    start_date gender type                 P1    P2    P3    P4    P5    P6
#>    <chr>      <chr>  <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 03/05/2020 female Individual Work 1     5     5     5     5     5     5
#>  2 03/05/2020 male   Individual Work 1     5     5     5     5     4     5
#>  3 03/05/2020 female Individual Work 1     5     5     4     5     5     5
#>  4 03/05/2020 male   Individual Work 1     5     5     5     5     5     5
#>  5 03/05/2020 male   Individual Work 1     2     5     5     5     5     5
#>  6 03/05/2020 male   Individual Work 1     5     4     5     1     5     5
#>  7 03/05/2020 male   Individual Work 1     2     1     5     5     2     5
#>  8 03/05/2020 male   Individual Work 1     5     5     5     5     5     5
#>  9 03/05/2020 male   Individual Work 1     4     5     5     5     5     5
#> 10 03/05/2020 male   Individual Work 1     3     4     5     5     5     5
#> # ℹ 11 more rows
```

The read_csv() function offers several arguments to customize how files are read. The header argument allows you to specify if the first row contains column names, while sep defines the column separator (defaulting to a comma). You can also use dec to set the character used for decimal points.

For Excel files, we use the read_excel() function from the readxl package. This function works similarly but includes specific arguments like sheet to specify which spreadsheet tab to import.

``` r
# Install the readxl package (if you don't have it installed)
install.packages("readxl")

# Load the readxl package
library(readxl)

# Import data from an Excel file called "states.xlsx"
states <- read_excel("states.xlsx")
```

3.2.2 Creating data frames manually

We can also create data frames manually, combining vectors with the data.frame() function.

# Create vectors with information about cities
cities <- c("New York", "Los Angeles", "Chicago")
states <- c("New York", "California", "Illinois")
population <- c(8.4e6, 3.9e6, 2.7e6)

# Create a data frame with city information
df_cities_simple <- data.frame(city = cities, state = states, population = population)

df_cities_simple
#>          city      state population
#> 1    New York   New York    8400000
#> 2 Los Angeles California    3900000
#> 3     Chicago   Illinois    2700000

In this example, we create a data frame called df_cities_simple with three columns: city, state, and population. Each column is created from a vector. Note that the vectors must have the same length to be combined into a data frame.

3.2.3 Examples

We can use data frames to organize diverse information about our move to the United States. For example, we could create a data frame with information about different cities, including their cost of living, crime rate, and climate. We could also create a data frame with information about the different states, including their population, gross domestic product (GDP), and education system.

# Create a data frame with information about cities
df_cities <- data.frame(
  city = c("New York", "Los Angeles", "Chicago", "Houston"),
  state = c("New York", "California", "Illinois", "Texas"),
  cost_of_living = c(3.5, 2.8, 2.5, 2.0),  # In thousands of dollars
  crime_rate = c(400, 350, 500, 450),  # Per 100,000 inhabitants
  climate = c("Temperate", "Mediterranean", "Continental", "Subtropical")
)

df_cities
#>          city      state cost_of_living crime_rate       climate
#> 1    New York   New York            3.5        400     Temperate
#> 2 Los Angeles California            2.8        350 Mediterranean
#> 3     Chicago   Illinois            2.5        500   Continental
#> 4     Houston      Texas            2.0        450   Subtropical

# Create a data frame with information about states
df_states <- data.frame(
  state = c("California", "Texas", "Florida", "New York"),
  population = c(39.2e6, 29.0e6, 21.4e6, 19.4e6),
  gdp = c(3.2e12, 1.8e12, 1.1e12, 1.7e12),  # In dollars
  education_system = c("Good", "Regular", "Good", "Excellent")
)

df_states
#>        state population     gdp education_system
#> 1 California   39200000 3.2e+12             Good
#> 2      Texas   29000000 1.8e+12          Regular
#> 3    Florida   21400000 1.1e+12             Good
#> 4   New York   19400000 1.7e+12        Excellent

These data frames will allow us to analyze the information more efficiently and make more informed decisions about our move.

3.3 Exploring Data Frames: Discovering the secrets of your data

We have already learned to create data frames, now it is time to explore their content and discover the information they hide. R offers us various tools to examine and understand our data.

3.3.1 Accessing rows, columns, and cells

A data frame is like a map organized in rows and columns. To access the information we need, we must know how to navigate this map. R provides us with different ways to access rows, columns, and cells of a data frame.

There are several ways to access specific data within a dataframe. To retrieve a column, you can use the $ operator (e.g., df_cities$state) or bracket notation with the column name in quotes (e.g., df_states["population"]). To access a specific row, use brackets with the row number (e.g., df_cities[3, ]). For a precise cell at the intersection of a row and column, specify both indices (e.g., df_states[2, 3]). You can also filter rows based on conditions, such as extracting all cities where the cost of living is less than 3 using a logical expression inside the brackets.

3.3.2 Functions for exploring data frames

R offers several useful functions for exploring data frames:

R provides useful functions for a quick overview of your data. head() displays the first six rows, while tail() shows the last six. To understand the structure—such as column names and data types—you can use str(). For a statistical overview including mean, median, and quartiles, summary() is the go-to function. Additionally, View() opens an interactive spreadsheet-style window to browse the data.

3.3.3 Examples: exploring data frames with move information

By exploring the data frames we created in the previous section, we can obtain valuable information about US cities and states. For example, we could use summary() to get descriptive statistics of the cost of living in different cities, or View() to examine information about each state in detail.

# Get descriptive statistics of cost of living in different cities
summary(df_cities$cost_of_living)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   2.000   2.375   2.650   2.700   2.975   3.500
# Examine detailed information about each state
View(df_states)

In addition to the mentioned functions, we can use other tools to explore our data frames. For example, we can use the table() function to get the frequency of each value in a categorical column, such as the climate column in the df_cities data frame.

table(df_cities$climate)
#> 
#>   Continental Mediterranean   Subtropical     Temperate 
#>             1             1             1             1

We can also use the hist() function to create a histogram of a numeric column, such as the population column in the df_states data frame.

hist(df_states$population)

These are just some ideas of how we can explore our data frames. As you become familiar with R, you will discover new functions and techniques for analyzing and visualizing your data.

3.4 Manipulating Data Frames: Transforming your data

In the previous section, we learned to explore data frames and access the information they contain. Now, we will go a step further and learn to manipulate data frames, transforming data to answer specific questions and obtain relevant information for our move.

3.4.1 Introduction to the pipeline operator (|>)

Before modifying data frames, we will introduce a tool to write more readable and efficient code: the native pipeline operator (|>). This operator was introduced in R 4.1 (2021) as a built-in language feature, meaning it works without any additional packages.

Note: You may also encounter the %>% pipe operator from the magrittr package (part of the tidyverse). Both |> and %>% work similarly for most data analysis tasks. We use the native |> operator throughout this book as it is built into R, but %>% is still widely used in older codebases.

The pipeline operator allows us to chain several operations sequentially. Instead of writing nested code, we can use the pipeline operator to “pass” the result of one operation to the next.

To use additional data manipulation functions, we’ll load the tidyverse package, which includes dplyr - a package with many useful functions for working with data frames.

A package in R is like a toolbox with additional functions and data for performing specific tasks. To use a package’s functions, we must first install it and then load it into our working environment.

To install the tidyverse package, we can use the following instruction in the R console:

install.packages("tidyverse")

This will install tidyverse and all the packages it contains, including dplyr. Once the package is installed, we can load it with the library() function:

library(tidyverse)

Now we can use the pipeline operator (|>) and functions from dplyr.

For example, we’ll use the murders dataset from the dslabs package. This dataset contains gun murder data by US state in 2010, including variables like state name, abbreviation, region, population, and total murders. Let’s use a pipeline to view selected columns:

install.packages("dslabs")
# Load library and dataset
library(dslabs)
data(murders)

# Pipeline
murders |> select(state, population, total)
#>                   state population total
#> 1               Alabama    4779736   135
#> 2                Alaska     710231    19
#> 3               Arizona    6392017   232
#> 4              Arkansas    2915918    93
#> 5            California   37253956  1257
#> 6              Colorado    5029196    65
#> 7           Connecticut    3574097    97
#> 8              Delaware     897934    38
#> 9  District of Columbia     601723    99
#> 10              Florida   19687653   669
#> 11              Georgia    9920000   376
#> 12               Hawaii    1360301     7
#> 13                Idaho    1567582    12
#> 14             Illinois   12830632   364
#> 15              Indiana    6483802   142
#> 16                 Iowa    3046355    21
#> 17               Kansas    2853118    63
#> 18             Kentucky    4339367   116
#> 19            Louisiana    4533372   351
#> 20                Maine    1328361    11
#> 21             Maryland    5773552   293
#> 22        Massachusetts    6547629   118
#> 23             Michigan    9883640   413
#> 24            Minnesota    5303925    53
#> 25          Mississippi    2967297   120
#> 26             Missouri    5988927   321
#> 27              Montana     989415    12
#> 28             Nebraska    1826341    32
#> 29               Nevada    2700551    84
#> 30        New Hampshire    1316470     5
#> 31           New Jersey    8791894   246
#> 32           New Mexico    2059179    67
#> 33             New York   19378102   517
#> 34       North Carolina    9535483   286
#> 35         North Dakota     672591     4
#> 36                 Ohio   11536504   310
#> 37             Oklahoma    3751351   111
#> 38               Oregon    3831074    36
#> 39         Pennsylvania   12702379   457
#> 40         Rhode Island    1052567    16
#> 41       South Carolina    4625364   207
#> 42         South Dakota     814180     8
#> 43            Tennessee    6346105   219
#> 44                Texas   25145561   805
#> 45                 Utah    2763885    22
#> 46              Vermont     625741     2
#> 47             Virginia    8001024   250
#> 48           Washington    6724540    93
#> 49        West Virginia    1852994    27
#> 50            Wisconsin    5686986    97
#> 51              Wyoming     563626     5

Code with pipeline is easier to read and understand, as it follows the natural flow of operations. Pipeline creates a view; we are not editing the murders data frame.

We can show the first rows using the head() function:

head(murders |> select(state, population, total))
#>        state population total
#> 1    Alabama    4779736   135
#> 2     Alaska     710231    19
#> 3    Arizona    6392017   232
#> 4   Arkansas    2915918    93
#> 5 California   37253956  1257
#> 6   Colorado    5029196    65

We can also use the pipeline operator to show the first rows:

murders |> select(state, population, total) |> head()
#>        state population total
#> 1    Alabama    4779736   135
#> 2     Alaska     710231    19
#> 3    Arizona    6392017   232
#> 4   Arkansas    2915918    93
#> 5 California   37253956  1257
#> 6   Colorado    5029196    65

For better readability, we will use one function per line, obtaining the same result:

murders |>
  select(state, population, total) |> # Select columns
  head() # Show first 6 rows
#>        state population total
#> 1    Alabama    4779736   135
#> 2     Alaska     710231    19
#> 3    Arizona    6392017   232
#> 4   Arkansas    2915918    93
#> 5 California   37253956  1257
#> 6   Colorado    5029196    65

3.4.2 Transforming a table with mutate()

We can create new columns or modify existing ones using the mutate() function. For example, to add a column with the homicide rate per 100,000 inhabitants to the murders data frame:

murders |>
  mutate(ratio = total / population * 100000) |>
  head()
#>        state abb region population total    ratio
#> 1    Alabama  AL  South    4779736   135 2.824424
#> 2     Alaska  AK   West     710231    19 2.675186
#> 3    Arizona  AZ   West    6392017   232 3.629527
#> 4   Arkansas  AR  South    2915918    93 3.189390
#> 5 California  CA   West   37253956  1257 3.374138
#> 6   Colorado  CO   West    5029196    65 1.292453

This creates a view with the additional ratio column.

If we want to modify the murders data frame directly, we use the assignment operator <-:

murders <- murders |>
  mutate(ratio = total / population * 100000)

3.4.3 Filtering data: selecting cities that interest you

We can filter rows meeting a condition using the filter() function. For example, to get states with less than 1 homicide per 100,000 inhabitants:

# Load dataset
data(murders)

murders |>
  mutate(ratio = total / population * 100000) |>
  filter(ratio < 1)
#>            state abb        region population total     ratio
#> 1         Hawaii  HI          West    1360301     7 0.5145920
#> 2          Idaho  ID          West    1567582    12 0.7655102
#> 3           Iowa  IA North Central    3046355    21 0.6893484
#> 4          Maine  ME     Northeast    1328361    11 0.8280881
#> 5      Minnesota  MN North Central    5303925    53 0.9992600
#> 6  New Hampshire  NH     Northeast    1316470     5 0.3798036
#> 7   North Dakota  ND North Central     672591     4 0.5947151
#> 8         Oregon  OR          West    3831074    36 0.9396843
#> 9   South Dakota  SD North Central     814180     8 0.9825837
#> 10          Utah  UT          West    2763885    22 0.7959810
#> 11       Vermont  VT     Northeast     625741     2 0.3196211
#> 12       Wyoming  WY          West     563626     5 0.8871131

We can use different operators to create our conditions:

R supports standard comparison operators to create conditions: greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), equal to (==), and different from (!=). You can combine multiple conditions using logical operators: & for AND, | for OR, and ! for NOT.

For example, to filter by ratio less than 1 and West region:

murders |>
  mutate(ratio = total / population * 100000) |>
  filter(ratio < 1 & region == "West")
#>     state abb region population total     ratio
#> 1  Hawaii  HI   West    1360301     7 0.5145920
#> 2   Idaho  ID   West    1567582    12 0.7655102
#> 3  Oregon  OR   West    3831074    36 0.9396843
#> 4    Utah  UT   West    2763885    22 0.7959810
#> 5 Wyoming  WY   West     563626     5 0.8871131

3.4.4 Sorting data: finding the safest cities

The arrange() function from the dplyr package allows us to order the rows of a data frame based on one or more columns. Imagine you have a data frame with information about different cities, and you want to order them from safest to least safe, based on their crime rate. Or perhaps you want to order them by cost of living, from cheapest to most expensive. arrange() allows you to do this easily.

For example, to order states by homicide rate (from lowest to highest):

murders |>
  mutate(ratio = total / population * 100000) |>
  arrange(ratio) |>
  head()
#>           state abb        region population total     ratio
#> 1       Vermont  VT     Northeast     625741     2 0.3196211
#> 2 New Hampshire  NH     Northeast    1316470     5 0.3798036
#> 3        Hawaii  HI          West    1360301     7 0.5145920
#> 4  North Dakota  ND North Central     672591     4 0.5947151
#> 5          Iowa  IA North Central    3046355    21 0.6893484
#> 6         Idaho  ID          West    1567582    12 0.7655102

If we want to sort in descending order, we use the desc() function:

murders |>
  mutate(ratio = total / population * 100000) |>
  arrange(desc(ratio)) |>
  head()
#>                  state abb        region population total     ratio
#> 1 District of Columbia  DC         South     601723    99 16.452753
#> 2            Louisiana  LA         South    4533372   351  7.742581
#> 3             Missouri  MO North Central    5988927   321  5.359892
#> 4             Maryland  MD         South    5773552   293  5.074866
#> 5       South Carolina  SC         South    4625364   207  4.475323
#> 6             Delaware  DE         South     897934    38  4.231937

We can also sort by multiple columns. For example, if we want to sort first by region and then by state (in alphabetical order):

murders |> 
  arrange(region, state) |>
  head()
#>           state abb    region population total
#> 1   Connecticut  CT Northeast    3574097    97
#> 2         Maine  ME Northeast    1328361    11
#> 3 Massachusetts  MA Northeast    6547629   118
#> 4 New Hampshire  NH Northeast    1316470     5
#> 5    New Jersey  NJ Northeast    8791894   246
#> 6      New York  NY Northeast   19378102   517

3.4.5 Aggregating and summarizing data: obtaining general overview

The summarize() function from the dplyr package allows us to calculate descriptive statistics for one or more columns of a data frame. It’s like summarizing information from our data frame into a single number or a set of numbers.

For example, to calculate the mean population of states:

murders |>
  summarize(mean_population = mean(population))
#>   mean_population
#> 1         6075769

We can combine summarize() with group_by() to calculate statistics by groups. For example, to calculate average population by region:

murders |>
  group_by(region) |>
  summarize(mean_population = mean(population))
#> # A tibble: 4 × 2
#>   region        mean_population
#>   <fct>                   <dbl>
#> 1 Northeast            6146360 
#> 2 South                6804378.
#> 3 North Central        5577250.
#> 4 West                 5534273.

3.4.6 Joining data frames: combining information

Imagine you have two data frames: one with information about cities (name, population, etc.) and another with information about the states those cities belong to (state name, governor, etc.). If you want to combine information from both data frames to have a single data frame with all information about cities and their states, you can use dplyr join functions.

dplyr offers several functions for joining data frames, such as left_join(), right_join(), inner_join(), and full_join(). Each function performs a different type of join, depending on how data frame rows are combined.

The left_join() function joins two data frames keeping all rows from the first data frame (the one on the left) and adding columns from the second data frame that match the first data frame’s rows. If a row from the first data frame has no match in the second data frame, new columns will have NA values.

For example, if we have a data frame with city information and another with state information, we can join them by the state column:

df_cities_states <- left_join(df_cities, df_states, by = "state")

The resulting data frame df_cities_states will contain information from both data frames combined. If a city in df_cities does not have a corresponding state in df_states, columns from df_states will have NA values for that city.

Let’s see a concrete example. Suppose we have the following data frames:

df_cities <- data.frame(
  city = c("New York", "Los Angeles", "Chicago", "Houston"),
  state = c("New York", "California", "Illinois", "Texas")
)

df_states <- data.frame(
  state = c("California", "Texas", "Florida"),
  governor = c("Gavin Newsom", "Greg Abbott", "Ron DeSantis")
)

# Join data frames by "state" column
df_cities_states <- left_join(df_cities, df_states, by = "state")

df_cities_states
#>          city      state     governor
#> 1    New York   New York         <NA>
#> 2 Los Angeles California Gavin Newsom
#> 3     Chicago   Illinois         <NA>
#> 4     Houston      Texas  Greg Abbott

In this example, left_join() combines df_cities and df_states data frames by the state column. Note that “New York” and “Chicago” cities have NA values in the governor column, since their states (“New York” and “Illinois”) are not present in the df_states data frame.

The other join functions (right_join(), inner_join(), and full_join()) work similarly, but with different criteria for combining data frame rows.

The other join functions work similarly but with different inclusion criteria. right_join() does the opposite of left_join(), keeping all rows from the right data frame and only matching rows from the left. inner_join() is more restrictive, keeping only rows that have matches in both tables, while full_join() is the most inclusive, retaining all rows from both data frames and filling in NA where no match exists.

You can consult dplyr documentation for more information about these functions.

3.4.7 Examples

The dplyr functions we have seen allow us to perform complex data transformations to answer specific questions about our move to the United States. Let’s see some examples with R code:

Examples of analysis questions

We can combine these tools to answer specific questions. To find suitable locations, we might filtered for cities with a “Good” education system and a cost of living index below 2.5. Alternatively, to study economic prosperity, we could sort states by their GDP per capita (calculated as GDP divided by population) in descending order. For a more comprehensive climate analysis, we could join our city data with a separate climate table.

With these tools, you will be able to explore and analyze information about the United States to make the best decision about your move.

3.5 Exercises

  1. Report the state abbreviation abb and population population columns from the murders data frame
Solution
murders |>
  select(abb, population)
  1. Report all data frame data that are not from the South region.
Solution
murders |>
  filter(region != "South")

If we want to filter all records that are from the South and West region we will use %in% instead of == to compare versus a vector

  1. Create the vector south_and_west containing values “South” and “West”. Then filter records that are from those two regions.
Solution
south_and_west <- c("South", "West")
  
murders |>
  filter(region %in% south_and_west)
  1. Add the ratio column to the murders data frame with the murder ratio per 100,000 inhabitants. Then, filter those with a ratio less than 0.5 and are from “South” and “West” regions. Report state, abb, and ratio columns.
Solution
data(murders)

south_and_west <- c("South", "West")
  
murders <- murders |>
  mutate(ratio = total/population*100000) |>
  filter(ratio < 0.5 & region %in% south_and_west) |>
  select(state, abb, ratio)

murders

To sort using pipeline we use the arrange(x) function, where x is the name of the column we want to take as reference which will sort in ascending order or arrange(desc(x)) to sort in descending order.

  1. Modify the code generated in the previous exercise to sort the result by the ratio field.
Solution
data(murders)

south_and_west <- c("South", "West")
  
murders <- murders |>
  mutate(ratio = total/population*100000) |>
  filter(ratio < 0.5 & region %in% south_and_west) |>
  select(state, abb, ratio) |>
  arrange(ratio)

murders
So, finally we can know what state options we have to be able to move and solve the presented case.

3.6 Data frames in plots

Now we will see some functions that allow us to visualize our data. Little by little we will build more complex and visually more aesthetic graphs to present. First let’s see the most basic functions R presents us. In the next chapter we will see in more detail graph types and in which situations it is recommended to use one or another graph.

3.6.1 Scatter plots

One of the most used plots in R is the scatter plot, which is a type of mathematical diagram using Cartesian coordinates to show values for two variables for a set of data (Jarrell 1994, 492). By default we assume the variables to analyze are independent. Thus, the scatter plot will show the degree of correlation (not causality) between the two variables.

The simplest way to plot a scatter plot is with the plot(x,y) function, where x and y are vectors indicating the x-axis coordinates and y-axis coordinates of each point we want to plot. For example, let’s see the relationship between population size and total murders.

# Let's store population data in the x_axis object
x_axis <- murders$population

# Let's store total murders data in the y_axis object
y_axis <- murders$total

# With this code we create the scatter plot
plot(x_axis, y_axis)

We can see a correlation between population and number of cases. Let’s transform the x_axis dividing by one million (\({10}^6\)). Thus we will have the x axis expressed in millions.

x_axis <- murders$population/10^6
y_axis <- murders$total

plot(x_axis, y_axis)

3.6.2 Histograms

We can also create histograms from a vector with the hist function.

data(murders)

murders <- murders |>
  mutate(ratio = total/population*100000)

hist(murders$ratio)

The ease R gives us to create graphs will save us time for analysis. From here we can quickly see that most states have a ratio < 5.

3.6.3 Box plot

Finally, R allows us to create box plots easily with the boxplot function. So, if we wanted to analyze the distribution of ratio we would use the following code:

boxplot(murders$ratio)

3.7 Data interpretation

We have seen graphs that can be generated with a line of code, but we need to interpret them. To do so, we need to learn or remember some statistics. Throughout this book we will learn statistical concepts not going deep into the math part, but from the practical part and leveraging that functions already exist in R.

Let’s remember our case/problem. We have a list of murders in each of the 51 states. If we order them by the total column we would have:

murders |> 
  arrange(total) |>
  head()
#>           state abb        region population total     ratio
#> 1       Vermont  VT     Northeast     625741     2 0.3196211
#> 2  North Dakota  ND North Central     672591     4 0.5947151
#> 3 New Hampshire  NH     Northeast    1316470     5 0.3798036
#> 4       Wyoming  WY          West     563626     5 0.8871131
#> 5        Hawaii  HI          West    1360301     7 0.5145920
#> 6  South Dakota  SD North Central     814180     8 0.9825837

R provides us with the summary() function, which gives us a summary of a vector’s data.

summary(murders$total)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     2.0    24.5    97.0   184.4   268.0  1257.0

The summary provides key insights: the Min and Max show the range of the data; the 1st Qu (first quartile) and 3rd Qu (third quartile) indicate the 25th and 75th percentiles; the Median marks the exact middle of the distribution; and the Mean gives the arithmetic average.

3.7.1 Quartiles

To understand quartiles let’s visualize total data in an ordered way. To only obtain a single column in pipeline we will use .$ before the variable name:

murders |> 
  arrange(total) |>
  pull(total)
#>  [1]    2    4    5    5    7    8   11   12   12   16   19   21   22   27   32
#> [16]   36   38   53   63   65   67   84   93   93   97   97   99  111  116  118
#> [31]  120  135  142  207  219  232  246  250  286  293  310  321  351  364  376
#> [46]  413  457  517  669  805 1257

Quartiles divide our vector into 4 parts with the same amount of data. Given we have 51 values, we would have groups of 51/4 = 12.75. We would have groups of 13 values (3 groups of 13 elements and one of 12 elements).

For example, the first group would be composed of these numbers:

#>  [1]  2  4  5  5  7  8 11 12 12 16 19 21 22

The second group would be composed of these numbers:

#>  [1] 27 32 36 38 53 63 65 67 84 93 93 97 97

And so on. In total 4 groups made up of 25% of data each.

3.7.1.1 First quartile

Therefore, when we see the 1st quartile, 1st Qu., let’s think that is the cut indicating up to where I can find 25% of the data.

summary(murders$total)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     2.0    24.5    97.0   184.4   268.0  1257.0

In our example 24.5 indicates that every number less than or equal to that number will be within the first 25% of data (25% of 51 data points = 12.75, rounded to 13 data points).

If we list numbers less than or equal to 24.5 we will have this list:

murders |>
  arrange(total) |>
  filter(total <= 24.5) |>
  pull(total)
#>  [1]  2  4  5  5  7  8 11 12 12 16 19 21 22

Which is exactly the same list we obtained previously for the first group.

3.7.1.2 Second quartile or median

The second quartile, also called the median (Median), indicates the cut of the second group. The first group contains the first 25% of data, the second group has additional 25%. So this cut would give us exactly the value found in the middle.

summary(murders$total)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     2.0    24.5    97.0   184.4   268.0  1257.0

In our example 97 indicates that below that number we will find 50% of total data (50% of 51 data points = 25.5, rounded to 26 data points).

murders |>
  arrange(total) |>
  filter(total <= 97) |>
  pull(total)
#>  [1]  2  4  5  5  7  8 11 12 12 16 19 21 22 27 32 36 38 53 63 65 67 84 93 93 97
#> [26] 97

3.7.1.3 Third quartile

The third quartile is the cut of the third group. Up to the median we already had 50%, if we add another 25% of data we would have 75%.

summary(murders$total)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     2.0    24.5    97.0   184.4   268.0  1257.0

In our example 268 indicates that below that number we will find 75% of total data (75% of 51 data points = 38.25, rounded to 38 data points).

3.7.2 Interpretation of box plot

We are now ready to create a box plot with total murders and interpret results.

boxplot(murders$total)

The box starts at value 24.5 (first quartile) and ends at value 268 (third quartile). The thick line represents the median (second quartile), 97 in our example.

Between the first quartile and third quartile (between 24.5 and 97 for our example) we will find 50% of the data, also called interquartile range or IQR.

Outside the box we see a vertical line upwards and another downwards, showing the range of our data. Outside those lines we see dots which are atypical data very far from the mean, known as outliers.

We can quickly find to which states these extreme data belong if we sort the table descendingly using the desc function:

murders |>
  arrange(desc(total)) |>
  head()
#>          state abb        region population total    ratio
#> 1   California  CA          West   37253956  1257 3.374138
#> 2        Texas  TX         South   25145561   805 3.201360
#> 3      Florida  FL         South   19687653   669 3.398069
#> 4     New York  NY     Northeast   19378102   517 2.667960
#> 5 Pennsylvania  PA     Northeast   12702379   457 3.597751
#> 6     Michigan  MI North Central    9883640   413 4.178622

We see that in California 1257 cases were reported. That is one of the extreme data points we see in the box plot.

3.7.3 Examples

  1. Create variable pop_log10 and store log base 10 data of population (log10() function). Perform the same log base 10 transformation for total murders and store it in variable tot_log10. Generate a scatter plot of these two variables.
pop_log10 <- log10(murders$population)
tot_log10 <- log10(murders$total)

plot(pop_log10, tot_log10)

  1. Create a histogram of population in millions (divided by \({10}^6\)).
hist(murders$population/10^6)

  1. Create a box plot of population.
boxplot(murders$population)

3.8 Exercises

Below, you will find a series of exercises with different levels of difficulty. It is time to put into practice what you have learned in this chapter. Remember you can use dplyr functions like filter(), arrange(), mutate(), summarize(), group_by() and left_join() to manipulate data frames.

  1. Create a data frame called my_expenses. It should contain a category factor with levels “Housing”, “Transport”, “Food”, and “Entertainment”, along with three numeric columns (january, february, march) recording expenses for each category.
Solution
my_expenses <- data.frame(
  category = factor(c("Housing", "Transport", "Food", "Entertainment")),
  january = c(1500, 300, 500, 200),
  february = c(1500, 250, 400, 150),
  march = c(1500, 350, 550, 250)
)

my_expenses
#>        category january february march
#> 1       Housing    1500     1500  1500
#> 2     Transport     300      250   350
#> 3          Food     500      400   550
#> 4 Entertainment     200      150   250
  1. Use head(), tail(), str() and summary() functions to explore my_expenses data frame.
Solution
head(my_expenses)
#>        category january february march
#> 1       Housing    1500     1500  1500
#> 2     Transport     300      250   350
#> 3          Food     500      400   550
#> 4 Entertainment     200      150   250
tail(my_expenses)
#>        category january february march
#> 1       Housing    1500     1500  1500
#> 2     Transport     300      250   350
#> 3          Food     500      400   550
#> 4 Entertainment     200      150   250
str(my_expenses)
#> 'data.frame':    4 obs. of  4 variables:
#>  $ category: Factor w/ 4 levels "Entertainment",..: 3 4 2 1
#>  $ january : num  1500 300 500 200
#>  $ february: num  1500 250 400 150
#>  $ march   : num  1500 350 550 250
summary(my_expenses)
#>           category    january        february        march       
#>  Entertainment:1   Min.   : 200   Min.   : 150   Min.   : 250.0  
#>  Food         :1   1st Qu.: 275   1st Qu.: 225   1st Qu.: 325.0  
#>  Housing      :1   Median : 400   Median : 325   Median : 450.0  
#>  Transport    :1   Mean   : 625   Mean   : 575   Mean   : 662.5  
#>                    3rd Qu.: 750   3rd Qu.: 675   3rd Qu.: 787.5  
#>                    Max.   :1500   Max.   :1500   Max.   :1500.0
  1. Access february column of my_expenses data frame using $ operator. Then, access the second row of the data frame using brackets.
Solution
my_expenses$february
#> [1] 1500  250  400  150
my_expenses[2, ]
#>    category january february march
#> 2 Transport     300      250   350
  1. Filter my_expenses data frame to get only rows where expenses in january are greater than 400.
Solution
my_expenses |>
  filter(january > 400)
#>   category january february march
#> 1  Housing    1500     1500  1500
#> 2     Food     500      400   550
  1. Sort my_expenses data frame descendingly by expenses in march.
Solution
my_expenses |>
  arrange(desc(march))
#>        category january february march
#> 1       Housing    1500     1500  1500
#> 2          Food     500      400   550
#> 3     Transport     300      250   350
#> 4 Entertainment     200      150   250
  1. Add a column called total to my_expenses data frame containing the sum of January, February, and March expenses for each category.
Solution
my_expenses <- my_expenses |> 
  mutate(total = january + february + march)
  1. Calculate mean and standard deviation of total expenses for each category in my_expenses data frame.
Solution
my_expenses |> 
  summarize(mean_total = mean(total), 
            std_total = sd(total))
#>   mean_total std_total
#> 1     1862.5  1793.216
  1. Group my_expenses data frame by category and calculate sum of expenses for each month.
Solution
my_expenses |> 
  group_by(category) |> 
  summarize(sum_january = sum(january), 
            sum_february = sum(february), 
            sum_march = sum(march))
#> # A tibble: 4 × 4
#>   category      sum_january sum_february sum_march
#>   <fct>               <dbl>        <dbl>     <dbl>
#> 1 Entertainment         200          150       250
#> 2 Food                  500          400       550
#> 3 Housing              1500         1500      1500
#> 4 Transport             300          250       350
  1. Visually analyze the following chart describing total murder distribution by regions. Just by visualizing it, could you point out which region has the smallest data range, ignoring outliers? Which region has the highest median?

Solution

West has the smallest data range and has two outliers. South has the highest median among all regions.

Analyzing solely by seeing a chart allows us to put ourselves in the final observer’s shoes and understand if decisions can be made just with presented information.
  1. Create south vector where you store filtered data of total murders occurred in South region. Then, create a histogram of south vector.
Solution
south <- murders |>
  filter(region == "South") |>
  pull(total)

hist(south)
  1. Create a new data frame called df_cities_climate combining information from df_cities and df_climate (you must create df_climate data frame with city climate information). Ensure resulting data frame contains all cities from df_cities, even if they don’t have climate information in df_climate.
Solution
df_climate <- data.frame(
  city = c("New York", "Los Angeles", "Chicago"),
  average_temperature = c(12.8, 17.7, 10.7),  # In degrees Celsius
  annual_precipitation = c(1269, 373, 965)  # In millimeters
)

df_cities_climate <- left_join(df_cities, df_climate, by = "city")
  1. Create a data frame with some missing values (NA). Then, replace missing values with mean of non-missing values in same column.
Solution
# Create data frame with missing values
df_with_na <- data.frame(
  x = c(1, 2, NA, 4, 5),
  y = c(NA, 7, 8, NA, 10)
)

df_with_na
#>    x  y
#> 1  1 NA
#> 2  2  7
#> 3 NA  8
#> 4  4 NA
#> 5  5 10

# Replace missing values with mean
df_with_na <- df_with_na |> 
  mutate(x = ifelse(is.na(x), mean(x, na.rm = TRUE), x),
         y = ifelse(is.na(y), mean(y, na.rm = TRUE), y))

df_with_na
#>   x         y
#> 1 1  8.333333
#> 2 2  7.000000
#> 3 3  8.000000
#> 4 4  8.333333
#> 5 5 10.000000
  1. Create a function called clean_data_frame() receiving a data frame as argument and replacing missing values with mean of non-missing values in each column.
Solution
clean_data_frame <- function(df) {
  for (col in names(df)) {
    if (is.numeric(df[[col]])) {
      df[[col]] <- ifelse(is.na(df[[col]]), mean(df[[col]], na.rm = TRUE), df[[col]])
    }
  }
  return(df)
}

## Test created function
# Create data frame with missing values to test function
df_test <- data.frame(
  age = c(25, 30, NA, 28, 35),
  height = c(1.75, 1.80, 1.65, NA, 1.70),
  weight = c(70, 80, 75, 65, NA)
)

df_test
#>   age height weight
#> 1  25   1.75     70
#> 2  30   1.80     80
#> 3  NA   1.65     75
#> 4  28     NA     65
#> 5  35   1.70     NA

# Apply function to test data frame
df_clean <- clean_data_frame(df_test)

# Show clean data frame
df_clean
#>    age height weight
#> 1 25.0  1.750   70.0
#> 2 30.0  1.800   80.0
#> 3 29.5  1.650   75.0
#> 4 28.0  1.725   65.0
#> 5 35.0  1.700   72.5

References

Jarrell, Stephen B. 1994. Basic Statistics. 1st ed. Brown (William C.) Co.