Chapter 3 Data Frames
3.1 Introduction to Data Frames
In previous chapters, we explored different types of objects in R, such as variables, vectors, lists, and matrices. These objects allow us to store information in more efficient ways. Now, in this chapter, we will delve into the world of data frames, an essential tool for organizing and analyzing information that will help you make the best decision about your move to the United States.
3.1.1 What are data frames?
Imagine a spreadsheet, with rows and columns organizing information in a tabular way. In R, a data frame is precisely that: a data structure that stores information in a tabular format, with rows representing observations (for example, every US city) and columns representing variables (such as population, cost of living, crime rate).
Each column of a data frame can contain a different data type: numeric, character, logical, factor, etc. This makes data frames very versatile for storing diverse information.
For example, a data frame about US cities could serve as a comprehensive record. It might contain a character column for the city name and another for the state it belongs to. Numeric columns could store the population and the area in square kilometers, while a logical column like has_beach could indicate whether the city is coastal.
3.1.2 Why data frames?
In R, there are various structures for organizing data, such as vectors, lists, and matrices. However, data frames stand out as a fundamental tool in data analysis. Why?
Data frames offer a unique combination of features that make them ideal for representing and manipulating complex information:
Data frames are uniquely suited for data analysis because of their specific features. Their tabular structure organizes data into rows and columns, similar to a spreadsheet, making it intuitive to visualize. They offer flexibility by allowing each column to hold a different data type, such as numbers, text, or dates. this structure also ensures efficiency, as most R analysis packages are optimized to work directly with data frames.
In summary, data frames are a versatile and powerful data structure that adapts to the needs of modern data analysis.
3.1.3 Data Frames in action: exploring information about the United States
In the context of your move to the United States, data frames will be essential for organizing and analyzing the information you need to make the best decision. We can use data frames to store information about:
We can use data frames to store and correlate various aspects of your potential new home. You might track crime rates across different states, compare the cost of living (housing, food, transportation) in target cities, analyze climate data like temperature and precipitation, or study demographics such as population age and education levels.
With this information organized in data frames, you will be able to perform deeper analyses and make more informed decisions about your move.
3.2 Creating Data Frames: Building your database for the move
Now that you know what data frames are and why they are so important in data analysis, it’s time to learn how to create them. In R, we can create data frames in different ways: importing data from external files or creating them manually.
3.2.1 Importing data from files: CSV, Excel
A common way to create data frames is by importing data from external files, such as CSV (Comma Separated Values) files or Excel files. R offers us functions to read data from different formats.
One of the most common ways to create data frames is by importing data from external files. For CSV (Comma Separated Values) files, we rely on the read_csv() function from the readr package (part of the tidyverse), which is faster and more robust than the base R equivalent. To import a file, you simply provide its URL or file path:
``` r
library(readr)
url <- "https://dparedesi.github.io/Data-Science-with-R-book/data/student-grades.csv"
# Import data from a CSV file called "student-grades.csv"
grades <- read_csv(url)
#> Rows: 21 Columns: 9
#> ── Column specification ─────────────────────────────
#> Delimiter: ","
#> chr (3): start_date, gender, type
#> dbl (6): P1, P2, P3, P4, P5, P6
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
grades
#> # A tibble: 21 × 9
#> start_date gender type P1 P2 P3 P4 P5 P6
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 03/05/2020 female Individual Work 1 5 5 5 5 5 5
#> 2 03/05/2020 male Individual Work 1 5 5 5 5 4 5
#> 3 03/05/2020 female Individual Work 1 5 5 4 5 5 5
#> 4 03/05/2020 male Individual Work 1 5 5 5 5 5 5
#> 5 03/05/2020 male Individual Work 1 2 5 5 5 5 5
#> 6 03/05/2020 male Individual Work 1 5 4 5 1 5 5
#> 7 03/05/2020 male Individual Work 1 2 1 5 5 2 5
#> 8 03/05/2020 male Individual Work 1 5 5 5 5 5 5
#> 9 03/05/2020 male Individual Work 1 4 5 5 5 5 5
#> 10 03/05/2020 male Individual Work 1 3 4 5 5 5 5
#> # ℹ 11 more rows
```
The read_csv() function offers several arguments to customize how files are read. The header argument allows you to specify if the first row contains column names, while sep defines the column separator (defaulting to a comma). You can also use dec to set the character used for decimal points.
For Excel files, we use the read_excel() function from the readxl package. This function works similarly but includes specific arguments like sheet to specify which spreadsheet tab to import.
``` r
# Install the readxl package (if you don't have it installed)
install.packages("readxl")
# Load the readxl package
library(readxl)
# Import data from an Excel file called "states.xlsx"
states <- read_excel("states.xlsx")
```
3.2.2 Creating data frames manually
We can also create data frames manually, combining vectors with the data.frame() function.
# Create vectors with information about cities
cities <- c("New York", "Los Angeles", "Chicago")
states <- c("New York", "California", "Illinois")
population <- c(8.4e6, 3.9e6, 2.7e6)
# Create a data frame with city information
df_cities_simple <- data.frame(city = cities, state = states, population = population)
df_cities_simple
#> city state population
#> 1 New York New York 8400000
#> 2 Los Angeles California 3900000
#> 3 Chicago Illinois 2700000In this example, we create a data frame called df_cities_simple with three columns: city, state, and population. Each column is created from a vector. Note that the vectors must have the same length to be combined into a data frame.
3.2.3 Examples
We can use data frames to organize diverse information about our move to the United States. For example, we could create a data frame with information about different cities, including their cost of living, crime rate, and climate. We could also create a data frame with information about the different states, including their population, gross domestic product (GDP), and education system.
# Create a data frame with information about cities
df_cities <- data.frame(
city = c("New York", "Los Angeles", "Chicago", "Houston"),
state = c("New York", "California", "Illinois", "Texas"),
cost_of_living = c(3.5, 2.8, 2.5, 2.0), # In thousands of dollars
crime_rate = c(400, 350, 500, 450), # Per 100,000 inhabitants
climate = c("Temperate", "Mediterranean", "Continental", "Subtropical")
)
df_cities
#> city state cost_of_living crime_rate climate
#> 1 New York New York 3.5 400 Temperate
#> 2 Los Angeles California 2.8 350 Mediterranean
#> 3 Chicago Illinois 2.5 500 Continental
#> 4 Houston Texas 2.0 450 Subtropical
# Create a data frame with information about states
df_states <- data.frame(
state = c("California", "Texas", "Florida", "New York"),
population = c(39.2e6, 29.0e6, 21.4e6, 19.4e6),
gdp = c(3.2e12, 1.8e12, 1.1e12, 1.7e12), # In dollars
education_system = c("Good", "Regular", "Good", "Excellent")
)
df_states
#> state population gdp education_system
#> 1 California 39200000 3.2e+12 Good
#> 2 Texas 29000000 1.8e+12 Regular
#> 3 Florida 21400000 1.1e+12 Good
#> 4 New York 19400000 1.7e+12 ExcellentThese data frames will allow us to analyze the information more efficiently and make more informed decisions about our move.
3.3 Exploring Data Frames: Discovering the secrets of your data
We have already learned to create data frames, now it is time to explore their content and discover the information they hide. R offers us various tools to examine and understand our data.
3.3.1 Accessing rows, columns, and cells
A data frame is like a map organized in rows and columns. To access the information we need, we must know how to navigate this map. R provides us with different ways to access rows, columns, and cells of a data frame.
There are several ways to access specific data within a dataframe. To retrieve a column, you can use the $ operator (e.g., df_cities$state) or bracket notation with the column name in quotes (e.g., df_states["population"]). To access a specific row, use brackets with the row number (e.g., df_cities[3, ]). For a precise cell at the intersection of a row and column, specify both indices (e.g., df_states[2, 3]). You can also filter rows based on conditions, such as extracting all cities where the cost of living is less than 3 using a logical expression inside the brackets.
3.3.2 Functions for exploring data frames
R offers several useful functions for exploring data frames:
R provides useful functions for a quick overview of your data. head() displays the first six rows, while tail() shows the last six. To understand the structure—such as column names and data types—you can use str(). For a statistical overview including mean, median, and quartiles, summary() is the go-to function. Additionally, View() opens an interactive spreadsheet-style window to browse the data.
3.3.3 Examples: exploring data frames with move information
By exploring the data frames we created in the previous section, we can obtain valuable information about US cities and states. For example, we could use summary() to get descriptive statistics of the cost of living in different cities, or View() to examine information about each state in detail.
# Get descriptive statistics of cost of living in different cities
summary(df_cities$cost_of_living)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.000 2.375 2.650 2.700 2.975 3.500In addition to the mentioned functions, we can use other tools to explore our data frames. For example, we can use the table() function to get the frequency of each value in a categorical column, such as the climate column in the df_cities data frame.
We can also use the hist() function to create a histogram of a numeric column, such as the population column in the df_states data frame.

These are just some ideas of how we can explore our data frames. As you become familiar with R, you will discover new functions and techniques for analyzing and visualizing your data.
3.4 Manipulating Data Frames: Transforming your data
In the previous section, we learned to explore data frames and access the information they contain. Now, we will go a step further and learn to manipulate data frames, transforming data to answer specific questions and obtain relevant information for our move.
3.4.1 Introduction to the pipeline operator (|>)
Before modifying data frames, we will introduce a tool to write more readable and efficient code: the native pipeline operator (|>). This operator was introduced in R 4.1 (2021) as a built-in language feature, meaning it works without any additional packages.
Note: You may also encounter the
%>%pipe operator from themagrittrpackage (part of the tidyverse). Both|>and%>%work similarly for most data analysis tasks. We use the native|>operator throughout this book as it is built into R, but%>%is still widely used in older codebases.
The pipeline operator allows us to chain several operations sequentially. Instead of writing nested code, we can use the pipeline operator to “pass” the result of one operation to the next.
To use additional data manipulation functions, we’ll load the tidyverse package, which includes dplyr - a package with many useful functions for working with data frames.
A package in R is like a toolbox with additional functions and data for performing specific tasks. To use a package’s functions, we must first install it and then load it into our working environment.
To install the tidyverse package, we can use the following instruction in the R console:
This will install tidyverse and all the packages it contains, including dplyr. Once the package is installed, we can load it with the library() function:
Now we can use the pipeline operator (|>) and functions from dplyr.
For example, we’ll use the murders dataset from the dslabs package. This dataset contains gun murder data by US state in 2010, including variables like state name, abbreviation, region, population, and total murders. Let’s use a pipeline to view selected columns:
# Load library and dataset
library(dslabs)
data(murders)
# Pipeline
murders |> select(state, population, total)
#> state population total
#> 1 Alabama 4779736 135
#> 2 Alaska 710231 19
#> 3 Arizona 6392017 232
#> 4 Arkansas 2915918 93
#> 5 California 37253956 1257
#> 6 Colorado 5029196 65
#> 7 Connecticut 3574097 97
#> 8 Delaware 897934 38
#> 9 District of Columbia 601723 99
#> 10 Florida 19687653 669
#> 11 Georgia 9920000 376
#> 12 Hawaii 1360301 7
#> 13 Idaho 1567582 12
#> 14 Illinois 12830632 364
#> 15 Indiana 6483802 142
#> 16 Iowa 3046355 21
#> 17 Kansas 2853118 63
#> 18 Kentucky 4339367 116
#> 19 Louisiana 4533372 351
#> 20 Maine 1328361 11
#> 21 Maryland 5773552 293
#> 22 Massachusetts 6547629 118
#> 23 Michigan 9883640 413
#> 24 Minnesota 5303925 53
#> 25 Mississippi 2967297 120
#> 26 Missouri 5988927 321
#> 27 Montana 989415 12
#> 28 Nebraska 1826341 32
#> 29 Nevada 2700551 84
#> 30 New Hampshire 1316470 5
#> 31 New Jersey 8791894 246
#> 32 New Mexico 2059179 67
#> 33 New York 19378102 517
#> 34 North Carolina 9535483 286
#> 35 North Dakota 672591 4
#> 36 Ohio 11536504 310
#> 37 Oklahoma 3751351 111
#> 38 Oregon 3831074 36
#> 39 Pennsylvania 12702379 457
#> 40 Rhode Island 1052567 16
#> 41 South Carolina 4625364 207
#> 42 South Dakota 814180 8
#> 43 Tennessee 6346105 219
#> 44 Texas 25145561 805
#> 45 Utah 2763885 22
#> 46 Vermont 625741 2
#> 47 Virginia 8001024 250
#> 48 Washington 6724540 93
#> 49 West Virginia 1852994 27
#> 50 Wisconsin 5686986 97
#> 51 Wyoming 563626 5Code with pipeline is easier to read and understand, as it follows the natural flow of operations. Pipeline creates a view; we are not editing the murders data frame.
We can show the first rows using the head() function:
head(murders |> select(state, population, total))
#> state population total
#> 1 Alabama 4779736 135
#> 2 Alaska 710231 19
#> 3 Arizona 6392017 232
#> 4 Arkansas 2915918 93
#> 5 California 37253956 1257
#> 6 Colorado 5029196 65We can also use the pipeline operator to show the first rows:
murders |> select(state, population, total) |> head()
#> state population total
#> 1 Alabama 4779736 135
#> 2 Alaska 710231 19
#> 3 Arizona 6392017 232
#> 4 Arkansas 2915918 93
#> 5 California 37253956 1257
#> 6 Colorado 5029196 65For better readability, we will use one function per line, obtaining the same result:
3.4.2 Transforming a table with mutate()
We can create new columns or modify existing ones using the mutate() function. For example, to add a column with the homicide rate per 100,000 inhabitants to the murders data frame:
murders |>
mutate(ratio = total / population * 100000) |>
head()
#> state abb region population total ratio
#> 1 Alabama AL South 4779736 135 2.824424
#> 2 Alaska AK West 710231 19 2.675186
#> 3 Arizona AZ West 6392017 232 3.629527
#> 4 Arkansas AR South 2915918 93 3.189390
#> 5 California CA West 37253956 1257 3.374138
#> 6 Colorado CO West 5029196 65 1.292453This creates a view with the additional ratio column.
If we want to modify the murders data frame directly, we use the assignment operator <-:
3.4.3 Filtering data: selecting cities that interest you
We can filter rows meeting a condition using the filter() function. For example, to get states with less than 1 homicide per 100,000 inhabitants:
# Load dataset
data(murders)
murders |>
mutate(ratio = total / population * 100000) |>
filter(ratio < 1)
#> state abb region population total ratio
#> 1 Hawaii HI West 1360301 7 0.5145920
#> 2 Idaho ID West 1567582 12 0.7655102
#> 3 Iowa IA North Central 3046355 21 0.6893484
#> 4 Maine ME Northeast 1328361 11 0.8280881
#> 5 Minnesota MN North Central 5303925 53 0.9992600
#> 6 New Hampshire NH Northeast 1316470 5 0.3798036
#> 7 North Dakota ND North Central 672591 4 0.5947151
#> 8 Oregon OR West 3831074 36 0.9396843
#> 9 South Dakota SD North Central 814180 8 0.9825837
#> 10 Utah UT West 2763885 22 0.7959810
#> 11 Vermont VT Northeast 625741 2 0.3196211
#> 12 Wyoming WY West 563626 5 0.8871131We can use different operators to create our conditions:
R supports standard comparison operators to create conditions: greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), equal to (==), and different from (!=). You can combine multiple conditions using logical operators: & for AND, | for OR, and ! for NOT.
For example, to filter by ratio less than 1 and West region:
murders |>
mutate(ratio = total / population * 100000) |>
filter(ratio < 1 & region == "West")
#> state abb region population total ratio
#> 1 Hawaii HI West 1360301 7 0.5145920
#> 2 Idaho ID West 1567582 12 0.7655102
#> 3 Oregon OR West 3831074 36 0.9396843
#> 4 Utah UT West 2763885 22 0.7959810
#> 5 Wyoming WY West 563626 5 0.88711313.4.4 Sorting data: finding the safest cities
The arrange() function from the dplyr package allows us to order the rows of a data frame based on one or more columns. Imagine you have a data frame with information about different cities, and you want to order them from safest to least safe, based on their crime rate. Or perhaps you want to order them by cost of living, from cheapest to most expensive. arrange() allows you to do this easily.
For example, to order states by homicide rate (from lowest to highest):
murders |>
mutate(ratio = total / population * 100000) |>
arrange(ratio) |>
head()
#> state abb region population total ratio
#> 1 Vermont VT Northeast 625741 2 0.3196211
#> 2 New Hampshire NH Northeast 1316470 5 0.3798036
#> 3 Hawaii HI West 1360301 7 0.5145920
#> 4 North Dakota ND North Central 672591 4 0.5947151
#> 5 Iowa IA North Central 3046355 21 0.6893484
#> 6 Idaho ID West 1567582 12 0.7655102If we want to sort in descending order, we use the desc() function:
murders |>
mutate(ratio = total / population * 100000) |>
arrange(desc(ratio)) |>
head()
#> state abb region population total ratio
#> 1 District of Columbia DC South 601723 99 16.452753
#> 2 Louisiana LA South 4533372 351 7.742581
#> 3 Missouri MO North Central 5988927 321 5.359892
#> 4 Maryland MD South 5773552 293 5.074866
#> 5 South Carolina SC South 4625364 207 4.475323
#> 6 Delaware DE South 897934 38 4.231937We can also sort by multiple columns. For example, if we want to sort first by region and then by state (in alphabetical order):
murders |>
arrange(region, state) |>
head()
#> state abb region population total
#> 1 Connecticut CT Northeast 3574097 97
#> 2 Maine ME Northeast 1328361 11
#> 3 Massachusetts MA Northeast 6547629 118
#> 4 New Hampshire NH Northeast 1316470 5
#> 5 New Jersey NJ Northeast 8791894 246
#> 6 New York NY Northeast 19378102 5173.4.5 Aggregating and summarizing data: obtaining general overview
The summarize() function from the dplyr package allows us to calculate descriptive statistics for one or more columns of a data frame. It’s like summarizing information from our data frame into a single number or a set of numbers.
For example, to calculate the mean population of states:
We can combine summarize() with group_by() to calculate statistics by groups. For example, to calculate average population by region:
3.4.6 Joining data frames: combining information
Imagine you have two data frames: one with information about cities (name, population, etc.) and another with information about the states those cities belong to (state name, governor, etc.). If you want to combine information from both data frames to have a single data frame with all information about cities and their states, you can use dplyr join functions.
dplyr offers several functions for joining data frames, such as left_join(), right_join(), inner_join(), and full_join(). Each function performs a different type of join, depending on how data frame rows are combined.
The left_join() function joins two data frames keeping all rows from the first data frame (the one on the left) and adding columns from the second data frame that match the first data frame’s rows. If a row from the first data frame has no match in the second data frame, new columns will have NA values.
For example, if we have a data frame with city information and another with state information, we can join them by the state column:
The resulting data frame df_cities_states will contain information from both data frames combined. If a city in df_cities does not have a corresponding state in df_states, columns from df_states will have NA values for that city.
Let’s see a concrete example. Suppose we have the following data frames:
df_cities <- data.frame(
city = c("New York", "Los Angeles", "Chicago", "Houston"),
state = c("New York", "California", "Illinois", "Texas")
)
df_states <- data.frame(
state = c("California", "Texas", "Florida"),
governor = c("Gavin Newsom", "Greg Abbott", "Ron DeSantis")
)
# Join data frames by "state" column
df_cities_states <- left_join(df_cities, df_states, by = "state")
df_cities_states
#> city state governor
#> 1 New York New York <NA>
#> 2 Los Angeles California Gavin Newsom
#> 3 Chicago Illinois <NA>
#> 4 Houston Texas Greg AbbottIn this example, left_join() combines df_cities and df_states data frames by the state column. Note that “New York” and “Chicago” cities have NA values in the governor column, since their states (“New York” and “Illinois”) are not present in the df_states data frame.
The other join functions (right_join(), inner_join(), and full_join()) work similarly, but with different criteria for combining data frame rows.
The other join functions work similarly but with different inclusion criteria. right_join() does the opposite of left_join(), keeping all rows from the right data frame and only matching rows from the left. inner_join() is more restrictive, keeping only rows that have matches in both tables, while full_join() is the most inclusive, retaining all rows from both data frames and filling in NA where no match exists.
You can consult dplyr documentation for more information about these functions.
3.4.7 Examples
The dplyr functions we have seen allow us to perform complex data transformations to answer specific questions about our move to the United States. Let’s see some examples with R code:
Examples of analysis questions
We can combine these tools to answer specific questions. To find suitable locations, we might filtered for cities with a “Good” education system and a cost of living index below 2.5. Alternatively, to study economic prosperity, we could sort states by their GDP per capita (calculated as GDP divided by population) in descending order. For a more comprehensive climate analysis, we could join our city data with a separate climate table.
With these tools, you will be able to explore and analyze information about the United States to make the best decision about your move.
3.5 Exercises
- Report the state abbreviation
abband populationpopulationcolumns from themurdersdata frame
- Report all data frame data that are not from the South region.
If we want to filter all records that are from the South and West region we will use
%in%instead of==to compare versus a vector
- Create the vector
south_and_westcontaining values “South” and “West”. Then filter records that are from those two regions.
- Add the
ratiocolumn to themurdersdata frame with the murder ratio per 100,000 inhabitants. Then, filter those with a ratio less than 0.5 and are from “South” and “West” regions. Reportstate,abb, andratiocolumns.
Solution
To sort using pipeline we use the
arrange(x)function, wherexis the name of the column we want to take as reference which will sort in ascending order orarrange(desc(x))to sort in descending order.
- Modify the code generated in the previous exercise to sort the result by the
ratiofield.
Solution
So, finally we can know what state options we have to be able to move and solve the presented case.3.6 Data frames in plots
Now we will see some functions that allow us to visualize our data. Little by little we will build more complex and visually more aesthetic graphs to present. First let’s see the most basic functions R presents us. In the next chapter we will see in more detail graph types and in which situations it is recommended to use one or another graph.
3.6.1 Scatter plots
One of the most used plots in R is the scatter plot, which is a type of mathematical diagram using Cartesian coordinates to show values for two variables for a set of data (Jarrell 1994, 492). By default we assume the variables to analyze are independent. Thus, the scatter plot will show the degree of correlation (not causality) between the two variables.
The simplest way to plot a scatter plot is with the plot(x,y) function, where x and y are vectors indicating the x-axis coordinates and y-axis coordinates of each point we want to plot. For example, let’s see the relationship between population size and total murders.
# Let's store population data in the x_axis object
x_axis <- murders$population
# Let's store total murders data in the y_axis object
y_axis <- murders$total
# With this code we create the scatter plot
plot(x_axis, y_axis)
We can see a correlation between population and number of cases. Let’s transform the x_axis dividing by one million (\({10}^6\)). Thus we will have the x axis expressed in millions.

3.7 Data interpretation
We have seen graphs that can be generated with a line of code, but we need to interpret them. To do so, we need to learn or remember some statistics. Throughout this book we will learn statistical concepts not going deep into the math part, but from the practical part and leveraging that functions already exist in R.
Let’s remember our case/problem. We have a list of murders in each of the 51 states. If we order them by the total column we would have:
murders |>
arrange(total) |>
head()
#> state abb region population total ratio
#> 1 Vermont VT Northeast 625741 2 0.3196211
#> 2 North Dakota ND North Central 672591 4 0.5947151
#> 3 New Hampshire NH Northeast 1316470 5 0.3798036
#> 4 Wyoming WY West 563626 5 0.8871131
#> 5 Hawaii HI West 1360301 7 0.5145920
#> 6 South Dakota SD North Central 814180 8 0.9825837R provides us with the summary() function, which gives us a summary of a vector’s data.
The summary provides key insights: the Min and Max show the range of the data; the 1st Qu (first quartile) and 3rd Qu (third quartile) indicate the 25th and 75th percentiles; the Median marks the exact middle of the distribution; and the Mean gives the arithmetic average.
3.7.1 Quartiles
To understand quartiles let’s visualize total data in an ordered way. To only obtain a single column in pipeline we will use .$ before the variable name:
murders |>
arrange(total) |>
pull(total)
#> [1] 2 4 5 5 7 8 11 12 12 16 19 21 22 27 32
#> [16] 36 38 53 63 65 67 84 93 93 97 97 99 111 116 118
#> [31] 120 135 142 207 219 232 246 250 286 293 310 321 351 364 376
#> [46] 413 457 517 669 805 1257Quartiles divide our vector into 4 parts with the same amount of data. Given we have 51 values, we would have groups of 51/4 = 12.75. We would have groups of 13 values (3 groups of 13 elements and one of 12 elements).
For example, the first group would be composed of these numbers:
#> [1] 2 4 5 5 7 8 11 12 12 16 19 21 22
The second group would be composed of these numbers:
#> [1] 27 32 36 38 53 63 65 67 84 93 93 97 97
And so on. In total 4 groups made up of 25% of data each.
3.7.1.1 First quartile
Therefore, when we see the 1st quartile, 1st Qu., let’s think that is the cut indicating up to where I can find 25% of the data.
In our example 24.5 indicates that every number less than or equal to that number will be within the first 25% of data (25% of 51 data points = 12.75, rounded to 13 data points).
If we list numbers less than or equal to 24.5 we will have this list:
murders |>
arrange(total) |>
filter(total <= 24.5) |>
pull(total)
#> [1] 2 4 5 5 7 8 11 12 12 16 19 21 22Which is exactly the same list we obtained previously for the first group.
3.7.1.2 Second quartile or median
The second quartile, also called the median (Median), indicates the cut of the second group. The first group contains the first 25% of data, the second group has additional 25%. So this cut would give us exactly the value found in the middle.
In our example 97 indicates that below that number we will find 50% of total data (50% of 51 data points = 25.5, rounded to 26 data points).
3.7.1.3 Third quartile
The third quartile is the cut of the third group. Up to the median we already had 50%, if we add another 25% of data we would have 75%.
In our example 268 indicates that below that number we will find 75% of total data (75% of 51 data points = 38.25, rounded to 38 data points).
3.7.2 Interpretation of box plot
We are now ready to create a box plot with total murders and interpret results.

The box starts at value 24.5 (first quartile) and ends at value 268 (third quartile). The thick line represents the median (second quartile), 97 in our example.
Between the first quartile and third quartile (between 24.5 and 97 for our example) we will find 50% of the data, also called interquartile range or IQR.
Outside the box we see a vertical line upwards and another downwards, showing the range of our data. Outside those lines we see dots which are atypical data very far from the mean, known as outliers.
We can quickly find to which states these extreme data belong if we sort the table descendingly using the desc function:
murders |>
arrange(desc(total)) |>
head()
#> state abb region population total ratio
#> 1 California CA West 37253956 1257 3.374138
#> 2 Texas TX South 25145561 805 3.201360
#> 3 Florida FL South 19687653 669 3.398069
#> 4 New York NY Northeast 19378102 517 2.667960
#> 5 Pennsylvania PA Northeast 12702379 457 3.597751
#> 6 Michigan MI North Central 9883640 413 4.178622We see that in California 1257 cases were reported. That is one of the extreme data points we see in the box plot.
3.7.3 Examples
- Create variable
pop_log10and store log base 10 data of population (log10()function). Perform the same log base 10 transformation for total murders and store it in variabletot_log10. Generate a scatter plot of these two variables.

- Create a histogram of population in millions (divided by \({10}^6\)).

- Create a box plot of population.

3.8 Exercises
Below, you will find a series of exercises with different levels of difficulty. It is time to put into practice what you have learned in this chapter. Remember you can use dplyr functions like filter(), arrange(), mutate(), summarize(), group_by() and left_join() to manipulate data frames.
- Create a data frame called
my_expenses. It should contain acategoryfactor with levels “Housing”, “Transport”, “Food”, and “Entertainment”, along with three numeric columns (january,february,march) recording expenses for each category.
Solution
my_expenses <- data.frame(
category = factor(c("Housing", "Transport", "Food", "Entertainment")),
january = c(1500, 300, 500, 200),
february = c(1500, 250, 400, 150),
march = c(1500, 350, 550, 250)
)
my_expenses
#> category january february march
#> 1 Housing 1500 1500 1500
#> 2 Transport 300 250 350
#> 3 Food 500 400 550
#> 4 Entertainment 200 150 250- Use
head(),tail(),str()andsummary()functions to exploremy_expensesdata frame.
Solution
head(my_expenses)
#> category january february march
#> 1 Housing 1500 1500 1500
#> 2 Transport 300 250 350
#> 3 Food 500 400 550
#> 4 Entertainment 200 150 250
tail(my_expenses)
#> category january february march
#> 1 Housing 1500 1500 1500
#> 2 Transport 300 250 350
#> 3 Food 500 400 550
#> 4 Entertainment 200 150 250
str(my_expenses)
#> 'data.frame': 4 obs. of 4 variables:
#> $ category: Factor w/ 4 levels "Entertainment",..: 3 4 2 1
#> $ january : num 1500 300 500 200
#> $ february: num 1500 250 400 150
#> $ march : num 1500 350 550 250
summary(my_expenses)
#> category january february march
#> Entertainment:1 Min. : 200 Min. : 150 Min. : 250.0
#> Food :1 1st Qu.: 275 1st Qu.: 225 1st Qu.: 325.0
#> Housing :1 Median : 400 Median : 325 Median : 450.0
#> Transport :1 Mean : 625 Mean : 575 Mean : 662.5
#> 3rd Qu.: 750 3rd Qu.: 675 3rd Qu.: 787.5
#> Max. :1500 Max. :1500 Max. :1500.0- Access
februarycolumn ofmy_expensesdata frame using$operator. Then, access the second row of the data frame using brackets.
Solution
- Filter
my_expensesdata frame to get only rows where expenses injanuaryare greater than 400.
Solution
- Sort
my_expensesdata frame descendingly by expenses inmarch.
Solution
- Add a column called
totaltomy_expensesdata frame containing the sum of January, February, and March expenses for each category.
- Calculate mean and standard deviation of total expenses for each category in
my_expensesdata frame.
Solution
- Group
my_expensesdata frame by category and calculate sum of expenses for each month.
Solution
my_expenses |>
group_by(category) |>
summarize(sum_january = sum(january),
sum_february = sum(february),
sum_march = sum(march))
#> # A tibble: 4 × 4
#> category sum_january sum_february sum_march
#> <fct> <dbl> <dbl> <dbl>
#> 1 Entertainment 200 150 250
#> 2 Food 500 400 550
#> 3 Housing 1500 1500 1500
#> 4 Transport 300 250 350- Visually analyze the following chart describing total murder distribution by regions. Just by visualizing it, could you point out which region has the smallest data range, ignoring outliers? Which region has the highest median?

Solution
West has the smallest data range and has two outliers. South has the highest median among all regions.
Analyzing solely by seeing a chart allows us to put ourselves in the final observer’s shoes and understand if decisions can be made just with presented information.- Create
southvector where you store filtered data of total murders occurred in South region. Then, create a histogram ofsouthvector.
- Create a new data frame called
df_cities_climatecombining information fromdf_citiesanddf_climate(you must createdf_climatedata frame with city climate information). Ensure resulting data frame contains all cities fromdf_cities, even if they don’t have climate information indf_climate.
Solution
- Create a data frame with some missing values (
NA). Then, replace missing values with mean of non-missing values in same column.
Solution
# Create data frame with missing values
df_with_na <- data.frame(
x = c(1, 2, NA, 4, 5),
y = c(NA, 7, 8, NA, 10)
)
df_with_na
#> x y
#> 1 1 NA
#> 2 2 7
#> 3 NA 8
#> 4 4 NA
#> 5 5 10
# Replace missing values with mean
df_with_na <- df_with_na |>
mutate(x = ifelse(is.na(x), mean(x, na.rm = TRUE), x),
y = ifelse(is.na(y), mean(y, na.rm = TRUE), y))
df_with_na
#> x y
#> 1 1 8.333333
#> 2 2 7.000000
#> 3 3 8.000000
#> 4 4 8.333333
#> 5 5 10.000000- Create a function called
clean_data_frame()receiving a data frame as argument and replacing missing values with mean of non-missing values in each column.
Solution
clean_data_frame <- function(df) {
for (col in names(df)) {
if (is.numeric(df[[col]])) {
df[[col]] <- ifelse(is.na(df[[col]]), mean(df[[col]], na.rm = TRUE), df[[col]])
}
}
return(df)
}
## Test created function
# Create data frame with missing values to test function
df_test <- data.frame(
age = c(25, 30, NA, 28, 35),
height = c(1.75, 1.80, 1.65, NA, 1.70),
weight = c(70, 80, 75, 65, NA)
)
df_test
#> age height weight
#> 1 25 1.75 70
#> 2 30 1.80 80
#> 3 NA 1.65 75
#> 4 28 NA 65
#> 5 35 1.70 NA
# Apply function to test data frame
df_clean <- clean_data_frame(df_test)
# Show clean data frame
df_clean
#> age height weight
#> 1 25.0 1.750 70.0
#> 2 30.0 1.800 80.0
#> 3 29.5 1.650 75.0
#> 4 28.0 1.725 65.0
#> 5 35.0 1.700 72.5
