Chapter 13 String processing and text mining

13.1 Basic functions

We have already learned how to import data and consolidate it. However, we cannot yet work with this data. We have to validate through string processing and ensure a minimum quality to be able to perform our analyses.

For example, in the previous chapter we imported data from Wikipedia, however we did not focus on whether we could already perform operations or visualizations with our data.

library(rvest)
url <- "https://es.wikipedia.org/wiki/Anexo:Pa%C3%ADses_hispanos_por_poblaci%C3%B3n"
#url <- "https://es.wikipedia.org/wiki/Distribuci%C3%B3n_geogr%C3%A1fica_del_idioma_espa%C3%B1ol" #as a back up URL
html_data <- read_html(url)

web_tables <- html_data |>
  html_nodes("body") |>
  html_nodes("table")

raw_table <- web_tables[[2]] |>
  html_table()

raw_table <- raw_table |> 
  setNames(c("N", "country", "population", "prop_population", "avg_change", "link")) 

raw_table <- raw_table |>
  as_tibble()

raw_table |> head(5)

We may not have noticed, but we can observe columns with spaces or commas where there should be numbers. We can validate this not only by analyzing the class of the column, but also if we try to calculate the average of that variable.

class(raw_table$population)

mean(raw_table$population)

We cannot do a direct conversion to number either because white spaces and commas are characters.

as.numeric(raw_table$population)

There are so frequent and so many possible use cases that there are already multiple functions for processing strings included in the tidyverse library. Likewise, there is more than one way to process strings. It will always depend on how the raw data is found.

13.1.1 Replacing characters

One of the basic functions that we will use the most will be replacing characters. We apply this function when we are sure that this change will not compromise the rest of the data. We have spaces and we have commas. So we could start by replacing one of the two to normalize them using the str_replace_all(string, pattern, replacement) function. In the pattern attribute we will use \\s, which comes from space. We are going to learn first to modify the data stored in a vector and then we will replicate it to our entire table.

library(tidyverse)
library(stringr)

population_vector <- raw_table$population

population_vector <- str_replace_all(population_vector, "\\s", ",")

population_vector

We have purposely taken all the values to be separated by commas because now we can easily use the parse_number(vector) function which not only replaces the commas with empty strings, but also removes any non-numeric value before the first number, which facilitates us if we had monetary values, and also converts the value from character type to numeric type.

population_vector <- parse_number(population_vector)

# Additional example in case we had a monetary value:
parse_number("$345,153")

This vector now allows us to perform mathematical operations or visualization of the distribution.

# Convert to millions
population_vector <- population_vector/10^6

# We remove the last value which is the world population:
length_val <- length(population_vector)
population_vector <- population_vector[-length_val]

# Visualization
boxplot(population_vector)

We already know which functions to use to transform the fields of our case. However, we have applied them to vectors. To mutate the columns of our table in raw form we will use the function mutate(across(columns, function)) using the pipeline operator |>. Let’s apply the first change of spaces by commas and not only to column 3, population, but also to column 5, average change.

raw_table |> 
  mutate(across(c(3,5), ~str_replace_all(., "\\s", ",")))

We have removed from the str_replace_all function the string attribute and replaced it with a dot .. And that dot . indicates that it will evaluate for each column c(3,5) of our table.

Now, let’s apply the parse_number function that we applied previously.

raw_table |> 
  mutate(across(c(3,5), ~str_replace_all(., "\\s", ","))) |> 
  mutate(across(c(3,5), ~parse_number(.)))

13.2 Regular expressions

A regular expression ¹¹ (or regex as it is known in English) is a pattern that describes a set of strings. We have already used regex in the previous section using only the pattern \\s. However, usually we will have many more use cases that will require a pattern that can convert a wider range of cases.

Although we could analyze all possible use cases available in the documentation, we learn faster by use cases. Let’s analyze a case that will allow us to learn some patterns little by little.

In the dslabs library we found and used previously the height data, heights, of students from a university expressed in inches.

library(dslabs)
library(tidyverse)
data(heights)

heights |> 
  head(10)
#>       sex height
#> 1    Male     75
#> 2    Male     70
#> 3    Male     68
#> 4    Male     74
#> 5    Male     61
#> 6  Female     65
#> 7  Female     66
#> 8  Female     62
#> 9  Female     66
#> 10   Male     67

These data were ready to be analyzed. However, that was not how it came from the source. The students had to fill out a survey and even when they were asked for their height in inches, they completed their height in inches, feet, centimeters, writing numbers, letters, etc. We can see the initial data from the form in the reported_heights data frame.

reported_heights |> 
  head(10)
#>             time_stamp    sex height
#> 1  2014-09-02 13:40:36   Male     75
#> 2  2014-09-02 13:46:59   Male     70
#> 3  2014-09-02 13:59:20   Male     68
#> 4  2014-09-02 14:51:53   Male     74
#> 5  2014-09-02 15:16:15   Male     61
#> 6  2014-09-02 15:16:16 Female     65
#> 7  2014-09-02 15:16:19 Female     66
#> 8  2014-09-02 15:16:21 Female     62
#> 9  2014-09-02 15:16:21 Female     66
#> 10 2014-09-02 15:16:22   Male     67

Although we might think that they entered the data correctly, we do not have to trust and it is always better to validate the quality of our data. There are multiple ways to validate, as we can see below:

heights <- reported_heights$height

# Validation option 1: Random sample
sample(heights, 100)
#>   [1] "71"       "67"       "66"       "64"       "70"       "70"      
#>   [7] "76"       "1"        "59"       "67.2"     "69"       "67"      
#>  [13] "67"       "6"        "62"       "69"       "74"       "69"      
#>  [19] "178"      "69"       "74"       "169"      "67"       "68.5"    
#>  [25] "68.5"     "71"       "68"       "158"      "6.1"      "708,661" 
#>  [31] "6.2"      "69"       "75"       "5'6"      "67"       "68.4"    
#>  [37] "75"       "5.7"      "72"       "77"       "75"       "68"      
#>  [43] "69"       "72"       "62"       "65"       "73"       "67"      
#>  [49] "67"       "5.2"      "67.71"    "67"       "5'3"      "66"      
#>  [55] "5'7.5''"  "69"       "65"       "150"      "69"       "5'11''"  
#>  [61] "68.11024" "175"      "152"      "5' 10"    "65"       "74.5"    
#>  [67] "70"       "72"       "73.22"    "63"       "5'9''"    "68.5"    
#>  [73] "74"       "74"       "5'7\""    "6'3\""    "67"       "73.22"   
#>  [79] "74"       "72.8346"  "67.72"    "175"      "5.69"     "69.3"    
#>  [85] "5'10''"   "72"       "60"       "68"       "69"       "73"      
#>  [91] "75"       "70"       "64"       "170"      "649,606"  "73"      
#>  [97] "58"       "60"       "174"      "64.173"

# Validation option 2: convert to numbers and count if there are NAs
x <- as.numeric(heights)
#> Warning: NAs introduced by coercion
sum(is.na(x))
#> [1] 81

# Validation option 3: add column of those that cannot be converted to number:
reported_heights |> 
  mutate(numeric_height = as.numeric(height)) |> 
  filter(is.na(numeric_height)) |> 
  head(10)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `numeric_height = as.numeric(height)`.
#> Caused by warning:
#> ! NAs introduced by coercion
#>             time_stamp    sex                 height numeric_height
#> 1  2014-09-02 15:16:28   Male                  5' 4"             NA
#> 2  2014-09-02 15:16:37 Female                  165cm             NA
#> 3  2014-09-02 15:16:52   Male                    5'7             NA
#> 4  2014-09-02 15:16:56   Male                  >9000             NA
#> 5  2014-09-02 15:16:56   Male                   5'7"             NA
#> 6  2014-09-02 15:17:09 Female                   5'3"             NA
#> 7  2014-09-02 15:18:00   Male 5 feet and 8.11 inches             NA
#> 8  2014-09-02 15:19:48   Male                   5'11             NA
#> 9  2014-09-04 00:46:45   Male                  5'9''             NA
#> 10 2014-09-04 10:29:44   Male                 5'10''             NA

We might want to choose to eliminate these NA data as they are not significant with respect to the total of 1,095 data points. However, there are several of these data points that follow a determined pattern and instead of being discarded could be converted to the scale we have in the rest of the data. For example, there are people who entered their height as 5’7”, which, for those who remember the conversion, can be converted because 1 foot is 12 inches. So $5*12+7=67$. And so, like that case, we can detect patterns, but we have, again, to be careful in detecting the exact pattern and not a very generic one that can change other use cases. If everyone followed the same pattern $x'y''$ or $x'y$ it would be much easier to convert it to inches by calculating $x*12+y$.

Let’s start by extracting our column to a single character vector with all the values that do not convert automatically to number or were entered in inches. We detect this if they measure more than 5 and up to 7 feet (from 1.5m to 2.1 meters). After that we will create the transformations little by little.

problematic_heights <- reported_heights |> 
  filter(is.na(as.numeric(height)) | # Does not convert to number
         (!is.na(as.numeric(height)) & as.numeric(height) >= 5 &
            as.numeric(height) <= 7 ) # or entered in feet and not inches
        ) |> 
  pull(height)

length(problematic_heights)
#> [1] 168

Adding the condition of having entered in feet we have 168 errors. We cannot ignore 15.3% of errors.

We will use str_view() to visualize matches. This function is extremely helpful when debugging regular expressions as it highlights exactly what is matching your pattern.

# Let's visualize entries containing "feet"
str_view(problematic_heights, "feet", match=TRUE)
#>  [10] │ 5 <feet> and 8.11 inches
#>  [82] │ 5 <feet> 7inches
#> [140] │ 5 <feet> 6 inches

We can also use str_detect(string, pattern) to get a logical value (TRUE/FALSE) to filter our vector.

index <- str_detect(problematic_heights, "feet")

problematic_heights[index] # Match the pattern
#> [1] "5 feet and 8.11 inches" "5 feet 7inches"         "5 feet 6 inches"
problematic_heights[!index] |> # Do not match the pattern
  head(40) 
#>  [1] "6"                      "5' 4\""                 "5.3"                   
#>  [4] "165cm"                  "6"                      "5'7"                   
#>  [7] ">9000"                  "5'7\""                  "5'3\""                 
#> [10] "5.25"                   "5'11"                   "5.5"                   
#> [13] "5'9''"                  "6"                      "6.5"                   
#> [16] "5'10''"                 "5.8"                    "5"                     
#> [19] "5.6"                    "5,3"                    "6'"                    
#> [22] "6"                      "5.9"                    "6,8"                   
#> [25] "5' 10"                  "5.5"                    "6.2"                   
#> [28] "Five foot eight inches" "6.2"                    "5.8"                   
#> [31] "5.1"                    "5.11"                   "5'5\""                 
#> [34] "5'2\""                  "5.75"                   "5,4"                   
#> [37] "7"                      "5.4"                    "6.1"                   
#> [40] "5'3"

13.2.1 Alternation

| is the alternation operator that will choose between one or more possible values. In our case, we have indicated to detect if there is the word “feet”, but we also have “ft” and “foot” to refer to the same thing in our data. Thus, we can create the pattern “feet” or “ft” or “foot”.

# Visualize the matches
str_view(problematic_heights, "feet|ft|foot", match=TRUE)
#>  [10] │ 5 <feet> and 8.11 inches
#>  [29] │ Five <foot> eight inches
#>  [82] │ 5 <feet> 7inches
#> [124] │ 5<ft> 9 inches
#> [125] │ 5 <ft> 9 inches
#> [140] │ 5 <feet> 6 inches

index <- str_detect(problematic_heights, "feet|ft|foot")
problematic_heights[index] # Match
#> [1] "5 feet and 8.11 inches" "Five foot eight inches" "5 feet 7inches"        
#> [4] "5ft 9 inches"           "5 ft 9 inches"          "5 feet 6 inches"

In the same way we can find the variations for inches and other symbols that we can remove:

index <- str_detect(problematic_heights, "inches|in|''|\"|cm|and")
problematic_heights[index] # Match
#>  [1] "5' 4\""                 "165cm"                  "5'7\""                 
#>  [4] "5'3\""                  "5 feet and 8.11 inches" "5'9''"                 
#>  [7] "5'10''"                 "Five foot eight inches" "5'5\""                 
#> [10] "5'2\""                  "5'10''"                 "5'3''"                 
#> [13] "5'7''"                  "5'3\""                  "5'6''"                 
#> [16] "5'7.5''"                "5'7.5''"                "5'2\""                 
#> [19] "5' 7.78\""              "5 feet 7inches"         "5'8\""                 
#> [22] "5'11\""                 "5'7\""                  "5' 11\""               
#> [25] "6'1\""                  "69\""                   "5' 7\""                
#> [28] "5'10''"                 "5ft 9 inches"           "5 ft 9 inches"         
#> [31] "5'11''"                 "5'8\""                  "5 feet 6 inches"       
#> [34] "5'10''"                 "6'3\""                  "5'5''"                 
#> [37] "5'7\""                  "6'4\""                  "170 cm"

In this case we have entered '' to detect those who entered that symbol to denote inches and \" in case they used double quotes. In this latter case we have used \ so that it does not generate an error when interpreting as closing the string.

We could already start replacing based on the detected patterns:

problematic_heights <- str_replace_all(problematic_heights, "feet|ft|foot", "'")
problematic_heights <- str_replace_all(problematic_heights, "inches|in|''|\"|cm|and", "")

problematic_heights |> 
  head(30)
#>  [1] "6"             "5' 4"          "5.3"           "165"          
#>  [5] "6"             "5'7"           ">9000"         "5'7"          
#>  [9] "5'3"           "5 '  8.11 "    "5.25"          "5'11"         
#> [13] "5.5"           "5'9"           "6"             "6.5"          
#> [17] "5'10"          "5.8"           "5"             "5.6"          
#> [21] "5,3"           "6'"            "6"             "5.9"          
#> [25] "6,8"           "5' 10"         "5.5"           "6.2"          
#> [29] "Five ' eight " "6.2"

As an additional effort, we could also look to solve that some people have written words instead of numbers. For this we create a function that replaces each word with a number and apply it to the vector:

words_to_number <- function(s){
  str_to_lower(s) |>  
    str_replace_all("zero", "0") |>
    str_replace_all("one", "1") |>
    str_replace_all("two", "2") |>
    str_replace_all("three", "3") |>
    str_replace_all("four", "4") |>
    str_replace_all("five", "5") |>
    str_replace_all("six", "6") |>
    str_replace_all("seven", "7") |>
    str_replace_all("eight", "8") |>
    str_replace_all("nine", "9") |>
    str_replace_all("ten", "10") |>
    str_replace_all("eleven", "11")
}

problematic_heights <- words_to_number(problematic_heights)
problematic_heights |> 
  head(30)
#>  [1] "6"          "5' 4"       "5.3"        "165"        "6"         
#>  [6] "5'7"        ">9000"      "5'7"        "5'3"        "5 '  8.11 "
#> [11] "5.25"       "5'11"       "5.5"        "5'9"        "6"         
#> [16] "6.5"        "5'10"       "5.8"        "5"          "5.6"       
#> [21] "5,3"        "6'"         "6"          "5.9"        "6,8"       
#> [26] "5' 10"      "5.5"        "6.2"        "5 ' 8 "     "6.2"

13.2.2 Anchoring

Now that it is more standardized we can start with regex with more generic characteristics. For example, there is a person who has entered 6'. It would be convenient to have everything in the form feet plus inches. With which we should have 6'0. To achieve this we have to create a regex according to this generic situation. We will use the symbol ^ to anchor our validation to “start with” and the symbol $ to match with the end of the string. Before replacing, let’s first see who matches.

str_view(problematic_heights, "^6'$", match=TRUE)
#> [22] │ <6'>

This regex indicates that it starts with 6' and that the expression ends there. We could still make it more generic to address those who, in the future, write 5 inches (1.52m) or 6 inches (1.82m). For this we will use brackets and inside them we will put all the values that we will accept.

index <- str_detect(problematic_heights, "^[56]'$")
problematic_heights[index] # Match
#> [1] "6'"

There is still only one result, but our regex is more generic now and we can already use it to replace. Before replacing in our vector we are going to do a test to learn how to create what we need from a pattern.

test_vec <- c("5'", "6'")

str_replace_all(test_vec, "^([56])'$", "\\1'0")
#> [1] "5'0" "6'0"

We have placed between parentheses to indicate that what is inside is our first value and we use \\1 to refer to that first value. So we are indicating to write the first value, then a quote ', and then a zero 0.

Now we are ready to apply to our entire vector. We are going to make the change to consider not only 5 and 6, but up to the value of 7 inches (2.1m). Likewise, we are going to take the cases in which there is only a number without the foot symbol '.

problematic_heights <- str_replace_all(problematic_heights, "^([5-7])'$", "\\1'0")
problematic_heights <- str_replace_all(problematic_heights, "^([5-7])$", "\\1'0")

problematic_heights |> 
  head(30)
#>  [1] "6'0"        "5' 4"       "5.3"        "165"        "6'0"       
#>  [6] "5'7"        ">9000"      "5'7"        "5'3"        "5 '  8.11 "
#> [11] "5.25"       "5'11"       "5.5"        "5'9"        "6'0"       
#> [16] "6.5"        "5'10"       "5.8"        "5'0"        "5.6"       
#> [21] "5,3"        "6'0"        "6'0"        "5.9"        "6,8"       
#> [26] "5' 10"      "5.5"        "6.2"        "5 ' 8 "     "6.2"

13.2.3 Repetitions

We can control how many times a pattern matches using repetition operators:

We can control how many times a pattern matches using repetition operators. The question mark ? indicates that the preceding element matches 0 or 1 time (making it optional). The plus sign + requires 1 or more matches, ensuring the element is present at least once. The asterisk * allows for 0 or more matches, meaning the element can be absent or repeated indefinitely.

For example, to find all cases where instead of using the foot symbol ' they entered a comma, a period, or a space we will use the following pattern:

pattern <- "^([4-7])\\s*[,\\.]\\s*(\\d*)$"

Let’s read the pattern:

The string starts with a digit ranging from 4 to 7.
\\s means that it is followed by a white space, but we use * to indicate that this character appears 0 or more times.
After that space we will look for any of the following characters: ,, a period \\. (to which we put double backslash because the period alone in a pattern means “any value”).
We use \\s* again to look for zero or more white spaces.
Finally we indicate that the string ends there with a digit, to denote that look for any digit we use \\d, d for digit. And we add asterisk so that it keeps one or more digits that it finds.

In summary: it starts with a number, then symbols and then a digit. Between the symbols there could be white spaces. That is our pattern.

str_view(problematic_heights, pattern, match=TRUE)
#>  [3] │ <5.3>
#> [11] │ <5.25>
#> [13] │ <5.5>
#> [16] │ <6.5>
#> [18] │ <5.8>
#> [20] │ <5.6>
#> [21] │ <5,3>
#> [24] │ <5.9>
#> [25] │ <6,8>
#> [27] │ <5.5>
#> [28] │ <6.2>
#> [30] │ <6.2>
#> [31] │ <5.8>
#> [32] │ <5.1>
#> [33] │ <5.11>
#> [36] │ <5.75>
#> [37] │ <5,4>
#> [39] │ <5.4>
#> [40] │ <6.1>
#> [42] │ <5.6>
#> ... and 48 more

We already found the values that match the pattern, so we are ready to replace.

problematic_heights <- str_replace_all(
                        problematic_heights, 
                        "^([4-7])\\s*[,\\.]\\s*(\\d*)$", "\\1.\\2'0"
                   )

problematic_heights |> 
  head(30)
#>  [1] "6'0"        "5' 4"       "5.3'0"      "165"        "6'0"       
#>  [6] "5'7"        ">9000"      "5'7"        "5'3"        "5 '  8.11 "
#> [11] "5.25'0"     "5'11"       "5.5'0"      "5'9"        "6'0"       
#> [16] "6.5'0"      "5'10"       "5.8'0"      "5'0"        "5.6'0"     
#> [21] "5.3'0"      "6'0"        "6'0"        "5.9'0"      "6.8'0"     
#> [26] "5' 10"      "5.5'0"      "6.2'0"      "5 ' 8 "     "6.2'0"

Another pattern we see now is when before or after the foot symbol ' there is a white space. Let’s make the change with what we learned and include cases where there are decimals:

index <- str_detect(problematic_heights, 
                     "^([4-7]\\.?\\d*)\\s*'\\s*(\\d+\\.?\\d*)\\s*$")

problematic_heights[index] |> # Match
  head(30)
#>  [1] "6'0"        "5' 4"       "5.3'0"      "6'0"        "5'7"       
#>  [6] "5'7"        "5'3"        "5 '  8.11 " "5.25'0"     "5'11"      
#> [11] "5.5'0"      "5'9"        "6'0"        "6.5'0"      "5'10"      
#> [16] "5.8'0"      "5'0"        "5.6'0"      "5.3'0"      "6'0"       
#> [21] "6'0"        "5.9'0"      "6.8'0"      "5' 10"      "5.5'0"     
#> [26] "6.2'0"      "5 ' 8 "     "6.2'0"      "5.8'0"      "5.1'0"

problematic_heights <- str_replace_all(
                      problematic_heights, 
                      "^([4-7]\\.?\\d*)\\s*'\\s*(\\d+\\.?\\d*)\\s*$",
                      "\\1'\\2"
                   )

problematic_heights |> 
  head(30)
#>  [1] "6'0"    "5'4"    "5.3'0"  "165"    "6'0"    "5'7"    ">9000"  "5'7"   
#>  [9] "5'3"    "5'8.11" "5.25'0" "5'11"   "5.5'0"  "5'9"    "6'0"    "6.5'0" 
#> [17] "5'10"   "5.8'0"  "5'0"    "5.6'0"  "5.3'0"  "6'0"    "6'0"    "5.9'0" 
#> [25] "6.8'0"  "5'10"   "5.5'0"  "6.2'0"  "5'8"    "6.2'0"

Likewise, we have the pattern in which they entered: feet + space + inches without any symbol. Let’s make the change with what we learned.

index <- str_detect(problematic_heights, "^([4-7])\\s+(\\d*)\\s*$")

problematic_heights[index] # Match
#> [1] "5 11" "6 04"

problematic_heights <- str_replace_all(
                      problematic_heights, 
                      "^([4-7])\\s+(\\d*)\\s*$", "\\1'\\2"
                   )

problematic_heights |> 
  head(30)
#>  [1] "6'0"    "5'4"    "5.3'0"  "165"    "6'0"    "5'7"    ">9000"  "5'7"   
#>  [9] "5'3"    "5'8.11" "5.25'0" "5'11"   "5.5'0"  "5'9"    "6'0"    "6.5'0" 
#> [17] "5'10"   "5.8'0"  "5'0"    "5.6'0"  "5.3'0"  "6'0"    "6'0"    "5.9'0" 
#> [25] "6.8'0"  "5'10"   "5.5'0"  "6.2'0"  "5'8"    "6.2'0"

We are ready to put all the patterns together and the power of patterns is that they can serve us for future exercises. Thus, we will create a function where we will place each change that we can verify to a string.

format_errors <- function(string){
  string |> 
    str_replace_all("feet|ft|foot", "'") |> # Change feet for '
    str_replace_all("inches|in|''|\"|cm|and", "") |> # Remove symbols
    str_replace_all("^([5-7])'$", "\\1'0") |> # Adds 0 to 5', 6' or 7'
    str_replace_all("^([5-7])$", "\\1'0") |> # Adds 0 to 5, 6 or 7
    str_replace_all("^([4-7])\\s*[,\\.]\\s*(\\d*)$", "\\1.\\2'0") |> # Change 5.3' to 5.3'0
    str_replace_all("^([4-7]\\.?\\d*)\\s*'\\s*(\\d+\\.?\\d*)\\s*$", "\\1'\\2") |> #Removes spaces in middle
    str_replace_all("^([4-7])\\s+(\\d*)\\s*$", "\\1'\\2") |> # Adds '
    str_replace("^([12])\\s*,\\s*(\\d*)$", "\\1.\\2") |> # Changes decimals from commas to dots
    str_trim() #Removes spaces at start and end
}

Thus, we have created two functions that could be useful to us if we were to work with surveys of the same type again.

Before applying it to our entire table let’s extract the values to a vector again to apply the created functions.

problematic_heights <- reported_heights |> 
  filter(is.na(as.numeric(height)) | # Does not convert to number
         (!is.na(as.numeric(height)) & as.numeric(height) >= 5 &
            as.numeric(height) <= 7 ) # or entered in feet and not inches
        ) |> 
  pull(height)

Now let’s apply the created functions:

formatted_heights <- problematic_heights |> 
  words_to_number() |> 
  format_errors()

pattern <- "^([4-7]\\.?\\d*)\\s*'\\s*(\\d+\\.?\\d*)\\s*$"
index <- str_detect(formatted_heights, pattern)
formatted_heights[!index] # Do not match the pattern
#>  [1] "165"       ">9000"     "2'33"      "1.70"      "yyy"       "6*12"     
#>  [7] "69"        "708,661"   "649,606"   "728,346"   "170"       "7,283,465"

We have managed to reduce from 168 errors of 1095 records, 15.3% of errors, to 12 errors of 1095, 1% of errors. We can now apply to our initial table.

# Apply created formulas
heights <- reported_heights |> 
  mutate(height) |> 
  mutate(height = words_to_number(height) |> format_errors())

# Get random samples to validate quality
random_indices <- sample(1:nrow(heights)) 
heights[random_indices, ] |> 
  head(15)
#>               time_stamp    sex height
#> 155  2014-09-02 15:17:12 Female     63
#> 426  2015-01-06 22:58:54   Male   5'12
#> 1029 2016-07-26 13:02:34   Male     67
#> 326  2014-10-14 05:18:11   Male     71
#> 789  2016-01-25 08:15:45 Female    5'5
#> 985  2016-04-23 17:15:26   Male   67.5
#> 39   2014-09-02 15:16:31   Male     72
#> 822  2016-01-25 21:18:33   Male     68
#> 986  2016-04-25 06:11:45   Male    180
#> 137  2014-09-02 15:17:02   Male     68
#> 455  2015-01-28 03:59:44   Male  5.5'0
#> 589  2015-05-25 16:19:20   Male     69
#> 1089 2017-09-04 07:28:40   Male     69
#> 196  2014-09-02 15:18:30 Female  64.57
#> 680  2015-09-01 22:45:11   Male     68

We still have to do some conversions. However, since they follow a determined pattern we can use the extract(source_column, new_columns, pattern, remove_source) function to confirm creating new columns for each value of our pattern.

pattern <- "^([4-7]\\.?\\d*)\\s*'\\s*(\\d+\\.?\\d*)\\s*$"

heights |> 
  extract(height, c("feet", "inches"), regex = pattern, remove = FALSE) |> 
  head(15)
#>             time_stamp    sex height feet inches
#> 1  2014-09-02 13:40:36   Male     75 <NA>   <NA>
#> 2  2014-09-02 13:46:59   Male     70 <NA>   <NA>
#> 3  2014-09-02 13:59:20   Male     68 <NA>   <NA>
#> 4  2014-09-02 14:51:53   Male     74 <NA>   <NA>
#> 5  2014-09-02 15:16:15   Male     61 <NA>   <NA>
#> 6  2014-09-02 15:16:16 Female     65 <NA>   <NA>
#> 7  2014-09-02 15:16:19 Female     66 <NA>   <NA>
#> 8  2014-09-02 15:16:21 Female     62 <NA>   <NA>
#> 9  2014-09-02 15:16:21 Female     66 <NA>   <NA>
#> 10 2014-09-02 15:16:22   Male     67 <NA>   <NA>
#> 11 2014-09-02 15:16:22   Male     72 <NA>   <NA>
#> 12 2014-09-02 15:16:23   Male    6'0    6      0
#> 13 2014-09-02 15:16:23   Male     69 <NA>   <NA>
#> 14 2014-09-02 15:16:26   Male     68 <NA>   <NA>
#> 15 2014-09-02 15:16:26   Male     69 <NA>   <NA>

Now that we have the data that matches the pattern in two other columns, and we know they are numbers, we can convert everything to number.

heights |> 
  extract(height, c("feet", "inches"), regex = pattern, remove = FALSE) |> 
  mutate(across(c("height", "feet", "inches"), ~as.numeric(.))) |> 
  head(15)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(c("height", "feet", "inches"),
#>   ~as.numeric(.))`.
#> Caused by warning:
#> ! NAs introduced by coercion
#>             time_stamp    sex height feet inches
#> 1  2014-09-02 13:40:36   Male     75   NA     NA
#> 2  2014-09-02 13:46:59   Male     70   NA     NA
#> 3  2014-09-02 13:59:20   Male     68   NA     NA
#> 4  2014-09-02 14:51:53   Male     74   NA     NA
#> 5  2014-09-02 15:16:15   Male     61   NA     NA
#> 6  2014-09-02 15:16:16 Female     65   NA     NA
#> 7  2014-09-02 15:16:19 Female     66   NA     NA
#> 8  2014-09-02 15:16:21 Female     62   NA     NA
#> 9  2014-09-02 15:16:21 Female     66   NA     NA
#> 10 2014-09-02 15:16:22   Male     67   NA     NA
#> 11 2014-09-02 15:16:22   Male     72   NA     NA
#> 12 2014-09-02 15:16:23   Male     NA    6      0
#> 13 2014-09-02 15:16:23   Male     69   NA     NA
#> 14 2014-09-02 15:16:26   Male     68   NA     NA
#> 15 2014-09-02 15:16:26   Male     69   NA     NA

Now that our columns are numeric we can perform operations to calculate height.

heights |> 
  extract(height, c("feet", "inches"), regex = pattern, remove = FALSE) |> 
  mutate(across(c("height", "feet", "inches"), ~as.numeric(.))) |> 
  mutate(fixed_heights = feet*12 + inches) |> 
  head(15)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(c("height", "feet", "inches"),
#>   ~as.numeric(.))`.
#> Caused by warning:
#> ! NAs introduced by coercion
#>             time_stamp    sex height feet inches fixed_heights
#> 1  2014-09-02 13:40:36   Male     75   NA     NA            NA
#> 2  2014-09-02 13:46:59   Male     70   NA     NA            NA
#> 3  2014-09-02 13:59:20   Male     68   NA     NA            NA
#> 4  2014-09-02 14:51:53   Male     74   NA     NA            NA
#> 5  2014-09-02 15:16:15   Male     61   NA     NA            NA
#> 6  2014-09-02 15:16:16 Female     65   NA     NA            NA
#> 7  2014-09-02 15:16:19 Female     66   NA     NA            NA
#> 8  2014-09-02 15:16:21 Female     62   NA     NA            NA
#> 9  2014-09-02 15:16:21 Female     66   NA     NA            NA
#> 10 2014-09-02 15:16:22   Male     67   NA     NA            NA
#> 11 2014-09-02 15:16:22   Male     72   NA     NA            NA
#> 12 2014-09-02 15:16:23   Male     NA    6      0            72
#> 13 2014-09-02 15:16:23   Male     69   NA     NA            NA
#> 14 2014-09-02 15:16:26   Male     68   NA     NA            NA
#> 15 2014-09-02 15:16:26   Male     69   NA     NA            NA

Finally, we will do a validation of whether the height is in an interval and/or if it was expressed in centimeters or meters.

# We assume for a person a minimum 50" (1.2m) and max 84" (2.1m)
min <- 50
max <- 84

heights <- heights |> 
  extract(height, c("feet", "inches"), regex = pattern, remove = FALSE) |> 
  mutate(across(c("height", "feet", "inches"), ~as.numeric(.))) |> 
  mutate(fixed_heights = feet*12 + inches) |> 
  mutate(final_height = case_when(
    !is.na(height) & between(height, min, max) ~ height, #inches 
    !is.na(height) & between(height/2.54, min, max) ~ height/2.54, #cm
    !is.na(height) & between(height*100/2.54, min, max) ~ height*100/2.54, #meters
    !is.na(fixed_heights) & inches < 12 & 
      between(fixed_heights, min, max) ~ fixed_heights, #feet'inches
    TRUE ~ as.numeric(NA)))
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(c("height", "feet", "inches"),
#>   ~as.numeric(.))`.
#> Caused by warning:
#> ! NAs introduced by coercion

# Random Sample:
random_indices <- sample(1:nrow(heights)) 
heights[random_indices, ] |> 
  select(-time_stamp) |> # Shows all columns except time_stamp
  head(10)
#>         sex   height feet inches fixed_heights final_height
#> 201  Female 67.00000   NA     NA            NA     67.00000
#> 1006   Male 68.11024   NA     NA            NA     68.11024
#> 651    Male 70.00000   NA     NA            NA     70.00000
#> 545    Male 68.00000   NA     NA            NA     68.00000
#> 617    Male 69.00000   NA     NA            NA     69.00000
#> 17     Male 75.00000   NA     NA            NA     75.00000
#> 102  Female 71.00000   NA     NA            NA     71.00000
#> 71     Male 73.00000   NA     NA            NA     73.00000
#> 80   Female 72.00000   NA     NA            NA     72.00000
#> 643    Male 72.00000   NA     NA            NA     72.00000

We already have our sample validated, we would only have to take the columns we need and start using the object for the analyses we need.

final_heights <- heights |> 
  select(gender = sex, heights = final_height)

final_heights |> 
  head(10)
#>    gender heights
#> 1    Male      75
#> 2    Male      70
#> 3    Male      68
#> 4    Male      74
#> 5    Male      61
#> 6  Female      65
#> 7  Female      66
#> 8  Female      62
#> 9  Female      66
#> 10   Male      67

13.3 From strings to dates

Regularly when we import data, we are not only going to want to transform numeric data. We will also have multiple cases where we need to transform our string to a date in some particular format. For this, we will use the lubridate library, included in tidyverse, which provides us with diverse functions to make date treatment more accessible.

library(lubridate)

When the text string is in the ISO 8601 date format (YYYY-MM-DD), we can directly use the month(), day(), year() function.

dates_char <- c("2010-05-19", "2020-05-06", "2010-02-03")

str(dates_char)
#>  chr [1:3] "2010-05-19" "2020-05-06" "2010-02-03"

month(dates_char)
#> [1] 5 5 2

However, we do not always have the date in that format and lubridate() gives other functions that are more flexible when coercing data. Look at this example:

dates <- c(20090101, "2009-01-02", "2009 01 03", "2009-1-4",
       "2009-1, 5", "Created on 2009 1 6", "200901 !!! 07")

str(dates)
#>  chr [1:7] "20090101" "2009-01-02" "2009 01 03" "2009-1-4" "2009-1, 5" ...

ymd(dates)
#> [1] "2009-01-01" "2009-01-02" "2009-01-03" "2009-01-04" "2009-01-05"
#> [6] "2009-01-06" "2009-01-07"

The first data entered was a number, but we already know that it coerces it to text. Then, we have different values entered, but all follow the same pattern. First is the year, then the month and then the day. When we know that first is the year, then month and then day we will use the ymd() function to convert all dates to ISO 8601 format.

In the same way, we will have the following functions that we can use depending on the form in which we have the date from our source. In all cases it will be convenient for us to convert to ISO 8601 format. For example here we can see when it correctly recognizes the format and when the formatting fails.

x <- "28/03/89"
ymd(x)
#> [1] NA
mdy(x)
#> [1] NA
ydm(x)
#> [1] NA
myd(x)
#> [1] NA
dmy(x)
#> [1] "1989-03-28"
dym(x)
#> [1] NA

Finally, in the same way that we can use these functions of days, months and years, we can also use to refer to hours, minutes and seconds.

# Format with hours, minutes and seconds
date_val <- "Feb/2/2012 12:34:56"
mdy_hms(date_val)
#> [1] "2012-02-02 12:34:56 UTC"

# Additional data: Showing system date:
now()
#> [1] "2025-12-25 16:09:37 GMT"

13.4 Exercises

Before solving the following exercise run this Script:

sales <- tibble(
  month = c("April", "May", "June"),
  revenue = c("s/32,124", "s/35,465", "S/38,332"),
  profit = c("s/8,120", "s/9,432", "s/10,543")
)

Convert the revenue and profit columns in the sales object to numeric values, removing any currency symbols or formatting characters.

Solution

# Solution 1
sales |> 
  mutate(across(c(2,3), ~parse_number(.)))

# Alternative solution, longer
sales |>
  mutate(across(c(2,3), ~str_replace_all(., "\\S/|,", ""))) |> 
  mutate(across(c(2,3), ~as.numeric(.)))

Clean the universities vector so that all university names are standardized. Specifically, replace abbreviations like “Univ.” or “U.” at the beginning of the string with the full word “University”.

Solution

universities |> 
    str_replace("^Univ\\.?\\s|^U\\.?\\s", "University ")

For the following exercises, we are going to work on the survey data conducted prior to Brexit in the UK. Run the Script first:

library(rvest)
library(tidyverse)
url <- "https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_United_Kingdom_European_Union_membership_referendum&oldid=896735054"
table_html <- read_html(url) |> html_nodes("table")
polls <- table_html[[5]] |> html_table(fill = TRUE)

Update the polls object by renaming columns to c("date", "remain", "leave", "undecided", "spread", "sample", "pollster", "type", "notes"). Then, filter the dataset to retain only rows where the remain column contains a percentage symbol (“%”).

Solution

names(polls) <- c("date", "remain", "leave", "undecided", "spread",
                  "sample", "pollster", "type", "notes")
polls <- polls[str_detect(polls$remain, "%"), ]
polls 

# If we want to validate the number of polls:
nrow(polls)

Extract the remain column into a vector and convert the text percentages into proper numeric probabilities (e.g., convert “50%” to 0.5).

Solution

remain <- polls$remain

# Solution 1:
percentages <- parse_number(remain)/100

# Solution 1:
temp <- str_replace(remain, "%", "")
percentages <- as.numeric(temp)/100

# Solution 2:
temp <- str_remove(remain, "%")
percentages <- as.numeric(temp)/100

In the undecided column, the value “N/A” appears when the sum of remain and leave equals 100%. Create a vector for undecided where these “N/A” values are replaced with “0%”.

Solution

undecided <- polls$undecided

str_replace(undecided, "N/A", "0%")

encapsulate your cleaning logic into a single function named format_percentage(string). Test this function with the vector c("13.5%", "N/A", "10%") to verify it handles both percentages and “N/A” values correctly.

Solution


format_percentage <- function(string){
  string |> 
    str_replace("N/A", "0%") |> 
    parse_number()/100
}

# Function test:
test_vec <- c("13.5%", "N/A", "10%")

format_percentage(test_vec)

Apply format_percentage to the remain, leave, undecided, and spread columns in the polls dataset. Also, ensure the sample column is converted to a numeric type.

Solution

polls <- polls |> 
  mutate(across(c("remain", "leave", "undecided", "spread"), ~format_percentage(.))) |> 
  mutate(across(c("sample"), ~parse_number(.)))

Import the Peruvian COVID-19 dataset from this URL into an object named covid_peru. Convert the birth date column (FECHA_NACIMIENTO) to a proper Date format and calculate the age distribution of the infected individuals using a histogram.

Solution

url <- "https://www.datosabiertos.gob.pe/sites/default/files/DATOSABIERTOS_SISCOVID.csv"
covid_peru <- read_csv(url)

# We look for those that do not follow the ISO 8601 standard:
index <- str_detect(covid_peru$FECHA_NACIMIENTO, "\\d{4}-\\d{2}-\\d{2}")
covid_peru$FECHA_NACIMIENTO[!index]

# We see dates in DD/MM/YYYY format
# We replace to ISO 8601 format:
covid_peru <- covid_peru |> 
  mutate(across("FECHA_NACIMIENTO", 
            ~str_replace(., "(\\d{2})/(\\d{2})/(\\d{4})", "\\3-\\2-\\1")
            ))

# We search again for those that do not follow ISO 8601 standard:
index <- str_detect(covid_peru$FECHA_NACIMIENTO, "\\d{4}-\\d{2}-\\d{2}")
covid_peru$FECHA_NACIMIENTO[!index]

# Convert column to date:
covid_peru <- covid_peru |> 
  mutate(across("FECHA_NACIMIENTO", ~ymd(.)))

# Now that it is date format we create histogram:
covid_peru |> 
  mutate(age = year(now()) - year(FECHA_NACIMIENTO)) |> 
  pull(age) |> 
  hist()

13.5 Text Mining using Tidy Data

Text mining is the discovery by computer of new information, previously unknown, by automatically extracting information from different written resources. Written resources can be websites, books, chats, comments, emails, reviews, articles, etc.

To perform text mining efficiently in R, we will use the tidytext package. The “tidy” text format is defined as a table with one token per row. A token can be a word, a sentence, or a paragraph, but usually, it is single words. This structure allows us to use all the standard tools we’ve learned (dplyr, ggplot2) to analyze text.

# Install packages if you haven't yet
# install.packages("tidytext")

library(tidytext)
library(tidyverse)
library(stringr)
library(syuzhet) # For sentiment analysis
library(wordcloud)

13.5.1 Importing data and Tokenization

Word maps or word clouds allow us to quickly identify which are the words that are repeated most in a text.

We are going to analyze the work “Pride and Prejudice” written by the author Jane Austen. We will obtain the text from the Project Gutenberg ¹² website. We will use the get_text_as_string() function from syuzhet to import properly.

url <- "https://www.gutenberg.org/cache/epub/1342/pg1342.txt"

# Import text as a single string
pride_book <- get_text_as_string(url)

# Convert to a data frame with sentences or just lines
# Here we will split by newline to create a rudimentary structure
text_df <- tibble(
  text = str_split(pride_book, "\n")[[1]]
)

# Remove empty lines
text_df <- text_df |> 
  filter(text != "")

head(text_df)
#> # A tibble: 1 × 1
#>   text                                                                          
#>   <chr>                                                                         
#> 1 "The Project Gutenberg eBook of Pride and Prejudice      This ebook is for th…

13.5.2 Text cleaning and Tokenization

Now we will clean the text and convert it to specific tokens (words). The unnest_tokens() function automatically: 1. Splits text into tokens (words by default). 2. Removes punctuation. 3. Converts to lowercase.

Note on AI: This process of breaking text into “tokens” is exactly how Large Language Models like GPT-4 work. In Chapter 14 (Data Science in the Age of AI), we will see that LLMs are essentially probabilistic engines that predict the next token in a sequence. Understanding how to handle tokens here is the foundation for understanding Generative AI.

# We eliminate first rows of notes/prologue if needed, though unnest_tokens handles a lot.
# Let's clean some metadata lines roughly
start_line <- 115
text_df <- text_df[start_line:nrow(text_df), ]

# Tokenize
tidy_pride <- text_df |>
  unnest_tokens(word, text)

# See the result
head(tidy_pride)
#> # A tibble: 6 × 1
#>   word 
#>   <chr>
#> 1 <NA> 
#> 2 <NA> 
#> 3 <NA> 
#> 4 <NA> 
#> 5 <NA> 
#> 6 <NA>

Now we have a table where each row is a word. This is the “tidy” format.

However, we clearly have words that do not add meaning (stop words), such as “the”, “and”, “of”. We can remove them using a list of stop words. The tm package provides a good list for English.

library(tm)
english_stop_words <- tibble(word = stopwords("english"))

# Remove stop words using anti_join
tidy_pride_clean <- tidy_pride |>
  anti_join(english_stop_words, by = "word")

head(tidy_pride_clean)
#> # A tibble: 6 × 1
#>   word 
#>   <chr>
#> 1 <NA> 
#> 2 <NA> 
#> 3 <NA> 
#> 4 <NA> 
#> 5 <NA> 
#> 6 <NA>

We might also want to remove custom words or numbers that appeared in the extraction.

custom_stop_words <- tibble(word = c("mr", "mrs", "miss", "said", "will", 
                                     "one", "much", "may", "can", "now", "sir", "lady"))

tidy_pride_clean <- tidy_pride_clean |>
  anti_join(custom_stop_words, by = "word") |>
  filter(!str_detect(word, "^\\d+$")) # Remove pure numbers

13.5.3 Word Cloud

Now that we have our clean data, calculating word frequency is as simple as using count().

word_counts <- tidy_pride_clean |>
  count(word, sort = TRUE)

head(word_counts)
#> # A tibble: 6 × 2
#>   word          n
#>   <chr>     <int>
#> 1 elizabeth   605
#> 2 darcy       383
#> 3 must        322
#> 4 bennet      309
#> 5 jane        274
#> 6 bingley     262

We can create the word cloud directly from this data frame.

wordcloud(words = word_counts$word, 
          freq = word_counts$n,
          min.freq = 5,
          max.words = 80, 
          random.order = FALSE, 
          colors = brewer.pal(name = "Dark2", n = 8))

13.5.4 Word Frequency Plot

Since we have the data in a tidy format, plotting a bar chart of the most frequent words is straightforward with ggplot2.

word_counts |>
  head(20) |>
  ggplot(aes(n, reorder(word, n))) +
  geom_col(fill = "blue") +
  labs(y = NULL, x = "Frequency", title = "Most common words in Pride and Prejudice")

13.6 Sentiment Analysis

Sentiment analysis allows us to know the tone of the messages. We will use the syuzhet package combined with our tidy data skills.

Let’s use the same example of tweets.

library(readxl)

# Download tweets
url <- "https://dparedesi.github.io/Data-Science-with-R-book/data/rmapalacios-tweets.xlsx"
temp_file <- tempfile()
download.file(url, temp_file)
posts <- read_excel(temp_file)
file.remove(temp_file)
#> [1] TRUE

# Filter for tweets only and create a tidy dataframe
tweets_df <- posts |> 
  filter(`Tweet Type` == "Tweet") |> 
  select(text = Text) |>
  mutate(tweet_id = row_number()) 

head(tweets_df)
#> # A tibble: 6 × 2
#>   text                                                                  tweet_id
#>   <chr>                                                                    <int>
#> 1 "Le agradezco mucho regidor.\nUna visita al niño y a su madre pueden…        1
#> 2 "Esto esta prohibido en tantas normas que no se por donde empezar.\n…        2
#> 3 "Nadie lo sabe y a los ministros sectoriales parece importarles poco…        3
#> 4 "Ahora se llama \"trabajo remoto\" con el auspicio del Estado peruan…        4
#> 5 "¿Y usted esta muy seguro que va a salir a trabajar el lunes 25? Vay…        5
#> 6 "No saben como abundan. https://t.co/r1qMGOhGcR"                             6

Now we clean the tweets. unnest_tokens handles most of it, but for tweets, we might want to remove URL links first.

# Custom cleaning function for tweets before tokenization
clean_tweets <- tweets_df |>
  mutate(text = str_replace_all(text, "http\\S+", "")) |> # remove URLs
  mutate(text = str_replace_all(text, "@\\S+", "")) # remove mentions

# Get Sentiment scores for each tweet
# syuzhet works well with the full text vector for scoring
tweet_sentiments <- get_nrc_sentiment(clean_tweets$text, language = "spanish")

# Combine with original data
tweets_with_sentiment <- bind_cols(clean_tweets, tweet_sentiments)

head(tweets_with_sentiment)
#> # A tibble: 6 × 12
#>   text    tweet_id anger anticipation disgust  fear   joy sadness surprise trust
#>   <chr>      <int> <dbl>        <dbl>   <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>
#> 1 "Le ag…        1     0            1       0     2     1       3        0     3
#> 2 "Esto …        2     4            0       3     5     0       3        0     0
#> 3 "Nadie…        3     0            0       0     0     0       0        0     0
#> 4 "Ahora…        4     0            1       0     0     1       0        1     1
#> 5 "¿Y us…        5     0            2       0     1     3       2        0     4
#> 6 "No sa…        6     0            0       0     0     0       0        0     0
#> # ℹ 2 more variables: negative <dbl>, positive <dbl>

We can now reshape this data to visualize emotions using pivot_longer, just like we do with any tidy dataset.

translate_emotions <- function(string){
  case_when(
    string == "anger" ~ "Anger",
    string == "anticipation" ~ "Anticipation",
    string == "disgust" ~ "Disgust",
    string == "fear" ~ "Fear",
    string == "joy" ~ "Joy",
    string == "sadness" ~ "Sadness",
    string == "surprise" ~ "Surprise",
    string == "trust" ~ "Trust",
    string == "negative" ~ "Negative",
    string == "positive" ~ "Positive",
    TRUE ~ string
  )
}

# Summarize totals
sentiment_totals <- tweets_with_sentiment |> 
  summarise(across(anger:positive, sum)) |>
  pivot_longer(cols = everything(), names_to = "sentiment", values_to = "total") |>
  mutate(sentiment = translate_emotions(sentiment))

sentiment_totals
#> # A tibble: 10 × 2
#>    sentiment    total
#>    <chr>        <dbl>
#>  1 Anger          805
#>  2 Anticipation   905
#>  3 Disgust        807
#>  4 Fear          1344
#>  5 Joy            549
#>  6 Sadness       1378
#>  7 Surprise       421
#>  8 Trust         1373
#>  9 Negative      2535
#> 10 Positive      2314

Visualizing:

# Separate positive/negative from specific emotions
general_sentiments <- c("Positive", "Negative")

sentiment_totals |>
  filter(!sentiment %in% general_sentiments) |>
  ggplot(aes(reorder(sentiment, total), total, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  labs(x = NULL, y = "Total Score", title = "Emotions in Tweets") +
  theme(legend.position = "none")

sentiment_totals |>
  filter(sentiment %in% general_sentiments) |>
  ggplot(aes(sentiment, total, fill = sentiment)) +
  geom_col() +
  labs(x = NULL, y = "Total Score", title = "Positive vs Negative Sentiment")

This tidy approach makes it much easier to inspect the data at every step and integrate valid data science workflows (filtering, joining, plotting) without learning a separate system just for text.

13.7 Exercises

For these exercises we will use more books from Project Gutenberg using the gutenbergr library.

# install.packages("gutenbergr")
library(gutenbergr)

# Tibble: list of books in Gutenberg.org
gutenberg_metadata

# List of books in Spanish
gutenberg_works(languages = "es")

Use gutenberg_download(2000) to download the text of “El ingenioso hidalgo don Quijote de la Mancha” and store the result in an object named download.

Solution

download <- gutenberg_download(2000)
quijote_text <- download$text
head(quijote_text)

Extract a random sample of 1,000 lines from the text. Clean this sample by tokenizing into words and removing standard Spanish stop words.

Solution

set.seed(123)
sample_lines <- tibble(text = sample(quijote_text, 1000))

tidy_quijote <- sample_lines |>
  unnest_tokens(word, text) |>
  anti_join(spanish_stop_words, by = "word") |>
  # Remove extra stop words if needed
  filter(!word %in% c("don", "quijote", "sancho"))

Visualize the most frequent words in your cleaned Quijote sample using a word cloud.

Solution

quijote_counts <- tidy_quijote |>
  count(word, sort = TRUE)

wordcloud(words = quijote_counts$word, 
          freq = quijote_counts$n,
          min.freq = 2,
          max.words = 80, 
          colors = brewer.pal(8, "Dark2"))

Analyze the sentiments present in your text sample to determine the overall emotional tone.

Solution

# Reconstruct text for syuzhet or do word-by-word sentiment if using tidytext lexicon
# Using syuzhet on the original sample lines is often better for context, 
# but let's try token-based simply for the exercise or just use the lines:

# Extract sentiments from the lines
quijote_sentiments <- get_nrc_sentiment(sample_lines$text, language = "spanish")

# Summarize/Plot
quijote_sentiments |>
  summarise(across(everything(), sum)) |>
  pivot_longer(everything(), names_to = "sentiment", values_to = "count") |>
  ggplot(aes(reorder(sentiment, count), count)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Sentiments in Don Quijote Sample")

https://stringr.tidyverse.org/articles/regular-expressions.html ↩︎
www.gutenberg.org↩︎