Chapter 14 Data Science in the Age of AI
The field of Data Science is in a state of constant evolution. We started by learning how to handle vectors and lists in Base R, we moved to the elegance of the tidyverse for data manipulation, and we explored the robustness of tidymodels for machine learning. Now, we are facing a new paradigm shift: Generative AI.
Just as the calculator did not replace the mathematician, Large Language Models (LLMs) will not replace the Data Scientist. However, a Data Scientist using AI will likely replace one who does not.
In this part of the book, we will demystify these “magic black boxes”. We will learn what they are, how to control them programmatically from R, and how to use them to unlock unstructured data that was previously inaccessible.
14.1 What is a Large Language Model?
To work effectively with LLMs, we must stop treating them as “people” and start treating them as probabilistic engines.
14.1.1 It’s all about Probability
At its most fundamental level, an LLM like GPT-4, Claude, or Llama is a “next token prediction machine”. It has been trained on a massive corpus of text (books, websites, code repositories) to answer a simple statistical question:
Given the sequence of text “The capital of France is…”, what is the most likely next piece of text?
The model does not “know” geography. It knows that, statistically, the token “Paris” appears more frequently after that sequence than “London” or “Potato”.
14.1.2 Tokens vs. Words
We often think models read words, but they actually process tokens, which can be whole words, fragments, or even spaces. For instance, “apple” might be a single token, while a complex word like “antidisestablishmentarianism” could be split into four or five. A useful rule of thumb is that 1,000 tokens are roughly equivalent to 750 words. This distinction is critical for two reasons: Cost, as you are billed by the token for both input and output; and Context Window, which serves as the model’s short-term memory. A model with a 128k context window can effectively “remember” about 96,000 words of conversation before it begins to lose track of the beginning.
14.1.3 Temperature: Controlling Creativity
One of the most important parameters you can control is Temperature, which dictates the randomness of the output. A temperature of 0 makes the model deterministic, always selecting the most probable next token—ideal for tasks requiring precision like data extraction, coding, or math. Conversely, raising the temperature to 1 or higher encourages the model to take risks and choose less likely tokens, making it suitable for creative writing, brainstorming, and poetry.
[!TIP] For Data Science, start at 0. When writing code or extracting data, we want reliability, not creativity.
14.2 Setting Up Your AI Environment
Before we write code, we must secure our environment. Accessing high-quality models usually requires an API Key (from OpenAI, Anthropic, Google, etc.).
[!DANGER] NEVER paste your API key directly into your R script. If you push that script to GitHub, bots will steal your key in seconds and drain your bank account. * Anonymize: If you must use a public tool, rename columns (
Client_A,Revenue_X) and inject fake values before prompting.
14.2.1 The Solution: Local LLMs
For sensitive data, the best solution is running a Local LLM on your own machine using tools like Ollama or LM Studio. This approach ensures 100% privacy and offline access, though it does come with trade-offs: it requires a capable computer (such as a Mac M-series or NVIDIA GPU), and local models are typically smaller and less capable than massive cloud models like GPT-4.
14.2.2 The .Renviron File
The standard way to handle secrets in R is the .Renviron file. This file lives in your project’s root or your home directory and is not tracked by Git (ensure it is in your .gitignore).
Open or create the file using R:
Add your keys in the following format:
Restart your R session.
Access them in R:
14.3 AI as the “Pair Programmer”
The most immediate value of AI is not replacing your analysis, but accelerating the code you write to perform it.
14.3.1 The Great Refactorer
We all have old code: nested for loops, variable names like x1, x2, and manual indexing. AI excels at modernizing legacy code.
Scenario: You have this Base R code to filter and clean data:
# Old Code
data <- read.csv("sales.csv")
clean_data <- data[data$amount > 100, ]
clean_data$date <- as.Date(clean_data$date)
final <- clean_data[order(clean_data$date), ]Prompt to AI:
> “Refactor this R code to use the tidyverse and the pipe (|>) operator. Ensure variable names are snake_case.”
AI Output:
14.3.2 The Translator
One of the hardest parts of learning R is knowing which package does what you want. You can describe your intent in plain English (or Spanish!) and get the function.
Example Prompt:
> “I have this R code using purrr::map. Can you explain what it does in simple terms and suggest if there is a more modern way to write it?”
14.3.3 Pro Tip: Prompt Engineering 101
Getting good code from an LLM isn’t magic; it’s engineering. A high-quality prompt typically combines four key components. First, establish a Role (“You are an expert R programmer…”) to frame the model’s perspective. Second, clearly define the Task (“Write a function to…”). Third, set explicit Constraints (“Use dplyr, not base R; do not assume clean data”). Finally, specify the desired Format (“Return the code in a single block with comments”) to ensure the output matches your needs.
[!TIP] Iterate. Your first prompt uses vague terms. Your second prompt clarifies them. Your third prompt gets the perfect answer. “I have a date column ‘2023-12-25’. I want to extract the week number of the year. Which
lubridatefunction should I use?”
AI Output:
> “You should use lubridate::isoweek() or lubridate::week().”
14.3.4 The Regex Master
Regular Expressions (Regex) are powerful but notoriously difficult to write. This is arguably the best use case for LLMs.
Scenario: You have a column with messy Peru phone numbers: (51) 999-999-999, +51 999 999 999.
Prompt:
> “I have inconsistent phone numbers. Write a regex compatible with stringr to extract only the 9 digits of the mobile number, ignoring country code.”
AI Output:
14.3.5 The Error Decoder
R error messages can be cryptic. * “Error in result[[1]] : subscript out of bounds” * “Error: aesthetics must be either length 1 or the same as the data”
Instead of staring at the screen, paste the error and the code chunk into the AI. It will usually pinpoint the exact mismatch in list lengths or ggplot layers.
14.4 The Risks: Hallucinations
We cannot finish this introduction without a warning. LLMs are people pleasers. They want to give you an answer, even if they have to invent it.
14.4.1 The “Package” Hallucination
It is common for an LLM to invent an R function that should exist but doesn’t.
User: “How do I calculate the Gini coefficient in
dplyr?” AI: “Just usesummarize(gini = gini_coeff(income))!”
There is no gini_coeff function in dplyr default exports. It sounded plausible, but running it will crash your script. Always verify functions in the Help tab (?function_name).
In the next chapter, we will stop chatting and start coding. We will build an engine to send data to the AI and get structured insights back.
# LLMs as an Analysis Engine {#genai-api}
In the previous chapter, we treated AI as a chatbot that helps us write code. Now, we are going to flip the script. We will treat the Large Language Model as a **function** within our R code—a function that accepts unstructured text as input and returns structured data as output. This approach turns unstructured text into structured data with minimal code.
This is the transition from "Chatting with AI" to "Building with AI".
## The API Economy
To interact with models programmatically, we use **APIs** (Application Programming Interfaces). Instead of a web interface, we send HTTP requests.
While there are R packages like `openai`, `ellmer` or `chattr` that wrap these APIs, as a Data Scientist it is critical to understand how to build the connection yourself using `httr2`. This gives you full control over error handling, retries, and costs.
### Prerequisite: The Setup
Ensure you have your API key stored in the `.Renviron` file as discussed in the previous chapter.
``` r
library(tidyverse)
library(httr2)
library(jsonlite)
# Reload environment if needed
readRenviron(".Renviron")
14.5 Building a Robust Request
A production-quality API request needs more than just a URL. It needs Authentication, Retry Logic, and Error Handling.
Let’s build a wrapper function to query OpenAI’s GPT-4o-mini (a cost-effective model).
query_openai <- function(prompt, system_prompt = "You are a helpful assistant.") {
api_key <- Sys.getenv("OPENAI_API_KEY")
if (api_key == "") stop("Error: OPENAI_API_KEY not found in environment.")
# 1. Construct the Request
req <- request("https://api.openai.com/v1/chat/completions") |>
req_headers(Authorization = paste("Bearer", api_key)) |>
req_body_json(list(
model = "gpt-4o-mini",
messages = list(
list(role = "system", content = system_prompt),
list(role = "user", content = prompt)
),
temperature = 0 # Deterministic for data tasks
)) |>
# 2. Add Robustness: Retry 3 times if server fails (500) or rate limited (429)
req_retry(max_tries = 3, backoff = ~ 2) |> # Exponential backoff
req_throttle(rate = 100/60) # 100 requests per minute
# 3. Perform Request & Handle Errors
response <- req_perform(req)
# 4. Parse the content
result <- response |> resp_body_json()
return(result$choices[[1]]$message$content)
}Now we have a function query_openai() that we can use like any other R function.
14.6 The Holy Grail: Structured Data extraction
The biggest problem with LLMs is that they love to talk. If you ask for a sentiment score, they might say: “Here is the sentiment score you requested based on my analysis: Positive.”
We don’t want that. We want "Positive". Or even better, we want a JSON object.
14.6.1 Forcing JSON Output
Most modern models support “JSON Mode”. This guarantees the output is machine-readable valid JSON.
Let’s say we have a dataset of raw customer reviews and want to extract specific insights. We need to capture the Sentiment (Positive or Negative), a list of mentioned Topics, and the Urgency level—flagging it as ‘High’ if the user is angry or at risk of churning, and ‘Low’ otherwise.
extract_review_data <- function(review_text) {
system_instructions <- "
You are a data extraction engine.
Extract the following fields from the user review and return ONLY a JSON object:
- sentiment: 'Positive', 'Neutral', or 'Negative'
- topics: a list of strings (e.g., ['Price', 'UX'])
- urgency: 'High' if the user is angry/churning, else 'Low'
"
# Note: To enforce strict JSON, we often need to tell the model in the prompt
# AND set response_format = { type: 'json_object' } if supported.
response_json <- query_openai(review_text, system_label = system_instructions)
# Parse JSON string to R list
return(fromJSON(response_json))
}14.7 Batch Processing: The purrr Workflow
Now, let’s apply this to a Data Frame. When processing hundreds of rows, we must be careful.
Now, let’s apply this to a Data Frame. When processing hundreds of rows, we must be careful. First, we need to respect Rate Limits, as APIs will block you if you send too many requests too quickly (e.g., 1000 in a second). Second, consider Cost by always testing on a small sample like head(df, 5) before running the full job. Finally, ensure Error Safety: if row 99 fails, we want to capture that error gracefully so the entire loop doesn’t crash.
We use purrr::map with possibly() (or safely()) generally, but for API calls, adding a small Sys.sleep() is wise.
# Sample Data
reviews_df <- tibble(
id = 1:3,
text = c(
"I love this product! Best purchase ever.",
"The delivery was late and the item is broken. I want a refund.",
"It's okay, but a bit expensive for what it is."
)
)
# 1. Create a Safe Function (returns NULL instead of crashing)
safe_extract <- possibly(extract_review_data, otherwise = NULL)
# 2. Iterate
results_df <- reviews_df |>
mutate(ai_data = map(text, function(t) {
Sys.sleep(0.5) # Be polite to the API
safe_extract(t)
})) |>
# 3. Unnest the JSON structure
unnest_wider(ai_data)
print(results_df)Resulting Data Frame:
| id | text | sentiment | topics | urgency |
|---|---|---|---|---|
| 1 | I love… | Positive | [“Product”] | Low |
| 2 | The delivery… | Negative | [“Shipping”, “Product”] | High |
| 3 | It’s okay… | Neutral | [“Price”] | Low |
14.8 Summary
We have turned an unstructured text column into usable columns for filtering and plotting. This is the true power of “LLMs as Data Engines”.
- We use
httr2for robust connections. - We use System Prompts to force JSON structure.
- We use
purrrandunnest_widerto flatten that AI insight back into our Tidyverse workflow.
In the next chapter, we will discuss Ethics. But before that, there is one more superpower we need to unlock: Embeddings.
14.9 Beyond Generation: Embeddings
So far, we have used LLMs to generate text. But they can also understand text by converting it into numbers. This is called an Embedding.
An embedding is a list of numbers (a vector, e.g., 1536 numbers long) that represents the semantic meaning of a text.
Consider the difference between a “Dog” and a “Puppy”; their corresponding vectors will be mathematically very close because they share similar semantic meanings. In contrast, “Dog” and “Sandwich” will be far apart. This capability powers Semantic Search. Unlike a standard keyword search that looks for exact matches like “Climate” or “Change”—and potentially misses relevant documents—a semantic search converts your query into a vector. It then finds documents with the closest vectors, allowing you to retrieve a report on “Global warming effect on maize” even if it doesn’t contain the exact words from your query “Climate change impact on corn.”
14.9.1 R Implementation
Getting an embedding is just another API call:
get_embedding <- function(text) {
req <- request("https://api.openai.com/v1/embeddings") |>
req_headers(Authorization = paste("Bearer", Sys.getenv("OPENAI_API_KEY"))) |>
req_body_json(list(
model = "text-embedding-3-small",
input = text
))
resp <- req_perform(req)
# Extract the vector
resp |> resp_body_json() |> pluck("data", 1, "embedding") |> unlist()
}
v1 <- get_embedding("The dog barked")
v2 <- get_embedding("The canine made noise")
# cosine_sim <- sum(v1 * v2) / (sqrt(sum(v1^2)) * sqrt(sum(v2^2)))
# The result will be very high (close to 1).This vectorization is the foundation of RAG (Retrieval Augmented Generation), which allows you to chat with your own PDFs.