Chapter 16 Case Study: Real Estate Market Analysis

In this case study, we will apply our data transformation and visualization skills to analyzing the real estate market. We will use the txhousing dataset provided by the ggplot2 package, which contains information about housing sales in Texas.

This dataset allows us to explore concepts like tracking value over time, comparing categories, and analyzing transaction volume—skills that transfer directly to financial analysis, sales reporting, and business intelligence.

16.1 Objectives

Data Cleaning: Handle missing values and format dates.
Transformation: Aggregate data by city and year to find trends.
Visualization: Create time-series plots to analyze market health.
Analysis: Identify cities with the highest growth and stability.

16.2 Loading Libraries

We will use the core tidyverse libraries.

library(tidyverse)
library(lubridate)
library(scales)
library(ggthemes)

16.3 Exploring the Data

First, let’s load and inspect the data.

data("txhousing")
glimpse(txhousing)
#> Rows: 8,602
#> Columns: 9
#> $ city      <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abil…
#> $ year      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
#> $ month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, …
#> $ sales     <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, …
#> $ volume    <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 1263…
#> $ median    <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 6450…
#> $ listings  <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, …
#> $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, …
#> $ date      <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, …

The dataset contains: - city: Name of the city. - year, month: Date components. - sales: Number of sales. - volume: Total value of sales. - median: Median sale price. - listings: Total active listings. - inventory: “Months inventory”: amount of time it would take to sell all current listings at current sales pace. - date: Date in decimal format (e.g., 2000.08).

16.4 Data Cleaning

Real-world data often has missing values (NA). Let’s check how many missing values we have in the sales column.

sum(is.na(txhousing$sales))
#> [1] 568

We see there are records with no sales data. For our analysis of market volume, we should remove these incomplete records. We will creating a clean dataset housing_clean.

We will also create a proper date column using lubridate::make_date(), which is easier to work with than the decimal date.

housing_clean <- txhousing |> 
  filter(!is.na(sales)) |> 
  mutate(date_proper = make_date(year, month, 1)) |> 
  select(-date) # Remove the decimal date

16.5 Exploratory Analysis

16.5.1 Market Volume Over Time

Let’s look at the total sales volume across all of Texas over time. This gives us a “macro” view of the market, similar to how we might look at total portfolio value in a financial context.

# Aggregate by date
total_market <- housing_clean |> 
  group_by(date_proper) |> 
  summarise(
    total_sales = sum(sales),
    total_volume = sum(volume, na.rm = TRUE)
  )

# Plot Volume
total_market |> 
  ggplot(aes(x = date_proper, y = total_volume)) +
  geom_line(color = "steelblue") +
  scale_y_continuous(labels = label_dollar(scale = 1e-9, suffix = "B")) +
  theme_minimal() +
  labs(
    title = "Total Texas Housing Market Volume",
    subtitle = "Monthly Total Sales Volume (Billions)",
    x = "Year",
    y = "Volume ($)"
  )

We can clearly see the seasonality (peaks in summer) and the impact of the 2008 financial crisis (dip around 2008-2010), followed by a strong recovery.

16.5.2 Comparing Cities

Just as we might compare different companies or portfolios, let’s compare the median housing prices in the major cities. We’ll focus on the “Big 4” Texas cities: Austin, Dallas, Houston, and San Antonio.

major_cities <- c("Austin", "Dallas", "Houston", "San Antonio")

city_trends <- housing_clean |> 
  filter(city %in% major_cities)

city_trends |> 
  ggplot(aes(x = date_proper, y = median, color = city)) +
  geom_line(alpha = 0.7) +
  theme_minimal() +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Median Housing Prices in Major Cities",
    x = "Year",
    y = "Median Price",
    color = "City"
  )

Austin (green) clearly shows the steepest growth curve, especially post-2012.

16.6 Creating Indicators

In financial analysis, we often create ratios. Here, let’s look at Inventory, which is a measure of supply vs. demand. - High Inventory: Buyer’s market (prices might drop). - Low Inventory: Seller’s market (prices might rise).

Let’s look at the average inventory per year for these cities.

city_inventory <- city_trends |> 
  group_by(city, year) |> 
  summarise(avg_inventory = mean(inventory, na.rm = TRUE), .groups = "drop")

city_inventory |> 
  ggplot(aes(x = year, y = avg_inventory, color = city)) +
  geom_line(linewidth = 1) +
  theme_fivethirtyeight() +
  labs(
    title = "Market Health: Months of Inventory",
    subtitle = "Lower means easier to sell (Seller's Market)",
    color = "City"
  )

We see a convergence around 2014-2015 where inventory became very tight across all major cities.

16.7 Try It Yourself

Expand the Analysis: Add “Fort Worth” to major_cities and re-run the median price comparison. How does it compare to the Big 4?
Calculate Growth: Compute the year-over-year percentage change in median price for Austin. Which year had the highest growth?
Seasonality Deep Dive: Which month typically has the highest sales volume? Create a boxplot of sales by month to visualize seasonal patterns.

16.8 Conclusions

Through this case study, we performed essential Data Science tasks on a real dataset:

Cleaning: Handling NA values and formatting dates with lubridate.
Aggregation: Summarizing billions of dollars of volume into clear trend lines.
Comparison: Benchmarking cities against each other to identify leaders.
Indicators: Creating business-relevant metrics like months of inventory.

[!TIP] This exact workflow applies to countless domains: stock prices, customer churn, inventory management, or any time-series business data. Master the pattern here, and you can adapt it anywhere.