Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Shaurya Chauhan

Published

February 21, 2026

Assignment Overview

Scenario

You are a data analyst for the California Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

  • Apply dplyr functions to real census data for policy analysis
  • Evaluate data quality using margins of error
  • Connect technical analysis to algorithmic decision-making
  • Identify potential equity implications of data reliability issues
  • Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: labs/lab_1/your_file_name.qmd
      text: "Lab 1: Census Data Exploration"

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)

# Set your Census API key

# Choose your state for analysis - assign it to a variable called my_state
my_state <- 'California'

State Selection: I have chosen California for this analysis because: I have worked with datasets from New York and Pennsylvania- want to try a west coast state. [Brief explanation of why you chose this state]

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
ca_data <- get_acs(
  geography = "county",
  variables = c(
    total_pop = "B01003_001",
    hshld_income = "B19013_001"
  ),
  state = "CA",
  year = 2022,
  output = "wide"
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
ca_clean <- ca_data %>%
  mutate(
    county_name = str_replace(NAME, " County, California$", "")
  )


# Display the first few rows
head(ca_clean)
# A tibble: 6 × 7
  GEOID NAME       total_popE total_popM hshld_incomeE hshld_incomeM county_name
  <chr> <chr>           <dbl>      <dbl>         <dbl>         <dbl> <chr>      
1 06001 Alameda C…    1663823         NA        122488          1231 Alameda    
2 06003 Alpine Co…       1515        206        101125         17442 Alpine     
3 06005 Amador Co…      40577         NA         74853          6048 Amador     
4 06007 Butte Cou…     213605         NA         66085          2261 Butte      
5 06009 Calaveras…      45674         NA         77526          3875 Calaveras  
6 06011 Colusa Co…      21811         NA         69619          5745 Colusa     

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
ca_reliability <- ca_clean %>%
  mutate(
    moe_percentage = round((hshld_incomeM / hshld_incomeE) * 100, 2),
    
    reliability = case_when(
      moe_percentage < 5 ~ "High Confidence",
      moe_percentage >= 5 & moe_percentage <= 10 ~ "Moderate",
      moe_percentage > 10 ~ "Low Confidence"
    )
  )

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
reliability_summary <- ca_reliability %>%
  group_by(reliability) %>%
  summarize(
    counties = n(),
    avg_income = round(mean(hshld_incomeE, na.rm = TRUE), 0)
  )

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
high_uncertainty <- ca_reliability %>%
  filter(moe_percentage > 8) %>%
  arrange(desc(moe_percentage)) %>%
  slice_head(n = 5) %>%
  select(county_name, hshld_incomeE, moe_percentage, hshld_incomeM, reliability)

glimpse(high_uncertainty)
Rows: 5
Columns: 5
$ county_name    <chr> "Mono", "Alpine", "Sierra", "Trinity", "Plumas"
$ hshld_incomeE  <dbl> 82038, 101125, 61108, 47317, 67885
$ moe_percentage <dbl> 18.76, 17.25, 15.12, 12.45, 11.45
$ hshld_incomeM  <dbl> 15388, 17442, 9237, 5890, 7772
$ reliability    <chr> "Low Confidence", "Low Confidence", "Low Confidence", "…
# Format as table with kable() - include appropriate column names and caption
kable(high_uncertainty,
      col.names = c("County", "Household Income", "MOE %", "MOE", "Reliability Category"),
      caption = "Counties with Highest Income Data Uncertainty",
      format.args = list(big.mark = ","))
Counties with Highest Income Data Uncertainty
County Household Income MOE % MOE Reliability Category
Mono 82,038 18.76 15,388 Low Confidence
Alpine 101,125 17.25 17,442 Low Confidence
Sierra 61,108 15.12 9,237 Low Confidence
Trinity 47,317 12.45 5,890 Low Confidence
Plumas 67,885 11.45 7,772 Low Confidence

Data Quality Commentary:

Margin of Error is high so Reliability is low. Therefore, decision making for authorities becomes challenging as this data does not have enough specificity for policy makers to be able to make specific decisions. Thos higher uncertainty might be related to adjustments made to account for households that didnt respond or since the data is representative, weighting is done for the data which might compound any variations in data

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- ca_reliability %>%
  group_by(reliability) %>%
  slice_sample(n = 1) %>%
  ungroup()


# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
  select(county_name, hshld_incomeE, moe_percentage, reliability)
# A tibble: 3 × 4
  county_name hshld_incomeE moe_percentage reliability    
  <chr>               <dbl>          <dbl> <chr>          
1 Butte               66085           3.42 High Confidence
2 Plumas              67885          11.4  Low Confidence 
3 San Benito         104451           5.23 Moderate       

Comment on the output: The population estimate for these counties is directly proportional to the reliability. Higher population county i.e. Santa Clara also has the most reliability while ALpine has the lowest population estimate number and also has least reliable household income data.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
ca_tract_data <- get_acs(
  geography = "tract",
  variables = c(
    white_pop = "B03002_003",
    black_pop = "B03002_004",
    latino_pop = "B03002_012",
    total_pop = "B03002_001"
  ),
  state = "CA",
  year = 2022,
  output = "wide"
)

ca_tract_clean <- ca_tract_data %>%
  mutate(
    tract_name = str_replace(NAME, "; California$", "")
  )

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_pop_percentages <- ca_tract_clean %>%
  mutate(
    pct_white    = round((white_popE / total_popE) * 100, 1),
    pct_black    = round((black_popE / total_popE) * 100, 1),
    pct_hispanic = round((latino_popE / total_popE) * 100, 1)
  )

# Add readable tract and county name columns using str_extract() or similar
tract_pop_percentages <- tract_pop_percentages %>%
  separate(NAME, 
           into = c("tract_name", "county_name", "state_name"), 
           sep = "; ")

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
highest_latino_pct <- tract_pop_percentages %>%
  arrange(desc(pct_hispanic)) %>%
  slice_head(n = 1)

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
 county_averages <- tract_pop_percentages %>%
  group_by(county_name) %>%
  summarize(
    avg_white = round(mean(pct_white, na.rm = TRUE), 1),
    avg_black = round(mean(pct_black, na.rm = TRUE), 1),
    avg_hispanic = round(mean(pct_hispanic, na.rm = TRUE), 1),
    total_tracts = n() # Bonus: see how many tracts are in each county
  ) %>%
  ungroup()

# Create a nicely formatted table of your results using kable()
library(kableExtra)
 county_averages %>%
  kable(
    col.names = c("County", "% White", "% Black", "% Hispanic", "Total Tracts"),
    caption = "Average Demographics by County",
    align = "lrrr"
  )
Average Demographics by County
County % White % Black % Hispanic Total Tracts
Alameda County 31.0 10.7 21.4 379
Alpine County 58.1 0.0 14.1 1
Amador County 75.7 1.6 14.9 10
Butte County 69.3 1.5 17.4 54
Calaveras County 81.0 0.9 11.6 14
Colusa County 34.0 1.6 60.4 6
Contra Costa County 42.6 8.0 25.1 242
Del Norte County 59.5 2.2 19.6 9
El Dorado County 76.0 0.6 13.8 55
Fresno County 28.4 4.1 54.1 225
Glenn County 54.0 0.3 39.2 8
Humboldt County 72.1 1.3 11.6 36
Imperial County 11.4 2.6 82.3 40
Inyo County 62.1 0.8 23.2 6
Kern County 33.1 4.7 54.0 236
Kings County 29.6 5.7 57.3 31
Lake County 69.3 2.5 20.0 21
Lassen County 70.0 5.0 15.3 9
Los Angeles County 26.3 7.6 47.6 2498
Madera County 33.7 2.4 58.0 34
Marin County 69.2 2.5 16.5 63
Mariposa County 76.8 0.8 13.4 6
Mendocino County 64.6 0.5 24.9 24
Merced County 25.4 2.8 62.1 63
Modoc County 76.6 1.4 15.1 4
Mono County 64.2 0.2 27.8 4
Monterey County 35.2 2.0 52.6 104
Napa County 54.6 2.1 31.7 40
Nevada County 83.4 0.3 9.6 26
Orange County 41.3 1.5 32.4 614
Placer County 70.9 1.4 14.5 92
Plumas County 85.2 0.6 8.6 7
Riverside County 34.6 5.7 49.9 518
Sacramento County 43.2 9.1 23.8 363
San Benito County 32.9 0.8 59.5 12
San Bernardino County 28.5 7.1 53.3 466
San Diego County 45.5 4.4 33.3 737
San Francisco County 39.5 5.1 15.1 244
San Joaquin County 29.8 6.7 43.6 174
San Luis Obispo County 67.1 1.3 23.2 70
San Mateo County 37.9 2.1 23.5 174
Santa Barbara County 46.0 1.8 43.6 109
Santa Clara County 29.6 2.3 25.3 408
Santa Cruz County 56.7 0.8 34.2 70
Shasta County 77.7 0.9 10.6 50
Sierra County 86.6 0.2 11.4 1
Siskiyou County 73.9 1.2 14.5 16
Solano County 36.1 13.0 28.2 100
Sonoma County 63.6 1.4 25.6 122
Stanislaus County 38.7 2.7 48.9 112
Sutter County 45.8 1.8 32.5 21
Tehama County 65.9 0.9 26.0 14
Trinity County 79.2 1.8 7.0 4
Tulare County 27.4 1.2 65.3 103
Tuolumne County 78.1 1.9 13.5 18
Ventura County 45.0 1.7 42.1 190
Yolo County 46.5 2.7 31.1 53
Yuba County 56.0 3.2 26.7 19

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
MOE_demographic <- tract_pop_percentages %>%
  mutate(
    moe_percentage_white = round((white_popM / white_popE) * 100, 2),
    moe_percentage_black = round((black_popM / black_popE) * 100, 2),
    moe_percentage_latino = round((latino_popM / latino_popE) * 100, 2),
    moe_percentage_popln_total = round((total_popM / total_popE) * 100, 2),
    
    reliability = case_when(
      moe_percentage_popln_total < 5 ~ "High Confidence",
      moe_percentage_popln_total >= 5 & moe_percentage_popln_total <= 12 ~ "Moderate Confidence",
      moe_percentage_popln_total > 12 ~ "Low Confidence",
      TRUE ~ "No Data"
    )
  )



# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
MOE_demographic <- MOE_demographic %>%
  mutate(
    high_moe_flag = case_when(
      # Handle cases where population is zero (prevents 100% unreliability)
      (white_popE == 0 & black_popE == 0 & latino_popE == 0) ~ "No Population",
      
      # The strict "OR" logic based on your assigned cutoffs
      (moe_percentage_white > 12 | 
       moe_percentage_black > 12 | 
       moe_percentage_latino > 12) ~ "Unreliable Demographics",
      
      # If it passes both above, it's reliable
      TRUE ~ "Reliable Demographics"
    )
  )

# Create summary statistics showing how many tracts have data quality issues
quality_summary <- MOE_demographic %>%
  count(high_moe_flag) %>%
  mutate(percentage = round((n / sum(n)) * 100, 1))

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
data_quality_pattern <- MOE_demographic %>%
  group_by(high_moe_flag) %>%
  summarize(
    count = n(),
    avg_total_pop = mean(total_popE, na.rm = TRUE),
    avg_white_pct = mean(pct_white, na.rm = TRUE),
    avg_black_pct = mean(pct_black, na.rm = TRUE),
    avg_latino_pct = mean(pct_hispanic, na.rm = TRUE)
  ) %>%
  mutate(across(where(is.numeric), round, 1))

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
pattern_table <- data.frame(
  `Quality Status` = c("Unreliable Demographics", "No Population"),
  `Avg Population` = c(4334, 2.6),
  `Pct White` = c(38.0, 0),
  `Pct Latino` = c(38.0, 0),
  `Pct Black`  = c(5.3, 0)
)

kable(
  pattern_table, 
  caption = "Characteristics of Tracts with Data Quality Issues",
  align = "lrrrr",
  col.names = c("Quality Status", "Avg Population", "% White", "% Latino", "% Black")
)
Characteristics of Tracts with Data Quality Issues
Quality Status Avg Population % White % Latino % Black
Unreliable Demographics 4334.0 38 38 5.3
No Population 2.6 0 0 0.0

Pattern Analysis: Data quality issues are mostly found in smaller neighborhoods where fewer people live- making the Census counts less certain. A major pattern is that when a specific group- like the Black population in this data- makes up a very small slice of the neighborhood (around 5.3%), the margin of error jumps too high to meet strict standards. We also see tracts with almost no people- which are likely parks or industrial areas rather than actual residential communities. This shows that it is very hard to get highly reliable data in places that are either sparsely populated or have very small numbers of a specific demographic group.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

Across California, data reliability is following a clear systematic patter- accuracy is directly tied to population density and demographic concentration. At the county level, high-population hubs like Santa Clara exhibit high confidence while smaller rural areas like Alpine County show significant uncertainty in household income data. This pattern is even more pronounced at the tract level where neighborhoods with smaller total populations or very small distributions of specific demographic groups frequently fail to meet our reliability standards. Hence, the further the analysis zooms in into small sub-populations- the noisier and less reliable the census estimates become.

These findings present a significant equity risk- communities with the highest data uncertainty often face the greatest risk of algorithmic bias. Specifically, neighborhoods where the Black population makes up a small percentage (around 5.3% black in “unreliable” tracts) are disproportionately flagged as having low-confidence data. If an algorithmic system is used to prioritize social funding based on these metrics- these communities may be unfairly excluded or misrepresented. Under-represented or sparsely populated groups are almost hidden by high margins of error- which could lead to a systematic denial of resources to those who need them most.

The root cause of this reliability gap is sampling variability within the American Community Survey. Because the ACS is a sample rather than a full count- the margin of error increases as the sample size decreases. In tracts with lower residential density or areas with very small numbers of specific demographic groups, the noise in the data often exceeds our 12% reliability threshold. Also, almost 0 population tracts in non-residential industrial zones or parks can seemingly create mathematical errors that can further skew algorithm’s results.

To address these systematic issues- a tiered decision-making framework could perhaps be used. Algorithmic systems should only be used for immediate implementation in “High Confidence” counties where margins of error are below 5%. For “Moderate Confidence” areas, the Department could implement outcome monitoring to catch any potential biases. Finally, “Low Confidence” tracts or county must require a manual review or use proxy data sources —such as school enrollment or state tax records—before making any significant decisions.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

county_recommendations <- ca_reliability %>%
  mutate(
    # Creating the decision framework
    algorithm_recommendation = case_when(
      reliability == "High Confidence" ~ "Safe for algorithmic decisions",
      reliability == "Moderate" ~ "Use with caution - monitor outcomes",
      reliability == "Low Confidence" ~ "Requires manual review or additional data",
      TRUE ~ "Data unavailable"
    )
  ) %>%
  # Selecting and renaming columns for the final table
  select(
    `County` = county_name,
    `Median Income` = hshld_incomeE,
    `MOE %` = moe_percentage,
    `Reliability Category` = reliability,
    `Recommendation` = algorithm_recommendation
  ) %>%
  # Sorting by MOE % to show most reliable data first
  arrange(`MOE %`)



# Format as a professional table with kable()
kable(
  county_recommendations,
  caption = "County-Level Algorithmic Implementation Framework",
  format.args = list(big.mark = ","),
  align = "lrrll"
)
County-Level Algorithmic Implementation Framework
County Median Income MOE % Reliability Category Recommendation
Los Angeles 83,411 0.53 High Confidence Safe for algorithmic decisions
Orange 109,361 0.81 High Confidence Safe for algorithmic decisions
Sacramento 84,010 0.97 High Confidence Safe for algorithmic decisions
Alameda 122,488 1.00 High Confidence Safe for algorithmic decisions
Santa Clara 153,792 1.00 High Confidence Safe for algorithmic decisions
San Diego 96,974 1.02 High Confidence Safe for algorithmic decisions
San Bernardino 77,423 1.04 High Confidence Safe for algorithmic decisions
Contra Costa 120,020 1.25 High Confidence Safe for algorithmic decisions
Riverside 84,505 1.26 High Confidence Safe for algorithmic decisions
Fresno 67,756 1.43 High Confidence Safe for algorithmic decisions
San Francisco 136,689 1.43 High Confidence Safe for algorithmic decisions
Ventura 102,141 1.50 High Confidence Safe for algorithmic decisions
Placer 109,375 1.70 High Confidence Safe for algorithmic decisions
San Joaquin 82,837 1.75 High Confidence Safe for algorithmic decisions
San Mateo 149,907 1.75 High Confidence Safe for algorithmic decisions
Solano 97,037 1.78 High Confidence Safe for algorithmic decisions
Stanislaus 74,872 1.83 High Confidence Safe for algorithmic decisions
Sonoma 99,266 2.00 High Confidence Safe for algorithmic decisions
Santa Barbara 92,332 2.05 High Confidence Safe for algorithmic decisions
Kern 63,883 2.07 High Confidence Safe for algorithmic decisions
Monterey 91,043 2.09 High Confidence Safe for algorithmic decisions
Tulare 64,474 2.31 High Confidence Safe for algorithmic decisions
San Luis Obispo 90,158 2.56 High Confidence Safe for algorithmic decisions
Yolo 85,097 2.74 High Confidence Safe for algorithmic decisions
Napa 105,809 2.82 High Confidence Safe for algorithmic decisions
Marin 142,019 2.89 High Confidence Safe for algorithmic decisions
Santa Cruz 104,409 3.04 High Confidence Safe for algorithmic decisions
Kings 68,540 3.29 High Confidence Safe for algorithmic decisions
Merced 64,772 3.31 High Confidence Safe for algorithmic decisions
El Dorado 99,246 3.36 High Confidence Safe for algorithmic decisions
Butte 66,085 3.42 High Confidence Safe for algorithmic decisions
Mendocino 61,335 3.58 High Confidence Safe for algorithmic decisions
Shasta 68,347 3.63 High Confidence Safe for algorithmic decisions
Humboldt 57,881 3.68 High Confidence Safe for algorithmic decisions
Madera 73,543 3.87 High Confidence Safe for algorithmic decisions
Imperial 53,847 4.11 High Confidence Safe for algorithmic decisions
Yuba 66,693 4.19 High Confidence Safe for algorithmic decisions
Lake 56,259 4.34 High Confidence Safe for algorithmic decisions
Sutter 72,654 4.71 High Confidence Safe for algorithmic decisions
Nevada 79,395 4.82 High Confidence Safe for algorithmic decisions
Siskiyou 53,898 4.90 High Confidence Safe for algorithmic decisions
Calaveras 77,526 5.00 Moderate Use with caution - monitor outcomes
San Benito 104,451 5.23 Moderate Use with caution - monitor outcomes
Lassen 59,515 5.97 Moderate Use with caution - monitor outcomes
Glenn 64,033 6.19 Moderate Use with caution - monitor outcomes
Tuolumne 70,432 6.66 Moderate Use with caution - monitor outcomes
Tehama 59,029 6.95 Moderate Use with caution - monitor outcomes
Del Norte 61,149 7.16 Moderate Use with caution - monitor outcomes
Amador 74,853 8.08 Moderate Use with caution - monitor outcomes
Colusa 69,619 8.25 Moderate Use with caution - monitor outcomes
Inyo 63,417 8.60 Moderate Use with caution - monitor outcomes
Mariposa 60,021 8.82 Moderate Use with caution - monitor outcomes
Modoc 54,962 9.80 Moderate Use with caution - monitor outcomes
Plumas 67,885 11.45 Low Confidence Requires manual review or additional data
Trinity 47,317 12.45 Low Confidence Requires manual review or additional data
Sierra 61,108 15.12 Low Confidence Requires manual review or additional data
Alpine 101,125 17.25 Low Confidence Requires manual review or additional data
Mono 82,038 18.76 Low Confidence Requires manual review or additional data

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

  1. Counties suitable for immediate algorithmic implementation: Counties: Santa Clara, Los Angeles, Sacramento, etc. Why: These counties feature high population estimates, which correlate directly with higher data reliability and low Margins of Error (MOE < 5%). Their large sample sizes ensure that algorithmic decisions are based on stable, statistically significant data points, minimizing the risk.

  2. Counties requiring additional oversight: Counties: Calaveras, San Benito, Lassen, etc. Monitoring Needed: For these areas, the department should implement outcome monitoring- regularly auditing the algorithm’s decisions against real-world feedback to ensure that minor data fluctuations aren’t causing systematic biases.

  3. Counties needing alternative approaches: Counties: Plumas, Trinity, Sierra, etc. Because these areas often have small or sparsely distributed populations, the department should rely on manual review or supplemental data (such as local administrative records). Relying solely on a census-based algorithm here would likely lead to bias due to the high noise in the estimates.

Questions for Further Investigation

  1. Spatial Correlation of Unreliability: Is there a geographic pattern to data unreliability? For e.g. are rural “border” tracts consistently more difficult to count than urban centers- regardless of the total population size?
  2. Impact of Time on Minority Margin of Error: How have the Margins of Error for small demographic subgroups (like the 5.3% Black population observed) changed over the last three ACS 5-year cycles? Is it becoming more or less reliable with time for such vulnerable groups?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 3/2/26

Reproducibility: - All analysis conducted in R version 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://shaurya-chauhan.github.io/PortfolioPPA/

Methodology Notes: This analysis used 2022 ACS 5-year estimates via the tidycensus package, focusing on California’s county and tract-level data. To ensure reproducibility, I selected counties representing high, moderate, and low confidence levels to observe how population size impacts data stability. A strict 12% Margin of Error (MOE) threshold was applied to demographic variables- to flag unreliable data. I specifically used case_when logic to separate “No Population” tracts from “Unreliable” ones, ensuring that mathematical errors in empty tracts did not skew the quality assessment. These analytical choices highlight that while strict standards are useful for policy, they frequently label smaller sub-populations.

Limitations: The primary limitation of this analysis is the reliance on a strict 12% Margin of Error threshold for demographic variables at the tract level. While this ensures high data quality- it tends to disproportionately flag smaller minority populations as “unreliable” potentially masking the needs of those specific groups in policy decisions. Temporally, since the study utilizes 2022 ACS 5-year estimates, which provide a stable average over time but may not capture rapid recent shifts in local conditions.


Submission Checklist

Before submitting your portfolio link on Canvas:

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html