# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
# Set your Census API key
# Choose your state for analysis - assign it to a variable called my_state
my_state <- 'California'Lab 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the California Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/
Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: labs/lab_1/your_file_name.qmd
text: "Lab 1: Census Data Exploration"
If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen California for this analysis because: I have worked with datasets from New York and Pennsylvania- want to try a west coast state. [Brief explanation of why you chose this state]
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs() to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.
# Write your get_acs() code here
ca_data <- get_acs(
geography = "county",
variables = c(
total_pop = "B01003_001",
hshld_income = "B19013_001"
),
state = "CA",
year = 2022,
output = "wide"
)
# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
ca_clean <- ca_data %>%
mutate(
county_name = str_replace(NAME, " County, California$", "")
)
# Display the first few rows
head(ca_clean)# A tibble: 6 × 7
GEOID NAME total_popE total_popM hshld_incomeE hshld_incomeM county_name
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 06001 Alameda C… 1663823 NA 122488 1231 Alameda
2 06003 Alpine Co… 1515 206 101125 17442 Alpine
3 06005 Amador Co… 40577 NA 74853 6048 Amador
4 06007 Butte Cou… 213605 NA 66085 2261 Butte
5 06009 Calaveras… 45674 NA 77526 3875 Calaveras
6 06011 Colusa Co… 21811 NA 69619 5745 Colusa
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate() with case_when() for the categories.
# Calculate MOE percentage and reliability categories using mutate()
ca_reliability <- ca_clean %>%
mutate(
moe_percentage = round((hshld_incomeM / hshld_incomeE) * 100, 2),
reliability = case_when(
moe_percentage < 5 ~ "High Confidence",
moe_percentage >= 5 & moe_percentage <= 10 ~ "Moderate",
moe_percentage > 10 ~ "Low Confidence"
)
)
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
reliability_summary <- ca_reliability %>%
group_by(reliability) %>%
summarize(
counties = n(),
avg_income = round(mean(hshld_incomeE, na.rm = TRUE), 0)
)2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange(), slice(), and select() functions.
# Create table of top 5 counties by MOE percentage
high_uncertainty <- ca_reliability %>%
filter(moe_percentage > 8) %>%
arrange(desc(moe_percentage)) %>%
slice_head(n = 5) %>%
select(county_name, hshld_incomeE, moe_percentage, hshld_incomeM, reliability)
glimpse(high_uncertainty)Rows: 5
Columns: 5
$ county_name <chr> "Mono", "Alpine", "Sierra", "Trinity", "Plumas"
$ hshld_incomeE <dbl> 82038, 101125, 61108, 47317, 67885
$ moe_percentage <dbl> 18.76, 17.25, 15.12, 12.45, 11.45
$ hshld_incomeM <dbl> 15388, 17442, 9237, 5890, 7772
$ reliability <chr> "Low Confidence", "Low Confidence", "Low Confidence", "…
# Format as table with kable() - include appropriate column names and caption
kable(high_uncertainty,
col.names = c("County", "Household Income", "MOE %", "MOE", "Reliability Category"),
caption = "Counties with Highest Income Data Uncertainty",
format.args = list(big.mark = ","))| County | Household Income | MOE % | MOE | Reliability Category |
|---|---|---|---|---|
| Mono | 82,038 | 18.76 | 15,388 | Low Confidence |
| Alpine | 101,125 | 17.25 | 17,442 | Low Confidence |
| Sierra | 61,108 | 15.12 | 9,237 | Low Confidence |
| Trinity | 47,317 | 12.45 | 5,890 | Low Confidence |
| Plumas | 67,885 | 11.45 | 7,772 | Low Confidence |
Data Quality Commentary:
Margin of Error is high so Reliability is low. Therefore, decision making for authorities becomes challenging as this data does not have enough specificity for policy makers to be able to make specific decisions. Thos higher uncertainty might be related to adjustments made to account for households that didnt respond or since the data is representative, weighting is done for the data which might compound any variations in data
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- ca_reliability %>%
group_by(reliability) %>%
slice_sample(n = 1) %>%
ungroup()
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
select(county_name, hshld_incomeE, moe_percentage, reliability)# A tibble: 3 × 4
county_name hshld_incomeE moe_percentage reliability
<chr> <dbl> <dbl> <chr>
1 Butte 66085 3.42 High Confidence
2 Plumas 67885 11.4 Low Confidence
3 San Benito 104451 5.23 Moderate
Comment on the output: The population estimate for these counties is directly proportional to the reliability. Higher population county i.e. Santa Clara also has the most reliability while ALpine has the lowest population estimate number and also has least reliable household income data.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
ca_tract_data <- get_acs(
geography = "tract",
variables = c(
white_pop = "B03002_003",
black_pop = "B03002_004",
latino_pop = "B03002_012",
total_pop = "B03002_001"
),
state = "CA",
year = 2022,
output = "wide"
)
ca_tract_clean <- ca_tract_data %>%
mutate(
tract_name = str_replace(NAME, "; California$", "")
)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_pop_percentages <- ca_tract_clean %>%
mutate(
pct_white = round((white_popE / total_popE) * 100, 1),
pct_black = round((black_popE / total_popE) * 100, 1),
pct_hispanic = round((latino_popE / total_popE) * 100, 1)
)
# Add readable tract and county name columns using str_extract() or similar
tract_pop_percentages <- tract_pop_percentages %>%
separate(NAME,
into = c("tract_name", "county_name", "state_name"),
sep = "; ")3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
highest_latino_pct <- tract_pop_percentages %>%
arrange(desc(pct_hispanic)) %>%
slice_head(n = 1)
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
county_averages <- tract_pop_percentages %>%
group_by(county_name) %>%
summarize(
avg_white = round(mean(pct_white, na.rm = TRUE), 1),
avg_black = round(mean(pct_black, na.rm = TRUE), 1),
avg_hispanic = round(mean(pct_hispanic, na.rm = TRUE), 1),
total_tracts = n() # Bonus: see how many tracts are in each county
) %>%
ungroup()
# Create a nicely formatted table of your results using kable()
library(kableExtra)
county_averages %>%
kable(
col.names = c("County", "% White", "% Black", "% Hispanic", "Total Tracts"),
caption = "Average Demographics by County",
align = "lrrr"
)| County | % White | % Black | % Hispanic | Total Tracts |
|---|---|---|---|---|
| Alameda County | 31.0 | 10.7 | 21.4 | 379 |
| Alpine County | 58.1 | 0.0 | 14.1 | 1 |
| Amador County | 75.7 | 1.6 | 14.9 | 10 |
| Butte County | 69.3 | 1.5 | 17.4 | 54 |
| Calaveras County | 81.0 | 0.9 | 11.6 | 14 |
| Colusa County | 34.0 | 1.6 | 60.4 | 6 |
| Contra Costa County | 42.6 | 8.0 | 25.1 | 242 |
| Del Norte County | 59.5 | 2.2 | 19.6 | 9 |
| El Dorado County | 76.0 | 0.6 | 13.8 | 55 |
| Fresno County | 28.4 | 4.1 | 54.1 | 225 |
| Glenn County | 54.0 | 0.3 | 39.2 | 8 |
| Humboldt County | 72.1 | 1.3 | 11.6 | 36 |
| Imperial County | 11.4 | 2.6 | 82.3 | 40 |
| Inyo County | 62.1 | 0.8 | 23.2 | 6 |
| Kern County | 33.1 | 4.7 | 54.0 | 236 |
| Kings County | 29.6 | 5.7 | 57.3 | 31 |
| Lake County | 69.3 | 2.5 | 20.0 | 21 |
| Lassen County | 70.0 | 5.0 | 15.3 | 9 |
| Los Angeles County | 26.3 | 7.6 | 47.6 | 2498 |
| Madera County | 33.7 | 2.4 | 58.0 | 34 |
| Marin County | 69.2 | 2.5 | 16.5 | 63 |
| Mariposa County | 76.8 | 0.8 | 13.4 | 6 |
| Mendocino County | 64.6 | 0.5 | 24.9 | 24 |
| Merced County | 25.4 | 2.8 | 62.1 | 63 |
| Modoc County | 76.6 | 1.4 | 15.1 | 4 |
| Mono County | 64.2 | 0.2 | 27.8 | 4 |
| Monterey County | 35.2 | 2.0 | 52.6 | 104 |
| Napa County | 54.6 | 2.1 | 31.7 | 40 |
| Nevada County | 83.4 | 0.3 | 9.6 | 26 |
| Orange County | 41.3 | 1.5 | 32.4 | 614 |
| Placer County | 70.9 | 1.4 | 14.5 | 92 |
| Plumas County | 85.2 | 0.6 | 8.6 | 7 |
| Riverside County | 34.6 | 5.7 | 49.9 | 518 |
| Sacramento County | 43.2 | 9.1 | 23.8 | 363 |
| San Benito County | 32.9 | 0.8 | 59.5 | 12 |
| San Bernardino County | 28.5 | 7.1 | 53.3 | 466 |
| San Diego County | 45.5 | 4.4 | 33.3 | 737 |
| San Francisco County | 39.5 | 5.1 | 15.1 | 244 |
| San Joaquin County | 29.8 | 6.7 | 43.6 | 174 |
| San Luis Obispo County | 67.1 | 1.3 | 23.2 | 70 |
| San Mateo County | 37.9 | 2.1 | 23.5 | 174 |
| Santa Barbara County | 46.0 | 1.8 | 43.6 | 109 |
| Santa Clara County | 29.6 | 2.3 | 25.3 | 408 |
| Santa Cruz County | 56.7 | 0.8 | 34.2 | 70 |
| Shasta County | 77.7 | 0.9 | 10.6 | 50 |
| Sierra County | 86.6 | 0.2 | 11.4 | 1 |
| Siskiyou County | 73.9 | 1.2 | 14.5 | 16 |
| Solano County | 36.1 | 13.0 | 28.2 | 100 |
| Sonoma County | 63.6 | 1.4 | 25.6 | 122 |
| Stanislaus County | 38.7 | 2.7 | 48.9 | 112 |
| Sutter County | 45.8 | 1.8 | 32.5 | 21 |
| Tehama County | 65.9 | 0.9 | 26.0 | 14 |
| Trinity County | 79.2 | 1.8 | 7.0 | 4 |
| Tulare County | 27.4 | 1.2 | 65.3 | 103 |
| Tuolumne County | 78.1 | 1.9 | 13.5 | 18 |
| Ventura County | 45.0 | 1.7 | 42.1 | 190 |
| Yolo County | 46.5 | 2.7 | 31.1 | 53 |
| Yuba County | 56.0 | 3.2 | 26.7 | 19 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
MOE_demographic <- tract_pop_percentages %>%
mutate(
moe_percentage_white = round((white_popM / white_popE) * 100, 2),
moe_percentage_black = round((black_popM / black_popE) * 100, 2),
moe_percentage_latino = round((latino_popM / latino_popE) * 100, 2),
moe_percentage_popln_total = round((total_popM / total_popE) * 100, 2),
reliability = case_when(
moe_percentage_popln_total < 5 ~ "High Confidence",
moe_percentage_popln_total >= 5 & moe_percentage_popln_total <= 12 ~ "Moderate Confidence",
moe_percentage_popln_total > 12 ~ "Low Confidence",
TRUE ~ "No Data"
)
)
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
MOE_demographic <- MOE_demographic %>%
mutate(
high_moe_flag = case_when(
# Handle cases where population is zero (prevents 100% unreliability)
(white_popE == 0 & black_popE == 0 & latino_popE == 0) ~ "No Population",
# The strict "OR" logic based on your assigned cutoffs
(moe_percentage_white > 12 |
moe_percentage_black > 12 |
moe_percentage_latino > 12) ~ "Unreliable Demographics",
# If it passes both above, it's reliable
TRUE ~ "Reliable Demographics"
)
)
# Create summary statistics showing how many tracts have data quality issues
quality_summary <- MOE_demographic %>%
count(high_moe_flag) %>%
mutate(percentage = round((n / sum(n)) * 100, 1))4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
data_quality_pattern <- MOE_demographic %>%
group_by(high_moe_flag) %>%
summarize(
count = n(),
avg_total_pop = mean(total_popE, na.rm = TRUE),
avg_white_pct = mean(pct_white, na.rm = TRUE),
avg_black_pct = mean(pct_black, na.rm = TRUE),
avg_latino_pct = mean(pct_hispanic, na.rm = TRUE)
) %>%
mutate(across(where(is.numeric), round, 1))
# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
pattern_table <- data.frame(
`Quality Status` = c("Unreliable Demographics", "No Population"),
`Avg Population` = c(4334, 2.6),
`Pct White` = c(38.0, 0),
`Pct Latino` = c(38.0, 0),
`Pct Black` = c(5.3, 0)
)
kable(
pattern_table,
caption = "Characteristics of Tracts with Data Quality Issues",
align = "lrrrr",
col.names = c("Quality Status", "Avg Population", "% White", "% Latino", "% Black")
)| Quality Status | Avg Population | % White | % Latino | % Black |
|---|---|---|---|---|
| Unreliable Demographics | 4334.0 | 38 | 38 | 5.3 |
| No Population | 2.6 | 0 | 0 | 0.0 |
Pattern Analysis: Data quality issues are mostly found in smaller neighborhoods where fewer people live- making the Census counts less certain. A major pattern is that when a specific group- like the Black population in this data- makes up a very small slice of the neighborhood (around 5.3%), the margin of error jumps too high to meet strict standards. We also see tracts with almost no people- which are likely parks or industrial areas rather than actual residential communities. This shows that it is very hard to get highly reliable data in places that are either sparsely populated or have very small numbers of a specific demographic group.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary:
Across California, data reliability is following a clear systematic patter- accuracy is directly tied to population density and demographic concentration. At the county level, high-population hubs like Santa Clara exhibit high confidence while smaller rural areas like Alpine County show significant uncertainty in household income data. This pattern is even more pronounced at the tract level where neighborhoods with smaller total populations or very small distributions of specific demographic groups frequently fail to meet our reliability standards. Hence, the further the analysis zooms in into small sub-populations- the noisier and less reliable the census estimates become.
These findings present a significant equity risk- communities with the highest data uncertainty often face the greatest risk of algorithmic bias. Specifically, neighborhoods where the Black population makes up a small percentage (around 5.3% black in “unreliable” tracts) are disproportionately flagged as having low-confidence data. If an algorithmic system is used to prioritize social funding based on these metrics- these communities may be unfairly excluded or misrepresented. Under-represented or sparsely populated groups are almost hidden by high margins of error- which could lead to a systematic denial of resources to those who need them most.
The root cause of this reliability gap is sampling variability within the American Community Survey. Because the ACS is a sample rather than a full count- the margin of error increases as the sample size decreases. In tracts with lower residential density or areas with very small numbers of specific demographic groups, the noise in the data often exceeds our 12% reliability threshold. Also, almost 0 population tracts in non-residential industrial zones or parks can seemingly create mathematical errors that can further skew algorithm’s results.
To address these systematic issues- a tiered decision-making framework could perhaps be used. Algorithmic systems should only be used for immediate implementation in “High Confidence” counties where margins of error are below 5%. For “Moderate Confidence” areas, the Department could implement outcome monitoring to catch any potential biases. Finally, “Low Confidence” tracts or county must require a manual review or use proxy data sources —such as school enrollment or state tax records—before making any significant decisions.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
county_recommendations <- ca_reliability %>%
mutate(
# Creating the decision framework
algorithm_recommendation = case_when(
reliability == "High Confidence" ~ "Safe for algorithmic decisions",
reliability == "Moderate" ~ "Use with caution - monitor outcomes",
reliability == "Low Confidence" ~ "Requires manual review or additional data",
TRUE ~ "Data unavailable"
)
) %>%
# Selecting and renaming columns for the final table
select(
`County` = county_name,
`Median Income` = hshld_incomeE,
`MOE %` = moe_percentage,
`Reliability Category` = reliability,
`Recommendation` = algorithm_recommendation
) %>%
# Sorting by MOE % to show most reliable data first
arrange(`MOE %`)
# Format as a professional table with kable()
kable(
county_recommendations,
caption = "County-Level Algorithmic Implementation Framework",
format.args = list(big.mark = ","),
align = "lrrll"
)| County | Median Income | MOE % | Reliability Category | Recommendation |
|---|---|---|---|---|
| Los Angeles | 83,411 | 0.53 | High Confidence | Safe for algorithmic decisions |
| Orange | 109,361 | 0.81 | High Confidence | Safe for algorithmic decisions |
| Sacramento | 84,010 | 0.97 | High Confidence | Safe for algorithmic decisions |
| Alameda | 122,488 | 1.00 | High Confidence | Safe for algorithmic decisions |
| Santa Clara | 153,792 | 1.00 | High Confidence | Safe for algorithmic decisions |
| San Diego | 96,974 | 1.02 | High Confidence | Safe for algorithmic decisions |
| San Bernardino | 77,423 | 1.04 | High Confidence | Safe for algorithmic decisions |
| Contra Costa | 120,020 | 1.25 | High Confidence | Safe for algorithmic decisions |
| Riverside | 84,505 | 1.26 | High Confidence | Safe for algorithmic decisions |
| Fresno | 67,756 | 1.43 | High Confidence | Safe for algorithmic decisions |
| San Francisco | 136,689 | 1.43 | High Confidence | Safe for algorithmic decisions |
| Ventura | 102,141 | 1.50 | High Confidence | Safe for algorithmic decisions |
| Placer | 109,375 | 1.70 | High Confidence | Safe for algorithmic decisions |
| San Joaquin | 82,837 | 1.75 | High Confidence | Safe for algorithmic decisions |
| San Mateo | 149,907 | 1.75 | High Confidence | Safe for algorithmic decisions |
| Solano | 97,037 | 1.78 | High Confidence | Safe for algorithmic decisions |
| Stanislaus | 74,872 | 1.83 | High Confidence | Safe for algorithmic decisions |
| Sonoma | 99,266 | 2.00 | High Confidence | Safe for algorithmic decisions |
| Santa Barbara | 92,332 | 2.05 | High Confidence | Safe for algorithmic decisions |
| Kern | 63,883 | 2.07 | High Confidence | Safe for algorithmic decisions |
| Monterey | 91,043 | 2.09 | High Confidence | Safe for algorithmic decisions |
| Tulare | 64,474 | 2.31 | High Confidence | Safe for algorithmic decisions |
| San Luis Obispo | 90,158 | 2.56 | High Confidence | Safe for algorithmic decisions |
| Yolo | 85,097 | 2.74 | High Confidence | Safe for algorithmic decisions |
| Napa | 105,809 | 2.82 | High Confidence | Safe for algorithmic decisions |
| Marin | 142,019 | 2.89 | High Confidence | Safe for algorithmic decisions |
| Santa Cruz | 104,409 | 3.04 | High Confidence | Safe for algorithmic decisions |
| Kings | 68,540 | 3.29 | High Confidence | Safe for algorithmic decisions |
| Merced | 64,772 | 3.31 | High Confidence | Safe for algorithmic decisions |
| El Dorado | 99,246 | 3.36 | High Confidence | Safe for algorithmic decisions |
| Butte | 66,085 | 3.42 | High Confidence | Safe for algorithmic decisions |
| Mendocino | 61,335 | 3.58 | High Confidence | Safe for algorithmic decisions |
| Shasta | 68,347 | 3.63 | High Confidence | Safe for algorithmic decisions |
| Humboldt | 57,881 | 3.68 | High Confidence | Safe for algorithmic decisions |
| Madera | 73,543 | 3.87 | High Confidence | Safe for algorithmic decisions |
| Imperial | 53,847 | 4.11 | High Confidence | Safe for algorithmic decisions |
| Yuba | 66,693 | 4.19 | High Confidence | Safe for algorithmic decisions |
| Lake | 56,259 | 4.34 | High Confidence | Safe for algorithmic decisions |
| Sutter | 72,654 | 4.71 | High Confidence | Safe for algorithmic decisions |
| Nevada | 79,395 | 4.82 | High Confidence | Safe for algorithmic decisions |
| Siskiyou | 53,898 | 4.90 | High Confidence | Safe for algorithmic decisions |
| Calaveras | 77,526 | 5.00 | Moderate | Use with caution - monitor outcomes |
| San Benito | 104,451 | 5.23 | Moderate | Use with caution - monitor outcomes |
| Lassen | 59,515 | 5.97 | Moderate | Use with caution - monitor outcomes |
| Glenn | 64,033 | 6.19 | Moderate | Use with caution - monitor outcomes |
| Tuolumne | 70,432 | 6.66 | Moderate | Use with caution - monitor outcomes |
| Tehama | 59,029 | 6.95 | Moderate | Use with caution - monitor outcomes |
| Del Norte | 61,149 | 7.16 | Moderate | Use with caution - monitor outcomes |
| Amador | 74,853 | 8.08 | Moderate | Use with caution - monitor outcomes |
| Colusa | 69,619 | 8.25 | Moderate | Use with caution - monitor outcomes |
| Inyo | 63,417 | 8.60 | Moderate | Use with caution - monitor outcomes |
| Mariposa | 60,021 | 8.82 | Moderate | Use with caution - monitor outcomes |
| Modoc | 54,962 | 9.80 | Moderate | Use with caution - monitor outcomes |
| Plumas | 67,885 | 11.45 | Low Confidence | Requires manual review or additional data |
| Trinity | 47,317 | 12.45 | Low Confidence | Requires manual review or additional data |
| Sierra | 61,108 | 15.12 | Low Confidence | Requires manual review or additional data |
| Alpine | 101,125 | 17.25 | Low Confidence | Requires manual review or additional data |
| Mono | 82,038 | 18.76 | Low Confidence | Requires manual review or additional data |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
Counties suitable for immediate algorithmic implementation: Counties: Santa Clara, Los Angeles, Sacramento, etc. Why: These counties feature high population estimates, which correlate directly with higher data reliability and low Margins of Error (MOE < 5%). Their large sample sizes ensure that algorithmic decisions are based on stable, statistically significant data points, minimizing the risk.
Counties requiring additional oversight: Counties: Calaveras, San Benito, Lassen, etc. Monitoring Needed: For these areas, the department should implement outcome monitoring- regularly auditing the algorithm’s decisions against real-world feedback to ensure that minor data fluctuations aren’t causing systematic biases.
Counties needing alternative approaches: Counties: Plumas, Trinity, Sierra, etc. Because these areas often have small or sparsely distributed populations, the department should rely on manual review or supplemental data (such as local administrative records). Relying solely on a census-based algorithm here would likely lead to bias due to the high noise in the estimates.
Questions for Further Investigation
- Spatial Correlation of Unreliability: Is there a geographic pattern to data unreliability? For e.g. are rural “border” tracts consistently more difficult to count than urban centers- regardless of the total population size?
- Impact of Time on Minority Margin of Error: How have the Margins of Error for small demographic subgroups (like the 5.3% Black population observed) changed over the last three ACS 5-year cycles? Is it becoming more or less reliable with time for such vulnerable groups?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 3/2/26
Reproducibility: - All analysis conducted in R version 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://shaurya-chauhan.github.io/PortfolioPPA/
Methodology Notes: This analysis used 2022 ACS 5-year estimates via the tidycensus package, focusing on California’s county and tract-level data. To ensure reproducibility, I selected counties representing high, moderate, and low confidence levels to observe how population size impacts data stability. A strict 12% Margin of Error (MOE) threshold was applied to demographic variables- to flag unreliable data. I specifically used case_when logic to separate “No Population” tracts from “Unreliable” ones, ensuring that mathematical errors in empty tracts did not skew the quality assessment. These analytical choices highlight that while strict standards are useful for policy, they frequently label smaller sub-populations.
Limitations: The primary limitation of this analysis is the reliance on a strict 12% Margin of Error threshold for demographic variables at the tract level. While this ensures high data quality- it tends to disproportionately flag smaller minority populations as “unreliable” potentially masking the needs of those specific groups in policy decisions. Temporally, since the study utilizes 2022 ACS 5-year estimates, which provide a stable average over time but may not capture rapid recent shifts in local conditions.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html