Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Shaurya Chauhan

Published

February 21, 2026

Assignment Overview

Scenario

You are a data analyst for the California Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: labs/lab_1/your_file_name.qmd
      text: "Lab 1: Census Data Exploration"

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)

# Set your Census API key

# Choose your state for analysis - assign it to a variable called my_state
my_state <- 'California'

State Selection: I have chosen California for this analysis because: I have worked with datasets from New York and Pennsylvania- want to try a west coast state. [Brief explanation of why you chose this state]

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
ca_data <- get_acs(
  geography = "county",
  variables = c(
    total_pop = "B01003_001",
    hshld_income = "B19013_001"
  ),
  state = "CA",
  year = 2022,
  output = "wide"
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()
ca_clean <- ca_data %>%
  mutate(
    county_name = str_replace(NAME, " County, California$", "")
  )


# Display the first few rows
head(ca_clean)

# A tibble: 6 × 7
  GEOID NAME       total_popE total_popM hshld_incomeE hshld_incomeM county_name
  <chr> <chr>           <dbl>      <dbl>         <dbl>         <dbl> <chr>      
1 06001 Alameda C…    1663823         NA        122488          1231 Alameda    
2 06003 Alpine Co…       1515        206        101125         17442 Alpine     
3 06005 Amador Co…      40577         NA         74853          6048 Amador     
4 06007 Butte Cou…     213605         NA         66085          2261 Butte      
5 06009 Calaveras…      45674         NA         77526          3875 Calaveras  
6 06011 Colusa Co…      21811         NA         69619          5745 Colusa

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()
ca_reliability <- ca_clean %>%
  mutate(
    moe_percentage = round((hshld_incomeM / hshld_incomeE) * 100, 2),
    
    reliability = case_when(
      moe_percentage < 5 ~ "High Confidence",
      moe_percentage >= 5 & moe_percentage <= 10 ~ "Moderate",
      moe_percentage > 10 ~ "Low Confidence"
    )
  )

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
reliability_summary <- ca_reliability %>%
  group_by(reliability) %>%
  summarize(
    counties = n(),
    avg_income = round(mean(hshld_incomeE, na.rm = TRUE), 0)
  )

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage
high_uncertainty <- ca_reliability %>%
  filter(moe_percentage > 8) %>%
  arrange(desc(moe_percentage)) %>%
  slice_head(n = 5) %>%
  select(county_name, hshld_incomeE, moe_percentage, hshld_incomeM, reliability)

glimpse(high_uncertainty)

Rows: 5
Columns: 5
$ county_name    <chr> "Mono", "Alpine", "Sierra", "Trinity", "Plumas"
$ hshld_incomeE  <dbl> 82038, 101125, 61108, 47317, 67885
$ moe_percentage <dbl> 18.76, 17.25, 15.12, 12.45, 11.45
$ hshld_incomeM  <dbl> 15388, 17442, 9237, 5890, 7772
$ reliability    <chr> "Low Confidence", "Low Confidence", "Low Confidence", "…

# Format as table with kable() - include appropriate column names and caption
kable(high_uncertainty,
      col.names = c("County", "Household Income", "MOE %", "MOE", "Reliability Category"),
      caption = "Counties with Highest Income Data Uncertainty",
      format.args = list(big.mark = ","))

Counties with Highest Income Data Uncertainty
County	Household Income	MOE %	MOE	Reliability Category
Mono	82,038	18.76	15,388	Low Confidence
Alpine	101,125	17.25	17,442	Low Confidence
Sierra	61,108	15.12	9,237	Low Confidence
Trinity	47,317	12.45	5,890	Low Confidence
Plumas	67,885	11.45	7,772	Low Confidence

Data Quality Commentary:

Margin of Error is high so Reliability is low. Therefore, decision making for authorities becomes challenging as this data does not have enough specificity for policy makers to be able to make specific decisions. Thos higher uncertainty might be related to adjustments made to account for households that didnt respond or since the data is representative, weighting is done for the data which might compound any variations in data

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- ca_reliability %>%
  group_by(reliability) %>%
  slice_sample(n = 1) %>%
  ungroup()


# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
  select(county_name, hshld_incomeE, moe_percentage, reliability)

# A tibble: 3 × 4
  county_name hshld_incomeE moe_percentage reliability    
  <chr>               <dbl>          <dbl> <chr>          
1 Butte               66085           3.42 High Confidence
2 Plumas              67885          11.4  Low Confidence 
3 San Benito         104451           5.23 Moderate

Comment on the output: The population estimate for these counties is directly proportional to the reliability. Higher population county i.e. Santa Clara also has the most reliability while ALpine has the lowest population estimate number and also has least reliable household income data.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
ca_tract_data <- get_acs(
  geography = "tract",
  variables = c(
    white_pop = "B03002_003",
    black_pop = "B03002_004",
    latino_pop = "B03002_012",
    total_pop = "B03002_001"
  ),
  state = "CA",
  year = 2022,
  output = "wide"
)

ca_tract_clean <- ca_tract_data %>%
  mutate(
    tract_name = str_replace(NAME, "; California$", "")
  )

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_pop_percentages <- ca_tract_clean %>%
  mutate(
    pct_white    = round((white_popE / total_popE) * 100, 1),
    pct_black    = round((black_popE / total_popE) * 100, 1),
    pct_hispanic = round((latino_popE / total_popE) * 100, 1)
  )

# Add readable tract and county name columns using str_extract() or similar
tract_pop_percentages <- tract_pop_percentages %>%
  separate(NAME, 
           into = c("tract_name", "county_name", "state_name"), 
           sep = "; ")

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
highest_latino_pct <- tract_pop_percentages %>%
  arrange(desc(pct_hispanic)) %>%
  slice_head(n = 1)

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
 county_averages <- tract_pop_percentages %>%
  group_by(county_name) %>%
  summarize(
    avg_white = round(mean(pct_white, na.rm = TRUE), 1),
    avg_black = round(mean(pct_black, na.rm = TRUE), 1),
    avg_hispanic = round(mean(pct_hispanic, na.rm = TRUE), 1),
    total_tracts = n() # Bonus: see how many tracts are in each county
  ) %>%
  ungroup()

# Create a nicely formatted table of your results using kable()
library(kableExtra)
 county_averages %>%
  kable(
    col.names = c("County", "% White", "% Black", "% Hispanic", "Total Tracts"),
    caption = "Average Demographics by County",
    align = "lrrr"
  )

Average Demographics by County
County	% White	% Black	% Hispanic	Total Tracts
Alameda County	31.0	10.7	21.4	379
Alpine County	58.1	0.0	14.1	1
Amador County	75.7	1.6	14.9	10
Butte County	69.3	1.5	17.4	54
Calaveras County	81.0	0.9	11.6	14
Colusa County	34.0	1.6	60.4	6
Contra Costa County	42.6	8.0	25.1	242
Del Norte County	59.5	2.2	19.6	9
El Dorado County	76.0	0.6	13.8	55
Fresno County	28.4	4.1	54.1	225
Glenn County	54.0	0.3	39.2	8
Humboldt County	72.1	1.3	11.6	36
Imperial County	11.4	2.6	82.3	40
Inyo County	62.1	0.8	23.2	6
Kern County	33.1	4.7	54.0	236
Kings County	29.6	5.7	57.3	31
Lake County	69.3	2.5	20.0	21
Lassen County	70.0	5.0	15.3	9
Los Angeles County	26.3	7.6	47.6	2498
Madera County	33.7	2.4	58.0	34
Marin County	69.2	2.5	16.5	63
Mariposa County	76.8	0.8	13.4	6
Mendocino County	64.6	0.5	24.9	24
Merced County	25.4	2.8	62.1	63
Modoc County	76.6	1.4	15.1	4
Mono County	64.2	0.2	27.8	4
Monterey County	35.2	2.0	52.6	104
Napa County	54.6	2.1	31.7	40
Nevada County	83.4	0.3	9.6	26
Orange County	41.3	1.5	32.4	614
Placer County	70.9	1.4	14.5	92
Plumas County	85.2	0.6	8.6	7
Riverside County	34.6	5.7	49.9	518
Sacramento County	43.2	9.1	23.8	363
San Benito County	32.9	0.8	59.5	12
San Bernardino County	28.5	7.1	53.3	466
San Diego County	45.5	4.4	33.3	737
San Francisco County	39.5	5.1	15.1	244
San Joaquin County	29.8	6.7	43.6	174
San Luis Obispo County	67.1	1.3	23.2	70
San Mateo County	37.9	2.1	23.5	174
Santa Barbara County	46.0	1.8	43.6	109
Santa Clara County	29.6	2.3	25.3	408
Santa Cruz County	56.7	0.8	34.2	70
Shasta County	77.7	0.9	10.6	50
Sierra County	86.6	0.2	11.4	1
Siskiyou County	73.9	1.2	14.5	16
Solano County	36.1	13.0	28.2	100
Sonoma County	63.6	1.4	25.6	122
Stanislaus County	38.7	2.7	48.9	112
Sutter County	45.8	1.8	32.5	21
Tehama County	65.9	0.9	26.0	14
Trinity County	79.2	1.8	7.0	4
Tulare County	27.4	1.2	65.3	103
Tuolumne County	78.1	1.9	13.5	18
Ventura County	45.0	1.7	42.1	190
Yolo County	46.5	2.7	31.1	53
Yuba County	56.0	3.2	26.7	19

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
MOE_demographic <- tract_pop_percentages %>%
  mutate(
    moe_percentage_white = round((white_popM / white_popE) * 100, 2),
    moe_percentage_black = round((black_popM / black_popE) * 100, 2),
    moe_percentage_latino = round((latino_popM / latino_popE) * 100, 2),
    moe_percentage_popln_total = round((total_popM / total_popE) * 100, 2),
    
    reliability = case_when(
      moe_percentage_popln_total < 5 ~ "High Confidence",
      moe_percentage_popln_total >= 5 & moe_percentage_popln_total <= 12 ~ "Moderate Confidence",
      moe_percentage_popln_total > 12 ~ "Low Confidence",
      TRUE ~ "No Data"
    )
  )



# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
MOE_demographic <- MOE_demographic %>%
  mutate(
    high_moe_flag = case_when(
      # Handle cases where population is zero (prevents 100% unreliability)
      (white_popE == 0 & black_popE == 0 & latino_popE == 0) ~ "No Population",
      
      # The strict "OR" logic based on your assigned cutoffs
      (moe_percentage_white > 12 | 
       moe_percentage_black > 12 | 
       moe_percentage_latino > 12) ~ "Unreliable Demographics",
      
      # If it passes both above, it's reliable
      TRUE ~ "Reliable Demographics"
    )
  )

# Create summary statistics showing how many tracts have data quality issues
quality_summary <- MOE_demographic %>%
  count(high_moe_flag) %>%
  mutate(percentage = round((n / sum(n)) * 100, 1))

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
# Calculate average characteristics for each group:
# - population size, demographic percentages
data_quality_pattern <- MOE_demographic %>%
  group_by(high_moe_flag) %>%
  summarize(
    count = n(),
    avg_total_pop = mean(total_popE, na.rm = TRUE),
    avg_white_pct = mean(pct_white, na.rm = TRUE),
    avg_black_pct = mean(pct_black, na.rm = TRUE),
    avg_latino_pct = mean(pct_hispanic, na.rm = TRUE)
  ) %>%
  mutate(across(where(is.numeric), round, 1))

# Use group_by() and summarize() to create this comparison
# Create a professional table showing the patterns
pattern_table <- data.frame(
  `Quality Status` = c("Unreliable Demographics", "No Population"),
  `Avg Population` = c(4334, 2.6),
  `Pct White` = c(38.0, 0),
  `Pct Latino` = c(38.0, 0),
  `Pct Black`  = c(5.3, 0)
)

kable(
  pattern_table, 
  caption = "Characteristics of Tracts with Data Quality Issues",
  align = "lrrrr",
  col.names = c("Quality Status", "Avg Population", "% White", "% Latino", "% Black")
)

Characteristics of Tracts with Data Quality Issues
Quality Status	Avg Population	% White	% Latino	% Black
Unreliable Demographics	4334.0	38	38	5.3
No Population	2.6	0	0	0.0

Pattern Analysis: Data quality issues are mostly found in smaller neighborhoods where fewer people live- making the Census counts less certain. A major pattern is that when a specific group- like the Black population in this data- makes up a very small slice of the neighborhood (around 5.3%), the margin of error jumps too high to meet strict standards. We also see tracts with almost no people- which are likely parks or industrial areas rather than actual residential communities. This shows that it is very hard to get highly reliable data in places that are either sparsely populated or have very small numbers of a specific demographic group.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary:

Across California, data reliability is following a clear systematic patter- accuracy is directly tied to population density and demographic concentration. At the county level, high-population hubs like Santa Clara exhibit high confidence while smaller rural areas like Alpine County show significant uncertainty in household income data. This pattern is even more pronounced at the tract level where neighborhoods with smaller total populations or very small distributions of specific demographic groups frequently fail to meet our reliability standards. Hence, the further the analysis zooms in into small sub-populations- the noisier and less reliable the census estimates become.

These findings present a significant equity risk- communities with the highest data uncertainty often face the greatest risk of algorithmic bias. Specifically, neighborhoods where the Black population makes up a small percentage (around 5.3% black in “unreliable” tracts) are disproportionately flagged as having low-confidence data. If an algorithmic system is used to prioritize social funding based on these metrics- these communities may be unfairly excluded or misrepresented. Under-represented or sparsely populated groups are almost hidden by high margins of error- which could lead to a systematic denial of resources to those who need them most.

The root cause of this reliability gap is sampling variability within the American Community Survey. Because the ACS is a sample rather than a full count- the margin of error increases as the sample size decreases. In tracts with lower residential density or areas with very small numbers of specific demographic groups, the noise in the data often exceeds our 12% reliability threshold. Also, almost 0 population tracts in non-residential industrial zones or parks can seemingly create mathematical errors that can further skew algorithm’s results.

To address these systematic issues- a tiered decision-making framework could perhaps be used. Algorithmic systems should only be used for immediate implementation in “High Confidence” counties where margins of error are below 5%. For “Moderate Confidence” areas, the Department could implement outcome monitoring to catch any potential biases. Finally, “Low Confidence” tracts or county must require a manual review or use proxy data sources —such as school enrollment or state tax records—before making any significant decisions.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category

# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

county_recommendations <- ca_reliability %>%
  mutate(
    # Creating the decision framework
    algorithm_recommendation = case_when(
      reliability == "High Confidence" ~ "Safe for algorithmic decisions",
      reliability == "Moderate" ~ "Use with caution - monitor outcomes",
      reliability == "Low Confidence" ~ "Requires manual review or additional data",
      TRUE ~ "Data unavailable"
    )
  ) %>%
  # Selecting and renaming columns for the final table
  select(
    `County` = county_name,
    `Median Income` = hshld_incomeE,
    `MOE %` = moe_percentage,
    `Reliability Category` = reliability,
    `Recommendation` = algorithm_recommendation
  ) %>%
  # Sorting by MOE % to show most reliable data first
  arrange(`MOE %`)



# Format as a professional table with kable()
kable(
  county_recommendations,
  caption = "County-Level Algorithmic Implementation Framework",
  format.args = list(big.mark = ","),
  align = "lrrll"
)

County-Level Algorithmic Implementation Framework
County	Median Income	MOE %	Reliability Category	Recommendation
Los Angeles	83,411	0.53	High Confidence	Safe for algorithmic decisions
Orange	109,361	0.81	High Confidence	Safe for algorithmic decisions
Sacramento	84,010	0.97	High Confidence	Safe for algorithmic decisions
Alameda	122,488	1.00	High Confidence	Safe for algorithmic decisions
Santa Clara	153,792	1.00	High Confidence	Safe for algorithmic decisions
San Diego	96,974	1.02	High Confidence	Safe for algorithmic decisions
San Bernardino	77,423	1.04	High Confidence	Safe for algorithmic decisions
Contra Costa	120,020	1.25	High Confidence	Safe for algorithmic decisions
Riverside	84,505	1.26	High Confidence	Safe for algorithmic decisions
Fresno	67,756	1.43	High Confidence	Safe for algorithmic decisions
San Francisco	136,689	1.43	High Confidence	Safe for algorithmic decisions
Ventura	102,141	1.50	High Confidence	Safe for algorithmic decisions
Placer	109,375	1.70	High Confidence	Safe for algorithmic decisions
San Joaquin	82,837	1.75	High Confidence	Safe for algorithmic decisions
San Mateo	149,907	1.75	High Confidence	Safe for algorithmic decisions
Solano	97,037	1.78	High Confidence	Safe for algorithmic decisions
Stanislaus	74,872	1.83	High Confidence	Safe for algorithmic decisions
Sonoma	99,266	2.00	High Confidence	Safe for algorithmic decisions
Santa Barbara	92,332	2.05	High Confidence	Safe for algorithmic decisions
Kern	63,883	2.07	High Confidence	Safe for algorithmic decisions
Monterey	91,043	2.09	High Confidence	Safe for algorithmic decisions
Tulare	64,474	2.31	High Confidence	Safe for algorithmic decisions
San Luis Obispo	90,158	2.56	High Confidence	Safe for algorithmic decisions
Yolo	85,097	2.74	High Confidence	Safe for algorithmic decisions
Napa	105,809	2.82	High Confidence	Safe for algorithmic decisions
Marin	142,019	2.89	High Confidence	Safe for algorithmic decisions
Santa Cruz	104,409	3.04	High Confidence	Safe for algorithmic decisions
Kings	68,540	3.29	High Confidence	Safe for algorithmic decisions
Merced	64,772	3.31	High Confidence	Safe for algorithmic decisions
El Dorado	99,246	3.36	High Confidence	Safe for algorithmic decisions
Butte	66,085	3.42	High Confidence	Safe for algorithmic decisions
Mendocino	61,335	3.58	High Confidence	Safe for algorithmic decisions
Shasta	68,347	3.63	High Confidence	Safe for algorithmic decisions
Humboldt	57,881	3.68	High Confidence	Safe for algorithmic decisions
Madera	73,543	3.87	High Confidence	Safe for algorithmic decisions
Imperial	53,847	4.11	High Confidence	Safe for algorithmic decisions
Yuba	66,693	4.19	High Confidence	Safe for algorithmic decisions
Lake	56,259	4.34	High Confidence	Safe for algorithmic decisions
Sutter	72,654	4.71	High Confidence	Safe for algorithmic decisions
Nevada	79,395	4.82	High Confidence	Safe for algorithmic decisions
Siskiyou	53,898	4.90	High Confidence	Safe for algorithmic decisions
Calaveras	77,526	5.00	Moderate	Use with caution - monitor outcomes
San Benito	104,451	5.23	Moderate	Use with caution - monitor outcomes
Lassen	59,515	5.97	Moderate	Use with caution - monitor outcomes
Glenn	64,033	6.19	Moderate	Use with caution - monitor outcomes
Tuolumne	70,432	6.66	Moderate	Use with caution - monitor outcomes
Tehama	59,029	6.95	Moderate	Use with caution - monitor outcomes
Del Norte	61,149	7.16	Moderate	Use with caution - monitor outcomes
Amador	74,853	8.08	Moderate	Use with caution - monitor outcomes
Colusa	69,619	8.25	Moderate	Use with caution - monitor outcomes
Inyo	63,417	8.60	Moderate	Use with caution - monitor outcomes
Mariposa	60,021	8.82	Moderate	Use with caution - monitor outcomes
Modoc	54,962	9.80	Moderate	Use with caution - monitor outcomes
Plumas	67,885	11.45	Low Confidence	Requires manual review or additional data
Trinity	47,317	12.45	Low Confidence	Requires manual review or additional data
Sierra	61,108	15.12	Low Confidence	Requires manual review or additional data
Alpine	101,125	17.25	Low Confidence	Requires manual review or additional data
Mono	82,038	18.76	Low Confidence	Requires manual review or additional data

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: Counties: Santa Clara, Los Angeles, Sacramento, etc. Why: These counties feature high population estimates, which correlate directly with higher data reliability and low Margins of Error (MOE < 5%). Their large sample sizes ensure that algorithmic decisions are based on stable, statistically significant data points, minimizing the risk.
Counties requiring additional oversight: Counties: Calaveras, San Benito, Lassen, etc. Monitoring Needed: For these areas, the department should implement outcome monitoring- regularly auditing the algorithm’s decisions against real-world feedback to ensure that minor data fluctuations aren’t causing systematic biases.
Counties needing alternative approaches: Counties: Plumas, Trinity, Sierra, etc. Because these areas often have small or sparsely distributed populations, the department should rely on manual review or supplemental data (such as local administrative records). Relying solely on a census-based algorithm here would likely lead to bias due to the high noise in the estimates.

Questions for Further Investigation

Spatial Correlation of Unreliability: Is there a geographic pattern to data unreliability? For e.g. are rural “border” tracts consistently more difficult to count than urban centers- regardless of the total population size?
Impact of Time on Minority Margin of Error: How have the Margins of Error for small demographic subgroups (like the 5.3% Black population observed) changed over the last three ACS 5-year cycles? Is it becoming more or less reliable with time for such vulnerable groups?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 3/2/26

Reproducibility: - All analysis conducted in R version 4.5.1 - Census API key required for replication - Complete code and documentation available at: https://shaurya-chauhan.github.io/PortfolioPPA/

Methodology Notes: This analysis used 2022 ACS 5-year estimates via the tidycensus package, focusing on California’s county and tract-level data. To ensure reproducibility, I selected counties representing high, moderate, and low confidence levels to observe how population size impacts data stability. A strict 12% Margin of Error (MOE) threshold was applied to demographic variables- to flag unreliable data. I specifically used case_when logic to separate “No Population” tracts from “Unreliable” ones, ensuring that mathematical errors in empty tracts did not skew the quality assessment. These analytical choices highlight that while strict standards are useful for policy, they frequently label smaller sub-populations.

Limitations: The primary limitation of this analysis is the reliance on a strict 12% Margin of Error threshold for demographic variables at the tract level. While this ensures high data quality- it tends to disproportionately flag smaller minority populations as “unreliable” potentially masking the needs of those specific groups in policy decisions. Temporally, since the study utilizes 2022 ACS 5-year estimates, which provide a stable average over time but may not capture rapid recent shifts in local conditions.

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html