Pipeline Status Flow: pending → processing → processed → ready → completed
Error States: error, duplicate, failed, rollback
Module 1: Data Extraction (City-Specific)
Purpose
Extract health inspection data from various sources (CSV, API, web scraping) and normalize it into a standardized format for processing. Each city requires its own extraction module due to different data formats and sources.
General Approach
Core Responsibilities:
- Source Connection: Connect to data source (CSV file, API endpoint, or web page)
- Data Retrieval: Download or fetch inspection records
- Field Extraction: Extract all available fields from raw data into corresponding database fields
- Record Combining: Merge multiple violations from the same inspection date into a single inspection record
- Normalization: Convert to standardized key:value pairs format
- Staging: Insert normalized data into temporary staging table
Implementation Progress - Chicago
✅ Completed Steps:
- Created
/data/chicago/ folder structure
- Downloaded Chicago food inspections CSV (312MB)
- Identified 15+ fields that can be directly extracted from CSV
Field Extraction Strategy
Principle: Any field that can be directly extracted from raw data should be populated at this extraction stage, minimizing the work needed in later processing stages.
Standardized Metadata Approach
Two-Phase Data Enrichment Strategy:
We separate factual extraction from intelligent inference, letting Claude AI handle all complex analysis to ensure maximum accuracy and context awareness.
Important Update: We no longer use hardcoded inference rules. All intelligent metadata is generated by Claude during processing, ensuring comprehensive and contextually accurate analysis.
Metadata Categories & Purpose
1. ESTABLISHMENT PROFILING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Fields: cuisine_type, service_style, price_range, is_chain, chain_name
Purpose: Creates establishment "fingerprints" for pattern analysis
Example Pattern: "Fast food chains have 3x more temperature violations than casual dining"
AI Benefit: Enables predictive risk scoring based on establishment type
2. GEOGRAPHIC CONTEXTUALIZATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Fields: neighborhood, neighborhood_type, nearby_landmarks, nearby_transit, nearby_universities
Purpose: Maps spatial relationships and demographic patterns
Example Pattern: "Restaurants near universities show seasonal violation spikes"
AI Benefit: Location-based risk prediction and targeted inspections
3. VIOLATION INTELLIGENCE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Fields: violation_category, violation_severity, critical_violations, is_closure
Purpose: Structured violation taxonomy for trend analysis
Example Pattern: "Pest violations increase 40% in summer months"
AI Benefit: Seasonal forecasting and resource allocation
4. RISK INDICATORS
━━━━━━━━━━━━━━━━━━━━━━━━━━
Fields: risk_factors, typical_violations, food_safety_concerns, remediation_required
Purpose: Pre-computed risk signals for real-time scoring
Example Pattern: "Sushi restaurants + summer = elevated temperature risk"
AI Benefit: Proactive alerts before violations occur
Vectorization Benefits
How Metadata Improves AI Performance:
- Dimensional Richness: 60+ fields create high-dimensional vectors that capture subtle patterns invisible to simple text search
- Semantic Clustering: Similar establishments naturally cluster in vector space (e.g., all pizza places near colleges)
- Anomaly Detection: Outliers become mathematically identifiable (e.g., fine dining with pest violations)
- Predictive Power: Historical patterns + metadata = future risk prediction
- Query Precision: "Find Korean BBQ restaurants in tourist areas with temperature violations" becomes a simple vector similarity search
Pattern Recognition Examples
Patterns Enabled by Rich Metadata:
- Chain Analysis: "Chipotle locations within 1 mile of universities have 2x violation rate"
- Cuisine Risks: "Buffet-style restaurants have 5x higher contamination violations"
- Geographic Trends: "Loop restaurants improve compliance before convention seasons"
- Temporal Patterns: "Food trucks show violation spikes during summer festivals"
- Demographic Correlations: "High-traffic tourist areas correlate with handwashing violations"
Two-Phase Implementation
Phase 1: Extraction (Facts Only)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Read raw CSV/API data
2. Apply direct field mappings only
3. Store factual data as-is:
- facility_name: "Luigi's Pizza"
- address: "123 N Michigan Ave"
- violations: [raw text]
4. Leave inference fields empty
5. Mark status: "pending"
Phase 2: Claude Processing (Intelligence)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Claude analyzes each record and provides:
ESTABLISHMENT ANALYSIS:
- cuisine_type: "italian" (from name/context)
- service_style: "casual_dining" (from full context)
- is_chain: true/false (from knowledge base)
- price_range: "$-$$$$" (from knowledge)
LOCATION INTELLIGENCE:
- neighborhood: "Magnificent Mile" (knows Chicago)
- nearby_landmarks: ["Water Tower", "Navy Pier"]
- nearby_transit: ["Red Line Grand Station"]
- district_characteristics: "High-end shopping district"
VIOLATION ANALYSIS:
- violation_category: "temperature_control"
- severity_score: 1-10 (contextual assessment)
- critical_violations: [extracted list]
- risk_factors: [identified concerns]
HISTORICAL CONTEXT:
- chain_history: "Previous E. coli outbreak 2018"
- remediation_required: "Retrain staff on HACCP"
Advantages of Claude-Based Inference:
- ✓ No hardcoded rules to maintain
- ✓ Complete Chicago knowledge (all neighborhoods, landmarks)
- ✓ Contextual understanding (Luigi's could be pizza OR fine Italian)
- ✓ Historical awareness (knows past incidents)
- ✓ Consistent, comprehensive analysis
- ✓ Adapts to new patterns automatically
Chicago Direct Field Mappings
CSV Field → Database Field
========================
Inspection ID → inspection_id
DBA Name → facility_name
AKA Name → aka_name
License # → license_number
Facility Type → facility_type
Risk → risk_level
Address → address
City → city
State → state
Zip → zip_code
Inspection Date → inspection_date
Inspection Type → inspection_type
Results → results
Violations → violations
Latitude → latitude
Longitude → longitude
Data Consolidation Logic (Two-Stage Process)
Stage 1: Bulk Insert
First, ALL rows from CSV are inserted directly into chicago_temp table without modification. This preserves all original data.
Stage 2: Consolidation
After bulk insert, consolidate multiple violation records for the same inspection:
Important Distinction:
- Multiple Violations ≠ Duplicates: Each row represents a separate violation instance, even if the violation type is the same
- Example: 3 rows with "Temperature violation" = 3 separate temperature issues, NOT duplicates
- True Duplicates: Only identical inspection_id values are actual duplicates (data errors)
Consolidation Process:
1. Query for inspections with multiple violations:
SELECT facility_name, inspection_date, COUNT(*)
FROM chicago_temp
GROUP BY facility_name, inspection_date
HAVING COUNT(*) > 1
2. For each facility+date with multiple rows:
- Combine ALL violations into raw_data as:
violation_1:First violation text
violation_2:Second violation text
violation_3:Third violation text (even if same type)
- Add metadata:
total_violations:3
combined_from_rows:3
consolidation_date:timestamp
- Update first record with combined data
- DELETE the other records
- Mark as consolidated: is_combined = true
Example Consolidation
BEFORE (3 rows for Joe's Pizza on 2025-08-15):
Row 1: Temperature violation - Cold holding 45°F
Row 2: Temperature violation - Cold holding 48°F
Row 3: Handwashing station - No soap
AFTER (1 consolidated row):
raw_data: "facility:Joe's Pizza, date:2025-08-15,
violation_1:Temperature violation - Cold holding 45°F,
violation_2:Temperature violation - Cold holding 48°F,
violation_3:Handwashing station - No soap,
total_violations:3,
combined_from_rows:3"
Critical: Every violation row is preserved as a separate violation, maintaining accurate violation counts and severity for article generation.
Output Format
Standardized key:value pairs stored in raw_data field:
facility_name:Restaurant Name
inspection_date:2025-08-15
city:Chicago
state:IL
violations:Temperature violation|Handwashing station|Pest activity
total_points:30
result:Failed
inspector:John Smith
address:123 Main St
Staging Table Structure (chicago_temp)
The temp table mirrors the main database structure with all extractable fields populated:
id - Auto-increment primary key
inspection_id - Unique inspection identifier from source
facility_name - Restaurant/facility name (DBA Name)
aka_name - Alternative business name
license_number - Business license number
facility_type - Type of establishment
risk_level - Risk category (1, 2, 3)
address - Street address
city - City name
state - State code
zip_code - Postal code
inspection_date - Date of inspection
inspection_type - Type of inspection performed
results - Inspection outcome (Pass/Fail/etc)
violations - Violation details text
latitude - Geographic coordinate
longitude - Geographic coordinate
raw_data - Complete key:value pairs from source
status - Processing status (pending/processing/processed/error)
source_city - Source identifier (chicago)
created_at - Extraction timestamp
processed_at - Processing completion timestamp
claude_response - AI-generated content storage
error_log - Error details if processing fails
City-Specific Considerations: Each city's extraction module will need custom logic for handling their unique data format, violation codes, and scoring systems.