CleanKitchens Pipeline Modules Documentation

Pipeline Status Flow: pending → processing → processed → ready → completed Error States: error, duplicate, failed, rollback

Module 1: Data Extraction (City-Specific)

Purpose

Extract health inspection data from various sources (CSV, API, web scraping) and normalize it into a standardized format for processing. Each city requires its own extraction module due to different data formats and sources.

General Approach

Core Responsibilities:

Implementation Progress - Chicago

✅ Completed Steps:

Field Extraction Strategy

Principle: Any field that can be directly extracted from raw data should be populated at this extraction stage, minimizing the work needed in later processing stages.

Standardized Metadata Approach

Two-Phase Data Enrichment Strategy:

We separate factual extraction from intelligent inference, letting Claude AI handle all complex analysis to ensure maximum accuracy and context awareness.

Important Update: We no longer use hardcoded inference rules. All intelligent metadata is generated by Claude during processing, ensuring comprehensive and contextually accurate analysis.

Metadata Categories & Purpose

1. ESTABLISHMENT PROFILING ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Fields: cuisine_type, service_style, price_range, is_chain, chain_name Purpose: Creates establishment "fingerprints" for pattern analysis Example Pattern: "Fast food chains have 3x more temperature violations than casual dining" AI Benefit: Enables predictive risk scoring based on establishment type
2. GEOGRAPHIC CONTEXTUALIZATION ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Fields: neighborhood, neighborhood_type, nearby_landmarks, nearby_transit, nearby_universities Purpose: Maps spatial relationships and demographic patterns Example Pattern: "Restaurants near universities show seasonal violation spikes" AI Benefit: Location-based risk prediction and targeted inspections
3. VIOLATION INTELLIGENCE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Fields: violation_category, violation_severity, critical_violations, is_closure Purpose: Structured violation taxonomy for trend analysis Example Pattern: "Pest violations increase 40% in summer months" AI Benefit: Seasonal forecasting and resource allocation
4. RISK INDICATORS ━━━━━━━━━━━━━━━━━━━━━━━━━━ Fields: risk_factors, typical_violations, food_safety_concerns, remediation_required Purpose: Pre-computed risk signals for real-time scoring Example Pattern: "Sushi restaurants + summer = elevated temperature risk" AI Benefit: Proactive alerts before violations occur

Vectorization Benefits

How Metadata Improves AI Performance:

Pattern Recognition Examples

Patterns Enabled by Rich Metadata:

Two-Phase Implementation

Phase 1: Extraction (Facts Only) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. Read raw CSV/API data 2. Apply direct field mappings only 3. Store factual data as-is: - facility_name: "Luigi's Pizza" - address: "123 N Michigan Ave" - violations: [raw text] 4. Leave inference fields empty 5. Mark status: "pending"
Phase 2: Claude Processing (Intelligence) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Claude analyzes each record and provides: ESTABLISHMENT ANALYSIS: - cuisine_type: "italian" (from name/context) - service_style: "casual_dining" (from full context) - is_chain: true/false (from knowledge base) - price_range: "$-$$$$" (from knowledge) LOCATION INTELLIGENCE: - neighborhood: "Magnificent Mile" (knows Chicago) - nearby_landmarks: ["Water Tower", "Navy Pier"] - nearby_transit: ["Red Line Grand Station"] - district_characteristics: "High-end shopping district" VIOLATION ANALYSIS: - violation_category: "temperature_control" - severity_score: 1-10 (contextual assessment) - critical_violations: [extracted list] - risk_factors: [identified concerns] HISTORICAL CONTEXT: - chain_history: "Previous E. coli outbreak 2018" - remediation_required: "Retrain staff on HACCP"
Advantages of Claude-Based Inference:

Chicago Direct Field Mappings

CSV Field → Database Field ======================== Inspection ID → inspection_id DBA Name → facility_name AKA Name → aka_name License # → license_number Facility Type → facility_type Risk → risk_level Address → address City → city State → state Zip → zip_code Inspection Date → inspection_date Inspection Type → inspection_type Results → results Violations → violations Latitude → latitude Longitude → longitude

Data Consolidation Logic (Two-Stage Process)

Stage 1: Bulk Insert

First, ALL rows from CSV are inserted directly into chicago_temp table without modification. This preserves all original data.

Stage 2: Consolidation

After bulk insert, consolidate multiple violation records for the same inspection:

Important Distinction:
Consolidation Process: 1. Query for inspections with multiple violations: SELECT facility_name, inspection_date, COUNT(*) FROM chicago_temp GROUP BY facility_name, inspection_date HAVING COUNT(*) > 1 2. For each facility+date with multiple rows: - Combine ALL violations into raw_data as: violation_1:First violation text violation_2:Second violation text violation_3:Third violation text (even if same type) - Add metadata: total_violations:3 combined_from_rows:3 consolidation_date:timestamp - Update first record with combined data - DELETE the other records - Mark as consolidated: is_combined = true

Example Consolidation

BEFORE (3 rows for Joe's Pizza on 2025-08-15): Row 1: Temperature violation - Cold holding 45°F Row 2: Temperature violation - Cold holding 48°F Row 3: Handwashing station - No soap AFTER (1 consolidated row): raw_data: "facility:Joe's Pizza, date:2025-08-15, violation_1:Temperature violation - Cold holding 45°F, violation_2:Temperature violation - Cold holding 48°F, violation_3:Handwashing station - No soap, total_violations:3, combined_from_rows:3"
Critical: Every violation row is preserved as a separate violation, maintaining accurate violation counts and severity for article generation.

Output Format

Standardized key:value pairs stored in raw_data field:

facility_name:Restaurant Name inspection_date:2025-08-15 city:Chicago state:IL violations:Temperature violation|Handwashing station|Pest activity total_points:30 result:Failed inspector:John Smith address:123 Main St

Staging Table Structure (chicago_temp)

The temp table mirrors the main database structure with all extractable fields populated:

City-Specific Considerations: Each city's extraction module will need custom logic for handling their unique data format, violation codes, and scoring systems.

Module 2: Article Processing (Universal)

Purpose

Process normalized inspection data through Claude AI to generate news articles, then validate and prepare for publication. This module is universal and works with any city's normalized data.

Processing Workflow

Step 1: Duplicate Detection

Step 2: Claude AI Processing

Claude Integration:

Step 3: SEO Optimization

Step 4: Image Assignment

Step 5: Validation

Required Field Validation:

Staging Table Updates

After processing, the staging record is updated with:

Success Criteria: Article is ready for final review when all fields are populated, validated, and status='ready'

Module 3: Database Insert & Publishing (Universal)

Purpose

Perform final validation, execute database insertion to Weaviate, and handle post-publication tasks. This module ensures data integrity and successful publication.

Pre-Insert Workflow

Step 1: Final Duplicate Check

Step 2: Data Preparation

Database Operations

Weaviate Insert Process

Article Object Structure:

Transaction Management

Post-Insert Tasks

Success Actions

Failure Handling

Monitoring & Reporting

Pipeline Metrics:
Completion: Article is live when status='completed' and accessible via its URL slug

Implementation Notes

Staging Table Benefits

Future Enhancements

Remember: Each module will be developed, tested, and refined iteratively. This document will be updated with specific implementation details and code examples as we build and test each component.

Implementation Progress

Chicago Pipeline Status

✅ Completed:
⏳ Next Steps:

Data Flow Summary

1. CSV Download → /data/chicago/chicago_food_inspections.csv 2. Bulk Insert → ALL rows to chicago_temp table 3. Consolidate → Combine multiple violations per inspection 4. Duplicate Check → Compare against main database 5. Process → Send to Claude for article generation 6. Validate → Check required fields 7. Publish → Move to main Articles table