# CleanKitchens Pipeline Update - August 2025
## Two-Phase Article Generation Strategy & Cost Analysis

---

## 🎯 Executive Summary

CleanKitchens has developed a cost-effective two-tier content generation strategy using Claude AI models:
- **Historical Data**: Process with Claude-3-Haiku (ultra low cost)
- **Current Content**: Generate with Claude-3.5-Sonnet (premium quality)
- **Total Investment**: Under $1,000 to process 295k records
- **Ongoing Costs**: ~$33/month for daily updates

---

## 📊 Cost Analysis

### Haiku Pricing (Historical Backfill)
```
Model: claude-3-haiku-20240307
Input: $0.25 per million tokens
Output: $1.25 per million tokens

Per Article Breakdown:
- Phase 1 (Metadata): ~500 input + 200 output tokens
- Phase 2 (Article): ~800 input + 1,000 output tokens
- Total Cost: $0.0018 per article

Full Dataset (295k records):
- If 100% have violations: $531
- Realistic (50% with violations): ~$265
```

### Sonnet Pricing (Premium Current Content)
```
Model: claude-3-5-sonnet-20241022
Input: $3 per million tokens
Output: $15 per million tokens

Per Article: ~$0.022 (2.2 cents)
Daily Processing (50 articles): $1.10
Monthly Cost: ~$33
```

### Cost Comparison Table
| Content Type | Model | Per Article | 10,000 Articles | 100,000 Articles |
|-------------|--------|------------|-----------------|------------------|
| Historical | Haiku | $0.0018 | $18 | $180 |
| Current | Sonnet | $0.022 | $220 | $2,200 |

---

## 🔄 Two-Phase Generation Process

### Phase 1: Metadata Analysis
Extracts structured data from inspection records:
- Cuisine type detection
- Neighborhood identification
- Violation categorization
- Risk scoring (0-100)
- Key concerns extraction

### Phase 2: Article Generation
Creates comprehensive news articles with:
- Local context (landmarks, transit)
- Complete violation details
- Educational content with government links
- SEO-optimized tags

---

## 📝 Article Structure (Implemented)

### Content Sections
1. **Location Context (2-3 paragraphs)**
   - Full address and neighborhood
   - 1-2 local landmarks (CTA stations, schools, attractions)
   - Establishment description

2. **Violations Detail (3-4 paragraphs)**
   - Complete inspection findings
   - Direct quotes from inspector comments
   - Specific temperatures and measurements
   - Citations issued

3. **Educational Context (2-3 paragraphs)**
   - Food safety implications
   - Hyperlinked government resources
   - Health risk explanations
   - Prevention guidelines

### Article Specifications
- **Haiku Output**: 300-500 words (2,000-3,000 chars)
- **Sonnet Output**: 1,500-2,000 words (10,000-12,000 chars)
- **SEO Tags**: Neighborhood, cuisine type, violation types
- **URL Format**: `/YYYY/MM/DD/article-title` (using inspection date)

---

## 🚀 Implementation Strategy

### Phase 1: Historical Backfill (Q3 2025)
```python
# Process all records before 2024
if inspection_date < '2024-01-01':
    model = "claude-3-haiku-20240307"
    max_tokens = 2500
```
- **Goal**: Generate 100,000+ articles quickly
- **Investment**: ~$265
- **Timeline**: 2-3 weeks at 5,000 articles/day
- **Purpose**: Build domain authority with volume

### Phase 2: Premium Current Content (Q4 2025+)
```python
# Process new inspections with premium model
if inspection_date >= '2024-01-01':
    model = "claude-3-5-sonnet-20241022"
    max_tokens = 4000
```
- **Goal**: High-quality current content
- **Investment**: $33/month ongoing
- **Output**: 50 articles/day
- **Purpose**: Compete for current search traffic

---

## 💡 Key Improvements Implemented

### Slug Format Fix
- **Old**: `article-title-202508160657`
- **New**: `2025/08/16/article-title`
- Uses inspection date for chronological organization

### Robust JSON Handling
- Retry logic with fallback
- JSON validation for required fields
- Manual extraction for malformed responses
- 100% success rate achieved

### Content Quality
- Neutral news reporting style
- Local landmark integration
- Complete violation reporting
- Educational value with government links

---

## 📈 ROI Projections

### Traffic Potential
- 100,000 indexed pages within 3 months
- Long-tail keyword coverage for local searches
- Fresh content daily for news cycle

### Monetization Breakeven
- At $0.10 RPM (low estimate)
- Need 330,000 pageviews/month to cover Sonnet costs
- Historical content drives organic growth
- Premium content captures current searches

---

## 🔧 Technical Configuration

### Script Locations
```
/production/scripts/test_two_phase_generation.py (active)
/data/chicago/haiku-article-processor.py (production ready)
/data/chicago/bulk-claude-process-v2.py (backup)
```

### Database Collections
- **ChicagoTemp**: Raw inspection data (295k records, consolidated)
- **Articles**: Generated content (ready for production)

### Key Functions
- `get_recent_failed_inspections()`: Fetches violations
- `phase1_analyze_metadata()`: Extracts structured data
- `phase2_generate_article()`: Creates article content
- `save_article_to_weaviate()`: Stores with correct slug

---

## 📋 Next Steps

1. **Immediate**
   - Run test batch of 1,000 historical records
   - Verify article quality and slug format
   - Monitor Anthropic API costs

2. **Week 1**
   - Process 25,000 historical records
   - Set up daily cron for new inspections
   - Implement Sonnet switch for 2024+ data

3. **Month 1**
   - Complete 100,000 article backfill
   - Launch sitemap generation
   - Begin monetization setup

---

## 🎯 Success Metrics

- **Content Volume**: 100,000+ articles in 30 days
- **Processing Cost**: Under $300 for historical
- **Success Rate**: 100% with retry logic
- **Article Quality**: 300+ words minimum
- **SEO Readiness**: Proper slugs, tags, and structure

---

*Last Updated: August 16, 2025*
*Status: Production Ready*