# CleanKitchens Knowledge Document
## Restaurant Health Inspection News Platform
### Created: August 14, 2025

---

## 🎯 System Overview

CleanKitchens is an automated restaurant health inspection news platform that transforms raw government inspection data into educational, SEO-optimized news articles. The system uses AI for content generation, semantic search for data retrieval, and advanced image optimization for fast web delivery.

### Key Capabilities
- **295,000+ Chicago inspection records** imported and vectorized
- **Sub-second page loads** with perfect PageSpeed Insights scores
- **4 valid structured data segments** per article page
- **Automated article generation** with educational focus
- **Multi-format image optimization** for all social platforms

---

## 📁 Directory Structure

```
/var/www/twin-digital-media/public_html/_sites/cleankitchens/
├── production/
│   ├── scripts/              # Main Python scripts
│   ├── logs/                 # Import and operation logs
│   └── data/                 # Import statistics
├── templates/                # PHP templates
├── includes/                 # PHP includes
├── assets/
│   └── images/
│       └── violations/
│           └── optimized/    # WebP and social media formats
├── functions*.php            # PHP functions
└── CLAUDE_PROGRESS.txt       # Development log

/home/chris/cleankitchens/
├── scripts/                  # Data import scripts
├── data/                     # Chicago CSV data
└── venv/                     # Python environment
```

---

## 🔧 Core Scripts & Functions

### 1. Article Generation Pipeline

#### `/production/scripts/test_article_generator.py`
**Purpose:** Complete article generation from inspection data
**Features:**
- Connects to Weaviate for data retrieval
- Uses Claude-3.5-Sonnet API for content generation
- Integrates educational references (FDA, CDC, USDA)
- Adds Chicago neighborhood context
- Selects appropriate violation images
- Generates structured data and meta tags

**Key Functions:**
- `get_inspection_data()` - Retrieves records from Weaviate
- `generate_article()` - Sends to Claude API with educational prompt
- `save_article_to_weaviate()` - Stores complete article
- `add_local_context()` - Adds neighborhood and transit info

---

### 2. Image Processing System

#### `/production/scripts/image_optimizer.py`
**Purpose:** Multi-format image optimization
**Features:**
- WebP conversion for performance (85% quality)
- Social media sizes: OpenGraph (1200x630), Twitter (1200x675), Instagram (1080x1080)
- Smart cropping with PIL ImageOps.fit
- .htaccess rules generation for WebP negotiation
- Batch processing with progress tracking

**Output Formats:**
- `*_og.jpg` - OpenGraph (Facebook)
- `*_twitter.jpg` - Twitter Card
- `*_instagram.jpg` - Instagram square
- `*_thumb.webp` - Thumbnails (400x300)
- `*.webp` - Main article images

#### `/production/scripts/image_selector.py`
**Purpose:** Intelligent image matching
**Algorithm:**
```python
1. Check for closure keywords → closure images
2. Check for violation type keywords → specific violation images
3. Check for restaurant chain keywords → chain images
4. Fall back to general violation images
5. Use least-recently-used rotation
```

**Violation Categories:**
- Rodent (4 images)
- Temperature (4 images)
- Cleanliness (4 images)
- Cross-contamination (4 images)
- Mold (4 images)
- Sewage (4 images)
- Structural (4 images)

---

### 3. Data Import & Vectorization

#### `/home/chris/cleankitchens/scripts/clean_and_import.py`
**Purpose:** Production data import with deduplication
**Process:**
1. Clean database (remove all collections)
2. Upload to temp collection without vectors
3. Check for duplicates by inspection_id
4. Add vectors after verification
5. Move to production collection

**Statistics:**
- 295,254 total records
- ~450 records/second import speed
- Duplicate prevention active
- Progress tracking with ETA

#### `/production/scripts/vectorize_all.py`
**Purpose:** Free local vectorization
**Features:**
- Sentence Transformers model (all-MiniLM-L6-v2)
- 384-dimension vectors
- Batch processing (100 records)
- ~30-45 minutes for full dataset
- Zero API costs

---

### 4. PHP Functions

#### `functions_live.php`
**Key Functions:**

```php
getOptimizedImageUrl($imageUrl, $context)
// Returns optimized image URL based on context
// Contexts: 'page', 'og', 'twitter', 'instagram', 'thumb'

queryWeaviate($query)
// GraphQL query to Weaviate database
// Returns JSON decoded response

getFeaturedArticleFromDB()
// Gets most recent article for homepage feature

getArticleBySlugFromDB($slug)
// Retrieves full article by URL slug
```

#### `functions_with_images.php`
**Purpose:** Enhanced functions with image support
- Connects articles to appropriate violation images
- Handles image fallbacks
- Manages alt text and captions

---

## 🚀 Performance Optimizations

### Image Delivery
1. **WebP with fallback:** `<picture>` elements with srcset
2. **Lazy loading:** Images load on scroll
3. **Preload critical images:** Featured article image
4. **CDN-ready structure:** Optimized directory layout

### Page Speed
1. **Inline critical CSS:** No render-blocking stylesheets
2. **Minimal JavaScript:** Only essential interactions
3. **Efficient HTML:** Semantic markup, minimal DOM
4. **Smart caching:** Browser and server-side

### Database
1. **Vector indexing:** Semantic search in milliseconds
2. **Batch operations:** 100-record chunks
3. **Connection pooling:** Reused Weaviate connections
4. **Query optimization:** GraphQL with specific field selection

---

## 📊 Structured Data Implementation

### Article Pages Include:
1. **NewsArticle Schema**
   - Headline, author, publisher
   - Date published/modified
   - Article body with sections
   - Image with captions

2. **BreadcrumbList Schema**
   - Home → Category → Article
   - Proper navigation hierarchy

3. **Organization Schema**
   - Publisher information
   - Logo and contact details

4. **FAQPage Schema**
   - Educational Q&A pairs
   - Food safety information

---

## 🔐 API Keys & Configuration

### Environment Variables (stored in ~/.env)
- `ANTHROPIC_API_KEY` - Claude API for article generation
- `OPENAI_API_KEY` - DALL-E for image generation (optional)

### Weaviate Configuration
- **Host:** localhost:8080
- **Collections:** RawInspection, Articles
- **Vectorizer:** Sentence Transformers (local)

---

## 📈 Cost Analysis

### Per Article Costs
- **Article Generation:** ~$0.0016 (Claude Haiku)
- **Vectorization:** $0 (local Sentence Transformers)
- **Image Selection:** $0 (pre-generated stock)
- **Total:** ~$0.0016 per article

### Full Dataset (295k records)
- **If all generated:** ~$472
- **Actual strategy:** Generate on-demand for high-impact violations

---

## 🛠️ Maintenance Commands

### Check Weaviate Status
```bash
docker ps | grep weaviate
curl http://localhost:8080/v1/.well-known/ready
```

### Monitor Import Progress
```bash
tail -f /var/www/twin-digital-media/public_html/_sites/cleankitchens/production/logs/clean_import.log
```

### Test Article Generation
```bash
cd /var/www/twin-digital-media/public_html/_sites/cleankitchens/production/scripts
python3 test_article_generator.py
```

### Optimize New Images
```bash
python3 image_optimizer.py /path/to/images
```

---

## 🔄 Workflow

### Adding New Content
1. Import inspection data → `clean_and_import.py`
2. Vectorize for search → `vectorize_all.py`
3. Generate articles → `test_article_generator.py`
4. Select images → `image_selector.py`
5. Optimize images → `image_optimizer.py`
6. Publish to site → Automatic via Weaviate

### Testing Changes
1. Use test scripts with 2-3 records first
2. Check logs for errors
3. Verify on staging URL
4. Deploy to production

---

## 📝 Recent Updates (Aug 14, 2025)

### Completed Today
✅ Set up Weaviate Docker container
✅ Imported 295k Chicago records
✅ Implemented free vectorization
✅ Created article generation pipeline
✅ Built image optimization system
✅ Fixed PHP template errors
✅ Achieved perfect PageSpeed scores

### Next Steps
- [ ] Implement cron job for daily imports
- [ ] Add more city data sources
- [ ] Create admin dashboard
- [ ] Set up monitoring alerts

---

## 🚨 Important Notes

1. **Always test with small batches first** (2-3 records)
2. **Monitor API costs** via Claude/OpenAI dashboards
3. **Check duplicate prevention** in import logs
4. **Verify image optimization** before bulk processing
5. **Keep vectors updated** when adding new records

---

## 📞 Support

For issues or questions about this system:
1. Check logs in `/production/logs/`
2. Review CLAUDE_PROGRESS.txt for development history
3. Test individual components with test scripts
4. Verify Weaviate is running: `docker ps`

---

*This document represents the complete CleanKitchens system as of August 14, 2025*