🎯 System Overview
CleanKitchens is an automated restaurant health inspection news platform that transforms raw government inspection data into educational, SEO-optimized news articles. The system uses AI for content generation, semantic search for data retrieval, and advanced image optimization for fast web delivery.
✨ Key Features
AI Content Generation
Claude-3.5-Sonnet powered article creation with educational focus
Semantic Search
384-dimension vectors for intelligent content retrieval
Image Optimization
WebP conversion with social media formats
SEO Excellence
4 structured data types per article page
Cost Efficiency
Free local vectorization saves $1,500+
Educational Focus
FDA, CDC, USDA reference integration
📁 Directory Structure
/var/www/twin-digital-media/public_html/_sites/cleankitchens/
├── production/
│ ├── scripts/ # Main Python scripts
│ ├── logs/ # Import and operation logs
│ └── data/ # Import statistics
├── templates/ # PHP templates
├── includes/ # PHP includes
├── assets/
│ └── images/
│ └── violations/
│ └── optimized/ # WebP and social formats
├── functions*.php # PHP functions
└── CLAUDE_PROGRESS.txt # Development log
/home/chris/cleankitchens/
├── scripts/ # Data import scripts
├── data/ # Chicago CSV data (326MB)
└── venv/ # Python environment
🔧 Core Scripts Created on 8/14
Article Generation Pipeline
| Script |
Location |
Purpose |
Status |
test_article_generator.py |
/production/scripts/ |
Complete article generation from inspection data |
Active |
image_optimizer.py |
/production/scripts/ |
Multi-format image optimization |
Active |
image_selector.py |
/production/scripts/ |
Intelligent violation-to-image matching |
Active |
clean_import.py |
/production/scripts/ |
Production data import with deduplication |
Active |
vectorize_all.py |
/production/scripts/ |
Free local vectorization |
Active |
clean_and_import.py |
~/cleankitchens/scripts/ |
Main Chicago data importer |
Active |
📝 Article Generator Details
/production/scripts/test_article_generator.py
Key Features:
- ✅ Connects to Weaviate for data retrieval
- ✅ Uses Claude-3.5-Sonnet API for content generation
- ✅ Integrates educational references (FDA, CDC, USDA)
- ✅ Adds Chicago neighborhood context and transit info
- ✅ Selects appropriate violation images
- ✅ Generates structured data and meta tags
Main Functions:
get_inspection_data(limit=2) # Retrieves records from Weaviate
generate_article(inspection) # Sends to Claude API with educational prompt
save_article_to_weaviate() # Stores complete article
add_local_context() # Adds neighborhood and transit info
select_appropriate_image() # Matches violation to image
🖼️ Image Processing System
Image Optimizer
/production/scripts/image_optimizer.py
Output Formats: WebP (85% quality) + Social Media Optimized JPEGs
| Format |
Dimensions |
Use Case |
File Extension |
| OpenGraph |
1200x630 |
Facebook sharing |
*_og.jpg |
| Twitter Card |
1200x675 |
Twitter/X sharing |
*_twitter.jpg |
| Instagram |
1080x1080 |
Instagram square |
*_instagram.jpg |
| Thumbnail |
400x300 |
Article cards |
*_thumb.webp |
| Main Article |
Original ratio |
Article pages |
*.webp |
Image Selection Algorithm
/production/scripts/image_selector.py
1. Check for closure keywords → closure images
2. Check for violation type keywords → specific violation images
3. Check for restaurant chain keywords → chain images
4. Fall back to general violation images
5. Use least-recently-used rotation
🗄️ PHP Functions
functions_live.php
getOptimizedImageUrl($imageUrl, $context)
// Returns optimized image URL based on context
// Contexts: 'page', 'og', 'twitter', 'instagram', 'thumb'
queryWeaviate($query)
// GraphQL query to Weaviate database
// Returns JSON decoded response
getFeaturedArticleFromDB()
// Gets most recent article for homepage feature
getArticleBySlugFromDB($slug)
// Retrieves full article by URL slug
getRecentClosuresFromDB()
// Returns recent closure articles
getWeeklyArticlesFromDB()
// Returns weekly summary articles
⚠️ Food Inspection Violation Code System
Comprehensive violation code lookup system based on FDA Food Code 2017/2022 and Chicago Department of Public Health standards.
🎯 System Overview
- File:
violation_codes_lookup.py
- Purpose: Translate numerical violation codes into human-readable explanations
- Integration: Automatic in article generation and auto-tagging
- Coverage: Temperature, equipment, sanitation, plumbing, facilities violations
📋 Violation Code Database
| Code |
Title |
Category |
Severity |
Health Risk |
| 18 |
Proper cooking time and temperature |
Temperature |
Priority |
High |
| 35 |
Approved thawing methods used |
Temperature |
Priority Foundation |
Medium |
| 44 |
Utensils, equipment properly stored |
Equipment |
Core |
Low |
| 47 |
Food contact surfaces cleanable |
Equipment |
Priority Foundation |
Medium |
| 48 |
Warewashing facilities maintained |
Sanitization |
Priority |
High |
| 51 |
Plumbing installed; proper backflow devices |
Plumbing |
Priority Foundation |
Medium |
| 52 |
Refuse, recyclables properly removed |
Waste Management |
Core |
Low |
| 53 |
Physical facilities maintained & clean |
Facilities |
Core |
Low |
🏷️ Severity Classifications
Priority: 🚨 Critical - Immediate health hazard that must be corrected immediately
Priority Foundation: ⚠️ Serious - Potential health hazard that must be corrected within timeframe
Core: 📋 Minor - Does not pose immediate threat to public health
🔧 Usage in Articles
# Article generation automatically includes violation explanations
violation_explanations = violation_lookup.explain_violations_for_article(violations_text)
# Auto-tagging includes violation-specific tags
violation_tags = violation_lookup.get_violation_tags(violations_text)
# Example output:
🚨 **Proper cooking time and temperature** (Code 18): Food must be cooked to safe minimum internal temperatures to kill harmful bacteria. This is considered a high-risk violation that could pose immediate health dangers.
💡 Key Features
- Automatic Integration: Works seamlessly with article generation
- Educational Focus: Explains what violations mean for food safety
- Reader-Friendly: Converts codes like "52" into understandable explanations
- Severity Awareness: Helps readers understand violation importance
- Prevention Tips: Includes common causes and prevention methods
💰 Prompt Caching Cost Optimization
🎯 Implementation
Separated system prompts from data-specific prompts to enable caching and reduce costs by ~50%.
# Cached system prompt (reused across all articles)
self.cached_system_prompt = self._build_cached_system_prompt()
# Data-specific prompt (changes per article)
data_prompt = self._build_data_prompt(inspection_data)
# API call with caching
response = self.anthropic_client.messages.create(
model="claude-3-haiku-20240307",
system=self.cached_system_prompt, # Cached
messages=[{"role": "user", "content": data_prompt}] # Variable
)
💸 Cost Savings
| Without Caching |
~$0.002/article |
| With Caching |
~$0.001/article |
| Savings |
~50% cost reduction |
| 47,599 Articles |
Saves ~$24 total |
🏗️ System Architecture
Data Flow Pipeline
Flow: Government Data → Import → Vectorization → Article Generation → Image Selection → Optimization → Publishing
Technology Stack
Backend
- PHP 8.2
- Python 3.12
- Weaviate Vector DB
- Docker
AI/ML
- Claude-3.5-Sonnet
- Sentence Transformers
- DALL-E 3 (optional)
- all-MiniLM-L6-v2
Frontend
- Semantic HTML5
- Inline Critical CSS
- Progressive Enhancement
- WebP with fallbacks
📊 Structured Data Implementation
✅ All 4 Structured Data Types Validated
Article Pages Include:
-
NewsArticle Schema
- Headline, author, publisher
- Date published/modified
- Article body with sections
- Image with captions
-
BreadcrumbList Schema
- Home → Category → Article
- Proper navigation hierarchy
-
Organization Schema
- Publisher information
- Logo and contact details
-
FAQPage Schema
- Educational Q&A pairs
- Food safety information
🔐 Configuration & API Keys
⚠️ Security Note: API keys should be stored in environment variables
Environment Variables
# ~/.env
ANTHROPIC_API_KEY=sk-ant-api03-xxxxx # Claude API
OPENAI_API_KEY=sk-xxxxx # DALL-E (optional)
Weaviate Configuration
| Host |
localhost:8080 |
| Collections |
RawInspection, Articles |
| Vectorizer |
Sentence Transformers (local) |
| Vector Dimensions |
384 |
🔄 Workflow Processes
Adding New Content
-
Import inspection data
python3 clean_and_import.py
-
Vectorize for search
python3 vectorize_all.py
-
Generate articles
python3 test_article_generator.py
-
Select images
python3 image_selector.py
-
Optimize images
python3 image_optimizer.py /path/to/images
-
Publish to site
Automatic via Weaviate
Testing Changes
Best Practice: Always test with 2-3 records first before bulk operations
- Use test scripts with 2-3 records first
- Check logs for errors
- Verify on staging URL
- Deploy to production
📈 Import Statistics
Chicago Data Import (8/14)
| Total Records |
295,254 |
| Import Speed |
~450 records/second |
| Vectorization Time |
30-45 minutes |
| Duplicates Prevented |
Active deduplication |
| Vector Dimensions |
384 |
| Model Used |
all-MiniLM-L6-v2 |
🛠️ Maintenance Commands
System Health Checks
# Check Weaviate Status
docker ps | grep weaviate
curl http://localhost:8080/v1/.well-known/ready
# Monitor Import Progress
tail -f /var/www/twin-digital-media/public_html/_sites/cleankitchens/production/logs/clean_import.log
# Test Article Generation
cd /var/www/twin-digital-media/public_html/_sites/cleankitchens/production/scripts
python3 test_article_generator.py
# Optimize New Images
python3 image_optimizer.py /path/to/images
# Check Vector Coverage
python3 test_and_fix_vectors.py
Database Operations
# Connect to Weaviate Console
docker exec -it [container_id] /bin/bash
# Backup Weaviate Data
docker exec [container_id] /bin/bash -c "weaviate backup create backup-$(date +%Y%m%d)"
# Check Collection Size
curl http://localhost:8080/v1/schema | jq
🚨 Important Notes
⚠️ Critical Reminders:
- Always test with small batches first (2-3 records)
- Monitor API costs via Claude/OpenAI dashboards
- Check duplicate prevention in import logs
- Verify image optimization before bulk processing
- Keep vectors updated when adding new records
Troubleshooting
| Issue |
Solution |
| Weaviate not responding |
docker restart [container_id] |
| Import fails |
Check logs in /production/logs/ |
| Articles not displaying |
Verify Weaviate query in functions_live.php |
| Images not optimized |
Check PIL installation: pip install Pillow |
| Vectors missing |
Run vectorize_all.py again |
📅 Recent Updates
Completed on August 14, 2025
- ✅ Set up Weaviate Docker container
- ✅ Imported 295k Chicago records
- ✅ Implemented free vectorization
- ✅ Created article generation pipeline
- ✅ Built image optimization system
- ✅ Fixed PHP template errors
- ✅ Achieved perfect PageSpeed scores
Completed on August 15, 2025
- ✅ Created comprehensive violation code lookup system
- ✅ Integrated FDA Food Code explanations for codes 18, 35, 44, 47, 48, 51, 52, 53
- ✅ Implemented prompt caching to reduce AI generation costs by ~50%
- ✅ Enhanced auto-tagging with violation-specific tags
- ✅ Added violation severity classification (priority, priority_foundation, core)
- ✅ Created comprehensive article generation with violation explanations
- ✅ Set up background processing with SSH disconnect protection
- ✅ Generated professional violation category images
Cost Optimization Features
- 💰 Prompt Caching: Separates system prompts from data prompts
- 💰 Haiku Model: Uses Claude 3 Haiku for ~$0.001/article generation
- 💰 Batch Processing: Optimized for processing 47,599 2023-2025 records
- 💰 Estimated Total Cost: ~$24.52 for all historical data processing
Next Steps
- ⏳ Process all 2023-2025 Chicago inspection data (47,599 records)
- ⏳ Fix CSV processing date grouping issue
- ⏳ Implement pattern detection for investigative stories
- ⏳ Create tag pages for auto-generated tags
- ⏳ Add more city data sources
- ⏳ Set up monitoring alerts