Automated Data Warehousing: Selenium Clusters and AI-Powered Content Enrichment
Building a cloud-native microservices platform that scrapes product data at scale, sanitizes it, enriches it with AI-generated SEO content, and delivers it via secure API—turning raw vendor data into market-ready product information.
Automated Data Warehousing: Selenium Clusters and AI-Powered Content Enrichment
A client in e-commerce faced a data quality nightmare: they needed to aggregate product information from hundreds of vendor websites, but the data was inconsistent, incomplete, and definitely not SEO-friendly. Vendor product descriptions were technical specs copy-pasted from PDFs. Images were low-resolution or watermarked. Pricing was scattered across multiple pages.
Manually collecting and cleaning this data would require a full-time team. Instead, we built an automated data warehousing platform that:
- Scrapes vendor websites at scale using Selenium clusters
- Sanitizes and normalizes data to consistent schema
- Enriches with AI to generate SEO-optimized descriptions and metadata
- Delivers via API for integration with their e-commerce platform
The system processes 50,000+ products daily, turning vendor chaos into clean, market-ready product data.
The Data Problem: Vendor Website Inconsistency
E-commerce aggregators depend on vendor data. The problem: vendors don't structure their data for you.
Common issues we encountered:
- No API available: Most vendors have websites, not data feeds
- Inconsistent structure: Each vendor's site has different HTML structure
- Dynamic content: JavaScript-rendered product information
- Anti-scraping measures: CAPTCHAs, rate limiting, IP blocking
- Poor data quality: Missing specifications, vague descriptions, inconsistent units
- No SEO value: Technical jargon that doesn't rank in search
Manual data entry doesn't scale. Public data feeds (if they exist) are expensive and limited. Web scraping was the only viable solution.
System Architecture: Cloud-Native Microservices
The platform is built as distributed microservices, each handling a specific stage of the data pipeline:
High-Level Pipeline
Vendor Websites
↓
[Scraper Cluster] → Selenium Grid (10-50 concurrent browsers)
↓
[Raw Data Queue] → RabbitMQ
↓
[Sanitizer Service] → Normalize schema, clean HTML, validate data
↓
[Enrichment Service] → AI-generated descriptions, SEO metadata
↓
[Data Warehouse] → PostgreSQL with full-text search
↓
[API Gateway] → Secure RESTful API for client access
Technology Stack
Scraping Layer:
- Selenium Grid: Distributed browser automation
- Python: Scraping scripts using selenium and BeautifulSoup
- Kubernetes: Container orchestration for scraper pods
- Redis: Deduplication and job queue
Processing Layer:
- Node.js: Sanitizer and enrichment services
- OpenAI API: GPT-4 for content generation
- Tesseract OCR: Extract text from product images
- ImageMagick: Image processing and optimization
Storage and API:
- PostgreSQL: Primary data store with JSONB columns for flexible schema
- Elasticsearch: Full-text search and product discovery
- FastAPI: RESTful API gateway
- AWS S3: Product image storage
- CloudFront: CDN for image delivery
Scraping at Scale: Selenium Grid Cluster
Web scraping sounds simple until you need to do it at scale. We needed to scrape 200+ vendor sites, each with thousands of products, every 24 hours.
Distributed Browser Automation
We deployed Selenium Grid on Kubernetes with:
- 1 Hub: Coordinates scraping jobs across the cluster
- 10-50 Chrome nodes: Scale dynamically based on workload
- 5 browser instances per node: Up to 250 concurrent scrapers
Each vendor requires a custom scraper tailored to their site structure. The scraper connects to the Selenium Hub, which assigns it to an available Chrome node. It then navigates the vendor's site, waits for JavaScript to render product details, and extracts:
- Product title, SKU, and pricing
- Specifications from tables or lists
- Image URLs from galleries
- Availability status
Anti-Scraping Countermeasures
Vendors don't want automated scraping. We implemented several strategies:
User-Agent Rotation: Cycle through realistic browser fingerprints to avoid detection.
Rate Limiting: Random delays between requests (typically 30 requests/minute per vendor) to appear human.
IP Rotation: Residential proxy services (BrightData) distribute requests across thousands of IPs, avoiding rate limits and IP blocks.
CAPTCHA Solving: Integration with 2Captcha API automatically solves CAPTCHAs when encountered, injecting the token back into the page.
Data Sanitization: Normalizing Chaos
Raw scraped data is messy. The sanitizer service cleans and normalizes everything to a consistent schema:
Normalization tasks:
- Title cleanup: Remove promotional text (SALE!, NEW!, etc.)
- SKU standardization: Uppercase, no spaces, consistent format
- Price conversion: Convert all currencies to USD at current rates
- HTML stripping: Clean description text of markup and scripts
- Unit standardization: Convert measurements to metric (kg, cm, etc.)
- Image processing: Download, resize to 1200x1200, optimize to 85% JPEG quality, upload to S3/CloudFront
This ensures all products in the warehouse have consistent structure regardless of source vendor.
AI Enrichment: GPT-4 Content Generation
The most valuable addition: AI-generated SEO content.
Vendor descriptions like "Industrial pump, 3HP, stainless steel" don't rank in search engines or help customers. We use GPT-4 to transform technical specs into compelling, SEO-optimized content:
Content Generation Pipeline
For each product, we generate:
- SEO-optimized descriptions (150-250 words, keyword-rich, benefit-focused)
- Meta title and description (optimized for search results)
- Feature bullets (5-7 compelling highlights)
- Comparison points (why this product vs. alternatives)
Results
Before AI enrichment:
"3HP stainless steel centrifugal pump, 240V, max flow 100L/min"
After AI enrichment:
"Upgrade your industrial operations with this robust 3HP stainless steel centrifugal pump, engineered for demanding commercial and agricultural applications. Delivering impressive flow rates up to 100 liters per minute, this pump combines durability with exceptional performance. The corrosion-resistant stainless steel construction ensures longevity even in harsh environments, while the efficient 240V motor provides reliable power for continuous operation. Perfect for water transfer, irrigation systems, and industrial processes where consistent, high-volume pumping is essential. Built to commercial standards, this pump offers the reliability professionals demand at a price point small businesses can afford."
The AI-generated content includes:
- Natural keyword placement ("industrial operations", "commercial and agricultural")
- Benefit-focused language ("durability", "exceptional performance", "longevity")
- Use cases ("water transfer, irrigation, industrial processes")
- Conversion-focused copy ("reliability professionals demand at a price point small businesses can afford")
Data Warehouse: PostgreSQL + Elasticsearch
The processed data is stored in a hybrid system optimized for different query patterns:
PostgreSQL: Primary Data Store
- Stores complete product records with JSONB columns for flexible specifications
- Full-text search using generated
tsvectorcolumns (weighted: title > description > features) - Indexed by SKU, vendor_id, and search vectors
Elasticsearch: Advanced Search
- Synced from PostgreSQL for complex queries and faceted search
- Multi-field matching with boosted relevance (title³, description², features¹)
- Aggregations for vendor filters, price ranges, and availability facets
This hybrid approach provides transactional consistency (PostgreSQL) with powerful search capabilities (Elasticsearch).
API Gateway: Secure Data Delivery
The final layer is a FastAPI gateway providing RESTful access with JWT authentication.
Key Features
- Token-based authentication: JWT tokens with expiration
- Advanced filtering: Search, vendor, price range, availability
- Pagination: Configurable limit/offset (max 100 per request)
- Rate limiting: Prevents API abuse
- Response caching: CDN-cached responses for popular queries
Clients integrate via simple HTTP requests, receiving clean JSON responses with products matching their criteria.
Real-World Performance
After 6 months in production:
Scale Metrics
- Products processed: 2.1 million total, 50,000/day average
- Vendors scraped: 240 active sources
- Scraping throughput: 15,000 products/hour during peak
- API requests: 500,000/day average
- Infrastructure cost: ~$800/month (AWS EC2, RDS, S3, Elasticsearch)
Data Quality Improvements
- Completeness: 95%+ products have full specifications (vs. 40% raw)
- Image quality: 100% optimized and CDN-delivered (vs. mixed quality)
- SEO descriptions: 100% AI-generated (vs. 0% usable for SEO)
- Data freshness: 24-hour update cycle (vs. weeks/months manual)
Business Impact
- E-commerce conversion rate: +35% (better product information)
- Organic search traffic: +120% (SEO-optimized content)
- Manual data entry: Eliminated (was 3 FTE)
- Time-to-market: New vendors added in days vs. months
Need automated data collection and enrichment? We build custom scraping, processing, and API platforms that turn distributed web data into unified, enriched datasets. Contact us to discuss your data warehousing requirements.