Technical

Automated Data Warehousing: Selenium Clusters and AI-Powered Content Enrichment

Building a cloud-native microservices platform that scrapes product data at scale, sanitizes it, enriches it with AI-generated SEO content, and delivers it via secure API—turning raw vendor data into market-ready product information.

April 10, 2024

Automation Services Team

Automated Data Warehousing: Selenium Clusters and AI-Powered Content Enrichment

A client in e-commerce faced a data quality nightmare: they needed to aggregate product information from hundreds of vendor websites, but the data was inconsistent, incomplete, and definitely not SEO-friendly. Vendor product descriptions were technical specs copy-pasted from PDFs. Images were low-resolution or watermarked. Pricing was scattered across multiple pages.

Manually collecting and cleaning this data would require a full-time team. Instead, we built an automated data warehousing platform that:

Scrapes vendor websites at scale using Selenium clusters
Sanitizes and normalizes data to consistent schema
Enriches with AI to generate SEO-optimized descriptions and metadata
Delivers via API for integration with their e-commerce platform

The system processes 50,000+ products daily, turning vendor chaos into clean, market-ready product data.

The Data Problem: Vendor Website Inconsistency

E-commerce aggregators depend on vendor data. The problem: vendors don't structure their data for you.

Common issues we encountered:

No API available: Most vendors have websites, not data feeds
Inconsistent structure: Each vendor's site has different HTML structure
Dynamic content: JavaScript-rendered product information
Anti-scraping measures: CAPTCHAs, rate limiting, IP blocking
Poor data quality: Missing specifications, vague descriptions, inconsistent units
No SEO value: Technical jargon that doesn't rank in search

Manual data entry doesn't scale. Public data feeds (if they exist) are expensive and limited. Web scraping was the only viable solution.

System Architecture: Cloud-Native Microservices

The platform is built as distributed microservices, each handling a specific stage of the data pipeline:

High-Level Pipeline

Vendor Websites
    ↓
[Scraper Cluster] → Selenium Grid (10-50 concurrent browsers)
    ↓
[Raw Data Queue] → RabbitMQ
    ↓
[Sanitizer Service] → Normalize schema, clean HTML, validate data
    ↓
[Enrichment Service] → AI-generated descriptions, SEO metadata
    ↓
[Data Warehouse] → PostgreSQL with full-text search
    ↓
[API Gateway] → Secure RESTful API for client access

Technology Stack

Scraping Layer:

Selenium Grid: Distributed browser automation
Python: Scraping scripts using selenium and BeautifulSoup
Kubernetes: Container orchestration for scraper pods
Redis: Deduplication and job queue

Processing Layer:

Node.js: Sanitizer and enrichment services
OpenAI API: GPT-4 for content generation
Tesseract OCR: Extract text from product images
ImageMagick: Image processing and optimization

Storage and API:

PostgreSQL: Primary data store with JSONB columns for flexible schema
Elasticsearch: Full-text search and product discovery
FastAPI: RESTful API gateway
AWS S3: Product image storage
CloudFront: CDN for image delivery

Scraping at Scale: Selenium Grid Cluster

Web scraping sounds simple until you need to do it at scale. We needed to scrape 200+ vendor sites, each with thousands of products, every 24 hours.

Distributed Browser Automation

We deployed Selenium Grid on Kubernetes with:

1 Hub: Coordinates scraping jobs across the cluster
10-50 Chrome nodes: Scale dynamically based on workload
5 browser instances per node: Up to 250 concurrent scrapers

Each vendor requires a custom scraper tailored to their site structure. The scraper connects to the Selenium Hub, which assigns it to an available Chrome node. It then navigates the vendor's site, waits for JavaScript to render product details, and extracts:

Product title, SKU, and pricing
Specifications from tables or lists
Image URLs from galleries
Availability status

Anti-Scraping Countermeasures

Vendors don't want automated scraping. We implemented several strategies:

User-Agent Rotation: Cycle through realistic browser fingerprints to avoid detection.

Rate Limiting: Random delays between requests (typically 30 requests/minute per vendor) to appear human.

IP Rotation: Residential proxy services (BrightData) distribute requests across thousands of IPs, avoiding rate limits and IP blocks.

CAPTCHA Solving: Integration with 2Captcha API automatically solves CAPTCHAs when encountered, injecting the token back into the page.

Data Sanitization: Normalizing Chaos

Raw scraped data is messy. The sanitizer service cleans and normalizes everything to a consistent schema:

Normalization tasks:

Title cleanup: Remove promotional text (SALE!, NEW!, etc.)
SKU standardization: Uppercase, no spaces, consistent format
Price conversion: Convert all currencies to USD at current rates
HTML stripping: Clean description text of markup and scripts
Unit standardization: Convert measurements to metric (kg, cm, etc.)
Image processing: Download, resize to 1200x1200, optimize to 85% JPEG quality, upload to S3/CloudFront

This ensures all products in the warehouse have consistent structure regardless of source vendor.

AI Enrichment: GPT-4 Content Generation

The most valuable addition: AI-generated SEO content.

Vendor descriptions like "Industrial pump, 3HP, stainless steel" don't rank in search engines or help customers. We use GPT-4 to transform technical specs into compelling, SEO-optimized content:

Content Generation Pipeline

For each product, we generate:

SEO-optimized descriptions (150-250 words, keyword-rich, benefit-focused)
Meta title and description (optimized for search results)
Feature bullets (5-7 compelling highlights)
Comparison points (why this product vs. alternatives)

Results

Before AI enrichment:

"3HP stainless steel centrifugal pump, 240V, max flow 100L/min"

After AI enrichment:

"Upgrade your industrial operations with this robust 3HP stainless steel centrifugal pump, engineered for demanding commercial and agricultural applications. Delivering impressive flow rates up to 100 liters per minute, this pump combines durability with exceptional performance. The corrosion-resistant stainless steel construction ensures longevity even in harsh environments, while the efficient 240V motor provides reliable power for continuous operation. Perfect for water transfer, irrigation systems, and industrial processes where consistent, high-volume pumping is essential. Built to commercial standards, this pump offers the reliability professionals demand at a price point small businesses can afford."

The AI-generated content includes:

Natural keyword placement ("industrial operations", "commercial and agricultural")
Benefit-focused language ("durability", "exceptional performance", "longevity")
Use cases ("water transfer, irrigation, industrial processes")
Conversion-focused copy ("reliability professionals demand at a price point small businesses can afford")

Data Warehouse: PostgreSQL + Elasticsearch

The processed data is stored in a hybrid system optimized for different query patterns:

PostgreSQL: Primary Data Store

Stores complete product records with JSONB columns for flexible specifications
Full-text search using generated tsvector columns (weighted: title > description > features)
Indexed by SKU, vendor_id, and search vectors

Elasticsearch: Advanced Search

Synced from PostgreSQL for complex queries and faceted search
Multi-field matching with boosted relevance (title³, description², features¹)
Aggregations for vendor filters, price ranges, and availability facets

This hybrid approach provides transactional consistency (PostgreSQL) with powerful search capabilities (Elasticsearch).

API Gateway: Secure Data Delivery

The final layer is a FastAPI gateway providing RESTful access with JWT authentication.

Key Features

Token-based authentication: JWT tokens with expiration
Advanced filtering: Search, vendor, price range, availability
Pagination: Configurable limit/offset (max 100 per request)
Rate limiting: Prevents API abuse
Response caching: CDN-cached responses for popular queries

Clients integrate via simple HTTP requests, receiving clean JSON responses with products matching their criteria.

Real-World Performance

After 6 months in production:

Scale Metrics

Products processed: 2.1 million total, 50,000/day average
Vendors scraped: 240 active sources
Scraping throughput: 15,000 products/hour during peak
API requests: 500,000/day average
Infrastructure cost: ~$800/month (AWS EC2, RDS, S3, Elasticsearch)

Data Quality Improvements

Completeness: 95%+ products have full specifications (vs. 40% raw)
Image quality: 100% optimized and CDN-delivered (vs. mixed quality)
SEO descriptions: 100% AI-generated (vs. 0% usable for SEO)
Data freshness: 24-hour update cycle (vs. weeks/months manual)

Business Impact

E-commerce conversion rate: +35% (better product information)
Organic search traffic: +120% (SEO-optimized content)
Manual data entry: Eliminated (was 3 FTE)
Time-to-market: New vendors added in days vs. months

Need automated data collection and enrichment? We build custom scraping, processing, and API platforms that turn distributed web data into unified, enriched datasets. Contact us to discuss your data warehousing requirements.

View All Posts