Building a Private ARAG AI System: Enterprise LLM Without the Cloud Risk
How we deployed a self-hosted AI system using OpenWebUI and Azure OpenAI with text embeddings, custom tools, and business data integration—giving enterprise intelligence without sacrificing data sovereignty.
Building a Private ARAG AI System: Enterprise LLM Without the Cloud Risk
A client came to us with a problem that's becoming increasingly common: they wanted the capabilities of modern AI systems like ChatGPT, but couldn't send sensitive business data through public cloud services. Their compliance requirements and data sovereignty concerns meant cloud-based LLMs were off the table, but they still needed the intelligence and productivity gains AI could provide.
We built them an Augmented Retrieval-Augmented Generation (ARAG) system—a private, self-hosted AI platform that combines local document retrieval with Azure OpenAI's intelligence, all while maintaining complete control over data flow.
The Architecture: Hybrid Intelligence
The solution needed to balance local data security with cloud AI capabilities. Here's how we structured it:
Core Components
- OpenWebUI: Self-hosted frontend providing ChatGPT-like interface
- Azure OpenAI: GPT-4 and GPT-3.5-turbo endpoints for intelligence
- Text Embedding Models: Local vector embeddings for document retrieval
- Custom Tools Framework: Business system integration layer
- PostgreSQL + pgvector: Vector database for semantic search
Data Flow Architecture
The critical design decision was where data lives versus where computation happens:
- Business documents never leave the network: All proprietary data stays on-premises
- Vector embeddings are local: Document embeddings generated and stored locally
- Only anonymized queries hit Azure: No sensitive context sent to cloud endpoints
- Response augmentation happens locally: AI responses are enriched with local data before display
This gives you the intelligence of GPT-4 while maintaining data sovereignty. Azure OpenAI only sees sanitized queries—it never has access to your document corpus.
Retrieval-Augmented Generation: The Smart Part
Standard LLMs are trained on broad knowledge but know nothing about your specific business. RAG solves this by retrieving relevant context from your documents before generating responses.
The Retrieval Pipeline
When a user asks a question:
User Query → Text Embedding → Vector Similarity Search → Top-K Documents Retrieved
We use sentence-transformers for local embedding generation. The model converts both user queries and document chunks into high-dimensional vectors. Similar concepts cluster in vector space, so finding relevant documents becomes a nearest-neighbor search problem.
Why this matters: A user can ask "What's our refund policy for damaged goods?" and the system retrieves the exact policy section from internal documentation—even if the wording doesn't match exactly. Semantic understanding beats keyword matching.
Augmentation Layer
Retrieved documents are injected into the prompt context:
System: You are an assistant with access to company documentation.
Context: [Retrieved Document Sections]
User: What's our refund policy for damaged goods?
The LLM generates responses grounded in your actual documentation rather than hallucinating policies.
Custom Tools: Business System Integration
The real power comes from connecting the AI to live business systems. We implemented a custom tools framework that lets the LLM query databases, check inventory, pull customer records, and trigger workflows.
Tool Definition Example
Tools are defined with JSON schemas that the LLM can reason about:
{
"name": "check_inventory",
"description": "Query current inventory levels for products",
"parameters": {
"sku": "string",
"location": "string (optional)"
}
}
When a user asks "Do we have SKU-12345 in stock?", the LLM:
- Recognizes this requires inventory data
- Calls
check_inventory(sku="SKU-12345") - Receives live data from your ERP
- Formulates a natural language response
No prompt engineering required from the user. The system transparently handles function calling and data integration.
Tracking and Auditing: Compliance Requirements
Enterprise AI needs audit trails. We implemented comprehensive logging:
- Query logging: Every question asked, by whom, when
- Document access tracking: Which documents were retrieved for each query
- Tool invocation logs: What business systems were accessed
- Response tracking: Full conversation history with timestamps
- Data lineage: Trace any AI-generated answer back to source documents
This satisfies compliance requirements and provides visibility into how the AI is being used. You can audit exactly what data informed each response.
User Authentication and RBAC
OpenWebUI integrates with existing identity providers (LDAP, Azure AD, etc.). We implemented role-based access control so:
- Department-specific document access: HR queries only see HR documents
- Tool permissions: Finance users can query accounting systems, others cannot
- Admin oversight: Audit logs accessible only to compliance team
Custom Prompts: Shaping Behavior
Different business functions need different AI behavior. We created custom system prompts for various use cases:
Customer Support Persona
You are a customer support specialist. Be empathetic and solution-focused.
Always cite policy documents when making statements about company policy.
If you cannot find an answer in the documentation, say so clearly.
Never guess or make up policy details.
Technical Documentation Assistant
You are a technical documentation assistant. Provide precise, accurate answers.
Include relevant code snippets and configuration examples from docs.
When multiple approaches exist, present options with trade-offs.
Prompts are version-controlled and can be updated without code changes.
Performance Characteristics
The system handles real-world load effectively:
- Query latency: 2-4 seconds including retrieval and generation
- Concurrent users: 50+ simultaneous without degradation
- Document corpus: 100,000+ pages indexed
- Embedding generation: 200 documents/second on local GPU
- Vector search: Sub-100ms for top-10 retrieval from millions of vectors
Deployment Architecture
We deployed on-premises with HA configuration:
- Application tier: Docker Swarm cluster (3 nodes)
- Database tier: PostgreSQL with pgvector extension (primary + replica)
- GPU tier: NVIDIA GPU for local embedding generation
- Load balancer: HAProxy for request distribution
- Monitoring: Prometheus + Grafana for observability
Total infrastructure cost was significantly lower than equivalent SaaS AI solutions, with the added benefit of data control.
Real-World Impact
After deployment, the client saw measurable improvements:
- Support ticket resolution time: 35% faster (agents find answers immediately)
- Employee self-service: 60% of common questions resolved without human escalation
- Onboarding time: New employees get instant answers to policy questions
- Compliance: Full audit trail for regulatory requirements
- Cost: $0 per-query fees vs. SaaS alternatives
The Technical Takeaway: Why ARAG Works
Traditional RAG retrieves documents and feeds them to an LLM. Augmented RAG adds:
- Tool integration: Live data, not just static documents
- Multi-source retrieval: Combine databases, APIs, and document stores
- Custom preprocessing: Sanitize queries before cloud endpoints
- Local post-processing: Enrich responses with real-time data
This architecture gives you enterprise AI that's actually useful—not just a chatbot with access to a wiki.
Need AI capabilities without cloud risk? We specialize in building private LLM systems that integrate with your existing infrastructure while maintaining complete data control. Contact us to discuss your requirements.