Technical

Building a Private ARAG AI System: Enterprise LLM Without the Cloud Risk

How we deployed a self-hosted AI system using OpenWebUI and Azure OpenAI with text embeddings, custom tools, and business data integration—giving enterprise intelligence without sacrificing data sovereignty.

September 12, 2024

Automation Services Team

Building a Private ARAG AI System: Enterprise LLM Without the Cloud Risk

A client came to us with a problem that's becoming increasingly common: they wanted the capabilities of modern AI systems like ChatGPT, but couldn't send sensitive business data through public cloud services. Their compliance requirements and data sovereignty concerns meant cloud-based LLMs were off the table, but they still needed the intelligence and productivity gains AI could provide.

We built them an Augmented Retrieval-Augmented Generation (ARAG) system—a private, self-hosted AI platform that combines local document retrieval with Azure OpenAI's intelligence, all while maintaining complete control over data flow.

The Architecture: Hybrid Intelligence

The solution needed to balance local data security with cloud AI capabilities. Here's how we structured it:

Core Components

OpenWebUI: Self-hosted frontend providing ChatGPT-like interface
Azure OpenAI: GPT-4 and GPT-3.5-turbo endpoints for intelligence
Text Embedding Models: Local vector embeddings for document retrieval
Custom Tools Framework: Business system integration layer
PostgreSQL + pgvector: Vector database for semantic search

Data Flow Architecture

The critical design decision was where data lives versus where computation happens:

Business documents never leave the network: All proprietary data stays on-premises
Vector embeddings are local: Document embeddings generated and stored locally
Only anonymized queries hit Azure: No sensitive context sent to cloud endpoints
Response augmentation happens locally: AI responses are enriched with local data before display

This gives you the intelligence of GPT-4 while maintaining data sovereignty. Azure OpenAI only sees sanitized queries—it never has access to your document corpus.

Retrieval-Augmented Generation: The Smart Part

Standard LLMs are trained on broad knowledge but know nothing about your specific business. RAG solves this by retrieving relevant context from your documents before generating responses.

The Retrieval Pipeline

When a user asks a question:

User Query → Text Embedding → Vector Similarity Search → Top-K Documents Retrieved

We use sentence-transformers for local embedding generation. The model converts both user queries and document chunks into high-dimensional vectors. Similar concepts cluster in vector space, so finding relevant documents becomes a nearest-neighbor search problem.

Why this matters: A user can ask "What's our refund policy for damaged goods?" and the system retrieves the exact policy section from internal documentation—even if the wording doesn't match exactly. Semantic understanding beats keyword matching.

Augmentation Layer

Retrieved documents are injected into the prompt context:

System: You are an assistant with access to company documentation.

Context: [Retrieved Document Sections]

User: What's our refund policy for damaged goods?

The LLM generates responses grounded in your actual documentation rather than hallucinating policies.

Custom Tools: Business System Integration

The real power comes from connecting the AI to live business systems. We implemented a custom tools framework that lets the LLM query databases, check inventory, pull customer records, and trigger workflows.

Tool Definition Example

Tools are defined with JSON schemas that the LLM can reason about:

{
  "name": "check_inventory",
  "description": "Query current inventory levels for products",
  "parameters": {
    "sku": "string",
    "location": "string (optional)"
  }
}

When a user asks "Do we have SKU-12345 in stock?", the LLM:

Recognizes this requires inventory data
Calls check_inventory(sku="SKU-12345")
Receives live data from your ERP
Formulates a natural language response

No prompt engineering required from the user. The system transparently handles function calling and data integration.

Tracking and Auditing: Compliance Requirements

Enterprise AI needs audit trails. We implemented comprehensive logging:

Query logging: Every question asked, by whom, when
Document access tracking: Which documents were retrieved for each query
Tool invocation logs: What business systems were accessed
Response tracking: Full conversation history with timestamps
Data lineage: Trace any AI-generated answer back to source documents

This satisfies compliance requirements and provides visibility into how the AI is being used. You can audit exactly what data informed each response.

User Authentication and RBAC

OpenWebUI integrates with existing identity providers (LDAP, Azure AD, etc.). We implemented role-based access control so:

Department-specific document access: HR queries only see HR documents
Tool permissions: Finance users can query accounting systems, others cannot
Admin oversight: Audit logs accessible only to compliance team

Custom Prompts: Shaping Behavior

Different business functions need different AI behavior. We created custom system prompts for various use cases:

Customer Support Persona

You are a customer support specialist. Be empathetic and solution-focused.
Always cite policy documents when making statements about company policy.
If you cannot find an answer in the documentation, say so clearly.
Never guess or make up policy details.

Technical Documentation Assistant

You are a technical documentation assistant. Provide precise, accurate answers.
Include relevant code snippets and configuration examples from docs.
When multiple approaches exist, present options with trade-offs.

Prompts are version-controlled and can be updated without code changes.

Performance Characteristics

The system handles real-world load effectively:

Query latency: 2-4 seconds including retrieval and generation
Concurrent users: 50+ simultaneous without degradation
Document corpus: 100,000+ pages indexed
Embedding generation: 200 documents/second on local GPU
Vector search: Sub-100ms for top-10 retrieval from millions of vectors

Deployment Architecture

We deployed on-premises with HA configuration:

Application tier: Docker Swarm cluster (3 nodes)
Database tier: PostgreSQL with pgvector extension (primary + replica)
GPU tier: NVIDIA GPU for local embedding generation
Load balancer: HAProxy for request distribution
Monitoring: Prometheus + Grafana for observability

Total infrastructure cost was significantly lower than equivalent SaaS AI solutions, with the added benefit of data control.

Real-World Impact

After deployment, the client saw measurable improvements:

Support ticket resolution time: 35% faster (agents find answers immediately)
Employee self-service: 60% of common questions resolved without human escalation
Onboarding time: New employees get instant answers to policy questions
Compliance: Full audit trail for regulatory requirements
Cost: $0 per-query fees vs. SaaS alternatives

The Technical Takeaway: Why ARAG Works

Traditional RAG retrieves documents and feeds them to an LLM. Augmented RAG adds:

Tool integration: Live data, not just static documents
Multi-source retrieval: Combine databases, APIs, and document stores
Custom preprocessing: Sanitize queries before cloud endpoints
Local post-processing: Enrich responses with real-time data

This architecture gives you enterprise AI that's actually useful—not just a chatbot with access to a wiki.

Need AI capabilities without cloud risk? We specialize in building private LLM systems that integrate with your existing infrastructure while maintaining complete data control. Contact us to discuss your requirements.

View All Posts