ModelGate: Secure and Unified API Gateway for Local LLM Deployment

Page content

As local large language model (LLM) deployment becomes increasingly common, simply exposing inference services like Ollama or vLLM to external users introduces serious security, quota, and operational challenges. What developers need is not just a model serverโ€”but a secure, unified, and OpenAI-compatible gateway layer.

The Challenges of Direct Local Model Exposure

Open-source tools such as Ollama, vLLM, and llama.cpp have significantly lowered the barrier for running large models locally. However, productionizing these services reveals several structural pain points.

๐Ÿ” Security Risks

Directly exposing an inference service to the public internet means:

  • Anyone can access your model endpoint.
  • There is no built-in access control.
  • Usage cannot be traced or audited reliably.

This is unacceptable in enterprise or multi-user environments.

๐Ÿ‘ฅ Multi-Tenant Management Complexity

In real-world deployments, you often need:

  • Separate API Keys for different teams or applications.
  • Distinct token quotas and rate limits.
  • Usage tracking and cost allocation.

Implementing these directly inside model servers is complex, error-prone, and hard to maintain.

โš™๏ธ Operational Overhead

Without a unified gateway:

  • Each service must implement authentication and rate limiting independently.
  • No centralized management interface exists.
  • Observability and monitoring are fragmented.

Operational costs quickly escalate.

๐Ÿ”„ OpenAI API Compatibility

Many existing applications are already built on top of the OpenAI API.

To migrate them to local modelsโ€”or to build hybrid deployments (local + OpenAI cloud)โ€”developers need a fully compatible API layer that enables zero-cost switching.


The Solution: ModelGate

ModelGate is an OpenAI-compatible API gateway purpose-built for local LLM deployment.

Developed in Go, ModelGate provides a secure and controlled way to expose local or hybrid model servicesโ€”while maintaining full compatibility with the OpenAI API format.


Core Capabilities

๐Ÿ” Security First

  • API Key authentication (SHA256 hashed storage)
  • Per-key IP whitelist support
  • HTTPS deployment support

๐Ÿ“Š Fine-Grained Quota & Rate Control

  • Token-based quota allocation per user
  • Redis-based rate limiting (RPM / Burst)
  • Full token consumption tracking per request

๐ŸŒ Multi-Backend Support

ModelGate supports multiple inference backends:

  • Ollama (local inference)
  • vLLM (high-performance inference)
  • llama.cpp (lightweight runtime)
  • OpenAI (hybrid cloud deployment)
  • API3 and other third-party APIs

This allows seamless hybrid model strategies.

๐Ÿ“ˆ Zero-Cost Migration from OpenAI

Existing applications only need to change the base_url:

# Original OpenAI usage
client = OpenAI(
    api_key="xxx",
    base_url="https://api.openai.com/v1"
)

# Switch to ModelGate
client = OpenAI(
    api_key="your-modelgate-key",
    base_url="http://your-server:8080/v1"
)
````

No other changes required.

### ๐ŸŽ› Flexible Management Interfaces

* Web Admin UI for visual management
* CLI tools for automation and scripting
* RESTful Admin APIs for deep integration

---

## Architecture Overview

text โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Client โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ ModelGate โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Ollama โ”‚ โ”‚ (OpenAI โ”‚ โ”‚ (Gateway) โ”‚ โ”‚ / vLLM โ”‚ โ”‚ SDK) โ”‚ โ”‚ โ”‚ โ”‚ / llama.cppโ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SQLite โ”‚ โ”‚ Redis โ”‚ โ”‚ Admin โ”‚ โ”‚ (Data) โ”‚ โ”‚ (Rate) โ”‚ โ”‚ UI/API โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜


### Adapter Pattern Design

ModelGate adopts an adapter pattern:

* Each backend (Ollama, vLLM, etc.) has its own adapter.
* Core gateway logic is decoupled from backend implementations.
* New model backends can be added easily.

---

## Quick Start

### One-Click Deployment

bash

Recommended: Docker Compose

docker-compose up -d

Or manual build

make build ./modelgate


### Configuration

Edit `configs/config.yaml`:

yaml server: port: 8080

admin: api_key: “your-admin-key”

adapters: ollama: base_url: http://localhost:11434 vllm: base_url: http://localhost:8000


### Create a User API Key

bash ./modelgate-cli key create -n “user1” -q 1000000 -r 60


### Send a Request

bash curl -X POST http://localhost:8080/v1/chat/completions
-H “Authorization: Bearer your-user-key”
-H “Content-Type: application/json”
-d ‘{ “model”: “qwen3:8b”, “messages”: [{“role”: “user”, “content”: “ไฝ ๅฅฝ”}] }’ ```


Real-World Use Cases

1๏ธโƒฃ Enterprise Internal Model Governance

Background

Multiple AI teams need access to internally deployed models.

With ModelGate

  • Independent API Keys per team
  • QPS / daily call / token quotas
  • Internal IP restrictions
  • Real-time monitoring (usage, latency, failures)
  • Internal cost allocation

Value

Enables enterprise-grade governance and prevents resource abuse.


2๏ธโƒฃ SaaS Platform with Hybrid Deployment

Background

Tiered service model:

  • Free users โ†’ local models
  • Paid users โ†’ GPT-4 or cloud models

With ModelGate

  • Identity-based routing
  • Multi-model load balancing
  • Unified billing system
  • Unified monitoring
  • Consistent API abstraction

Value

Achieves hybrid cloud deployment with optimized cost-performance balance.


3๏ธโƒฃ Productizing a Vertical Model API

Background

A team trains a vertical model (medical, legal, financial) and wants to commercialize it.

With ModelGate

  • Full authentication and access control
  • OpenAI-compatible API packaging
  • Token-based or subscription billing
  • Usage analytics and reporting
  • Admin dashboard
  • Rate limiting and traffic protection

Value

Transforms internal models into a scalable, monetizable API platform.


4๏ธโƒฃ Local Multi-Instance OpenClaw Scheduling

Background

A researcher runs multiple OpenClaw instances and wants isolated quotas and scheduling control.

With ModelGate

  • Separate API Keys per instance
  • Load balancing across local models
  • Resource monitoring per instance
  • Strategy-based routing
  • Per-instance rate limits
  • Behavioral usage analysis

Value

Enables controlled multi-agent experimentation in local environments.


Comparison

Feature ModelGate OpenAI API Nginx
Multi-Backend Support โœ… โŒ โŒ
API Key Management โœ… โœ… โŒ
Token Quota โœ… โœ… โŒ
Rate Limiting โœ… โœ… Basic
Usage Statistics โœ… โœ… โŒ
Web Admin UI โœ… โœ… โŒ
Deployment Complexity Low None Medium
Open Source โœ… โŒ โœ…

Roadmap

  • Plugin system
  • Monitoring and alerting
  • Distributed deployment
  • More backend integrations

Conclusion

ModelGate addresses the security and governance challenges of local LLM deployment.

Instead of forcing developers to reinvent authentication, quota management, and rate limiting for each model service, ModelGate provides a unified and production-ready gateway layer.

Developers can now focus on building models and applicationsโ€”while ModelGate handles the infrastructure.

Open Source Repository: https://github.com/derekwin/ModelGate

Star and contributions are welcome.