INFRASTRUCTURE

Model-Agnostic Architecture: Routing LLMs by Task, Cost, and Latency

OCT 07, 2025

12 MIN READ

334 Likes

When we started building Agentica, we made an early architectural decision that has proven to be one of the most valuable: treat LLM providers as interchangeable infrastructure components, not as fixed dependencies. Every agent in the system talks to an abstraction layer — the LLM factory — that can route requests to any configured provider without the agent needing to know which one it's talking to.

This wasn't a philosophical stance about vendor lock-in. It was a practical response to the rate of change in the LLM market: new models release quarterly, pricing changes frequently, and the best model for a given task today is not necessarily the best model six months from now. An architecture that requires touching agent code every time a new model becomes available is an architecture with a permanent maintenance tax.

The Abstraction Layer

The LLM factory in Agentica exposes a single interface: take a list of messages, a tool schema, and configuration parameters, and return a response. Behind this interface, providers are implemented as pluggable modules — currently supporting OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Grok, NVIDIA NIM, and Ollama. Adding a new provider requires implementing the adapter interface and registering the provider; no other code changes are needed.

The critical property of this abstraction is that it normalizes the response format. Different providers return responses in different formats — different field names, different ways of representing tool calls, different metadata structures. The factory layer normalizes all of these to a common format before returning to the caller. Agents never need to handle provider-specific response formats.

Routing Strategy

With a provider abstraction in place, routing becomes a configuration decision. Agentica supports several routing strategies: static (always use this provider for this agent), cost-optimized (use the cheapest provider that meets minimum capability requirements), latency-optimized (use the fastest provider that meets quality requirements), and capability-based (use specific providers for specific task types).

The most practically useful is capability-based routing. Reasoning tasks — complex analysis, multi-step planning, nuanced judgment — route to frontier models with high reasoning capability. Classification tasks — intent detection, entity extraction, routing decisions — route to smaller, faster, cheaper models. Document processing tasks with large context requirements route to models with the largest context windows. This stratification typically reduces cost by 40-60% compared to using the same frontier model for all tasks, while maintaining quality on the tasks that require it.

Fallback and Resilience

Provider outages happen. Rate limits get hit. Network errors occur. A model-agnostic architecture naturally enables fallback routing: if the primary provider returns an error, the factory automatically retries with a configured fallback provider. This happens transparently to the agent — from its perspective, the LLM call either succeeded or failed; the retry logic is invisible.

Fallback routing requires some care: not all providers support the same tool schemas, and a fallback to a provider with different tool support may change agent behavior. Agentica handles this by maintaining capability profiles for each provider and only falling back to providers with compatible capabilities for the current task configuration.

Local Models

The ability to route to locally-hosted models via Ollama or NVIDIA NIM is particularly valuable for data-sensitive use cases. Some organizations have strict requirements about data leaving their network perimeter — routing sensitive queries to local models while using cloud providers for less sensitive tasks gives the cost and capability benefits of cloud LLMs where appropriate while maintaining data sovereignty requirements where required.

Deploy Strategic Intelligence

Schedule a technical briefing on multi-agent deployment patterns.

Contact Engineering

Similar Research

View All Logs

RETRIEVAL

Choosing a Vector Database for Production RAG: What Actually Matters

The vector database market has exploded with options. After evaluating six databases for Agentica's RAG infrastructure, here are the dimensions that actually matter in production — and the ones that are mostly marketing.

Analyze Report →

REASONING

Prompt Engineering for Production: What Works at Scale

Prompt engineering in demos is about getting impressive outputs. Prompt engineering in production is about consistency, reliability, and graceful degradation when the model doesn't cooperate. These are different skills.

Analyze Report →