I’ve been working with Azure OpenAI for the past year at Advania, building a custom AI application for one of our enterprise clients. Users can create folders, organize their chats, ask anything from simple questions to complex analysis, and upload Word or Excel files for document analysis. We’ve wired it up to Azure AI Search, Azure AI Agents with code interpreter, Fabric for data access, and Bing grounding for web search.
One pattern kept repeating: we’d deploy GPT-4 for everything because it was the safe choice, then watch our Azure bill climb while knowing that most queries were simple enough for GPT-4o-mini. Sound familiar?
The obvious solution is to add routing logic – classify the query, pick the right model. But then you’re maintaining yet another piece of infrastructure, dealing with edge cases, and second-guessing every decision. When Microsoft announced Model Router in May 2025, I was sceptical. Another abstraction to learn? But after three months using it in production, I’ve changed my mind.
This is the first in a series where I’ll share what I’ve learned about Model Router – not just how it works, but when it’s worth the trade-offs and when you should stick with what you have. This post covers the architecture and decision framework. Later posts will dig into implementation, cost optimization, and multi-agent patterns.
The Problem: Static Models Don’t Match Variable Workloads
Our app handles an incredibly diverse workload. Users ask simple questions like “What’s the weather in London?” (which we route through Bing grounding), upload Excel files for data analysis (code interpreter via Azure AI Agents), search through their company’s document repository (Azure AI Search), or query their data warehouse (Fabric integration). Then there are the complex queries – multi-step analysis, synthesizing information from multiple sources, reasoning through business scenarios.
Early on, we made a pragmatic choice: deploy GPT-4 for all of it. The reasoning was sound: consistent quality, one deployment to manage, no classification logic to maintain. But here’s what actually happened. We’d send queries like “What’s 15% of 2,400?” to GPT-4 (which the code interpreter could handle, but still), spending $0.03 per 1,000 tokens for something GPT-4o-mini could easily answer. Meanwhile, the complex multi-document analysis that really needed GPT-4’s capabilities was costing us the same per token whether it was a simple file upload or synthesizing insights from ten different Excel spreadsheets.
The math was brutal. Processing 100,000 queries per month at an average of 2,000 input tokens and 600 output tokens per query:
- GPT-4 across the board: ~$8,100/month
- Perfect routing (if we could somehow classify perfectly): ~$3,400/month
That $4,700 monthly difference funds a senior developer. But building and maintaining a classification system? That also costs a senior developer’s time, plus the operational overhead of yet another service to monitor and tune.
How Model Router Actually Works
Model Router isn’t magic – it’s a fine-tuned small language model that Microsoft trained to classify query complexity and select from a pool of underlying models. When you send a request to a Model Router deployment, here’s what happens:
- Your prompt hits the router’s classification model
- The router analyses query complexity, context length requirements, and task type
- It selects the most cost-effective model that can handle the query
- Your request gets forwarded to that model
- The response comes back with a field showing which model was actually used
The key insight is that Microsoft is treating model selection as an ML problem, not a rules-based one. You’re not writing if complexity > threshold then use_gpt4 logic. You’re leveraging a model that’s been trained on massive amounts of query data to make these decisions.
The router currently (as of the November 2025 version) can select from 12+ models including GPT-4.1, GPT-5 series, o4-mini for reasoning tasks, and several smaller models like GPT-4o-mini and GPT-4.1-nano. It also includes third-party models like Claude Sonnet/Opus/Haiku and Grok-4 if you want that flexibility.
What’s interesting is how it handles reasoning models. If your query needs multi-step logical reasoning, the router can select from the o-series models (like o4-mini), which have chain-of-thought reasoning built in. For straightforward tasks, it sticks with standard chat models.
Three Routing Modes: Picking Your Optimization Target
Microsoft gives you three routing modes, and honestly, the names are self-explanatory but the implications aren’t obvious until you use them in production.
- Cost Saving Mode prioritizes the cheapest models that can handle the task. In practice, this means your queries get routed heavily toward GPT-4-nano and GPT-4o-mini. We tested this mode for general Q&A where users are asking straightforward questions or doing simple document searches through Azure AI Search – basic, high-volume stuff. The cost savings were real (about 60% compared to static GPT-5), but we did notice occasional quality drops on edge cases. For simple queries, that was acceptable. For complex data analysis with code interpreter? Not a chance.
- Quality Mode takes the opposite approach – when in doubt, use the more capable model. This routes more traffic to GPT-4.1, GPT-5, and reasoning models. Your costs go up, but so does the consistency of your outputs. We use this mode for scenarios where users are uploading multiple documents and asking for cross-document analysis, or when they’re querying Fabric and need sophisticated data interpretation.
- Balanced Mode is the default, and it’s where most production workloads should start. It tries to optimize the cost-quality trade-off dynamically. In our testing, it gave us about 45-50% cost savings versus static GPT-4 while maintaining quality that was statistically indistinguishable in user satisfaction surveys.
Here’s the catch: you set the routing mode at deployment time, not per request. If you want to A/B test different modes, you need multiple deployments. This matters for optimization, which I’ll cover later.
Model Subsets: The Control You Actually Need
The November 2025 update added something crucial: custom model subsets. Instead of letting the router choose from its entire pool of models, you can specify exactly which models are available for routing.
Why does this matter? Three real-world scenarios:
- Budget constraints: If you exclude GPT-5 and the reasoning models from your subset, you cap your maximum per-query cost. The router can’t surprise you with an expensive model selection. We did this for our general Q&A deployment where users are asking straightforward questions and cost predictability mattered more than maximum accuracy.
- Compliance requirements: Some regulated industries need to stick with specific model providers. You can create a subset that only includes Microsoft’s models, excluding Claude and Grok. This matters when you’re working with enterprise clients who have specific vendor requirements in their data processing agreements.
- Context window needs: This one bit us. The router’s context window limit is the smallest model in your pool. If you include GPT-4.1-nano (which has a smaller context window), your document analysis with large Excel files can fail when the router happens to select that model. Once we understood this, we created separate deployments with carefully chosen model subsets based on expected document sizes.
The subset feature gives you a way to guide the router’s decisions without building your own classification logic. You’re essentially saying “here are the models I trust for this workload” and letting the router optimize within those constraints.
When Model Router Actually Makes Sense
After three months in production, here’s when I’d recommend Model Router and when I wouldn’t.
Use Model Router if:
- You have high query volume where cost actually matters. Below 10,000 queries/month, the cost difference probably doesn’t justify any optimization effort. Above 100,000 queries/month? You should be looking at this. Our app serves hundreds of users who create multiple chats per day – the volume adds up fast.
- Your workload has genuinely variable complexity. Our app handles everything from “What’s the weather?” (Bing grounding) to “Analyse these three Excel files and identify trends across all product lines” (code interpreter + reasoning). That’s a perfect fit. If all your queries are similar complexity, static deployment is simpler.
- You’re in the prototyping phase and don’t want to commit to a specific model yet. Model Router buys you time to understand your actual workload distribution before locking into infrastructure decisions. We used it to understand which features in our app actually needed GPT-4 versus where we could use cheaper models.
- You’re building multi-agent systems where different agents have different needs. This is where it really shines – when you have separate agents for document search (Azure AI Search), data analysis (code interpreter), web grounding (Bing), and synthesis, each agent can benefit from different models.
Stick with static deployments if:
- You need exact model version control for compliance or reproducibility. Model Router can update its underlying models when you enable auto-update, which might not be acceptable in regulated environments.
- Your latency budget is extremely tight. The router adds roughly 50-100ms to your request time. For most applications this doesn’t matter, but if you’re building real-time systems where every millisecond counts, the overhead might not be worth it.
- You have very predictable workloads where you already know the optimal model. If 95% of your queries need GPT-4’s capabilities, just deploy GPT-4. The router optimization won’t help much.
- You’re using specific model features that only certain models support. For example, if you’re heavily relying on function calling patterns that work differently across models, the routing might cause subtle inconsistencies.
The Cost Reality Check
Here’s actual production data from our AI app over 30 days:
- Total queries processed: 142,500
- Model Router (Balanced mode): $3,247
- Projected cost with static GPT-4: $7,200
- Actual savings: $3,953 (55%)
But that’s not the whole story. The router itself now costs money (as of November 2025). There’s a per-token charge for the routing decision, plus the underlying model costs. The full calculation looks like:
Total Cost = (Input_Tokens * Router_Rate) + (Input_Tokens * Model_Input_Rate) + (Output_Tokens * Model_Output_Rate)
The router rate is relatively small, but it’s not zero. In our case, it added about 3% to the total cost. Still a massive win, but worth accounting for in your projections.
The other cost is operational: monitoring and understanding model selection patterns. We added telemetry to track which models were being selected for different document types. That’s extra Application Insights queries, dashboard maintenance, and alert tuning. Not huge, but it’s real effort.
Conclusion: The Router Isn't Magic, But It Works
Here's the reality: Model Router won't revolutionize your AI application overnight. It adds complexity – another deployment type to understand, routing patterns to monitor, and model selection telemetry to track. The 50-100ms latency overhead is real. The per-token routing cost, while small, isn't zero.
But after three months in production and $3,953 in monthly savings, I'm convinced it solves a genuine problem. Not because it's technically impressive (though Microsoft's ML-based classification approach is clever), but because it removes a decision you shouldn't be making manually at scale.