azure foundry scale model to be able to handle peak request
Azure Foundry has three deployment types you should know. For peak traffic, the pattern recommended by Microsoft is Global Provisioned (PTU) as your baseline, with spillover to Global Standard to absorb bursts
Global Provisioned deployments use Azure's global infrastructure to dynamically route traffic to available datacenters, providing reserved model processing capacity with guaranteed throughput combining global routing with lower, more consistent latency than standard.
More details here.
However, this also means data can be processed global and violates data sovereignty.
Typcially we have different skus (to keep things simple) that we can configure the followings :-
1. global
2. standard
3. regional
4. developer
More details of deployment type can be find here.
Here is a diagram that might provide a better understanding.
And depending on your workload requirement, this offers some guide
| If your workload is... | Recommended deployment |
|---|---|
| Prototyping or trying a new model | Instant Models (Preview) |
| Variable or bursty traffic | Standard / Global Standard |
| Consistently high traffic | Provisioned |
| Large batch jobs that are not time-sensitive | Global Batch / Data Zone Batch |
| Fine-tuned model testing and evaluation | Developer |
Limits
For quota, TPM and RPM limits are defined per region, per subscription, and per model — and the highest level of quota restriction is scoped at the Azure subscription level, not tenant level. Spreading deployments across regions via APIM load balancing effectively multiplies your available quota
When we're using Gateway API or APIM to route request to our Azure foundry, it normally would direct traffic to a project tied to a single subscriptions.
If your workload is too, then you may want to re-think how to structure your Azure Foundry project accordingly.
Comments