azure foundry scale model to be able to handle peak request

Azure Foundry has three deployment types you should know. For peak traffic, the pattern recommended by Microsoft is Global Provisioned (PTU) as your baseline, with spillover to Global Standard to absorb bursts

Global Provisioned deployments use Azure's global infrastructure to dynamically route traffic to available datacenters, providing reserved model processing capacity with guaranteed throughput combining global routing with lower, more consistent latency than standard.

More details here.

However, this also means data can be processed global and violates data sovereignty.

Typcially we have different skus (to keep things simple) that we can configure the followings :-

1. global

2. standard

3. regional 

4. developer

More details of deployment type can be find here.

Here is a diagram that might provide a better understanding. 




And depending on your workload requirement, this offers some guide

If your workload is...Recommended deployment
Prototyping or trying a new modelInstant Models (Preview)
Variable or bursty trafficStandard / Global Standard
Consistently high trafficProvisioned
Large batch jobs that are not time-sensitiveGlobal Batch / Data Zone Batch
Fine-tuned model testing and evaluationDeveloper

Limits 

For quota, TPM and RPM limits are defined per region, per subscription, and per model — and the highest level of quota restriction is scoped at the Azure subscription level, not tenant level. Spreading deployments across regions via APIM load balancing effectively multiplies your available quota

When we're using Gateway API or APIM to route request to our Azure foundry, it normally would direct traffic to a project tied to a single subscriptions. 

If your workload is too, then you may want to re-think how to structure your Azure Foundry project accordingly. 


 








Comments

Popular posts from this blog

mongosh install properly

gemini cli getting file not defined error

vllm : Failed to infer device type