Problem Context

LLM calls are slow. A GPT-4o request takes 2โ€“15 seconds depending on output length. When you put that in a synchronous API call, your users wait, your threads block, and your system's throughput collapses under load.

Most AI workloads don't need real-time responses. Document summarization, batch classification, content moderation, embedding generation, report creation โ€” these can all run asynchronously. Event-driven architecture decouples the request from the processing, letting you scale AI workloads independently and handle failures gracefully.

๐Ÿค” Sound familiar?
  • Your API response times went from 200ms to 8 seconds after adding LLM calls to the request path
  • Users are staring at loading spinners while the backend waits for GPT-4o to finish
  • Your LLM rate limits kill throughput when 50 requests hit at once
  • A single failed LLM call causes the entire request to fail with no retry

This article shows you how to move AI workloads off the critical path โ€” and why it fixes all four problems.

Concept Explanation

The core idea: instead of calling the LLM inline during a user request, publish an event ("document uploaded", "review requested") and let a background processor handle the AI work. The user gets an immediate acknowledgment; the AI result arrives when it's ready.


      flowchart LR
          subgraph Request Path - Fast
              A["User Upload"] --> B["API - Validate & Store"]
              B --> C["Publish Event"]
              C --> D["Return 202 Accepted"]
          end
      
          subgraph Processing Path - Async
              E["Service Bus Queue"] --> F["AI Worker"]
              F --> G["Call LLM"]
              G --> H["Store Result"]
              H --> I["Notify User"]
          end
      
          C --> E
      
          style A fill:#059669,color:#fff,stroke:#047857
          style D fill:#059669,color:#fff,stroke:#047857
          style F fill:#4f46e5,color:#fff,stroke:#4338ca
          style G fill:#7c3aed,color:#fff,stroke:#6d28d9
      

When to Go Async

Use event-driven patterns when:

  • Latency tolerance > 5 seconds. If users can wait for a notification, don't block the request.
  • Batch processing. Embedding 10,000 documents, classifying a backlog, generating reports.
  • Fan-out workloads. One input triggers multiple AI tasks (summarize + classify + extract entities).
  • Retry requirements. LLM calls fail. Queues give you built-in retry with dead-letter handling.

Keep synchronous when:

  • Chat/conversational interfaces where the user expects to see tokens streaming back.
  • Sub-second classification that gates a UI decision (e.g., content safety check before showing content).

Azure Messaging Services for AI Pipelines

Azure Service Bus โ€” Your primary choice for AI task queues. Supports sessions (ordered processing per entity), dead-letter queues (failed LLM calls), and scheduled delivery (rate-limit your LLM calls). Use queues for point-to-point processing, topics for fan-out.

Azure Event Grid โ€” For event routing and triggering. When a blob is uploaded to storage, Event Grid triggers your processing pipeline. It's the router, not the queue โ€” low latency, push-based, no message retention.

Azure Event Hubs โ€” For high-throughput ingestion when you're processing thousands of AI requests per second (telemetry classification, log analysis). Overkill for most AI pipelines but essential at scale.

Implementation

Pattern 1: Document Processing Pipeline


      flowchart TD
          A["Blob Upload"] -->|"Event Grid"| B["Orchestrator Function"]
          B --> C["Service Bus: chunk-queue"]
          B --> D["Service Bus: metadata-queue"]
      
          C --> E["Chunking Worker"]
          E --> F["Service Bus: embed-queue"]
          F --> G["Embedding Worker"]
          G --> H["AI Search Index"]
      
          D --> I["Metadata Extractor"]
          I --> J["Cosmos DB"]
      
          H --> K["Notify: Ready for RAG"]
          J --> K
      
          style B fill:#4f46e5,color:#fff,stroke:#4338ca
          style G fill:#7c3aed,color:#fff,stroke:#6d28d9
          style I fill:#059669,color:#fff,stroke:#047857
      
// Azure Function: Service Bus triggered AI worker
      [Function("ProcessDocument")]
      public async Task Run(
          [ServiceBusTrigger("ai-tasks", Connection = "ServiceBusConnection")]
          ServiceBusReceivedMessage message,
          ServiceBusMessageActions messageActions)
      {
          var task = JsonSerializer.Deserialize<AiTask>(message.Body);
      
          try
          {
              var result = task.Type switch
              {
                  "summarize" => await _llmService.SummarizeAsync(task.Content),
                  "classify" => await _llmService.ClassifyAsync(task.Content),
                  "embed" => await _llmService.EmbedAsync(task.Content),
                  _ => throw new ArgumentException($"Unknown task: {task.Type}")
              };
      
              await _resultStore.SaveAsync(task.Id, result);
              await messageActions.CompleteMessageAsync(message);
          }
          catch (Azure.RequestFailedException ex) when (ex.Status == 429)
          {
              // Rate limited โ€” abandon so it retries after visibility timeout
              await messageActions.AbandonMessageAsync(message);
          }
      }
      

Pattern 2: Fan-Out / Fan-In with Durable Functions

When a single input needs multiple AI operations that converge into one result:

// Orchestrator: fan-out AI tasks, fan-in results
      [Function("AnalyzeDocument")]
      public async Task<AnalysisResult> RunOrchestrator(
          [OrchestrationTrigger] TaskOrchestrationContext context)
      {
          var document = context.GetInput<Document>();
      
          // Fan-out: run 3 AI tasks in parallel
          var summaryTask = context.CallActivityAsync<string>(
              "Summarize", document.Content);
          var entitiesTask = context.CallActivityAsync<List<Entity>>(
              "ExtractEntities", document.Content);
          var sentimentTask = context.CallActivityAsync<string>(
              "AnalyzeSentiment", document.Content);
      
          await Task.WhenAll(summaryTask, entitiesTask, sentimentTask);
      
          // Fan-in: combine results
          return new AnalysisResult
          {
              Summary = summaryTask.Result,
              Entities = entitiesTask.Result,
              Sentiment = sentimentTask.Result
          };
      }
      

Pattern 3: Rate-Controlled Batch Processing

When you need to process thousands of items but your LLM endpoint has TPM limits:

// Configure Service Bus processor with concurrency control
      var processor = client.CreateProcessor("embed-queue", new ServiceBusProcessorOptions
      {
          MaxConcurrentCalls = 5,  // Match your TPM budget
          AutoCompleteMessages = false,
          MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(10)
      });
      

Set MaxConcurrentCalls based on your Azure OpenAI deployment's TPM limit divided by your average tokens-per-request. This gives you natural backpressure without building custom rate limiting.

Pitfalls

โš ๏ธ Common Mistakes

1. No dead-letter handling

LLM calls fail for many reasons: rate limits, content policy violations, malformed inputs, model timeouts. If you don't configure dead-letter queues and monitor them, failed tasks silently disappear. Set MaxDeliveryCount to 3โ€“5 and actively process the dead-letter queue.

2. Message lock expiration during long LLM calls

Service Bus default lock duration is 30 seconds. GPT-4o can take 30+ seconds for long outputs. If the lock expires, another consumer picks up the same message, and you get duplicate processing. Set lock duration to 5 minutes or use auto-lock-renewal.

3. Ordering assumptions

Service Bus queues don't guarantee ordering across consumers. If document chunks must be processed in order (e.g., for context-dependent summarization), use Service Bus sessions with a session ID per document.

4. Missing idempotency

Messages can be delivered more than once (at-least-once semantics). If your AI worker writes results without checking if they already exist, you'll get duplicate entries. Store a processing record keyed by message ID and check before processing.

5. Tight coupling through message schemas

If your event schema includes LLM-specific details (model name, temperature), you couple every producer to your AI implementation. Keep events domain-focused ("DocumentUploaded") and let the consumer decide which model to use.

Practical Takeaways

โœ… Key Lessons
  • Default to async for any AI workload that doesn't need streaming responses. The reliability and scalability benefits are immediate.
  • Use Service Bus for AI task queues โ€” built-in retries, dead-letter, sessions, and scheduled delivery handle 90% of your failure modes.
  • Control LLM throughput with consumer concurrency, not custom rate limiters. MaxConcurrentCalls on the Service Bus processor is your throttle.
  • Durable Functions for fan-out/fan-in when one input needs multiple AI operations that converge. Don't build your own orchestration.
  • Monitor your dead-letter queue. It's the first place to check when "AI isn't working" โ€” the answer is usually there.