Making AI Faster: The Story Behind Bhumi

Choose your reading experience:

Hey everyone, Rach here. Welcome back to the blog! Today, I want to share the journey behind Bhumi—a tool I built to make inference clients faster and more efficient.

Why I Built Bhumi

A while back, while working on finbro.ai, a project where we built AI-powered agents, I ran into a frustrating problem: latency. Every time we asked an AI to do something, it took forever to respond. And since we had multiple agents working together, the delays stacked up, making everything painfully slow.

I knew AI could be faster. But the existing solutions weren't cutting it. So, I built Bhumi—a fast AI inference client that optimizes speed and efficiency, letting AI models run as smoothly as possible.

Why Is AI Slow in the First Place?

Think of streaming a movie. You don't want to wait for the entire movie to download before you start watching, right? You just want it to start playing instantly while the rest loads in the background.

Most AI models don't work that way. Instead of "streaming" small chunks of information as they become available, they often wait to generate everything at once before showing you results. That's like downloading a whole movie before you can watch the first scene. Inefficient and frustrating.

Another problem? The tools that manage AI requests (like LiteLLM) weren't handling multiple requests well, leading to even more delays.

How Bhumi Makes AI Faster

Bhumi fixes these issues with three key optimizations:

1. Optimized Request Handling with MAP-Elites

Instead of traditional HTTP request handling, Bhumi uses an adaptive optimization approach. Using the MAP-Elites algorithm, it dynamically adjusts buffer sizes and processing patterns based on:

Provider-Specific Optimization: Different buffer sizes for different AI providers
Adaptive Processing: Buffer management that evolves based on performance data
Quality-Diversity Balance: Maintaining both speed and reliability
Continuous Improvement: Learning from each request to optimize future ones

Our testing showed throughput improvements from 600 to over 1400 characters per second after just 15 iterations.

2. Rust + Python Architecture

Bhumi's core is built in Rust for maximum performance, with a Python interface for ease of use. This hybrid approach delivers:

Native-speed processing with PyO3
Developer-friendly API
Minimal overhead

3. Optimized Validation with Satya

We replaced the standard Pydantic validation with Satya, our custom validation library that:

Reduces memory overhead
Processes types faster
Maintains full type safety

Results and Impact

These optimizations deliver significant performance improvements:

Response Time Improvements

OpenAI: 2.5x faster than raw implementation, 1.9x faster than native
Gemini: 1.5x faster than raw, 1.6x faster than native
Anthropic: 1.8x faster than raw, 1.4x faster than native

Memory Efficiency

Only 1.1x memory overhead vs native implementations
Stable performance under load
Efficient resource utilization

Real-World Impact

These improvements translate to real-world benefits:

Faster response times for user interactions
More efficient resource utilization
Better scaling for multi-agent systems
Reduced operational costs

The combination of MAP-Elites optimization, Rust-based streaming, and efficient buffer management has created a solution that's not just incrementally better, but fundamentally more efficient. And we're just getting started—there's still room for optimization and improvement.

Supported AI Providers & Structured Outputs

Bhumi supports multiple AI providers, allowing seamless switching between them. Currently supported providers include:

OpenAI (openai/{model_name})
Anthropic (anthropic/{model_name})
Gemini (gemini/{model_name})
Groq (groq/{model_name})
SambaNova (sambanova/{model_name})

Bhumi also supports structured outputs and tool use, making it easy to integrate external functions into AI responses.

Using Bhumi for Tool Use & Structured Outputs

Bhumi allows AI models to call external tools for better interactivity. Here's an example that registers a weather tool and lets AI call it dynamically:

import asyncio
from bhumi.base_client import BaseLLMClient, LLMConfig
import os
import json
from dotenv import load_dotenv

load_dotenv()

# Example weather tool function
async def get_weather(location: str, unit: str = "f") -> str:
    result = f"The weather in {location} is 75°{unit}"
    print(f"\nTool executed: get_weather({location}, {unit}) -> {result}")
    return result

async def main():
    config = LLMConfig(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="openai/gpt-4o-mini"
    )
    
    client = BaseLLMClient(config)
    
    # Register the weather tool
    client.register_tool(
        name="get_weather",
        func=get_weather,
        description="Get the current weather for a location",
        parameters={
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city and state e.g. San Francisco, CA"},
                "unit": {"type": "string", "enum": ["c", "f"], "description": "Temperature unit (c for Celsius, f for Fahrenheit)"}
            },
            "required": ["location", "unit"],
            "additionalProperties": False
        }
    )
    
    print("\nStarting weather query test...")
    messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
    
    print(f"\nSending messages: {json.dumps(messages, indent=2)}")
    
    try:
        response = await client.completion(messages)
        print(f"\nFinal Response: {response['text']}")
    except Exception as e:
        print(f"\nError during completion: {e}")

if __name__ == "__main__":
    asyncio.run(main())

With Bhumi, AI models can generate structured responses and interact with external tools effortlessly.

Final Thoughts

Bhumi isn't just about speed—it's about flexibility and efficiency. Whether you need to switch AI providers on the fly or enable structured outputs and tool use, Bhumi makes it seamless.

Drop a comment below—I'd love to hear your thoughts! 🚀