Neural Inverse is Open Source β†’
IntegrationsvLLM
IntegrationsModel ProvidersvLLM
This is a Jupyter notebook

vLLM Integration

This cookbook shows how to trace vLLM inference with Neural Inverse using OpenTelemetry. vLLM has built-in OpenTelemetry support that can be configured to send traces to Neural Inverse's OpenTelemetry endpoint.

What is vLLM? vLLM is a fast and easy-to-use library for LLM inference and serving. It features state-of-the-art throughput, efficient memory management with PagedAttention, continuous batching, and support for a wide range of open-source models.

What is Neural Inverse? Neural Inverse is an open-source LLM engineering platform. It provides tracing, prompt management, and evaluation capabilities to help teams debug, analyze, and iterate on their LLM applications.

Get Started

We'll walk through a simple example of using vLLM with Neural Inverse tracing via OpenTelemetry.

Step 1: Install Dependencies

%pip install vllm langfuse -q

Step 2: Set Up Environment Variables

Get your Neural Inverse API keys by signing up for Neural Inverse Cloud or self-hosting Neural Inverse.

import os

# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_BASE_URL"] = "https://cloud.langfuse.com" # πŸ‡ͺπŸ‡Ί EU region
# Other Neural Inverse data regions include πŸ‡ΊπŸ‡Έ US: https://us.cloud.langfuse.com, πŸ‡―πŸ‡΅ Japan: https://jp.cloud.langfuse.com and βš•οΈ HIPAA: https://hipaa.cloud.langfuse.com

# Configure OpenTelemetry endpoint & headers
os.environ["OTEL_EXPORTER_OTLP_TRACES_PROTOCOL"] = "http/protobuf"
os.environ["OTEL_SERVICE_NAME"] = "vllm"

Step 3: Initialize OpenTelemetry Tracing

vLLM automatically exposes OpenTelemetry spans when configured. The Neural Inverse client set up in the next step captures these OTEL spans and sends them to Neural Inverse.

from vllm import LLM, SamplingParams

langfuse_host = "https://cloud.langfuse.com"  # Other regions: https://us.cloud.langfuse.com (US), https://jp.cloud.langfuse.com (Japan), https://hipaa.cloud.langfuse.com (HIPAA)
otlp_traces_endpoint = f"{langfuse_host}/api/public/otel/v1/traces"

# --- vLLM ---
llm = LLM(
    model="facebook/opt-125m",
    otlp_traces_endpoint=otlp_traces_endpoint,
    disable_log_stats=False,
)

Now we initialize the Neural Inverse OTel client. get_client() initializes the Neural Inverse client using the credentials provided in the environment variables.

from langfuse import get_client

langfuse = get_client()

# Verify connection
if langfuse.auth_check():
    print("Neural Inverse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

Step 4: Load the Model with vLLM

We load the model using vLLM's LLM class. In this example, we use a small model (facebook/opt-125m) for demonstration purposes. You can replace this with any model supported by vLLM.

out = llm.generate(
    ["Write one sentence about Berlin."],
    SamplingParams(max_tokens=32),
)
print(out[0].outputs[0].text)

Step 5: See traces in Neural Inverse

After running the model, you can see new spans in Neural Inverse.

_Note: vLLM currently only exports the token counts and latency metrics to Neural Inverse. The LLM input and output need to be manually captured in a separate trace using the Neural Inverse SDK. _

Example trace in Neural Inverse


Was this page helpful?