This is a Jupyter notebook

Automate Code Optimization with Weco and Neural Inverse

This notebook provides a step-by-step guide on integrating Weco with Neural Inverse to automatically optimize LLM application code using Neural Inverse datasets, code evaluators, and managed LLM-as-a-Judge evaluators.

What is Weco? Weco (GitHub) is a code optimization platform. Given a source file and an evaluation function, Weco's optimizer iteratively edits the code, re-evaluates, and keeps the version that scores best. It works with any measurable metric — accuracy, latency, cost, or a custom composite score.

What is Neural Inverse? Neural Inverse is an open-source LLM engineering platform. It offers tracing and monitoring capabilities for AI applications. Neural Inverse helps developers debug, analyze, and optimize their AI systems by providing detailed insights and integrating with a wide array of tools and frameworks through native integrations, OpenTelemetry, and dedicated SDKs.

How It Works

Weco connects to Neural Inverse as an evaluation backend. On each optimization step, Weco:

Edits your source code (e.g. prompts, parsing logic)
Runs your target function against every item in a Neural Inverse dataset
Collects scores from local code evaluators and/or managed evaluators (LLM-as-a-Judge) configured in the Neural Inverse UI
Combines scores into a single metric and keeps the best-performing version

Each iteration creates a new experiment run in Neural Inverse so you can compare all variants side-by-side in the Neural Inverse dashboard.

Getting Started

Let's walk through a practical example of using Weco with Neural Inverse to optimize a simple QA function.

Step 1: Install Dependencies

For this example we'll install the weco client in an virtual environment. For global installation instructions and usage with agent skills please refer to Weco's docs.

!pip install "weco[langfuse]" langfuse openai -q

Authenticate with Weco:

!weco login

Step 2: Configure Neural Inverse SDK

Set up your Neural Inverse API keys. You can get these keys by signing up for a free Neural Inverse Cloud account or by self-hosting Neural Inverse. These environment variables are essential for the Neural Inverse client to authenticate and send data to your Neural Inverse project.

import os

# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-***"
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-***"
os.environ["LANGFUSE_BASE_URL"] = "https://us.cloud.langfuse.com"  # 🇺🇸 US region
# Other Neural Inverse data regions: 🇪🇺 EU https://cloud.langfuse.com, 🇯🇵 Japan https://jp.cloud.langfuse.com, ⚕️ HIPAA https://hipaa.cloud.langfuse.com

# Your OpenAI key
os.environ["OPENAI_API_KEY"] = "sk-proj-***"

With the environment variables set, we can now initialize the Neural Inverse client. get_client() initializes the Neural Inverse client using the credentials provided in the environment variables.

from langfuse import get_client

langfuse = get_client()

# Verify connection
if langfuse.auth_check():
    print("Neural Inverse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

Step 3: Create a Dataset in Neural Inverse

Weco evaluates your code against a Neural Inverse dataset. Each dataset item has an input dict and an optional expected_output dict. Let's create a small QA dataset.

dataset = langfuse.create_dataset(name="weco-demo-qa")

qa_pairs = [
    {
        "input": {"question": "What is the capital of France?"},
        "expected_output": {"expected_answer": "Paris"},
    },
    {
        "input": {"question": "What is the largest planet in our solar system?"},
        "expected_output": {"expected_answer": "Jupiter"},
    },
    {
        "input": {"question": "Who wrote Romeo and Juliet?"},
        "expected_output": {"expected_answer": "William Shakespeare"},
    },
    {
        "input": {"question": "What is the boiling point of water in Celsius?"},
        "expected_output": {"expected_answer": "100 degrees Celsius"},
    },
    {
        "input": {"question": "What year did the Berlin Wall fall?"},
        "expected_output": {"expected_answer": "1989"},
    },
]

for pair in qa_pairs:
    langfuse.create_dataset_item(
        dataset_name="weco-demo-qa",
        input=pair["input"],
        expected_output=pair["expected_output"],
    )

langfuse.flush()
print(f"Created dataset 'weco-demo-qa' with {len(qa_pairs)} items.")

Step 4: Write the Target Function

The target function is the code Weco will optimize. It receives an inputs dict from the dataset and returns a dict of outputs. Neural Inverse calls this function once per dataset item during each evaluation run.

Save this as agent.py in your working directory:

%%writefile agent.py
from openai import OpenAI

client = OpenAI()

SYSTEM_PROMPT = "You are a helpful assistant. Answer the question concisely in one sentence."


def answer_question(inputs: dict) -> dict:
    question = inputs.get("question", "")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": question},
        ],
        temperature=0.0,
    )
    return {"answer": response.choices[0].message.content}

Weco will iteratively edit agent.py — modifying SYSTEM_PROMPT, response parsing, or other logic — to improve the evaluation metric.

Step 5: Write an Evaluator and Metric Function

Code evaluators are Python functions that score each target function output. They receive keyword arguments and return a Neural Inverse Evaluation object.

The metric function combines all evaluator scores into a single number that Weco optimizes.

Save this as evaluators.py:

%%writefile evaluators.py
from langfuse import Evaluation


def answer_quality(*, input, output, expected_output=None, **kwargs):
    """Check that the answer is non-empty and reasonably concise."""
    answer = (output or {}).get("answer", "")
    if not answer:
        return Evaluation(name="answer_quality", value=0.0, comment="Empty answer")
    word_count = len(answer.split())
    score = 1.0 if word_count <= 50 else max(0.0, 1.0 - (word_count - 50) / 100)
    return Evaluation(
        name="answer_quality",
        value=score,
        comment=f"{word_count} words",
    )


def qa_metric(scores: dict) -> float:
    """Combine evaluator scores into a single optimization target.

    Multiplies answer_quality by Correctness (a managed evaluator
    configured in the Neural Inverse UI) so that only correct, concise
    answers score well.
    """
    return scores.get("answer_quality", 0.0) * scores.get("Correctness", 0.0)

Step 6: Configure a Managed Evaluator in Neural Inverse

Managed evaluators are LLM-as-a-Judge evaluators that run server-side on experiment traces. Set one up in the Neural Inverse UI:

Go to your project in Neural Inverse
Navigate to Evaluation → LL-as-a-Judge
Click + Set up evaluator

Create a Correctness evaluator:

Name: Correctness
Score: 0 or 1 (binary factual accuracy)
Variable mappings:
- {{input}} → $.input.question
- {{output}} → $.output.answer
- {{expected_output}} → $.expected_output.expected_answer

Important: Evaluator names are case-sensitive. The name in the Neural Inverse UI (e.g. Correctness) must exactly match the name passed to --langfuse-managed-evaluators and the key used in your metric function (scores.get("Correctness")).

Managed evaluators run asynchronously after each experiment. Weco automatically polls for their scores (up to 15 minutes by default). Adjust the timeout with --langfuse-managed-evaluator-timeout.

Step 7: Run the Optimization

With the dataset, target function, evaluators, and metric function in place, run Weco from the command line. Weco uses Neural Inverse as the evaluation backend, running your target function against the dataset on each iteration and tracking progress as experiment runs in Neural Inverse.

!weco run --source agent.py \
  --eval-backend langfuse \
  --langfuse-dataset weco-demo-qa \
  --langfuse-target agent:answer_question \
  --langfuse-evaluators evaluators:answer_quality \
  --langfuse-managed-evaluators Correctness \
  --langfuse-metric-function evaluators:qa_metric \
  --metric qa_metric --goal maximize --steps 5 \
  --output plain  # For Jupyter-friendly formatting

Step 8: View Results

Each optimization step creates a new experiment run in Neural Inverse. Navigate to Datasets → weco-demo-qa → Runs to compare inputs, outputs, and evaluator scores across all variants side-by-side.

You can also track iteration-by-iteration progress in the Weco dashboard, which shows metric scores and the exact code changes Weco made at each step. When the run completes, you'll be prompted to apply the best-performing version to your source file.

Full Example: End-to-End QA Optimization

A complete working example with a richer dataset, multiple evaluators, and holdout validation is available in the Weco CLI repository:

git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/langfuse-zeph-hr-qa

This example optimizes an HR QA agent against a dataset of policy questions, using both local code evaluators and managed LLM-as-a-Judge evaluators to measure correctness and helpfulness. See the full tutorial in the Weco docs for a detailed walkthrough including dataset setup, evaluator configuration, and holdout validation.

Troubleshooting

No experiment runs appear in Neural Inverse

Verify that LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, and LANGFUSE_BASE_URL are exported in the same shell session where you run weco.
Check that the dataset name passed to --langfuse-dataset matches an existing dataset in your Neural Inverse project.

Managed evaluator scores never arrive

Confirm the evaluator name passed to --langfuse-managed-evaluators exactly matches the name in the Neural Inverse UI (case-sensitive).
Verify variable mappings in the evaluator setup by using the live preview on a historical trace.
Increase the polling timeout with --langfuse-managed-evaluator-timeout if evaluators are slow.

Metric stays at 0

Print intermediate evaluator outputs to check score scales and key names.
Confirm that the keys used in your metric function (scores.get("Correctness")) match the evaluator names exactly.

Auth errors

Re-check API keys and confirm the base URL matches your region (EU vs US).
Verify project-level key permissions in Neural Inverse project settings.

Learn More

Weco documentation — full CLI reference and advanced configuration.
Weco CLI GitHub — source code and example projects.
Neural Inverse Datasets documentation — creating and managing datasets.
Neural Inverse Evaluation documentation — configuring evaluators and scores.

Was this page helpful?

On this page