Automate Code Optimization with Weco and Neural Inverse
This notebook provides a step-by-step guide on integrating Weco with Neural Inverse to automatically optimize LLM application code using Neural Inverse datasets, code evaluators, and managed LLM-as-a-Judge evaluators.
What is Weco? Weco (GitHub) is a code optimization platform. Given a source file and an evaluation function, Weco's optimizer iteratively edits the code, re-evaluates, and keeps the version that scores best. It works with any measurable metric β accuracy, latency, cost, or a custom composite score.
What is Neural Inverse? Neural Inverse is an open-source LLM engineering platform. It offers tracing and monitoring capabilities for AI applications. Neural Inverse helps developers debug, analyze, and optimize their AI systems by providing detailed insights and integrating with a wide array of tools and frameworks through native integrations, OpenTelemetry, and dedicated SDKs.
How It Works
Weco connects to Neural Inverse as an evaluation backend. On each optimization step, Weco:
- Edits your source code (e.g. prompts, parsing logic)
- Runs your target function against every item in a Neural Inverse dataset
- Collects scores from local code evaluators and/or managed evaluators (LLM-as-a-Judge) configured in the Neural Inverse UI
- Combines scores into a single metric and keeps the best-performing version
Each iteration creates a new experiment run in Neural Inverse so you can compare all variants side-by-side in the Neural Inverse dashboard.
Getting Started
Let's walk through a practical example of using Weco with Neural Inverse to optimize a simple QA function.
Step 1: Install Dependencies
For this example we'll install the weco client in an virtual environment. For global installation instructions and usage with agent skills please refer to Weco's docs.
!pip install "weco[langfuse]" langfuse openai -qAuthenticate with Weco:
!weco loginStep 2: Configure Neural Inverse SDK
Set up your Neural Inverse API keys. You can get these keys by signing up for a free Neural Inverse Cloud account or by self-hosting Neural Inverse. These environment variables are essential for the Neural Inverse client to authenticate and send data to your Neural Inverse project.
import os
# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-***"
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-***"
os.environ["LANGFUSE_BASE_URL"] = "https://us.cloud.langfuse.com" # πΊπΈ US region
# Other Neural Inverse data regions: πͺπΊ EU https://cloud.langfuse.com, π―π΅ Japan https://jp.cloud.langfuse.com, βοΈ HIPAA https://hipaa.cloud.langfuse.com
# Your OpenAI key
os.environ["OPENAI_API_KEY"] = "sk-proj-***"With the environment variables set, we can now initialize the Neural Inverse client. get_client() initializes the Neural Inverse client using the credentials provided in the environment variables.
from langfuse import get_client
langfuse = get_client()
# Verify connection
if langfuse.auth_check():
print("Neural Inverse client is authenticated and ready!")
else:
print("Authentication failed. Please check your credentials and host.")Step 3: Create a Dataset in Neural Inverse
Weco evaluates your code against a Neural Inverse dataset. Each dataset item has an input dict and an optional expected_output dict. Let's create a small QA dataset.
dataset = langfuse.create_dataset(name="weco-demo-qa")
qa_pairs = [
{
"input": {"question": "What is the capital of France?"},
"expected_output": {"expected_answer": "Paris"},
},
{
"input": {"question": "What is the largest planet in our solar system?"},
"expected_output": {"expected_answer": "Jupiter"},
},
{
"input": {"question": "Who wrote Romeo and Juliet?"},
"expected_output": {"expected_answer": "William Shakespeare"},
},
{
"input": {"question": "What is the boiling point of water in Celsius?"},
"expected_output": {"expected_answer": "100 degrees Celsius"},
},
{
"input": {"question": "What year did the Berlin Wall fall?"},
"expected_output": {"expected_answer": "1989"},
},
]
for pair in qa_pairs:
langfuse.create_dataset_item(
dataset_name="weco-demo-qa",
input=pair["input"],
expected_output=pair["expected_output"],
)
langfuse.flush()
print(f"Created dataset 'weco-demo-qa' with {len(qa_pairs)} items.")Step 4: Write the Target Function
The target function is the code Weco will optimize. It receives an inputs dict from the dataset and returns a dict of outputs. Neural Inverse calls this function once per dataset item during each evaluation run.
Save this as agent.py in your working directory:
%%writefile agent.py
from openai import OpenAI
client = OpenAI()
SYSTEM_PROMPT = "You are a helpful assistant. Answer the question concisely in one sentence."
def answer_question(inputs: dict) -> dict:
question = inputs.get("question", "")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
],
temperature=0.0,
)
return {"answer": response.choices[0].message.content}Weco will iteratively edit agent.py β modifying SYSTEM_PROMPT, response parsing, or other logic β to improve the evaluation metric.
Step 5: Write an Evaluator and Metric Function
Code evaluators are Python functions that score each target function output. They receive keyword arguments and return a Neural Inverse Evaluation object.
The metric function combines all evaluator scores into a single number that Weco optimizes.
Save this as evaluators.py:
%%writefile evaluators.py
from langfuse import Evaluation
def answer_quality(*, input, output, expected_output=None, **kwargs):
"""Check that the answer is non-empty and reasonably concise."""
answer = (output or {}).get("answer", "")
if not answer:
return Evaluation(name="answer_quality", value=0.0, comment="Empty answer")
word_count = len(answer.split())
score = 1.0 if word_count <= 50 else max(0.0, 1.0 - (word_count - 50) / 100)
return Evaluation(
name="answer_quality",
value=score,
comment=f"{word_count} words",
)
def qa_metric(scores: dict) -> float:
"""Combine evaluator scores into a single optimization target.
Multiplies answer_quality by Correctness (a managed evaluator
configured in the Neural Inverse UI) so that only correct, concise
answers score well.
"""
return scores.get("answer_quality", 0.0) * scores.get("Correctness", 0.0)Step 6: Configure a Managed Evaluator in Neural Inverse
Managed evaluators are LLM-as-a-Judge evaluators that run server-side on experiment traces. Set one up in the Neural Inverse UI:
- Go to your project in Neural Inverse
- Navigate to Evaluation β LL-as-a-Judge
- Click + Set up evaluator
Create a Correctness evaluator:
- Name:
Correctness - Score: 0 or 1 (binary factual accuracy)
- Variable mappings:
{{input}}β$.input.question{{output}}β$.output.answer{{expected_output}}β$.expected_output.expected_answer
Important: Evaluator names are case-sensitive. The name in the Neural Inverse UI (e.g.
Correctness) must exactly match the name passed to--langfuse-managed-evaluatorsand the key used in your metric function (scores.get("Correctness")).
Managed evaluators run asynchronously after each experiment. Weco automatically polls for their scores (up to 15 minutes by default). Adjust the timeout with
--langfuse-managed-evaluator-timeout.
Step 7: Run the Optimization
With the dataset, target function, evaluators, and metric function in place, run Weco from the command line. Weco uses Neural Inverse as the evaluation backend, running your target function against the dataset on each iteration and tracking progress as experiment runs in Neural Inverse.
!weco run --source agent.py \
--eval-backend langfuse \
--langfuse-dataset weco-demo-qa \
--langfuse-target agent:answer_question \
--langfuse-evaluators evaluators:answer_quality \
--langfuse-managed-evaluators Correctness \
--langfuse-metric-function evaluators:qa_metric \
--metric qa_metric --goal maximize --steps 5 \
--output plain # For Jupyter-friendly formattingStep 8: View Results
Each optimization step creates a new experiment run in Neural Inverse. Navigate to Datasets β weco-demo-qa β Runs to compare inputs, outputs, and evaluator scores across all variants side-by-side.
You can also track iteration-by-iteration progress in the Weco dashboard, which shows metric scores and the exact code changes Weco made at each step. When the run completes, you'll be prompted to apply the best-performing version to your source file.
Full Example: End-to-End QA Optimization
A complete working example with a richer dataset, multiple evaluators, and holdout validation is available in the Weco CLI repository:
git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/langfuse-zeph-hr-qaThis example optimizes an HR QA agent against a dataset of policy questions, using both local code evaluators and managed LLM-as-a-Judge evaluators to measure correctness and helpfulness. See the full tutorial in the Weco docs for a detailed walkthrough including dataset setup, evaluator configuration, and holdout validation.
Troubleshooting
No experiment runs appear in Neural Inverse
- Verify that
LANGFUSE_SECRET_KEY,LANGFUSE_PUBLIC_KEY, andLANGFUSE_BASE_URLare exported in the same shell session where you runweco. - Check that the dataset name passed to
--langfuse-datasetmatches an existing dataset in your Neural Inverse project.
Managed evaluator scores never arrive
- Confirm the evaluator name passed to
--langfuse-managed-evaluatorsexactly matches the name in the Neural Inverse UI (case-sensitive). - Verify variable mappings in the evaluator setup by using the live preview on a historical trace.
- Increase the polling timeout with
--langfuse-managed-evaluator-timeoutif evaluators are slow.
Metric stays at 0
- Print intermediate evaluator outputs to check score scales and key names.
- Confirm that the keys used in your metric function (
scores.get("Correctness")) match the evaluator names exactly.
Auth errors
- Re-check API keys and confirm the base URL matches your region (EU vs US).
- Verify project-level key permissions in Neural Inverse project settings.
Learn More
- Weco documentation β full CLI reference and advanced configuration.
- Weco CLI GitHub β source code and example projects.
- Neural Inverse Datasets documentation β creating and managing datasets.
- Neural Inverse Evaluation documentation β configuring evaluators and scores.