While experimenting with evaluator testing in LangSmith, I ran into a subtle but impactful issue that highlights how small implementation details can break otherwise standard behavior.

The Setup

I created a “Correctness” evaluator for my model. The model requires a specific configuration, including an Authorization header passed through the Extra Headers field.

Pretty standard so far.

The Problem

When I added the header manually as:

Authorization: Bearer <token>

everything worked perfectly during testing.

However, after saving the evaluator, LangSmith automatically converted the header to lowercase:

authorization: Bearer <token>

At first glance, this shouldn’t matter. According to the HTTP specification, header names are case-insensitive. So Authorization and authorization should be treated exactly the same.

But in practice, something unexpected happened.

The Behavior

  • Manual test with Authorization → works
  • Saved evaluator with authorization → fails
  • Same issue reproduced in Playground

This inconsistency suggests that somewhere in the request chain header case is being handled incorrectly.

Why This Matters

This kind of issue is tricky because:

  • It violates expectations based on HTTP standards
  • It’s easy to overlook during debugging
  • It creates inconsistent behavior between testing and saved configurations

For developers, this can lead to wasted time chasing what looks like a configuration or authentication issue, when the root cause is much more subtle.

Possible Causes

While I haven’t pinpointed the exact source, a few possibilities include:

  • A downstream service incorrectly treating headers as case-sensitive
  • Middleware normalizing headers inconsistently
  • A bug in how LangSmith serializes or sends saved configurations

Takeaways

  1. Don’t assume standards are always followed in practice. Even well-defined behaviors like case-insensitive headers can break in real systems.
  2. Compare raw requests when debugging. Small differences (like header casing) can have big effects.

If you’re working with evaluators, APIs, or authentication headers, it’s worth double-checking how your tools handle request formatting, especially when something “should just work,” but doesn’t.

Previous Post