You can now create datasets, run experiments, and attach evaluations to experiments using the Phoenix TS/JS client.
Shoutout to @anthonypowell.me and @mikeldking.bsky.social for the work here!
You can now create datasets, run experiments, and attach evaluations to experiments using the Phoenix TS/JS client.
Shoutout to @anthonypowell.me and @mikeldking.bsky.social for the work here!
This guide is a great primer on common approaches we see towards automated prompt optimization. If you've already read 100 "prompting tips and tricks" blogs but aren't yet a full DSPy contributor, then let this be your bridge!
This guide is a great primer on common approaches we see towards automated prompt optimization. If you've already read 100 "prompting tips and tricks" blogs but aren't yet a full DSPy contributor, then let this be your bridge!
Add GenAI tracing to your @arize-phoenix.bsky.social applications in just a few lines. Works great with Span Replay so you can debug, tweak, and explore agent behavior in prompt playground.
Check Notebook + docs below!👇
Add GenAI tracing to your @arize-phoenix.bsky.social applications in just a few lines. Works great with Span Replay so you can debug, tweak, and explore agent behavior in prompt playground.
Check Notebook + docs below!👇
I’ve been really liking some of the eval tools from Pydantic's evals package.
Wanted to see if I could combine these with Phoenix’s tracing so I could run Pydantic evals on traces captured in Phoenix
I’ve been really liking some of the eval tools from Pydantic's evals package.
Wanted to see if I could combine these with Phoenix’s tracing so I could run Pydantic evals on traces captured in Phoenix
It's amazing to see how fast the discourse has moved from "just agents" to now multi-agent flows, optimized evals, and automated improvement strategies.
It's amazing to see how fast the discourse has moved from "just agents" to now multi-agent flows, optimized evals, and automated improvement strategies.
✔️ Trace agent decisions at every step
✔️ Offline and Online Evals using LLM as a Judge
If you're building agents, measuring them is essential.
Full vid and cookbook below
✔️ Trace agent decisions at every step
✔️ Offline and Online Evals using LLM as a Judge
If you're building agents, measuring them is essential.
Full vid and cookbook below
Tag a function with `@ tracer.llm` to automatically capture it as an @opentelemetry.io span.
- Automatically parses input and output messages
- Comes in decorator or context manager flavors
Tag a function with `@ tracer.llm` to automatically capture it as an @opentelemetry.io span.
- Automatically parses input and output messages
- Comes in decorator or context manager flavors
My go-to way to test out these new models: grab a failed trace from a previous run, pull it into playground, switch the model and see if 4.1 can succeed where 4o failed.
Early signs are promising!
My go-to way to test out these new models: grab a failed trace from a previous run, pull it into playground, switch the model and see if 4.1 can succeed where 4o failed.
Early signs are promising!
Flowise is fast, visual, and low-code — but what happens under the hood?
With the new Arize Phoenix integration, you can debug, inspect, and visualize your LLM applications and agent workflows with 1 configuration step - no code required.
Flowise is fast, visual, and low-code — but what happens under the hood?
With the new Arize Phoenix integration, you can debug, inspect, and visualize your LLM applications and agent workflows with 1 configuration step - no code required.
Here's how you can easily setup a LangGraph agent, deploy it with Google ADK, and trace it with @arize-phoenix.bsky.social.
Here's how you can easily setup a LangGraph agent, deploy it with Google ADK, and trace it with @arize-phoenix.bsky.social.
Together you can:
✅ Evaluate performance with Ragas metrics
✅ Visualize and understand LLM behavior through traces & experiments in Arize or Phoenix
Dive into our docs & notebooks ⬇️
Together you can:
✅ Evaluate performance with Ragas metrics
✅ Visualize and understand LLM behavior through traces & experiments in Arize or Phoenix
Dive into our docs & notebooks ⬇️
In my new tutorial, learn techniques on how to optimize your prompt so your judge can improve accuracy, cost, fairness, and robustness
better prompts ➡️ better evals
In my new tutorial, learn techniques on how to optimize your prompt so your judge can improve accuracy, cost, fairness, and robustness
better prompts ➡️ better evals
📌Tag prompts in code and see those tags reflected in the UI
📌Tag prompt versions as development, staging, or production — or define your own
📌Add in tag descriptions for more clarity
Manage your prompt lifecycles with confidence🚀
📌Tag prompts in code and see those tags reflected in the UI
📌Tag prompt versions as development, staging, or production — or define your own
📌Add in tag descriptions for more clarity
Manage your prompt lifecycles with confidence🚀
We’ve integrated @CleanlabAI’s Trustworthy Language Model (TLM) with Phoenix to help teams improve LLM reliability and performance
🔗 Dive into the full implementation in our docs & notebook:
We’ve integrated @CleanlabAI’s Trustworthy Language Model (TLM) with Phoenix to help teams improve LLM reliability and performance
🔗 Dive into the full implementation in our docs & notebook:
Phoenix has changed a TON since its first iteration.
I'm constantly in awe of the execution speed and quality of this team. Here's to the next 5k and beyond!
Phoenix has changed a TON since its first iteration.
I'm constantly in awe of the execution speed and quality of this team. Here's to the next 5k and beyond!
I’ve been using Chain of Thought (CoT) prompting to help LLMs replicate logical step-by-step thinking.
For the next segment in my prompting series, I use @arize-phoenix.bsky.social to test the performance of various CoT methods
I’ve been using Chain of Thought (CoT) prompting to help LLMs replicate logical step-by-step thinking.
For the next segment in my prompting series, I use @arize-phoenix.bsky.social to test the performance of various CoT methods
In my latest tutorial, I explore how few-shot prompting boosts accuracy without massive datasets or retraining—using @arize-phoenix.bsky.social prompts and experiments to break it down.
This kicks off my prompting series... more to come!
In my latest tutorial, I explore how few-shot prompting boosts accuracy without massive datasets or retraining—using @arize-phoenix.bsky.social prompts and experiments to break it down.
This kicks off my prompting series... more to come!
We're hosting an in-person office hours tomorrow all around LLM and Agent Evals.
Join for the free snacks/drinks, stay for the heated discussions about the validity of Pokemon-based model evaluations ⚡️🐀
We're hosting an in-person office hours tomorrow all around LLM and Agent Evals.
Join for the free snacks/drinks, stay for the heated discussions about the validity of Pokemon-based model evaluations ⚡️🐀
There's also a tutorial linked in here where you can use Phoenix to compare the performance of different techniques. 👇
arize.com/blog/prompt-...
There's also a tutorial linked in here where you can use Phoenix to compare the performance of different techniques. 👇
arize.com/blog/prompt-...
Aman combined our recent Agent Evaluation course with the latest prompt optimization techniques to automate the improvement process.
Aman combined our recent Agent Evaluation course with the latest prompt optimization techniques to automate the improvement process.
With just a few lines of code, you'll be ready to trace and run agents. Check it out to level up your agent monitoring.
Our docs have all the details for a quick setup! 👇
#OpenAI #Agents #Tracing
Forget manual prompt engineering - there are better (read: "more automatic") ways to improve your prompts.
This video and notebook break down these techniques.
Featuring:
- DSPy
- @arize-phoenix.bsky.social
Forget manual prompt engineering - there are better (read: "more automatic") ways to improve your prompts.
This video and notebook break down these techniques.
Featuring:
- DSPy
- @arize-phoenix.bsky.social
Too often, teams are stuck using disconnected tools—one for evaluation, another for monitoring, and yet another for debugging.
So, we built a unified approach.
arize.com/blog/why-ai-...
Too often, teams are stuck using disconnected tools—one for evaluation, another for monitoring, and yet another for debugging.
So, we built a unified approach.
arize.com/blog/why-ai-...
Our newest blog post on @hf.co has you covered!
This post shows you how to use @arize-phoenix.bsky.social to trace and evaluate your smolagents.
Credit to @srichavali.bsky.social and @aymeric-roucher.bsky.social
Our newest blog post on @hf.co has you covered!
This post shows you how to use @arize-phoenix.bsky.social to trace and evaluate your smolagents.
Credit to @srichavali.bsky.social and @aymeric-roucher.bsky.social