Watch this session as ๐๐ง๐๐ง๐ ๐๐๐ง๐ง๐๐ฉ๐ฉ๐๐ง, Co-Founder and CEO of Patronus AI addresses the unpredictability of Large Language Models (LLMs) and the most reliable automated methods for testing them at scale. He discusses why intrinsic evaluation metrics like perplexity often donโt align with human judgments and why open-source LLM benchmarks may not be the best way to measure AI progress anymore.
35+ Sessions
60+ Speakers
20,000+ Attendees
2000+ Minutes
Live Q&As
Intrinsic evaluation metrics like perplexity tend to be weakly correlated with human judgments, so they shouldn't be used to evaluate LLMs post-training.
Creating test cases to measure LLM performance is as much an art as it is a science. Test sets should be diverse in distribution and cover as much of the use case scope as possible.
Open-source LLM benchmarks are no longer trustworthy to measure progress in AI since most LLM developers have already trained on them.
Testฮผ
Testฮผ (TestMu) is more than just a virtual conference. It is an immersive experience designed by the community, for the community! A 3-day gathering that unites testers, developers, community leaders, industry experts, and ecosystem partners in the testing and QA field, all converging under a single virtual roof. ๐