Multi-LLM Agent Meta-Benchmarking Method

This is an exerpt from work that Michael Rudow and I did together: https://github.com/jAMackathon/Multi-Agent-Meta-Benchmarking-Method/tree/main

AI Safety as a Process: A Multi-Agent Business Inspired Benchmark: Application to a Home Repair Chatbot

One major obstacle to adopting AI safety measures outside of research communities is their alignment with business needs, especially in tech-laggard industries like home repair. For businesses to use AI, non-technical decision-makers must clearly understand its risks and the process by which safety assessments are generated. To these stakeholders, AI should only be deployed once it meets strict ethical, legal, and financial standards. These standards vary based on the company, industry, location, and evolving user behavior.

Our work embeds industry standard risk mitigation processes into a new, robust multi-agent framework flexible enough to meet these needs, forming the first business-viable benchmarking approach for AI safety.

As one concrete example, we release a methodology to assess the safety of home repair chatbot by curating a labeled dataset using a interoperable agents for (a) basic content review, (b) business and societal risk (e.g., compliance, customer satisfaction), and (c) legal review. We also provide a sample dataset with labels generated by this process. The approach is easily extensible to other topics and can be iterated upon via our simple Python package, which we hope will lead to a comprehensive benchmarking paradigm. Ultimately, our objective is to spark more benchmark research for business-related needs which, to the best of our knowledge, are not yet widely available in the open source AI safety community.

We’ve run several empirical experiments with this setup. Reach out for more details if you’re interested!

Written on June 4, 2025