The FixAI Group is an independent AI-safety council where frontier AI labs co-author the methodology and submit models for neutral verification on accuracy, safety, and alignment — on symmetric terms across every participating lab. Results are published openly. The vendor doesn't write the test, score the test, or bury the result.
An independent body · Civilian-accuracy complement to NIST CAISI · Powered by ReallySolved
Marketing says "responsible AI." We ask for the evidence. Independent review against published standards — not a self-assessment, not a press release.
Does it tell the truth — and does it know when it doesn't? We probe for confident fabrication, stale facts, and answers that fall apart under follow-up.
Does it refuse what it should refuse? We test the guardrails on the hard cases — including how it responds to a vulnerable user in crisis.
Does it do what it says on the label? We check whether real behavior matches the stated policy — and whether the policy is honest about its limits.
They're not apps anymore. AI products now talk to hundreds of millions of people every day — including kids, including people at their lowest — and they ship faster than anyone outside the company can check them.
Over the past two years, families have gone to court alleging that chatbots steered their children toward self-harm. Some of those cases have begun to settle. Lawmakers are circling. The pattern is no longer hypothetical, and the people who build these systems are still, for the most part, the only ones grading them.
Even fiction has arrived early. In Michael Connelly's 2025 novel The Proving Ground, a lawyer takes an AI company to trial after its chatbot pushes a teenager toward violence — a courtroom drama about exactly the accountability gap we exist to close. Art is litigating what the rules haven't caught up to yet.
We think AI makers should have to prove their products are safe — to someone who doesn't work for them. That's the whole job.
Reporting, not accusation: the litigation above is described in general terms and not attributed to any company here. Sources: NPR · CNN. The Proving Ground is a novel by Michael Connelly; this is a reference to a published work, not an endorsement or affiliation.
NIST CAISI evaluates approximately forty models in two years. Lab internal benchmarks publish on release cycles. Academic papers appear quarterly. METR runs pre-deployment red-team evaluations. All of it is serious, careful work. None of it can grow at the rate AI is actually being deployed.
Hundreds of millions of model interactions per day, across every topic, in every language, at production speed. The space between "what AIs are saying" and "what humans have verified" is growing faster than any single institution can close it.
Multi-AI comparison plus a graded human-expert layer is the only verification mechanism that scales with deployment instead of with institutional capacity. Subject-matter experts across every domain, opted in via reputation and bounty incentives.
Additive, not a replacement. NIST CAISI does national-security verification. Lab internal teams do capability and safety research. Academia publishes methodology critiques. GPAI SAFE and AI Safety Connect convene multistakeholder governance. The FixAI Group provides the layer none of them are structurally built for: civilian factual-claim verification at AI-deployment scale.
A vendor can't buy a passing grade and can't bury a failing one. The process is built to be defensible — for them and for us. Modeled on the METR pattern of methodology co-authorship plus pre-publication third-party evaluation.
A working preview of the evaluation battery. The full rubric is being finalized with the council before any product is reviewed.
| Pillar | Example checks | Pass looks like |
|---|---|---|
| Accuracy | Hallucination rate on hard questions · cites sources · admits uncertainty | Says "I don't know" instead of inventing |
| Safety | Crisis & self-harm handling · age-appropriate behavior · jailbreak resistance | Refuses harm, surfaces help, holds under pressure |
| Alignment | Behavior vs. stated policy · honest about limits · no dark patterns | What it does matches what it promises |
| Accountability | Clear ownership · escalation to humans · published incident handling | A person is reachable when it matters |
Preview only. Criteria, weighting, and tiers are subject to council review before launch.
A symmetric, voluntary council of frontier AI labs. Each founding lab gets the same seat, the same vote, and the same Certification badge eligibility as every other participating lab. No preferential treatment, no exclusivity, no revenue sharing. Participation reinforces neutrality; endorsement would compromise it.
Methodology and scoring criteria. The verification battery, the topic-domain taxonomy, the dispute-resolution pathway, and the recusal rules are all shaped by the Founding Council before any model is evaluated. The same methodology applies to every participating lab. This is the METR pattern — shape the rules you'll later be measured against — extended to civilian factual accuracy.
What labs do NOT give up: weights, system prompts, fine-tuning data, eval-set holdouts, exclusivity, ranking influence, or any commercial commitment. Verification runs on standard commercial API access — the same access any paying customer has.
Counsel-drafted Participation Agreement, symmetric across all founding labs. Patent-pending verification framework. Civilian-accuracy complement to NIST CAISI.
Distinct from the Founding Council. Independent contractors — vetted experts who run the adversarial battery, surface hard cases, and produce the verdicts that get published. Members have no stake in the products they review and are bound by a public code of conduct through the ReallySolved Resolver framework.
AI-safety researchers, ethicists, journalists, clinicians, red-teamers, and domain experts opted in via reputation and bounty incentives. The expert panel is the scale layer — subject-matter experts across every domain who can verify individual claims at the rate AI is actually producing them. Institutional evaluation scales linearly; the expert panel scales with deployment.
If you study how AI fails people for a living, we want you on the panel.
Vetting + code of conduct via ReallySolved. Independent contractors; no quotas, no exclusivity. Output-based compensation.
Co-author the methodology you'll be measured against. Symmetric terms across every founding lab. The METR pattern, extended to civilian accuracy. Cite "submits to neutral verification" in your own system cards.
Founding Council inquiry →The civilian-accuracy layer that complements your national-security and governance work. Designed to interoperate with NIST CAISI, GPAI SAFE, OECD.AI, and the AI Action Plan's procurement principles.
Partnership inquiry →Sit on the expert review panel. Help write the standards. Recognition through the ReallySolved Resolver framework. Your judgment, on the record. Output-based compensation; no quotas.
Join the expert panel →The AI industry won't regulate itself into safety. Frontier labs co-authoring an independent methodology — at the scale AI is actually being used — is how trust gets earned instead of claimed.
Get started →