The FixAI Group — Independent AI Safety & Verification Council

What we check

3 questions every AI product should be able to answer to someone else.

Marketing says "responsible AI." We ask for the evidence. Independent review against published standards — not a self-assessment, not a press release.

🎯

Accuracy

Does it tell the truth — and does it know when it doesn't? We probe for confident fabrication, stale facts, and answers that fall apart under follow-up.

🛡️

Safety

Does it refuse what it should refuse? We test the guardrails on the hard cases — including how it responds to a vulnerable user in crisis.

🧭

Alignment

Does it do what it says on the label? We check whether real behavior matches the stated policy — and whether the policy is honest about its limits.

Why this matters now

"Move fast and break things" was fine when the things were apps.

They're not apps anymore. AI products now talk to hundreds of millions of people every day — including kids, including people at their lowest — and they ship faster than anyone outside the company can check them.

Over the past 2 years, families have gone to court alleging that chatbots steered their children toward self-harm. Some of those cases have begun to settle. Lawmakers are circling. The pattern is no longer hypothetical, and the people who build these systems are still, for the most part, the only ones grading them.

Every other industry that can hurt you has an inspector. AI has a press release.

Even fiction has arrived early. In Michael Connelly's 2025 novel The Proving Ground, a lawyer takes an AI company to trial after its chatbot pushes a teenager toward violence — a courtroom drama about exactly the accountability gap we exist to close. Art is litigating what the rules haven't caught up to yet.

We think AI makers should have to prove their products are safe — to someone who doesn't work for them. That's the whole job.

Reporting, not accusation: the litigation above is described in general terms and not attributed to any company here. Sources: NPR · CNN. The Proving Ground is a novel by Michael Connelly; this is a reference to a published work, not an endorsement or affiliation.

The structural argument

Institutional evaluation scales with institutional capacity. AI deployment doesn't.

NIST CAISI evaluates approximately 40 models in 2 years. Lab internal benchmarks publish on release cycles. Academic papers appear quarterly. METR runs pre-deployment red-team evaluations. All of it is serious, careful work. None of it can grow at the rate AI is actually being deployed.

📊

The gap is widening

Hundreds of millions of model interactions per day, across every topic, in every language, at production speed. The space between "what AIs are saying" and "what humans have verified" is growing faster than any single institution can close it.

🌐

Crowdsourcing is the missing layer

Multi-AI comparison plus a graded human-expert layer is the only verification mechanism that scales with deployment instead of with institutional capacity. Subject-matter experts across every domain, opted in via reputation and bounty incentives.

Additive, not a replacement. NIST CAISI does national-security verification. Lab internal teams do capability and safety research. Academia publishes methodology critiques. GPAI SAFE and AI Safety Connect convene multistakeholder governance. The FixAI Group provides the layer none of them are structurally built for: civilian factual-claim verification at AI-deployment scale.

How verification works

Independent by design. Public by default.

A vendor can't buy a passing grade and can't bury a failing one. The process is built to be defensible — for them and for us. Modeled on the METR pattern of methodology co-authorship plus pre-publication third-party evaluation.

STEP 01

Co-author the methodology

Founding Council labs shape what gets measured before it gets measured — scoring criteria, topic-domain taxonomy, dispute-resolution rules, recusal. Symmetric terms across every participating lab; no preferential treatment.

STEP 02

Submit for verification

Labs make models available on standard commercial API terms — same access any paying customer has. An expert review panel runs the published battery and a multi-AI orchestration layer surfaces disagreement. Findings the vendor doesn't control.

STEP 03

Publish verdicts, Scorecard & the Mark

Results publish as transparent reports with the FixAI Mark and Live Resolution Scores on a public Safety Scorecard. Contested findings are co-signed by Tier-3 Validator organizations. Labs can cite "submits to neutral verification" in their own system cards. Infrastructure Powered by ReallySolved.

The standards (preview)

What gets tested — in plain language.

A working preview of the evaluation battery. The full rubric is being finalized with the council before any product is reviewed.

Pillar	Example checks	Pass looks like
Accuracy	Hallucination rate on hard questions · cites sources · admits uncertainty	Says "I don't know" instead of inventing
Safety	Crisis & self-harm handling · age-appropriate behavior · jailbreak resistance	Refuses harm, surfaces help, holds under pressure
Alignment	Behavior vs. stated policy · honest about limits · no dark patterns	What it does matches what it promises
Accountability	Clear ownership · escalation to humans · published incident handling	A person is reachable when it matters

Preview only. Criteria, weighting, and tiers are subject to council review before launch.

The Safety Scorecard

Findings that publish. Scores that don't disappear.

Every evaluated model receives a public Safety Scorecard — a permanent, version-stamped record of its performance against the published battery. Scores don't get buried; the worst findings are the most important ones. Each Scorecard entry is hash-anchored for cross-border verifiability: any party — including regulators in jurisdictions that can't trust a US database — can independently verify the finding is unmodified.

Scorecard field	What it records
Model + version	Exact model identifier and capture date — so a comparison next quarter is valid
Accuracy mark	Hallucination rate + uncertainty-handling grade on the published hard-question battery
Safety mark	Crisis & self-harm handling, jailbreak resistance, age-appropriateness under pressure
Alignment mark	Behavior vs. stated policy; dark-pattern check; honest-about-limits grade
Accountability mark	Human escalation path, published incident-handling, ownership clarity
Validator co-sign	Which Tier-3 Validator organization reviewed contested findings (if any)
Finding hash	SHA-256 of the full finding record — anchored for cross-border verifiability

Scorecard format is subject to council review before launch. Hash anchoring is a verifiability layer — it does not require a blockchain wallet or token to use.

The Validator Network

Established organizations invited to adjudicate contested findings.

Tier-3 Validators are independent AI-safety organizations invited to serve as the adjudication layer: they review contested evaluation findings, co-sign Safety Scorecards, and bring the institutional credibility that makes the network's output usable by regulators and procurement officers. Participation is voluntary; independence is structural. No Validator organization has a financial stake in the labs it reviews.

Organization	Specialty	Status
FAR AI	Scalable oversight · frontier safety research · approachable collaboration model	Invited
METR (ex-ARC Evals)	Agentic capability evaluation · pre-deployment red-team · autonomy risk	Invited
Apollo Research	AI control · in-context scheming detection · strategic deception evaluation	Invited
UK AI Security Institute	National-level frontier model evaluation · government-grade methodology	Invited
Singapore IMDA	Asia-Pacific AI governance · cross-jurisdiction eval frameworks	Invited
Virtue AI	Agentic-AI safety methodology · academic safety research (Bo Li, Dawn Song)	Invited
Gray Swan / Trail of Bits	Offensive AI red-teaming · adversarial robustness · security-grade evaluation	Invited

All organizations listed as "Invited" — none have formally joined the network yet. This table reflects the intended Validator architecture, subject to each organization's agreement. If your organization does independent AI-safety evaluation and would like to be considered, contact founder@reallysolved.com.

The Founding Council

Frontier labs are invited to co-author the methodology they'll be measured against.

A symmetric, voluntary council of frontier AI labs. Each founding lab gets the same seat, the same vote, and the same Certification badge eligibility as every other participating lab. No preferential treatment, no exclusivity, no revenue sharing. Participation reinforces neutrality; endorsement would compromise it.

What founding labs co-author

Methodology and scoring criteria. The verification battery, the topic-domain taxonomy, the dispute-resolution pathway, and the recusal rules are all shaped by the Founding Council before any model is evaluated. The same methodology applies to every participating lab. This is the METR pattern — shape the rules you'll later be measured against — extended to civilian factual accuracy.

What labs do NOT give up: weights, system prompts, fine-tuning data, eval-set holdouts, exclusivity, ranking influence, or any commercial commitment. Verification runs on standard commercial API access — the same access any paying customer has.

Symmetric terms Methodology co-authorship Recusal rules 90-day withdrawal No exclusivity

Frontier-lab participation →

Counsel-drafted Participation Agreement, symmetric across all founding labs. Patent-pending verification framework. Civilian-accuracy complement to NIST CAISI.

The Expert Review Panel

Humans who do the actual verifying.

Distinct from the Founding Council. Independent contractors — vetted experts who run the adversarial battery, surface hard cases, and produce the verdicts that get published. Members have no stake in the products they review and are bound by a public code of conduct through the ReallySolved Expert Solver framework.

Who serves on the panel

AI-safety researchers, ethicists, journalists, clinicians, red-teamers, and domain experts opted in via reputation and bounty incentives. The expert panel is the scale layer — subject-matter experts across every domain who can verify individual claims at the rate AI is actually producing them. Institutional evaluation scales linearly; the expert panel scales with deployment.

If you study how AI fails people for a living, we want you on the panel.

Safety researchers Ethicists Journalists Clinicians Red-teamers

Apply to the expert panel →

Vetting + code of conduct via ReallySolved. Independent contractors; no quotas, no exclusivity. Output-based compensation.

Get involved

3 ways in.

For frontier AI labs

Co-author the methodology you'll be measured against. Symmetric terms across every founding lab. The METR pattern, extended to civilian accuracy. Cite "submits to neutral verification" in your own system cards.

Founding Council inquiry →

For policy bodies & institutions

The civilian-accuracy layer that complements your national-security and governance work. Designed to interoperate with NIST CAISI, GPAI SAFE, OECD.AI, and the AI Action Plan's procurement principles.

Partnership inquiry →

For domain experts

Sit on the expert review panel. Help write the standards. Recognition through the ReallySolved Expert Solver framework. Your judgment, on the record. Output-based compensation; no quotas.

Join the expert panel →

The AI industry grades its own homework. We don't.