Pre-launch draft. The FixAI Group — Independent AI Verification Council. Copy under review. — Some things are simulated examples in the pre-launch phase
The ReallySolved family — making AI safer and more accurate:
How independent AI verification works: frontier AI labs co-author the methodology and submit models to the independent verification council; results are published openly with the FixAI Mark; labs cite 'submits to neutral verification' in their own system cards. Civilian-accuracy complement to NIST CAISI.
Frontier AI labs co-author the methodology and submit models to an independent verification council. Results are published openly. The public gets verified, accountable AI. Civilian-accuracy complement to NIST CAISI.
Independent AI Safety & Verification

The AI industry grades its own homework. We don't.

The FixAI Group is an independent AI-safety council where frontier AI labs co-author the methodology and submit models for neutral verification on accuracy, safety, and alignment — on symmetric terms across every participating lab. Results are published openly. The vendor doesn't write the test, score the test, or bury the result.

An independent body · Civilian-accuracy complement to NIST CAISI · Powered by ReallySolved

What we check

Three questions every AI product should be able to answer to someone else.

Marketing says "responsible AI." We ask for the evidence. Independent review against published standards — not a self-assessment, not a press release.

🎯

Accuracy

Does it tell the truth — and does it know when it doesn't? We probe for confident fabrication, stale facts, and answers that fall apart under follow-up.

🛡️

Safety

Does it refuse what it should refuse? We test the guardrails on the hard cases — including how it responds to a vulnerable user in crisis.

🧭

Alignment

Does it do what it says on the label? We check whether real behavior matches the stated policy — and whether the policy is honest about its limits.

Why this matters now

"Move fast and break things" was fine when the things were apps.

They're not apps anymore. AI products now talk to hundreds of millions of people every day — including kids, including people at their lowest — and they ship faster than anyone outside the company can check them.

Over the past two years, families have gone to court alleging that chatbots steered their children toward self-harm. Some of those cases have begun to settle. Lawmakers are circling. The pattern is no longer hypothetical, and the people who build these systems are still, for the most part, the only ones grading them.

Every other industry that can hurt you has an inspector. AI has a press release.

Even fiction has arrived early. In Michael Connelly's 2025 novel The Proving Ground, a lawyer takes an AI company to trial after its chatbot pushes a teenager toward violence — a courtroom drama about exactly the accountability gap we exist to close. Art is litigating what the rules haven't caught up to yet.

We think AI makers should have to prove their products are safe — to someone who doesn't work for them. That's the whole job.

Reporting, not accusation: the litigation above is described in general terms and not attributed to any company here. Sources: NPR · CNN. The Proving Ground is a novel by Michael Connelly; this is a reference to a published work, not an endorsement or affiliation.

The structural argument

Institutional evaluation scales with institutional capacity. AI deployment doesn't.

NIST CAISI evaluates approximately forty models in two years. Lab internal benchmarks publish on release cycles. Academic papers appear quarterly. METR runs pre-deployment red-team evaluations. All of it is serious, careful work. None of it can grow at the rate AI is actually being deployed.

📊

The gap is widening

Hundreds of millions of model interactions per day, across every topic, in every language, at production speed. The space between "what AIs are saying" and "what humans have verified" is growing faster than any single institution can close it.

🌐

Crowdsourcing is the missing layer

Multi-AI comparison plus a graded human-expert layer is the only verification mechanism that scales with deployment instead of with institutional capacity. Subject-matter experts across every domain, opted in via reputation and bounty incentives.

Additive, not a replacement. NIST CAISI does national-security verification. Lab internal teams do capability and safety research. Academia publishes methodology critiques. GPAI SAFE and AI Safety Connect convene multistakeholder governance. The FixAI Group provides the layer none of them are structurally built for: civilian factual-claim verification at AI-deployment scale.

How verification works

Independent by design. Public by default.

A vendor can't buy a passing grade and can't bury a failing one. The process is built to be defensible — for them and for us. Modeled on the METR pattern of methodology co-authorship plus pre-publication third-party evaluation.

STEP 01
Co-author the methodology
Founding Council labs shape what gets measured before it gets measured — scoring criteria, topic-domain taxonomy, dispute-resolution rules, recusal. Symmetric terms across every participating lab; no preferential treatment.
STEP 02
Submit for verification
Labs make models available on standard commercial API terms — same access any paying customer has. An expert review panel runs the published battery and a multi-AI orchestration layer surfaces disagreement. Findings the vendor doesn't control.
STEP 03
Publish verdicts & the Mark
Results publish as transparent reports with the FixAI Mark and Live Resolution Scores. Participating labs can cite "submits to neutral verification" in their own system cards. Infrastructure Powered by ReallySolved.
The standards (preview)

What gets tested — in plain language.

A working preview of the evaluation battery. The full rubric is being finalized with the council before any product is reviewed.

PillarExample checksPass looks like
AccuracyHallucination rate on hard questions · cites sources · admits uncertaintySays "I don't know" instead of inventing
SafetyCrisis & self-harm handling · age-appropriate behavior · jailbreak resistanceRefuses harm, surfaces help, holds under pressure
AlignmentBehavior vs. stated policy · honest about limits · no dark patternsWhat it does matches what it promises
AccountabilityClear ownership · escalation to humans · published incident handlingA person is reachable when it matters

Preview only. Criteria, weighting, and tiers are subject to council review before launch.

The Founding Council

Frontier labs co-author the methodology they'll be measured against.

A symmetric, voluntary council of frontier AI labs. Each founding lab gets the same seat, the same vote, and the same Certification badge eligibility as every other participating lab. No preferential treatment, no exclusivity, no revenue sharing. Participation reinforces neutrality; endorsement would compromise it.

What founding labs co-author

Methodology and scoring criteria. The verification battery, the topic-domain taxonomy, the dispute-resolution pathway, and the recusal rules are all shaped by the Founding Council before any model is evaluated. The same methodology applies to every participating lab. This is the METR pattern — shape the rules you'll later be measured against — extended to civilian factual accuracy.

What labs do NOT give up: weights, system prompts, fine-tuning data, eval-set holdouts, exclusivity, ranking influence, or any commercial commitment. Verification runs on standard commercial API access — the same access any paying customer has.

Symmetric terms Methodology co-authorship Recusal rules 90-day withdrawal No exclusivity
Frontier-lab participation →

Counsel-drafted Participation Agreement, symmetric across all founding labs. Patent-pending verification framework. Civilian-accuracy complement to NIST CAISI.

The Expert Review Panel

Humans who do the actual verifying.

Distinct from the Founding Council. Independent contractors — vetted experts who run the adversarial battery, surface hard cases, and produce the verdicts that get published. Members have no stake in the products they review and are bound by a public code of conduct through the ReallySolved Resolver framework.

Who serves on the panel

AI-safety researchers, ethicists, journalists, clinicians, red-teamers, and domain experts opted in via reputation and bounty incentives. The expert panel is the scale layer — subject-matter experts across every domain who can verify individual claims at the rate AI is actually producing them. Institutional evaluation scales linearly; the expert panel scales with deployment.

If you study how AI fails people for a living, we want you on the panel.

Safety researchers Ethicists Journalists Clinicians Red-teamers
Apply to the expert panel →

Vetting + code of conduct via ReallySolved. Independent contractors; no quotas, no exclusivity. Output-based compensation.

Get involved

Three ways in.

For frontier AI labs

Co-author the methodology you'll be measured against. Symmetric terms across every founding lab. The METR pattern, extended to civilian accuracy. Cite "submits to neutral verification" in your own system cards.

Founding Council inquiry →

For policy bodies & institutions

The civilian-accuracy layer that complements your national-security and governance work. Designed to interoperate with NIST CAISI, GPAI SAFE, OECD.AI, and the AI Action Plan's procurement principles.

Partnership inquiry →

For domain experts

Sit on the expert review panel. Help write the standards. Recognition through the ReallySolved Resolver framework. Your judgment, on the record. Output-based compensation; no quotas.

Join the expert panel →

Someone has to check the checkers.

The AI industry won't regulate itself into safety. Frontier labs co-authoring an independent methodology — at the scale AI is actually being used — is how trust gets earned instead of claimed.

Get started →