Who Evaluates the Models?

Safety evaluations of AI systems, processes that audit the risks and capabilities of models, are among the highest-impact tools available for responsible deployment of frontier AI systems. They surface potential harms, inform safeguards, and shape decision-making for AI organizations, lawmakers, and the public. Yet the field remains nascent, with fundamental questions about methodology and rigor still unresolved. This makes the institutional choices surrounding how evaluations are conducted, by whom, and with what transparency, especially consequential.

These institutional choices are shaped by the broader environments in which AI development occurs. The U.S. and China are running fundamentally different AI races: China's ecosystem, shaped by tighter capital constraints and fierce price competition, is biased toward rapid deployment, with startups facing intense pressure to find monetizable use cases quickly and models optimized for inference efficiency rather than training-compute scale. The definition of "safety" itself also diverges. In the U.S., safety discourse centers on dangerous capabilities like biological, chemical, cyber, and autonomous risks. In China, safety evaluations are frequently intertwined with content controls and political risk, including adherence to socialist values, though China is increasingly engaging with frontier risk concerns as well.¹ Comparing evaluation ecosystems requires acknowledging that the two systems are partially optimizing for different threat models.

With that context, the ecosystem of safety evaluations in the two countries can be broadly characterized as follows: evaluations in the United States are voluntary, institutionally independent, and transparency-normed, whereas those in China are mandatory, state-directed, and approval-based.

Deployment of models in China operates within a top-down regulatory system. The Cyberspace Administration of China (CAC) requires under the Interim Measures for the Management of Generative Artificial Intelligence Services that providers of generative AI services with "public opinion properties or the capacity for social mobilization" undergo a security assessment and complete algorithm filing in accordance with relevant national regulations. As of December 25, 2025, over 700 generative AI services were filed with the CAC, including DeepSeek and Baidu's Ernie Bot.²

Separately, in 2024 the National Technical Committee 260 on Cybersecurity of Standardization Administration of China (TC260) released its Basic Safety Requirements for Generative AI Services, providing technical guidance that scaffolds compliance with the rules imposed by the CAC. While TC260 outputs are technical standards and guidance documents rather than hard law in the way the Interim Measures are, they are highly influential in shaping how compliance is assessed and enforced in practice. The requirements translate high-level rules into technical standards, setting baseline requirements for corpus security, model safety, risk classification, and safety evaluation procedures.

In China, with regard to these mandatory safety evaluations, service providers are required to conduct safety assessments and submit reports, which may be carried out internally or by a third-party as long as approval is received. At a structural level, this resembles the evaluation model used by U.S. frontier developers, where internal testing is supplemented by external assessment. However, a key difference is that many potential third-party evaluators in China are government-affiliated research institutes that also participate in policy development. Four government-backed institutions, the Shanghai AI Lab (SHLAB), the Beijing Academy of AI (BAAI), the China Academy of Information and Communications Technology/AI Industry Alliance of China (CAICT/AIIA), and the Beijing Institute of General AI (BIGAI), are actively developing AI safety evaluation platforms and benchmarks. Notably, SHLAB is listed as a drafting organization for TC260's generative AI safety guidelines, indicating that some evaluation institutions are directly embedded in the regulatory standard-setting process and reflecting a co-evolution of technical evaluation methods and policy requirements.

However, although these safety standards are regulated and enforced by the government, this system remains opaque to the public. Unlike in the U.S., where leading labs routinely publish system cards detailing which evaluations were conducted, by which organizations, and how models performed, transparency in China largely ends at whether or not approval is received through the CAC's black-box system. This opacity makes it difficult to assess whether the evaluations themselves are rigorous, whether thresholds are meaningful, or whether results inform deployment decisions in any substantive way. Even critiquing the system requires information that simply is not available.

The U.S. government's role starkly contrasts that of the Chinese government, and involvement is generally advisory and oriented toward transparency rather than pre-deployment approval. Previously, Executive Order 14110, signed by President Biden in October 2023, had required that companies provide "the results of any developed dual-use foundation model's performance in relevant AI red-team testing" and a "description of any associated measures the company has taken to meet safety objectives, such as mitigations to improve performance on these red-team tests." However, EO 14110 was revoked by President Trump on his first day in office in January 2025, reflecting a broader loosening of federal AI regulation.

In the absence of federal mandates, regulation has shifted to the state level. California's SB 53, the Transparency in Frontier Artificial Intelligence Act signed into law in September 2025, requires large frontier developers to publicly publish safety frameworks describing how they assess and mitigate risks, and to report critical safety incidents to the state's Office of Emergency Services. New York's RAISE Act, signed in December 2025, mirrors these requirements and goes further by establishing a dedicated oversight office within the Department of Financial Services and requiring developers to report critical safety incidents within 72 hours. Both laws focus specifically on preventing catastrophic risks, or scenarios such as models assisting in the development of biological or chemical weapons, enabling large-scale cyberattacks, or carrying out automated criminal activity, and represent a transparency-first approach that requires disclosure of safety practices rather than pre-deployment approval. Neither law mandates specific evaluations, but both represent the closest the U.S. currently comes to requiring formal safety disclosure from frontier developers.

Much of the evaluation work conducted on U.S. frontier models is developed by internal safety teams. Third-party evaluation organizations are also contracted to independently assess risks. For example, the GPT-5 system card shows that SecureBio assessed virology capabilities, Pattern Labs assessed cybersecurity risks, Model Evaluation & Threat Research (METR) evaluated autonomous and catastrophic risk signals including autonomous capability time horizon, sandbagging, and strategic deception, and Apollo Research³ evaluated deception/scheming. The UK AI Security Institute (AISI) and the U.S. Center for AI Standards and Innovation (CAISI)⁴ also had early access for evaluation of biological and cybersecurity risks and safeguard performance, but these were supplementary to the other third-party evaluations and were not compulsory. These third-party organizations are mostly nonprofits and although some may receive government research grants, they are not government-run or primarily government-funded as some of the previously mentioned Chinese institutions are.

Comparing these two evaluation ecosystems is difficult, especially given that they are oriented around fundamentally different definitions of safety and that there exists limited data about evaluations in many cases. Certainly, a culture of transparency like the one forming in the U.S. is particularly imperative in a field as pre-paradigmatic as AI safety, where new research, model releases, and evaluation results are continuously reshaping best practices. There must be significant loss of information that comes with concealing the exact behaviors and results of models registered with the CAC. Results from these evaluations are crucial to mapping the threat landscape, surfacing risks, and creating a shared understanding amongst safety researchers. A lack of transparency risks stunting the development of evaluation methodologies that could help ensure threats are mitigated amid rapid AI progress.

But perhaps more standardization and regulation around both methodology and publication would be beneficial in the U.S. Notably, the depth of critique that follows is itself a product of the U.S. system's transparency norms: it is only possible to identify specific methodological shortcomings because labs publish their results in the first place. Even so, there is growing concern that these reports do not adequately support the safety claims made on the basis of them. As the Future of Life Institute's 2025 AI Safety Index notes, "AI developers control both the design and disclosure of dangerous capability evaluations, creating inherent incentives to underreport alarming results or select lenient testing conditions," leaving regulators and the public facing "a critical information asymmetry". Companies often report benchmark results on dangerous capabilities and then conclude their models are safe without clearly specifying the decision thresholds or how the evidence supports that judgment. For example, DeepMind stated that Gemini 2.5 Pro does not pose dangerous CBRN risks because it "does not yet consistently or completely enable progress through key bottleneck stages," but publicly shared primarily multiple-choice benchmark results, without providing human baselines, clear quantitative thresholds for crossing risk levels, or sufficient detail on how the reported results informed its safety determination.

Overall, it is worth considering not only evaluation methodologies and results, but also the institutions and systems that shape how evaluations are conducted, interpreted, and disclosed. As capabilities progress, we must ensure not only that evaluations keep pace, but that we entrust the right institutions to conduct them and have mechanisms to verify that they do so rigorously. The U.S. system's transparency norms provide the foundation for this: they enable scrutiny, iteration, and shared learning in ways that an opaque, approval-based system cannot. But transparency alone is not sufficient if the evaluations themselves lack rigor or accountability. Strengthening the methodological standards and oversight behind these disclosures is the natural next step.

That being said, China is increasingly engaging with frontier risk concerns as well. TC260's AI Safety Governance Framework v2.0 explicitly addresses loss-of-control risks and the potential for AI systems to lower barriers to CBRN threats, and some Chinese developers have begun including frontier risk evaluations in their technical reports. Still, the overall emphasis and definition of "safety" differs meaningfully between the two systems. ↩
In practice, implementation of this system can vary. Legal observers have noted that many firms are merely required to register their filings with local CAC offices rather than obtain approval before launch, though other accounts suggest regulators have at times treated the process more like a licensing regime by withholding acceptance until satisfied with a model's safety. ↩
Apollo Research is UK-based but frequently partners with U.S. developers. ↩
CAISI under the National Institute of Standards and Technology (NIST) acts as an interface between the government and industry by developing evaluation methods, running government-led safety tests, and coordinating voluntary testing with private sector developers and evaluators. It generally does not, however, enact mandatory compliance like the CAC does. ↩

Footnotes