OSS Qualification Rubric#

Status: Active — v1.1
Date: 2026-05-22
Author: Murali Raju / NotionAlpha OSS AI Lab
Applies to: Every capability layer in the reference architecture (docs/oss-ai-lab/reference-architecture.md)
Related: Per-layer evaluations (Task B3) apply this rubric; the running benchmark (Task B4) is a separate, heavyweight instrument built for one or two high-value layers only.

Purpose and scope#

This document is the evaluation framework applied to OSS implementation candidates for every capability layer in the agent-native enterprise AI reference architecture. It is a lightweight qualification instrument — consistent criteria, scored per candidate, applied uniformly regardless of layer. Its job is to determine whether an OSS project clears the adoption bar.

It is not the running benchmark. The benchmark is a heavyweight, repeatable performance and correctness instrument built for one or two specific high-value layers (the Assurance layer first). The qualification rubric runs first, on every candidate; the benchmark runs only on projects that clear the rubric and are shortlisted for a layer where measurement precision matters enough to justify the build cost.

The distinction matters: the rubric can be applied to any candidate in a few hours using public signals. The benchmark requires weeks of tooling investment per layer. Collapsing the two would either make evaluation prohibitively expensive (if everything got a benchmark) or produce decisions without evidence (if the benchmark replaced per-candidate qualification).

The meta-test#

Can an enterprise both adopt an implementation and leave it?

This is the question the rubric answers. The six criteria below cash it out into observable, scoreable signals. A project that scores well across all six can be adopted on open foundations and swapped out — either by replacing it with an alternative behind the same capability interface, or by forking it under its permissive license — without the enterprise being stranded. A project that scores poorly on one or more criteria carries a risk that must be explicitly acknowledged in the per-layer evaluation before it is recommended.

Scoring scheme#

Each criterion is scored on a 0–3 scale. The scale is defined per criterion because the signals differ, but the labels are consistent across all six:

Score	Label	Interpretation
3	Passes cleanly	The criterion is fully satisfied; no hedging required.
2	Passes with caveats	The criterion is substantially satisfied but one or two specific conditions fall short; the caveat must be stated explicitly.
1	Marginal	The project does not clearly satisfy the criterion; it may partially satisfy it, but the gap is material. A score of 1 requires a written rationale for why the project is still a candidate.
0	Fails	The criterion is not satisfied. A score of 0 on any criterion is a disqualifier unless overridden by the honest-tension provision (§ Honest tension) with explicit justification.

Total score range: 0–18.

Interpretation guidelines:

Total	Interpretation
15–18	Strong candidate — proceed to per-layer recommendation.
11–14	Viable with caveats — document each caveat in the per-layer evaluation; confirm no single criterion scored 0.
7–10	Weak candidate — proceed only if no alternative exists and a mitigation plan is documented for each low-scoring criterion.
0–6	Disqualified — do not recommend without a fundamental change in the project's status.

These thresholds are guidelines, not automatic gates. A project that scores 14 with a 0 on criterion 1 (license) is disqualified regardless of total; a project that scores 11 with all criteria at 2 or above is a viable candidate. Total score is a summary; per-criterion scores are the decision record.

How to apply the rubric#

Name the project, the specific component being evaluated, its version or commit, and the date of evaluation. Rubric scores are dated assessments, not permanent verdicts. An OSS project's governance, health, and license can change. A project may also ship multiple components under different licenses and postures (e.g., a permissively-licensed runtime alongside a source-available control plane) — evaluate the specific component the architecture would actually use, not the project as an undifferentiated whole.
Score each criterion independently. Do not let overall enthusiasm or skepticism bleed across criteria.
For each criterion, write one to three sentences of evidence. Score without evidence is an opinion, not an evaluation. The evidence must be observable by anyone — a URL, a commit count, a license SPDX identifier, an OpenSSF Scorecard badge link.
State any caveats for scores of 1 or 2 explicitly. A caveat is a specific condition, not a vague hedge. "Active but recently lost a core maintainer" is a caveat; "somewhat active" is not.
Apply the honest-tension provision if needed — see § Honest tension.
Record the total and the per-criterion scores. The per-layer evaluation document (Task B3) carries the full scored rubric for each candidate.

Criterion 1 — Genuinely open license#

Definition#

The project is released under an OSI-approved, permissive license — specifically Apache-2.0 or MIT (or equivalent: BSD-2-Clause, BSD-3-Clause, ISC). Permissive means: no copyleft conditions, no network-use clauses, no additional conditions beyond attribution.

Licenses that do not satisfy this criterion:

Source-available licenses (BUSL/BSL, SSPL, Elastic License, Commons Clause) — source is readable but use is restricted by conditions outside OSI scope. An enterprise cannot freely adopt, fork, or redistribute without legal review of those conditions.
Open-core with load-bearing parts closed — the project presents itself as open-source, but the components an enterprise actually depends on for production use are proprietary or licensed under a non-OSI license. The open-source component is a top-of-funnel artifact, not the real product.
Weak copyleft (LGPL, MPL) or strong copyleft (GPL, AGPL) — these require legal review for enterprise use cases and may impose distribution conditions on enterprise code that links to or embeds the project. They are not automatic disqualifiers, but they score lower than permissive.

The license criterion is checked at the repository root (LICENSE or LICENSE.md) and at the level of any separately released packages or components. If the project publishes multiple packages under different licenses, each component used by the architecture must satisfy this criterion individually.

Scoring guidance#

Score	Condition
3	Apache-2.0, MIT, BSD-2-Clause, BSD-3-Clause, or ISC at repository root; all distributed components under one of those same named licenses. No additional conditions.
2	LGPL or MPL — permissive enough for typical enterprise use as a linked library but requires legal review for embedded or modified distribution. State the specific use pattern and confirm it does not trigger copyleft obligations.
1	Weak source-available (e.g., BUSL with a short conversion date that has already passed, converting the project to an open license) or GPL/AGPL with an enterprise exception clause that covers the deployment pattern. Requires explicit legal confirmation before recommendation.
0	SSPL, BUSL/BSL with active restrictions, Elastic License, Commons Clause, proprietary, or open-core with load-bearing parts closed.

Criterion 2 — Forkability and vendor independence#

Definition#

If the maintaining organization or vendor walked away from the project — stopped cutting releases, stopped reviewing pull requests, archived the repository — could the community carry it forward?

This criterion measures structural resilience, not goodwill. A project whose future depends entirely on one vendor's continued engagement is a lock-in risk regardless of its license. The permissive license provides a legal backstop (anyone can fork), but fork-ability in practice requires more than legal permission: there must be a community capable of maintaining the fork, the codebase must be comprehensible to outside contributors, and the governance structure must allow the community to take control.

Four signals are examined:

Governance structure — who makes decisions about the project? A foundation-governed project (CNCF, Apache, Linux Foundation, OpenSSF) has explicit governance that survives any single organization's departure. A vendor-backed project with a public governance charter is better than one with no stated governance. A project where one vendor controls the main branch and all release decisions is the weakest governance signal.
Contributor distribution — over the trailing 12 months, what fraction of code contributions — counted as commits, pull request reviews, and merges together, not commits alone — come from contributors outside the maintaining organization? Counting all three resists gaming by bulk commits from a single vendor. A project where 95% of code contributions are from one company is de facto a vendor product with a public repository.
Codebase comprehensibility — is the codebase documented well enough for outside contributors to make meaningful contributions? Sparse or absent developer documentation, no contribution guide, and architecture that assumes insider knowledge all reduce practical fork-ability.
Fork history — has the project been forked and carried by outside parties before? A track record of successful community forks is the strongest evidence of practical fork-ability.
Upstream lineage and merge optionality — if the candidate is itself a derivative (a fork that adds capability on top of a base project), record the lineage: what is the upstream, how far has the derivative diverged, and is the derivative's added value being upstreamed or held as a private diff? In a fork-tree ecosystem a derivative whose additions are upstreamed (or upstreamable) is lower lock-in than one whose value lives only in an unmerged fork. The architecture maps to the capability expressed through the interface, not to a single node in a fork tree; an upstream and its derivatives are interchangeable candidates behind the same capability boundary, ranked in part by this signal.

Scoring guidance#

Score	Condition
3	Foundation-governed (CNCF, Apache, Linux Foundation, OpenSSF, or equivalent neutral foundation); OR substantial, documented outside-contributor base (>30% of code contributions — commits, PR reviews, and merges combined — from non-maintaining-org contributors over the past 12 months) with a published governance charter.
2	Single-vendor-backed with a public governance charter that provides a credible path for outside parties to take over; OR foundation-hosted but with de facto single-vendor control. State which specific governance provisions provide the protection.
1	Single-vendor-backed with no governance charter; codebase is comprehensible and permissively licensed (providing the fork backstop) but the practical community capacity to carry a fork is unproven.
0	Single-vendor-backed, no governance charter, sparse developer documentation, and no evidence of an outside contributor community capable of sustaining a fork.

Criterion 3 — Health and bus factor#

Definition#

The project is actively maintained and does not depend on a single person or a single team to sustain it.

"Active" means the project is responding to the current state of its problem domain — issues are triaged, pull requests are reviewed, releases are cut when warranted. A project can be stable with a low release frequency if its domain is mature and the codebase is correct; the concern is a project that is silent in the face of known issues or a changing problem space.

"Bus factor" is the number of contributors whose departure would halt or severely impair progress. A bus factor of one means one person leaving would effectively end the project. For enterprise adoption, a minimum bus factor of three to five active, knowledgeable contributors is a reasonable floor.

Four observable signals are used:

Release cadence — how frequently are new versions released? Appropriate cadence is domain-dependent: a security-adjacent tool should ship security fixes quickly; an infrastructure primitive might have quarterly releases. No release in 12+ months with open issues is a warning sign regardless of domain.
Maintainer count and distribution — how many individuals have commit rights or actively review and merge PRs? Are they concentrated in one organization?
Issue and PR responsiveness — what is the median time to first response on new issues? Median time to merge for non-trivial PRs? These are observable from public repository data.
Contributor trend — is the contributor count growing, stable, or declining over the past 12 months? A declining contributor base is a leading indicator of eventual abandonment.

Scoring guidance#

Score	Condition
3	Active releases in the past 6 months (or documented rationale for a stable-and-deliberate low cadence); 5+ active maintainers across 2+ organizations; median issue first-response under 7 days; contributor count stable or growing.
2	Active releases in the past 12 months; 3–4 active maintainers; issues receive responses but may take 2–4 weeks; one organization dominates contributions but at least one outside maintainer exists. State specific caveats.
1	No release in 12–18 months but the repository is not archived and issues receive occasional responses; OR fewer than 3 active maintainers concentrated in one organization; OR recent departure of a key maintainer with no identified replacement. Requires documented rationale for why the health risk is acceptable.
0	Archived repository; no response to issues in 12+ months; single maintainer with no succession plan; or a public statement of abandonment or end-of-life.

Criterion 4 — Production adoption and security posture#

Definition#

The project has demonstrated that real organizations run it in production, and it manages its own security with the rigor that entails.

Production adoption is the strongest evidence that a project is battle-tested: real users encounter real edge cases, file real bug reports, and require real security fixes. A project used only in hobby environments or evaluated only in proofs-of-concept carries higher unknown-failure risk than one with documented enterprise deployments.

Security posture assesses whether the project treats security as an ongoing discipline, not a one-time effort:

OpenSSF Scorecard — the Open Source Security Foundation's automated scoring system evaluates projects on a fixed set of security practices (branch protection, dependency pinning, code review requirements, vulnerability disclosure). A Scorecard of 7 or above indicates solid baseline security hygiene; 5–6 is acceptable with noted gaps; below 5 indicates material gaps.
Security policy (SECURITY.md) — does the project provide a documented channel for responsible disclosure of vulnerabilities? Projects without a security policy leave researchers and users with no clear path to report issues privately.
CVE responsiveness — when vulnerabilities are reported and assigned CVEs, how quickly does the project ship fixes? A pattern of slow CVE response in a security-adjacent project is a material risk signal.
Evidence of production use — documented enterprise or production deployments (case studies, conference talks, integrators citing the project, package download counts, active commercial support offerings). Self-reported production use with no corroboration does not count.

Scoring guidance#

Score	Condition
3	Multiple documented enterprise or production deployments; OpenSSF Scorecard ≥7; `SECURITY.md` present with a clear responsible-disclosure channel; CVE history (if any) shows patches shipped within 30 days of report.
2	At least one documented production deployment; OpenSSF Scorecard 5–6; `SECURITY.md` present; CVE response time is acceptable but may be slower (30–90 days). State specific caveats for any sub-dimension that is partial.
1	Production use is credibly implied (e.g., significant package downloads, active commercial adopters) but not documented; OpenSSF Scorecard 3–4 with specific gaps identified and not critical; `SECURITY.md` absent but issues are responsive. Requires documented rationale.
0	No credible evidence of production use; OpenSSF Scorecard <3 or not run; no security policy; history of slow or non-existent CVE response.

Combined-score rule. This criterion covers two independent sub-dimensions — production adoption and security posture. Score each sub-dimension against the rows above, then take the criterion score. If either sub-dimension scores at a level that would map to 0 on its own (e.g., no credible evidence of production use, OR Scorecard <3 with no security policy), the criterion scores 0 regardless of how the other sub-dimension scores. Otherwise, the criterion score is the lower of the two sub-dimension scores. This keeps the combined score deterministic.

Criterion 5 — Composability#

Definition#

The project exposes clean interfaces that make it genuinely swappable behind a capability boundary. It is not a monolith that requires wholesale adoption to get any value from it.

In the reference architecture, each capability layer is defined by its interfaces, not by any particular implementation. An implementation recommendation is only as durable as the ease with which a different implementation could replace it. A project that is technically permissively licensed and community-governed but which exposes a proprietary interface, requires deep integration into its internals, or bundles unrelated capabilities into an inseparable whole is not composable — replacing it would require re-architecting the surrounding system.

Three signals:

Defined, stable interface surface — the project exposes its functionality through a documented API, SDK, or protocol rather than requiring consumers to call internal functions. The interface has versioning semantics so consumers know what stability to expect.
Separation of concerns — the project does one capability well and does not bundle unrelated concerns that force co-adoption. A project that couples, say, orchestration and memory into an inseparable runtime imposes an architectural constraint that is hard to undo.
Plugin or adapter extensibility — the project provides extension points (plugin interfaces, adapters, drivers) that allow integration with multiple backends, transports, or protocols without forking the core.

This criterion covers internal structure only — interface stability, separation of concerns, and extensibility. The project's external protocol choices (whether it emits open standards such as OpenTelemetry or MCP) are scored separately under Criterion 6; do not score protocol choices here, to avoid double-counting.

Two-economies qualification. For faculties (the composable economy) this criterion is applied without qualification. For spine and control-plane candidates (the trust economy — runtime isolation, confidential execution, assurance, durable trajectories), composability is evaluated at the seam and evidence-format layer, not the trust-chain layer: an integrated cryptographic trust chain — attestation bound to sealed state bound to signed audit — is expected and is not penalized as "coupling." What is scored is whether the candidate's evidence formats and interfaces follow adopted seams (Criterion 6) so a different implementation can verify and consume them, even where the trust relationships are not swappable link-by-link. See reference-architecture.md § Two economies.

Scoring guidance#

Score	Condition
3	Documented, versioned external API, SDK, or interface; clear separation of concerns (does one capability); extension points for adapters or backends. Replacing it behind the capability interface would require changes only at the integration boundary, not in the surrounding system.
2	External interface exists but is partially undocumented or has weak versioning semantics; OR concerns are slightly bundled but separable in practice (e.g., a core library plus optional extensions that can be excluded); OR extension points are limited. State specific caveats.
1	Internal coupling makes it difficult to use as a library or service without adopting a significant portion of its runtime or abstractions; OR the external interface is unstable (frequently breaking) or absent (requiring internal-function calls).
0	Monolithic, all-or-nothing adoption; no stable external interface; no extension points.

Criterion 6 — Open-standards alignment#

Definition#

The project emits and speaks open, widely-adopted standards that allow it to interoperate with the rest of the agent-native stack without requiring proprietary adapters.

This criterion is distinct from composability (criterion 5), which is about the project's internal structure. Open-standards alignment is about the project's external protocol choices — specifically, whether the signals it produces and the protocols it speaks can be consumed by any conformant tooling, or only by tooling written specifically for this project.

Two standards are first-class for the reference architecture:

OpenTelemetry (OTel) — agent conventions — the emerging standard for structured observability in agent systems. A project that emits traces, metrics, and logs in OTel format can be observed by any OTel-compatible backend (Jaeger, Grafana Tempo, Honeycomb, Datadog OTel receiver, etc.). A project that emits proprietary structured logs requires project-specific integrations to connect to the Durable Trajectories spine or to observability tooling.
Model Context Protocol (MCP) — the open protocol for tool-serving and tool-calling in agent systems, with growing support across frameworks. A project in the Tools & Effectors or Orchestration layers that speaks MCP can interoperate with any MCP-compatible agent framework without a bespoke adapter.

Additional standards to check by layer:

OpenAPI / AsyncAPI — for projects that expose REST or event-driven APIs; machine-readable specs enable client generation and contract testing.
SPIFFE/SPIRE, JWT, OAuth2 — for projects in the Identity & Delegation layer; these are the open standards for workload identity and authorization.
Agent-to-Agent (A2A) — for orchestration-layer projects; the emerging cross-framework agent interoperability protocol.
Attestation & runtime supply-chain formats — for Runtime Isolation & Governance and confidential-execution candidates: SEV-SNP report, TDX quote, DCAP collateral, Arm CCA token, and RA-TLS for attestation evidence; OSV, in-toto/SLSA, and CVSS for runtime supply-chain decisions. A candidate that emits attestation evidence a third party can verify without trusting the host scores higher than one with a proprietary evidence blob — this is the trust economy's equivalent of open-standards alignment.

A project that pre-dates a standard and has not adopted it is not automatically penalized — the evaluation notes whether adoption is roadmapped or whether the project has expressed intent. A project that has actively chosen a proprietary format in a domain where an open standard exists and has traction is penalized.

Scoring guidance#

Score	Condition
3	Emits OTel-format signals (traces and/or metrics) with conformant agent-convention attributes; speaks or serves MCP where applicable to its layer; speaks layer-appropriate open standards (SPIFFE, A2A, OpenAPI) where relevant. Proprietary formats are used only for capabilities not yet covered by an open standard.
2	Partial OTel support (e.g., traces but not agent-specific semantic conventions; or OTel via a community plugin rather than built-in); MCP support roadmapped or available via adapter; OR the project pre-dates the relevant standards and has not yet adopted them but has a stated plan. State which standards are partial and what the adoption status is.
1	No OTel support; no MCP support; speaks proprietary formats only; no stated roadmap toward open standards. The project is usable but creates integration debt that accumulates as the open-standards ecosystem matures. A score of 1 on this criterion requires a documented integration-cost assessment in the per-layer evaluation — the specific adapter or conversion work an enterprise would have to build and maintain — parallel to the written-rationale requirement for other 1s.
0	Actively resists standard adoption (e.g., documented decision to use proprietary formats in preference to open standards with no stated intent to change); or emits signals in a format that is incompatible with the standard without a conversion path.

Scoring template#

Copy this template into each per-layer evaluation document (Task B3) for each candidate project.

## Qualification rubric — [Project Name] [version/commit] [date]

**Rubric version:** v1.1

| Criterion | Score (0–3) | Evidence | Caveats |
|-----------|-------------|----------|---------|
| 1. License | | | |
| 2. Forkability / vendor independence | | | |
| 3. Health & bus factor | | | |
| 4. Production adoption & security posture | | | |
| 5. Composability | | | |
| 6. Open-standards alignment | | | |
| **Total** | **/18** | | |

**Interpretation:** [Strong / Viable with caveats / Weak / Disqualified]

**Any criterion scored 0:** [Yes / No] — if yes, see disqualification rule.

**Honest-tension provision invoked:** [Yes / No] — if yes, see § Honest tension and state the specific justification.

**Recommendation:** [Recommended / Recommended with documented caveats / Not recommended]

Honest tension#

RAMPART (Microsoft's agent assurance toolkit, MIT-licensed) and OpenShell (NVIDIA's runtime isolation layer, Apache-2.0-licensed) are the current reference implementations for the Assurance and Runtime Isolation layers respectively. Both are single-vendor projects at the time of this writing — RAMPART is a Microsoft project and OpenShell is an NVIDIA project — and both score weaker on criterion 2 (forkability / vendor independence) than a foundation-governed project would.

This is an honest tension, and it is not resolved by pretending it does not exist. It is resolved by two structural mechanisms stated in the reference architecture (§ Defining principle and spec §5):

Mechanism 1 — capability-first, implementation-swappable. The reference architecture defines capabilities and interfaces; implementation recommendations are separate, dated, and replaceable. RAMPART and OpenShell are the recommended implementations for their respective layers at this point in time. If either project's status changes — the vendor disengages, the codebase is abandoned, a superior alternative emerges — the capability layer persists and a different implementation slots in behind the same interface. The architecture does not collapse with the implementation.

Mechanism 2 — permissive-license forkability backstop. Both RAMPART (MIT) and OpenShell (Apache-2.0) are genuinely permissively licensed. If Microsoft or NVIDIA were to disengage, the permissive license means anyone — including the enterprise using it, or the community — can fork the project and carry it forward without any legal barrier. The fork-ability is real, even if the practical community capacity to sustain a fork at launch is limited (criterion 2 caveat).

These two mechanisms together mean that single-vendor projects with permissive licenses can be recommended for a layer even when criterion 2 scores 1 or 2, provided:

The per-layer evaluation states the criterion 2 score and caveat explicitly.
The evaluation documents why no foundation-governed alternative exists or is sufficiently mature.
The capability-layer interface is defined precisely enough that a swap is operationally feasible.
The permissive license is confirmed (criterion 1 score of 3).

This honest-tension provision applies narrowly — it is not a general license to recommend vendor-controlled projects. It applies where a single-vendor project is genuinely the best available implementation for a capability, the license provides a real forkability backstop, and the per-layer evaluation documents the reasoning transparently.

Relationship to the running benchmark#

The qualification rubric and the running benchmark are complementary instruments at different positions in the evaluation workflow.

The qualification rubric is applied first, to every candidate for every layer. It answers the meta-test (can an enterprise adopt and leave this?) using public signals that any evaluator can verify in a few hours. It produces a score and a recommendation. It does not measure performance, correctness, or behavioral fidelity.

The running benchmark is built for one or two high-value layers where performance and correctness are the differentiating variables between candidates that have all cleared the rubric. It requires weeks of investment to build per layer and is not economically justified for every capability. The Assurance layer (where RAMPART provides the tooling itself) is the first benchmark target. Other layers are evaluated by rubric alone unless a specific, documented reason justifies the benchmark investment.

A project must clear the qualification rubric before being evaluated by the benchmark. A project that fails the rubric is not benchmarked — regardless of its performance characteristics, an implementation that cannot be adopted and left is not suitable for the reference architecture.

NotionAlpha OSS AI Lab — notionalpha.com