SHACL conformance: what we ship and why

A consumer integrates against a substrate. They read the documentation. They build their pipelines around the schema they were told to expect. A release later, a field is renamed - or, more subtly, a constraint they assumed is no longer being enforced upstream. The consumer’s pipeline keeps running. The data flowing through it is silently malformed. The first time anyone notices is when a downstream query starts returning shapes it shouldn’t.

The defence against this is not “trust the publisher”. The defence is conformance shapes shipped with every release - a machine-readable contract the consumer can validate against, on demand, without depending on the publisher’s word. We ship ours in SHACL. This note is about what that ship actually contains, what it commits us to, and why we treat conformance not as an optional add-on but as a load-bearing part of every substrate release.

01 / The trust failure mode

A substrate without conformance shapes makes a category of promise it cannot keep. The promise is the data conforms to this schema. The problem is that the schema lives in documentation, the data lives in the release, and the relationship between the two is whatever the publisher’s discipline happened to enforce in that particular build. There is no external check. The consumer can verify nothing the publisher does not voluntarily expose to verification.

This is not abstract. We have all seen the failure mode. A field that was always present becomes sometimes-absent. A relationship that was always typed becomes sometimes-string-labelled. A provenance object that was always populated becomes sometimes-null. None of these break a query immediately. All of them break trust irreversibly the first time someone notices and asks how long it has been like this.

Documentation cannot solve this. Tests in the publisher’s codebase cannot solve this for the consumer. The only thing that does is a contract the consumer can run themselves, against the data they actually received, on the version they are actually consuming. That is what conformance shapes are.

02 / What conformance shapes do, in principle

SHACL is the W3C standard for validating graph data against a set of constraints. The constraints are written as shapes - each shape targets a class of nodes and specifies what is and is not permissible about them. A shape can require a field to be present, restrict it to a particular type, bound its cardinality, restrict its values to a controlled vocabulary, or assert relationships between nodes. The output of running a shape against data is a conformance report - a structured statement of where the data conforms and where it does not, with enough detail that a consumer can act on the result.

We use SHACL because it does this job correctly for graph-shaped data, which our substrate is. There are alternatives - JSON Schema for JSON documents, OWL reasoning for ontological consistency checks - but each is solving a different problem. SHACL is the right shape of test for the right shape of artefact.

03 / What categories of constraint we encode

The actual shapes evolve release-to-release. What does not evolve is the categories of constraint we hold the substrate to. There are five.

Mandatory provenance. Every node and every relationship in the substrate carries provenance. Conformance shapes enforce this at the structural level - a node without provenance is not a substrate-valid node, and the shapes will flag it. This is Aletheia translated into a test.

Typed relationships. Every relationship between nodes carries a named type. A relationship without a type, or with a type outside the substrate’s vocabulary of relationship types, is non-conformant. This is the structural prohibition on the “knowledge graph that is really a taxonomy with unlabelled arrows” failure mode the ontology vs taxonomy piece named.

Identifier conformance. Every entity has an identifier in a stable, recognisable form. Identifiers from outside the substrate’s identifier space, or identifiers that do not resolve, are flagged.

Cross-reference integrity. When a node refers to another node, that referent has to exist within the same substrate version. Dangling references - the kind that produce silent partial results when queries traverse them - are caught at conformance time, not at query time.

Value-domain restrictions. Fields with constrained value spaces (jurisdictions in ISO 3166, languages in ISO 639, currencies in ISO 4217, the substrate’s own controlled vocabularies) are restricted to those value spaces. A field that should hold a country code but holds a country name is non-conformant.

These five categories are not all of conformance, but they are the ones whose violation is most likely to look fine for months and then quietly corrupt a downstream consumer’s view of the substrate. They are also the ones a consumer most needs to be able to test independently.

04 / How a consumer uses the shapes

The intended flow is short, and is the same regardless of how the consumer is integrating.

The substrate release arrives, alongside the SHACL shapes for that release. The consumer runs the shapes against the data - using any SHACL-conformant validator; we do not require ours. The consumer receives a conformance report. If the report is clean, the release conforms to the schema it claims; the consumer can proceed with confidence the structural contract is being honoured. If the report is not clean, the report tells them exactly which nodes, which relationships, and which fields are non-conformant, with the shape rule that was violated for each.

The shapes are also useful in the other direction. A consumer producing data against the substrate - extending it with their own annotations, deriving new relationships, building a custom asset on top - can run the same shapes against their own derived data to verify they have produced something that integrates back cleanly. The conformance contract is symmetric.

05 / Why we ship them with every release

A specification that lives at the substrate’s “stable” layer does not solve the consumer’s problem. The consumer is integrating against a particular release, with a particular set of nodes, on a particular substrate version. The conformance shapes have to ride along with that release, frozen at the same version, so the validation result is meaningful for the data the consumer actually has.

This is the versioned commitment from the substrate’s six engineering principles, applied to the validation layer. Shapes pinned to a release are testable against that release. Shapes that float relative to releases are documentation, not contracts.

We also ship the diff between the shapes from the previous release and the current release. A consumer upgrading substrate versions can see exactly which constraints have tightened, which have loosened, and which have been added or removed - before they touch their pipeline. This is the reverse of the trust failure mode the note opened with: nothing changes silently, because the conformance surface itself is version-controlled.

06 / What this commits us to

Shipping conformance shapes is the part of the discipline that costs the most and earns the most. It costs the most because it removes the publisher’s ability to be slightly wrong. The substrate has to actually conform to the shapes it ships - which means every loader, every refresh, every release pipeline has to run conformance on its own output before release. A non-conforming release is not a release.

It earns the most because it is the structural form of Parrhesia. We do not ask consumers to trust the substrate’s claims about its own structure. We give them the tools to test those claims, against the data they actually have, on the version they are actually consuming. The substrate’s honesty becomes something the consumer can verify, not something they have to take on faith.

The next note in this thread will return to the architectural arc the anchor essay opened: provenance at generation time - what it actually takes to record, as part of the response itself, which substrate elements grounded an AI inference, and why doing this after the fact is forensics, not provenance.