Schema data

Schema Designer provides rich metadata layer for defining data. It is used to define the data within the Data Hub product.

The schema definitions can also be used to specify JSON data sets. This data has a standard format, which we call "schema data", and can be processed according to the specification using Schema Data Script.

Schema data format

Schema data is represented in JSON. To allow for efficient streaming, it uses arrays to structure the data rather than objects.

The general structure of the data is:

  • The schema data is an array of entities.
  • Each entity is represented by an array.
  • The first item in the entity array if a reference for the entity.
  • The second item in the entity array is the data. This can be either:
    • An array of objects, representing data records.
    • A single object, representing a single data record.

There are two reserved entity references:

  • control - this is used to control processing. This includes the standard properties:
    • mode
    • entity
  • options - this is used for holding user-supplier options

If present, control and options must be the first entities in the list, and in that order.

Here is an example.

[
[
"control",
{
"mode": "source|target|source_entity",
"entity": "reference",
   }
],
[
"options",
{
}
],
[
"entity_reference",
[
{ .. record .. },
... more records
]
],
... more entities
]

Any entity can be represented by a single object or an array.

mode indicates the mode of the data:

  • source indicates that this is source data, using source data names. This is the default.
  • target indicates that this is target data, using target data names.
  • source_entity indicates that all the data is for a single entity. The reference of the entity is given in entity. The reference field on the entities is ignored.

options contains a set of options that apply to the dataset.

entities is an array of objects, each one of which represents data for one entity. The entity object contains a reference to the entity and a data array containing the records for the entity in JSON format.

The schema data can easily be mapped two or from a set of tables, such as HTML tables or tabs in an Excel spreadsheet. In this case, entities with an object are shown as a two-column table with the column headings "Property" and "Value". Entities with arrays are shown as data tables.

Conventionally, control and options are held as objects because a single object is expected, and other entities held as an array of objects. However, processes must allow for control and options to be held as an array of objects (only the first item is used), or for other entities to be held as a single object (which is equivalent to a single item array containing the object).

Child data

Schema data supports children by using numeric suffixes on fields. For example, if a product records has multiple prices of different types, you can encode a repeating group of two with columns like "Price Type 1", "Price 1", "Price Type 2", "Price 2". The numeric suffix must be separated by a space, not other punctuation, even when field names are standardised. So the standardised field might be "price_type 2", not "price_type_2". If you happen to have a non-child field that does end in a number, code it without a space, e.g. "contact_phone_1 2" would mean the "contact_phone_1" field on the second child.

Conversion of these to child records depends on them bring unambiguously identifiable in the source. Some manipulation of the source definition may be required to ensure uniqueness for complicated child data.

Only one level of child specification is supported.

Child data fields with suffixes are converted during validation.

Link fields

Link fields represent links from a child record to a parent record.

There are three ways of dealing with links.

  • You can populate a link field directly. The link field is passed on without validation.
  • You can use an alternative field definition to redefine a target link field as a field of another type (such as number). This permits field validation and makes sure that link field processing (see below) is not carried out.
  • If the link field is in the source entity definition but is not populated in the record, then link field processing is performed as part of validation.

Link field processing creates a link object that represents the values that are required to determine the parent. It works this way:

  • All the fields in link field's sourceLinkKey are retrieved. If any are missing (undefined, not just null), the key is assumed to be unpopulated and the link field is not generated.
  • Assuming all the fields are present, the link field is created as an object containing only those fields.

The sourceLinkKey is usually a single reference, but may be an array.

In this example addressing a parcel in the internal post, the "destination" link field requires a site and room number.

"sourceLinkKey": ["site","room_number"]

Source record before link field added:

{
"site": "HQ",
"room_number": "817",
"recipient": "Henry Jones"
}

After link field added:

{
"site": "HQ",
"room_number": "817",
"recipient": "Henry Jones",
"destination": {
"site": "HQ",
"room_number": "817"
 }
}

The link object is converted into suitable parent fields by the resolve() method.

You can create additional fields with a null target reference (enter the reference as "null") and use these to represent fields that are only used to hold link key fields. These source fields are removed from the record as part of the target() method, but not from the link object. If you have more than one link field to the same entity (for example, two links from a "project" entity to "person", one as a sponsor and one as a manager), you will need this approach to prevent target field duplication.

Options application

The options must be defined in an "options" source entity. This can be defined using an entity mapping with no target entity, but with the target entity reference set to "options".

In source and source_entity mode, options can also hold defaults for table data. These are applied to the entities as part of validation. The option properties and the table data properties should match exactly when applying option defaults.

Schema data methods

Scripts based on Schema Data Processor Script support a number of methods to manipulate schema data. See the script documentation for more details.

Schema Data Validate Processor Script

Checks that schema data conforms to the schema, and forces the data to validity.

Validity includes:

  • Entity and field references. These are looked up from names if required.
  • Mandatory fields are present.
  • Correct data types.
  • Application of options.
  • Application of repeating child data.
  • Creation of link field resolution objects

The validation is strict - it raises an error if there is unexpected data, to guard against misspellings.

Schema Data Join Processor Script

In source_entity mode, joins the data from all the entities to create a entity data, sets the reference, and turns the mode to source. Ignored if mode is not source_entity.

The data sets are joined in turn.

  • Where the data sets share the same columns, an inner join is used on those columns.
  • Where the data sets do not have the same columns, a cartesian join is used.
Schema Data Target Processor Script

Convert source data to target data, but do not perform resolution. This includes:

  • Change entities and properties to use target references, not source references.
  • Remove fields with null target references.

All the functions are run using the first mapping in mappings list for the source entity.