Validation Rules
Overview
When Schematic validates a manifest, it uses a data model. The data model contains a list of validation rules for each component(data-type). This document describes all allowed validation rules currently implemented by Schematic.
Rules that change the validation behavior which must be taken from the following pre-specified list, formatted in the indicated ways, and added to the data model to apply. An example data model using each rule is available for reference.
The column the rule refers to must be in the manifest for validation to happen.
Rules can optionally be configured to raise errors and prevent manifest submission in the case of invalid entries, or warnings and allow submission when invalid entries are found within the attribute the rule is set for. Validators will be notified of the invalid values in both cases. Default message levels for each rule can be found below under each rule.
Attributes that are not required will raise warnings when invalid entries are identified. For attributes that are not required, if a user does not submit a response, a warning or error will no longer be logged. If you desire raising an error for non-entries, set Required to True in the data model.
Metadata validation is an explicit step in the submission pipeline and must be run either before or during manifest submission.
Validation Types
Validation rules are just one type of validation run by Schematic to ensure that submitted manifests conform to the expectations set by the data model.
This page details how to use Validation Rules, but please refer to this documentation to learn about the other types of validation.
Rule Implementation
Some validation rules are handles by Schematic itself, while others are handled by using the Great Expectations library.
Rule |
In-House |
Great Expectations (GX) |
JSON Schema Validation |
---|---|---|---|
list |
✓ |
||
regex module |
✓ |
||
float |
✓ |
✓ |
|
int |
✓ |
✓ |
|
num |
✓ |
✓ |
|
string |
✓ |
✓ |
|
url |
✓ |
||
matchAtLeastOne |
✓ |
||
matchExactlyOne |
✓ |
||
matchNone |
✓ |
||
recommended |
✓ |
||
protectAges |
✓ |
||
unique |
✓ |
||
inRange |
✓ |
||
date |
✓ |
||
required |
✓ |
||
valid values |
✓ |
Rule Types and Details
List Validation Type
list
Use to parse the imported value to a list of values and (optionally) to verify that the user provided value was a comma separated list, depending on how strictly entries must conform to the list structure. Values can come from Valid Values.
Format:
list <conformity level> <raised message level>
list strict
Validates that entries are comma separated lists, and parses into list
Requires all attribute entries to be comma-delimited, even lists with only one element (lists with a trailing comma)
list like
Assume entries are either lists or like a list but do not verify that entries are comma separated lists, and attempt to parse into a list
Single values, or lists of length one, can be entered without a comma delimiter
Can use
list
rule in conjunction withregex
rule to validate that the items in a list follow a specific pattern.See the
list::regex
rule below in rule combinations.All the values in the list need to follow the same pattern. This is ideal for when users need to provide a list of IDs.
Default behavior: raises
error
Regex Validation Type
regex
Use the
regex
validation rule when you want to require that a user input values in a specific format, i.e. an ID that follows a particular format.Format:
regex <module> <regular_expression> <raised message level>
Module: is the Python
re
module that you want to use. A common one would be search. Refer to Pythonre
source material to find the most appropriate module to use.Single spaces separate the three strings.
Example:
regex search [0-9]{4}\/[0-9]*
The regular expression defined above allows comparison to an expected format of a histological morphology code.
Default behavior: raises
error
Note
regex101.com is a tool that can be used to build and validate the behavior of your regular expression
If the module specified is match for a given attribute’s validation rule, regex match validation will be preformed in Google Sheets (but not Excel) real-time during metadata entry.
The strict_validation parameter
(in the config.yml file for CLI or in manifest generation REST API calls) sets whether to stop the user from entering incorrect information in a Google Sheets cell (strict_validation = true
) or simply throws a warning (strict_validation = false
). Default: true
.
regex
validation in Google Sheets is different than standard regex validation (for example, it does not support validation of digits). See this documentation for details on Google regex syntax. It is up to the user/modeler to validate that regex match
is working in their manifests, as intended. This is especially important if the strict_validation
parameter is set to True
as users will be blocked from entering incorrect data. If you are using Google Sheets and do not want to use real-time validation use regex search
instead of regex match
.
Type Validation Type
Format:
<type> <warning level>
The first parameter is type and must be one of [
float
,int
,num
,str
]The second optional parameter is the msg level and must be one of [
error
,warning
], defaults toerror
.
Examples: [
str
,str error
,str warning
]
float
Checks that the value is a float.
int
Checks that the value is an integer.
num
Checks that the value is either an integer or float.
str
Checks that the value is a string (not a number).
URL Validation Type
url
Using the
url
rule implies the user should add a URL to a free text box as a string. This function will check that the user has provided a usable URL. It will check for any standard URL error and throw an error if one is found. Further additions to this rule can allow for checking that a specific type of URL is added. For example, if the user needs to ensure that the input contains a http://protocols.io URL string, http://protocols.io can be added after url to perform this check.Format:
url <optional strings> <raised message level>
url
must be specified first then an arbitrary number of strings can be added after (separated by spaces) to add additional levels of specificity.
Alternatively, its valid to pass only
url
to simply check if the input is a url.
Examples:
url http://protocols.io
Will check that any input is a valid URL, and will also check to see that the URL contains the stringhttp://protocols.io
If not, an error will be raised.url dx.doi http://protocols.io
Will check that any input is a valid URL, and will also check to see that the URL contains the stringsdx.doi
andhttp://protocols.io
. If not, an error will be raised.
Default behavior: raises
error
Required Validation Type
required
An attribute’s requirement is typically set using the required column (csv) or field (JSONLD) in the data model. A True
value means a users must supply a value, False
means they are allowed to skip providing a value.
Some users may want to use the same attribute across several manifests, but have different requirements based on the manifest/component. For example, say the data model contains an attribute called PatientID, and this attribute is used in manifests Biospecimen, Patient and Demographics. Say the modeler wants to require that PatientID be required in the Patient manifest but not Biospecimen or Demographics. In the standard Data Model format, there is only one requirement option per Attribute, so one would not be able to set requirements per component. But with the advent of component based rule settings, this can now be achieved.
Requirements can be specified per component by setting the required field in the data model to False
, and using component based rule setting along with the required “rule”.
Note
This new required validation rule is not a traditional validation rule, but rather impacts the JSON validation schema. This means requirements propagate automatically to manifests as well.
When using the required
validation rule, the Required
column must False
in the CSV, or the Required
must be set to False
in the JsonLD or this will cause the rule to not work as expected (i.e. components were the attribute is expected to not be required due to the validation rules, will still be required).
Note
While using the CLI, a warning will be raised for discrepancies in requirements settings are found when running validation.
required
can be used in conjunction with other rules, without restriction.The messaging level, like all JSON validation checks, is always set at
error
, and not modifiable.required
does not work with other rule modifiers, such aswarning
,error
etc…Though it will not throw an error if rule modifiers are added, it will not work as intended, and a warning will appear
For example, if the rule
^^#Biospecimen required warning
, is added to the data model a warning will be raised letting the user know that the rule modifier cannot be applied to required.
Using the
required
validation rule is the equivalent of puttingTrue
in theRequired
column of the CSV. If theRequired
column isFalse
, and therequired
validation rule is used, the validation rule will override theRequired
column.Controlling
required
through the validation rule will also impact Manifest formatting (in terms of required column highlighting).To verify that the
required
rule is working as expected, you can generate all impacted manifests—required, and columns should appear highlighted in light blue.
Examples:
#BiospecimenManifest required
For
BiospecimenManifest
manifests, if values are missing, an error will be raised.For all other manifests, filling out values for the attribute is optional.
#Demographics required^^#BiospecimenManifest required^^
For
Demographics
andBiospecimenManifest
manifests, values are required to be supplied, if they are not supplied an error will be raised.For all other manifests this attribute is not required.
Cross-manifest Validation Type
Use cross-manifest validation rules when you want to check the values of an attribute in the manifest being validated against an attribute in the manifest(s) of a different component. For example, if a sample manifest has a patient id attribute and you want to check it against the id attribute of patient manifests.
The format for cross-validation is: <rule> <targetComponent>.<targetAttribute> <scope> <raised message level>
There are three rules that do cross-manifest validation: [matchAtLeastOne
, matchExactlyOne
, matchNone
]
There are two scopes to choose from: [ value
, set
]
Value Scope
When the value scope is used all values from the target attribute in all target manifests are combined. The values from the manifest being validated are compared to this combined list. In other words, there is no distinction between what values came from what target manifest.
matchAtleastOne Value Scope
The manifest is validated if each value in the target attribute exists at least once in the combined values of the target attribute of the target manifests.
matchExactlyOne Value Scope
The manifest is validated if each value in the target attribute exists once, and only once, in the combined values of the target attribute of the target manifests.
matchNone Value Scope
The manifest is validated if each value in the target attribute does not exist in the combined values of the target attribute of the target manifests.
Example 1
Tested manifest: [“A”]
Target manifests: [“A”, “B”]
matchExactlyOne: passes
matchAtleastOne: passes
matchNone: fails
because “A” is in the target manifest
Example 2
Tested manifest: [“A”, “C”]
Target manifests: [“A”, “B”]
matchExactlyOne: fails
because “C” is not in the target manifest
matchAtleastOne: fails
because “C” is not in the target manifest
matchNone: fails
because “A” is in the target manifest
Example 3
Tested manifest: [“C”]
Target manifests: [“A”, “B”]
matchExactlyOne: fails
because “C” is not in the target manifest
matchAtleastOne: fails
because “C” is not in the target manifest
matchNone: passes
Example 4
Tested manifest: [“A”, “A”]
Target manifests: [“A”, “B”]
matchExactlyOne: passes
matchAtleastOne: passes
matchNone: fails
because “A” is in the target manifest
Example 5
Tested manifest: [“A”]
Target manifests: [“A”, “A”]
matchExactlyOne: fails
because “A” is in the target manifest twice
matchAtleastOne: passes
matchNone: fails
because “A” is in the target manifest
Example 6
Tested manifest: [“A”]
Target manifests: [“A”], [“A”]
matchExactlyOne: fails
because “A” is in both target manifests
matchAtleastOne: passes
matchNone: fails
because “A” is in the target manifest
Example 7
Tested manifest: [“A”]
Target manifests: [“A”, “B”], [“A”, “B”]
matchExactlyOne: fails
because “A” is in both target manifests
matchAtleastOne: passes
matchNone: fails
because “A” is in the target manifest
Set scope
When the set scope is used the values from the tested manifest are compared one at a time against each target manifest, and the number of matches are counted. The test to determine if the tested manifest matches the target manifest is to see if the tested manifest values are a subset of the target manifest values. Imagine a target manifest who’s values are [“A”, “B” “C”]:
[ ], [“A”], [“A”, “A”], [“A”, “B”, “C”] are all subsets of the example target manifest.
[1], [“D”], [“D”, “D”], [“D”, “E”] are not subsets of the example target manifest.
matchAtleastOne Set scope
The manifest is validated if there is atleast one set match between the tested manifest and the target manifests
matchExactlyOne Set scope
The manifest is validated if there is one and only one set match between the tested manifest and the target manifests
matchNone Set scope
The manifest is validated if there are no set match between the tested manifest and the target manifests
Example 1
Tested manifest: [“A”]
Target manifests: [“A”, “B”]
matchExactlyOne: passes
matchAtleastOne: passes
matchNone: fails
because “A” is in the target manifest
Example 2
Tested manifest: [“A”]
Target manifests: [“A”, “B”], [“C”, “D”]
matchExactlyOne: passes
matchAtleastOne: passes
matchNone: fails
because “A” is in atleast one of the target manifest
Example 3
Tested manifest: [“A”]
Target manifests: [“A”, “B”], [“A”, “B”]
matchExactlyOne: fails
because “A” is in more than one target manifest
matchAtleastOne: passes
matchNone: fails
because “A” is in atleast one of the target manifests
Example 4
Tested manifest: [“C”]
Target manifests: [“A”, “B”]
matchExactlyOne: fails
because “C” is not in the target manifest
matchAtleastOne: fails
because “C” is not in the target manifest
matchNone: passes
Content Validation Type
Rules can be used to validate the contents of entries for an attribute.
recommended
Use to raise a warning when a manifest column is not required but empty. If an attribute is always necessary then
required
should be set toTRUE
instead of using therecommended
validation rule.Format:
recommended <raised message level>
Examples:
recommended
Default behavior: raises
warning
protectAges
Use to ensure that patient ages under 18 and over 89 years of age are censored when uploading for sharing. If necessary, a censored version of the manifest will be created and uploaded along with the uncensored version. Uncensored versions will be uploaded as restricted and Terms of Use will need to be set. Please follow up with governance after upload to set the terms of use
Format:
protectAges <raised message level>
Examples:
protectAges warning
Default behavior: raises
warning
unique
Use to ensure that attribute values are not duplicated within a column.
Format:
unique <raised message level>
Examples:
unique error
Default behavior: raises
error
inRange
Use to ensure that numerical data is within a specified range
Format:
inRange <lower range bound> <upper range bound> <raised message level>
Examples:
inRange 50 100 error
Default behavior: raises
error
date
Use to ensure the value parses as a date
Uses
dateutils
to parse the valueCan parse many formats
YYYY-MM-DD format is recommended
Every value must be read as a string so no formats such as YYYYDDMM which would be read in as an int
Default behavior: raises
error
Filename Validation
This requires paths to be enabled for the synapse master file view in use. Can be enabled by navigating to an existing view and selecting show view schema
> edit schema
> add default view columns
> save
. Paths are enabled on new views by default.
This should be used only with the Filename attribute in a data model and specified with Component Based Rule Setting
filenameExists
Used to validate that the filenames and paths as they exist in the metadata manifest match the paths that are in the Synapse master File View for the specified dataset
Conditions in which an error is raised:
missing entityId
: The entityId field for a manifest row is null or an empty stringentityId does not exist
: The entityId provided for a manifest row does not exist within the specified dataset’s file viewpath does not exist
: The Filename in the manifest row does not exist within the specified dataset’s file viewmismatched entityId
: The entityId and Filename do not match the expected values from the specified dataset’s file view
Format
filenameExists <dataset scope> <raised message level>
Example
This sets the rule for the MockFilename component ONLY with the specified dataset scope syn61682648
#MockFilename filenameExists syn61682648^^
Default behavior: raises
error
Given this File View:
id,path
syn61682653,schematic - main/MockFilenameComponent/txt1.txt
syn61682659,schematic - main/MockFilenameComponent/txt4.txt
syn61682660,schematic - main/MockFilenameComponent/txt2.txt
syn61682662,schematic - main/MockFilenameComponent/txt3.txt
syn63141243,schematic - main/MockFilenameComponent/txt6.txt
We get the following results for this Manifest:
Component,Filename,entityId
MockFilename,schematic - main/MockFilenameComponent/txt1.txt,syn61682653 # Pass
MockFilename,schematic - main/MockFilenameComponent/txt2.txt,syn61682660 # Pass
MockFilename,schematic - main/MockFilenameComponent/txt3.txt,syn61682653 # mismatched entityId
MockFilename,schematic - main/MockFilenameComponent/this_file_does_not_exist.txt,syn61682653 # path does not exist
MockFilename,schematic - main/MockFilenameComponent/txt4.txt,syn6168265 # entityId does not exist
MockFilename,schematic - main/MockFilenameComponent/txt6.txt, # missing entityId
Rule Combinations
Schematic allows certain combinations of existing validation rules to be used on a single attribute, where appropriate.
Note
The following are the tested and validated combinations, all other combinations are not officially supported.
isNa and required can be combined with all rules and rule combos.
Rule combinations: [list::regex
, int::inRange
, float::inRange
, num::inRange
, protectAges::inRange
]
Format:
<rule 1> <applicable rule 1 arguments>::<rule 2> <applicable rule 2 arguments>
::
delimiter used to separate each rule
Example:
list :: regex search [HTAN][0-9]{1}_[0-9]{4}_[0-9]*
Component-Based Rule Setting
Component-Based Rule Setting is a powerful feature in data modeling that enables users to create rules tailored to specific subsets of components or manifests. This functionality was developed to address scenarios where a data modeler needs to enforce uniqueness for certain attribute values within one manifest while allowing non-uniqueness in another.
Here’s how it works:
Rule Definition at Attribute Level: Rules are defined at the attribute level within the data model.
Manifest-Level Referencing: These rules can then be applied (or not) to specific manifests within the data model. This means that rules can be selectively enforced based on the manifest they’re associated with.
This feature offers flexibility and applicability beyond its original use case. The new Component-Based Rule Setting feature provides users with the following options:
Apply a Rule to All Manifests Except Specified Ones: Users can now define a rule that applies to all manifests within the data model except for those explicitly specified. In cases where exceptions are specified, users have the flexibility to define unique rules for these exceptions or opt not to apply any rule at all.
Specify a Rule for a Single Manifest: Alternatively, users can specify a rule that applies to a single manifest exclusively. This allows for fine-grained control over rule enforcement at the manifest level.
Unique Rules for Each Manifest: Users can also define unique rules for each manifest within the data model. This enables tailored rule enforcement based on the specific requirements and characteristics of each manifest.
By leveraging the enhanced Component-Based Rule Setting feature, data modelers can efficiently enforce rules across their data models with greater precision and flexibility, ensuring data integrity while accommodating diverse use cases and requirements.
Note
All restrictions to rule combos and implementation also apply to component based rules.
As always try the rule combos with mock data to ensure they are working as intended before using in production.
Format:
^^
Double carrots indicate that Component-Based rules are being setUse
`^^`
to separate component rule sets
#
In the first position (prior to the rule) to define the component/manifest to apply the rule to#
character cannot be used without the^^
to indicate component rule sets
Use case:
Apply rule to all manifests except the specified set.
validation_rule^^#ComponentA
validation_rule^^#ComponentA^^#ComponentB
Apply a unique rule to each manifest.
#ComponentA validation_rule_1^^#ComponentB validation_rule_2^^#ComponentC validation_rule_3
For the specified manifest, apply the given validation rule, but for all others, run a different rule
#ComponentA validation_rule_1^^validation_rule_2
validation_rule_2^^#ComponentA validation_rule_1
Apply the validation rule to only one manifest
#ComponentA validation_rule_1^^
Example Rules:
Test by adding these rules to the
Patient ID
attribute in theexample.model.csv
model, then run validation with new rules against the example manifests.-
Rule:
#Patient int::inRange 100 900 error^^#Biospecimen int::inRange 100 900 warning
For the
Patient
manifest, apply the comborule int::inRange 100 900
at theerror
level.The value provided must be an integer in the range of 100-900; if it does not fall in the range, throw an error
For the
Biospecimen
manifest, apply the combo ruleint::inRange 100 900
at thewarning
levelThe value provided must be an integer in the range of 100-900; if it does not fall in the range, throw a warning
Rule:
#Patient int::inRange 100 900 error^^int::inRange 100 900 warning
For the
Patient
manifest, apply ruleint::inRange 100 900
at anerror
levelFor all other manifests, apply the
rule int::inRange 100 900
at a warning level
Rule:
#Patient^^int::inRange 100 900 warning
For all manifests except
Patient
apply the ruleint::inRange 100 900
at thewarning
level
Rule:
int::inRange 100 900 error^^#Biospecimen
Apply the rule
int::inRange 100 900 error
, to all manifests exceptBiospecimen
Rule:
#Patient unique error^^
To the
PatientManifest
only, apply theunique
validation rule at theerror
level