Prediction Engine
General Information
The prediction engine provides recommendations for mapping the fields of a dataset to the fields of a specific domain model. There are two main ways to invoke the matching prediction service:
- Standard-based matching, used when the user knows that the provided dataset conforms to a specific domain standard (from the ones included in the domain model). In this case the service simply returns the model fields that correspond to the standard's fields. The elastic-matching method (described below) is used in order to perform an exact matching on the field names.
- Regular (not standard-based) matching, used when the input dataset doesn't conform to a particular standard. Here all three matching methods described below are used (if confingured).
These recommendations of the prediction engine are generated based on three different methods, as follows:
- Elastic-Matching: matches the user input field to model fields (leaf nodes) based on various text fields of the model's leaf nodes using Elasticsearch queries. Each prediction is based on the following information for the model field returned as a match: field name (complete version when standard-based matching is used, tokenized version otherwise), field description, tokenised terms of the field and its parent, related terms and terms from corresponding standards' fields (all tokenized).
- Fuzzy-Matching: matches the user input field to model fields (leaf nodes) based on the names and the related terms of the leaf nodes using levenshtein distance.
- Sample-matching: matches the user input field based on each sample data, returning model fields (leaf nodes) for which the data contents seem similar. This method uses supervised machine learning and in particular classifiers from the flexmatcher python package.
Endpoints
The prediction engine offers two main services:
- The matching-prediction service (/matching-prediction), which is the one invoked to get the mapping suggestions, as described above.
- The sample retraining service (/sample-retraining) which is used to retrain the model used by the sample-matching method.
Each of the three matching methods is implemented as a seperate class and exposes its own endpoints (for testing and normal operations), so they could also be used indpendently. As of version 2 of the prediction engine, the /matching-prediction service allows selecting the method(s) that will be used, so the individual endpoints are not needed.
Version 1 input and output payloads are not compatible with the ones used by later versions. Progressively version 1 will be replaced everywhere and will not be supported. When the input payload doesn't specify a version, version 1 is assumed.
Background workflow
This section briefly presents the steps followed for each of the main ways to invoke the matching prediction service.
When using standard-based matching
- The /matching-prediction endpoint is invoked
- The request input is checked to identify whether version 1 or 2 of the service was invoked and use the appropriate methods in the following fields
- For each field, the exact match is found (from elastic-matching method) and returned
- Results for all fields are formatted according to the expected output structure
- Response is returned
When the request does not provide a standard
- The /matching-prediction endpoint is invoked
- The request input is checked to identify whether version 1 or 2 of the service was invoked and use the appropriate methods in the following fields
- The input sample field is processed to get the flattened structure to be used by the sample-matching method. In version 2, this step may be skipped depending on the input payload configuration, i.e. if the sample-based matching is not used.
-
For each user field:
- The three matching methods (or the ones specified in v2 payloads) are used to provide their recommendations.
- Each function returns its top X (where X is an env variable) results. Each method provides its best matches independently of the others.
- A function combines the results and calculates scores/confidence
- The final results list for all fields is formatted according to expected output.
- Response is returned
By default the matching-prediction service searches for matches only withing the fields of a psecific model concept, which is provided in the request. It is possible to disable this and search across the complete model (through the "within_concept_only" flag in the request), but the results will be worse. Searching across the complete model and leveraging the model structure will be explored in next versions.
The way the confidence of the results is computed and the way results are combined (when coming from different methods) need to be further examined.
Samples for version 1
Input
For regular matching prediction
{
"metadata": {
"domain": 1,
"standard": null,
"concept": 1000
},
"sample": [
{
"SeatsPassenger": 340,
"Transport_Model": "YRT098"
}
],
"configuration": {
"within_concept_only": false
}
}
For standard-based matching prediction
{
"metadata": {
"domain": 1,
"standard": {
"name": "ACRIS",
"version": "1.1"
},
"concept": 1000
},
"sample": [
{
"SeatsPassenger": 340,
"Transport_Model": "YRT098"
}
]
}
Note that "metadata" and "sample" fields are both required, whereas the "configuration" field is optional. If missing, it defaults to true.
Output
{
"SeatsPassenger": null,
"Transport_Model": {
"confidence": 1.0,
"score": 1.0,
"target": 1001
}
}
Samples for version 2
Input
For regular matching prediction
{
"metadata": {
"domain": 1,
"standard": null,
"concept": 8
},
"sample": [
{
"cArrivalTime": 1,
"dummynonexistent": 1,
"doublenestedtest": [
{
"double.nested": {
"ArrivalTime": 1,
"CRSArrTime": 2
}
},
{
"double.nested": {
"ArrivalTime": 3,
"CRSArrTime": 4
}
}
]
}
],
"fields": [
{
"id": 1,
"title": "ArrivalTime",
"path": [
"doublenestedtest[]",
"double.nested"
]
},
{
"id": 2,
"title": "CRSArrTime",
"path": [
"doublenestedtest[]",
"double.nested"
]
},
{
"id": 3,
"title": "cArrivalTime",
"path": [
]
},
{
"id": 4,
"title": "dummynonexistent",
"path": [
]
}
],
"configuration": {
"within_concept_only": true,
"method": {
"matchers": [
"es",
"fz"
],
"scoring": "default"
},
"es": {
"use_model_paths": false,
"use_input_paths": false
}
},
"version": 2
}
"Metadata", "sample" and "fields" fields are required. The "version" field is also required since when missing version 1 is assumed and invoked. The "configuration" field is optional, as well as its children-fields. Default configuration will be used for all missing options. The available values for the "configuration.method.matchers" field are: "fz", "es" and "sa", that correspond to the fuzzy-, elastic- and sample-matching respectively. Note that in order for sample matching to be used, proper training should have been performed. The training process is sensitive to the model field ids and should be further examined.
The "path" fields should correspond to the structure of the sample. This is considered certain and is not validated by the prediction engine. If this is not the case, key errors will be returned.
For standard-based matching prediction
The payload sample is the same as the previous one, but instead of a null value in the "metadata.standard" field, an actual standard is provided in the same way as in the version 1 payload.
Output
{
"1": {
"id": 1,
"matchings": {
"confidence": 0.0,
"score": 0.25,
"target": 261
},
"path": [
"doublenestedtest[]",
"double.nested"
],
"title": "ArrivalTime"
},
"2": {
"id": 2,
"matchings": {
"confidence": 0.0,
"score": 0.25,
"target": 261
},
"path": [
"doublenestedtest[]",
"double.nested"
],
"title": "CRSArrTime"
},
"3": {
"id": 3,
"matchings": {
"confidence": 0.0,
"score": 0.25,
"target": 261
},
"path": [
],
"title": "cArrivalTime"
},
"4": {
"id": 4,
"matchings": null,
"path": [
],
"title": "dummynonexistent"
}
}
Model Management
There are specific endpoints that handle model management in Elasticsearch.
Endpoints
-
/model: This endpoint is used to index a new model to Elasticsearch
Input
{ "standards": [ { "standard": "ACRIS", "version": "v2.0" }, { "standard": "AIDX", "version": "v16.1" } ], "model": [ { "dateAdded": "2022-06-27T06:44:53.130Z", "updatedAt": "2022-06-27T03:45:07.229Z", "id": 3805, "uid": "7d2404a5-db37-4f2b-886f-abb6f4b7081c", "majorVersion": 1, "name": "aviation1656312293130", "description": "A data model representing the aviation data exchanged by the aviation data value chain stakeholders.", "version": "1.0.0", "dateDeprecated": null, "type": null, "standardsMapping": [ { "standard": "ACRIS", "version": "v2.0" }, { "standard": "AIDX", "version": "v16.1" } ], "tokenizedTerms": null, "relatedTerms": null, "metadata": null, "parentId": null, "referenceConceptName": null, "referenceConceptId": null, "referencePrefix": null, "status": "stable", "domainUid": null } ] }
Note that "standards" and "model" are required.
-
/model-node: This endpoint is used to index a new concept to Elasticsearch
Input
{ "dateAdded": "2022-06-27T08:02:02.038Z", "updatedAt": "2022-06-27T08:02:02.038Z", "name": "test", "description": "test", "type": "object", "standardsMapping": [ { "name": "test", "standard": "test", "version": "1", "type": "test" } ], "relatedTerms": [], "metadata": null, "parentId": 3805, "referenceConceptName": null, "referenceConceptId": null, "referencePrefix": [], "domainUid": "7d2404a5-db37-4f2b-886f-abb6f4b7081c", "parent": { "dateAdded": "2022-06-27T06:44:53.130Z", "updatedAt": "2022-06-27T05:02:02.010Z", "id": 3805, "uid": "7d2404a5-db37-4f2b-886f-abb6f4b7081c", "majorVersion": 1, "name": "aviation1656312293130", "description": "A data model representing the aviation data exchanged by the aviation data value chain stakeholders.", "version": "1.3.0", "dateDeprecated": null, "type": null, "standardsMapping": [ { "standard": "ACRIS", "version": "v2.0" }, { "standard": "AIDX", "version": "v16.1" } ], "tokenizedTerms": null, "relatedTerms": null, "metadata": null, "parentId": null, "referenceConceptName": null, "referenceConceptId": null, "referencePrefix": null, "status": "stable", "domainUid": null }, "version": "1.3.0", "majorVersion": 1, "dateDeprecated": null, "tokenizedTerms": null, "id": 5120, "uid": "dbcbd3da-ef23-4050-b79b-e79ee98eb80b", "status": "stable" }
Note that non-leaf nodes (with "object" type), related nodes (with "object" type and "referenceConceptId") and leaf nodes (with type other than "object") are treated differently.
-
/model-node/<nodeID>: This endpoint is used to update or delete a specific concept with <nodeID>
Input
{ "description": "The body type of the aircraft. (i.e. W = Wide, N = Narrow). ", "standardsMapping": [ { "name": "test", "standard": "test", "version": "1", "type": "test" }, { "name": "df", "standard": "erdfrf", "version": "ref", "type": "ref" } ], "relatedTerms": ["test"], "metadata": { "encryption": true, "sensitive": false, "multiple": true, "ordered": false, "identifier": true, "spatialID": false, "spatial": false, "index": "both" } }
Note that "description", "standardsMapping", "relatedTerms" and "metadata" are optional.
-
/model-standard: This endpoint is used to add a new standard to Elasticsearch mapping
Input
{ "standards": [ { "standard": "test", "version": "1" } ] }