Prediction Engine - Product

General Information

The prediction engine provides recommendations for mapping the fields of a dataset to the fields of a specific domain model. There are two main ways to invoke the matching prediction service:

  • Standard-based matching, used when the user knows that the provided dataset conforms to a specific domain standard (from the ones included in the domain model). In this case the service simply returns the model fields that correspond to the standard's fields. The elastic-matching method is used in order to perform an exact matching on the field names.
  • Regular (not standard-based) matching, used when the input dataset doesn't conform to a particular standard. Here the elastic-matching method is used again in a different way than described above.

These recommendations are generated based on the following method, as follows:

Elastic-Matching

Matches the user input field to model fields (leaf nodes) based on various text fields of the model's leaf nodes using ElasticSearch queries. Each prediction is based on the following information for the model field returned as a match: field name (complete version when standard-based matching is used, tokenized version otherwise), tokenized terms of the field and its parent, related terms and terms from corresponding standards' fields (all tokenized). Key points about the method:

  • The matching process is based on ElasticSearch queries.
  • Any special characters (reserved ES characters) in user field names are ignored
  • All searches are performed within the concept provided in the request
  • Tokenization of field names is performed by a custom function available in the utils and also by the ES analyzers used for indexing
  • EdgeNgrams (3-5 letters) are utilized to search for abbreviations in the titles
  • The search is performed with fuzziness to map terms that might have been misspelled by the user
  • When a path is provided, path terms are also included in the search queries with dynamically calculated weights (outermost term has the smallest weight and the innermost has the highest)
  • When a standard is provided, exact matching on standard field names is performed

Endpoints

The prediction engine offers one main service:

  • The matching-prediction service (/matching-prediction), which is the one invoked to get the mapping suggestions, as described above.

The elastic-matching method is implemented as a separate class and exposes its own endpoints /matching-prediction for normal operations and /test-elastic-matching for testing), so they could also be used independently.

Background workflow

This section briefly presents the steps followed for each of the main ways to invoke the matching prediction service.

When using standard-based matching

  1. The /matching-prediction endpoint is invoked
  2. The request input is checked
  3. For each field, the exact match is found (according to the relevant standard) and returned with a perfect score
  4. Results for all fields are formatted according to the expected output structure
  5. Response is returned

When the request does not provide a standard

  1. The /matching-prediction endpoint is invoked
  2. The request input is checked
  3. For each user field, the matching method is called (not through API), calculates scores/confidence for potential matching fields and keeps the field with the top score as prediction, if exists
  4. The final results list for all fields is formatted according to the expected output
  5. Response is returned

By default the matching-prediction service searches for matches only within the fields of a specific model concept, which is provided in the request. It is possible to disable this and search across the complete model (through the "within_concept_only" flag in the request), but the results will be worse. Searching across the complete model and leveraging the model structure will be explored in next versions.

The way the confidence of the results is computed needs to be further examined.

Samples

Input

For regular matching prediction

{
  "metadata": {
    "domain": 1,
    "standard": null,
    "concept": 6
  },
  "fields": [
    {
      "id": 1,
      "title": "ArrivalTime",
      "path": [
        "doublenestedtest[]",
        "double.nested"
      ]
    },
    {
      "id": 2,
      "title": "CRSArrTime",
      "path": [
        "doublenestedtest[]",
        "double.nested"
      ]
    },
    {
      "id": 3,
      "title": "cArrivalTime",
      "path": [

      ]
    },
    {
      "id": 4,
      "title": "dummynonexistent",
      "path": [

      ]
    }
  ],
  "configuration": {
    "within_concept_only": true,
    "es": {
      "use_model_paths": false,
      "use_input_paths": false
    }
  }
}

"Metadata" and "fields" fields are required. The "configuration" field is optional, as well as its children-fields. Default configuration will be used for all missing options.

Default configuration values:

{
  "configuration": {
    "within_concept_only": true,
    "es": {
      "use_model_paths": false,
      "use_input_paths": true
    }
  }
}

For standard-based matching prediction

{
  "metadata": {
    "domain": 1,
    "standard": {
      "standard": "ACRIS",
      "version": "1.1"
    },
    "concept": 6
  },
  "fields": [
    {
      "id": 1,
      "title": "ArrivalTime",
      "path": [
        "doublenestedtest[]",
        "double.nested"
      ]
    },
    {
      "id": 2,
      "title": "CRSArrTime",
      "path": [
        "doublenestedtest[]",
        "double.nested"
      ]
    },
    {
      "id": 3,
      "title": "cArrivalTime",
      "path": [

      ]
    },
    {
      "id": 4,
      "title": "dummynonexistent",
      "path": [

      ]
    }
  ],
  "configuration": {
    "within_concept_only": true,
    "es": {
      "use_model_paths": false,
      "use_input_paths": false
    }
  }
}

The payload sample is the same as the previous one, but instead of a null value in the "metadata.standard" field, an actual standard is provided.

Output

{
	"1": {
		"id": 1,
		"matchings": {
			"confidence": 0.0,
			"score": 0.7,
			"target": 211,
			"title": "scheduledAircraftArrivalTime"
		},
		"path": [
			"doublenestedtest[]",
			"double.nested"
		],
		"title": "ArrivalTime"
	},
	"2": {
		"id": 2,
		"matchings": null,
		"path": [
			"doublenestedtest[]",
			"double.nested"
		],
		"title": "CRSArrTime"
	},
	"3": {
		"id": 3,
		"matchings": {
			"confidence": 0.0,
			"score": 0.7,
			"target": 211,
			"title": "scheduledAircraftArrivalTime"
		},
		"path": [],
		"title": "cArrivalTime"
	},
	"4": {
		"id": 4,
		"matchings": null,
		"path": [],
		"title": "dummynonexistent"
	}
}

Model Management

Using the specific endpoints described below, a model can be indexed to Elasticsearch and retrieved in a tree format. Additionally, new model nodes can be added, updated or deleted.

Model nodes

Model nodes included in a data model have the following structure.

export inteface ModelNode {
    id: number;
    uid: string;
    name: string;
    description: string;
    version: string;
    dateAdded: date;
    dateDeprecated: date | null;
    type: string | null;
    standardsMapping: any[] | null;
    tokenizedTerms: string[] | null;
    relatedTerms: string[] | null;
    metadata: any | null;
    referenceConceptId: number | null;
    referenceConceptName: string | null;
    referencePrefix: any[];
    parentId: number | null;
    majorVersion: number;
    status: string;
    updatedAt: date;
    domainUid: string | null;
}

There are 4 types of model nodes:

  • Domain node: type is equal to null
  • High level node: type is equal to "object"
  • Related node: type is equal to "object" (*)
  • Leaf node: type is equal to any data type (including "boolean", "date", "datetime", "double", "integer", "string", "time", "binary")

(*) Fields referenceConceptId, referenceConceptName and referencePrefix have a value only if the node is "related" (which means that it references a high level node) otherwise they have null value

Endpoints

  • /model: This endpoint is used to index a new model to Elasticsearch

    Input

    {
        "standards": [
            {
                "standard": "ACRIS",
                "version": "v2.0"
            },
            {
                "standard": "AIDX",
                "version": "v16.1"
            }
        ],
        "model": [
            {
                "dateAdded": "2022-06-27T06:44:53.130Z",
                "updatedAt": "2022-06-27T03:45:07.229Z",
                "id": 3805,
                "uid": "7d2404a5-db37-4f2b-886f-abb6f4b7081c",
                "majorVersion": 1,
                "name": "aviation1656312293130",
                "description": "A data model representing the aviation data exchanged by the aviation data value chain stakeholders.",
                "version": "1.0.0",
                "dateDeprecated": null,
                "type": null,
                "standardsMapping": [
                    {
                        "standard": "ACRIS",
                        "version": "v2.0"
                    },
                    {
                        "standard": "AIDX",
                        "version": "v16.1"
                    }
                ],
                "tokenizedTerms": null,
                "relatedTerms": null,
                "metadata": null,
                "parentId": null,
                "referenceConceptName": null,
                "referenceConceptId": null,
                "referencePrefix": null,
                "status": "stable",
                "domainUid": null
            }
        ]
    }

    Note that "standards" and "model" are required. Also, alternative paths with depth 2 (for example, Parent.RelatedNode.LeafNode) are calculated for each related and leaf node.

  • /model-node: This endpoint is used to index a new concept to Elasticsearch

    Input

    {
        "dateAdded": "2022-06-27T08:02:02.038Z",
        "updatedAt": "2022-06-27T08:02:02.038Z",
        "name": "test",
        "description": "test",
        "type": "object",
        "standardsMapping": [
            {
                "name": "test",
                "standard": "test",
                "version": "1",
                "type": "test"
            }
        ],
        "relatedTerms": [],
        "metadata": null,
        "parentId": 3805,
        "referenceConceptName": null,
        "referenceConceptId": null,
        "referencePrefix": [],
        "domainUid": "7d2404a5-db37-4f2b-886f-abb6f4b7081c",
        "parent": {
            "dateAdded": "2022-06-27T06:44:53.130Z",
            "updatedAt": "2022-06-27T05:02:02.010Z",
            "id": 3805,
            "uid": "7d2404a5-db37-4f2b-886f-abb6f4b7081c",
            "majorVersion": 1,
            "name": "aviation1656312293130",
            "description": "A data model representing the aviation data exchanged by the aviation data value chain stakeholders.",
            "version": "1.3.0",
            "dateDeprecated": null,
            "type": null,
            "standardsMapping": [
                {
                    "standard": "ACRIS",
                    "version": "v2.0"
                },
                {
                    "standard": "AIDX",
                    "version": "v16.1"
                }
            ],
            "tokenizedTerms": null,
            "relatedTerms": null,
            "metadata": null,
            "parentId": null,
            "referenceConceptName": null,
            "referenceConceptId": null,
            "referencePrefix": null,
            "status": "stable",
            "domainUid": null
        },
        "version": "1.3.0",
        "majorVersion": 1,
        "dateDeprecated": null,
        "tokenizedTerms": null,
        "id": 5120,
        "uid": "dbcbd3da-ef23-4050-b79b-e79ee98eb80b",
        "status": "stable"
    }

    Note that non-leaf nodes (with "object" type), related nodes (with "object" type and "referenceConceptId") and leaf nodes (with type other than "object") are treated differently. Also, alternative paths are calculated accordingly.

  • /model-node/<nodeID>: This endpoint is used to update or delete a specific concept with <nodeID>

    Input

    {
        "description": "The body type of the aircraft. (i.e. W = Wide, N = Narrow). ",
        "standardsMapping": [
            {
                "name": "test",
                "standard": "test",
                "version": "1",
                "type": "test"
            },
            {
                "name": "df",
                "standard": "erdfrf",
                "version": "ref",
                "type": "ref"
            }
        ],
        "relatedTerms": ["test"],
        "metadata": {
            "encryption": true,
            "sensitive": false,
            "multiple": true,
            "ordered": false,
            "identifier": true,
            "spatialID": false,
            "spatial": false,
            "index": "both"
        }
    }

    Note that "description", "standardsMapping", "relatedTerms" and "metadata" are optional.

    Also, when a concept is deleted, the alternative paths of other concepts related to it, are updated.

  • /model-standard: This endpoint is used to add a new standard to Elasticsearch mapping

    Input

    {
        "standards": [
            {
                "standard": "test",
                "version": "1"
            }
        ]
    }
  • /model-search: This endpoint is used to search and return a specific model

    Input

    {
        "filter": {
            "model": 1,
            "concept": "99f8c57b-137b-4269-8f4d-b0cd0b0757bd"
        },
        "pagination": {
            "page": 1,
            "size": 10
        }
    }

    Note that "filter.model" is required and "filter.concept" is optional and can be null if given. When "filter.concept" is given, the alternative paths are filtered based on the specific concept uid. Also, "pagination.page" and "pagination.size" are optional and if given then the returned concepts are paginated.