Search Engine
The search engine is responsible for searching in the datasets' information to retrieve datasets that best fit the user's search. Currently it searches both in the datasets' metadata and structure and in their contents.
The search engine uses Elasticsearch to index the datasets' information and apart from the core search service, also offers the services needed to add and update datasets' information in the Elasticsearch index.
Dataset-related services
The information indexed in Elasticsearch about each dataset at a high-level includes:
- core information, including the dataset's id, name, description, the dates that it was created/updated/made available, its version and number of downloads, its status and type, the standard it follows (if any) and its creator
- basic metadata, including tags, distribution information (e.g. type, format, accessibility etc.), spatiotemporal coverage, temporal resolution, license information, pricing etc.
- structure information, i.e. the dataset's domain and its concepts (primary and other)
- schema information, which defines (a) the dataset schema, i.e. the fields it contains and the way they are related - nested or at the same level and (b) minimum and maximum values of the numeric and date fields, as well as unique values for the string fields
The engine offers a POST and a PUT method at /dataset/[id]. The request payload is provided below.
No schema validation is performed, as dataset information becomes available at various stages and there is no constraint in the order this happens, e.g. metadata may become available prior to or after the schema. Therefore, the engine doesn't perform any type of check, but assumes that when the dataset status is set to "Available", its information is correct and consistent and the dataset is included in the searches. For any other status value, the dataset is ignored and the engine is not affected by its contents and/or consistency.
Sample payload. It contains a superset of the information currently provided for datasets and usually (as explained above) not all this information is provided at once.
{
"createdAt": "2020-03-16T09:53:20.723Z",
"updatedAt": "2020-05-17T08:10:27.429Z",
"availableAt": "2020-06-17T08:10:27.429Z",
"modifiedAt": "2020-06-17T08:10:27.429Z",
"version": "v2",
"downloads": 10,
"status": "available",
"id": 3,
"name": "Energy Data for BK",
"description": "Detailed information per registered apartment",
"assetTypeId": 1,
"standard": {
"standard": "test",
"version": "v1.0"
},
"createdBy": {
"firstName": "Test",
"lastName": "User1",
"organisationId": 2
},
"metadata": {
"general": {
"tags": [
"energy",
"apartments"
],
"reference": null
},
"distribution": {
"type": "Text",
"format": [
"JSON"
],
"velocity": "Batch",
"accessibility": [
"Through an API"
],
"accrualMethod": "Through an API",
"accrualPeriodicity": "Hourly",
"language": "English",
"volume": 1000
},
"extent": {
"temporalCoverage": {
"type": "custom",
"field": null,
"unit": "time period",
"min": "2018-06-16T09:53:20.723Z",
"max": "2019-06-16T09:53:20.723Z",
"value": null
},
"spatialCoverage": {
"unit": "Continent/Country",
"value": null,
"values": [
"Greece",
"Italy",
"Germany"
],
"field": null,
"coordinates": null
},
"temporalResolution": {
"unit": "Per Hour",
"value": 1
},
"spatialResolution": {
"unit": "Per Country"
}
},
"license": {
"accessLevel": "Public",
"license": "Custom",
"copyrightOwner": "Evmorfia Biliri",
"link": "",
"derivation": [
"Aggregate"
],
"attribution": "Required",
"reproduction": "Prohibited",
"distribution": "Prohibited",
"shareAlike": "Not applicable",
"reContext": "Prohibited",
"offlineRetention": "Prohibited",
"targetPurpose": [
"Non-Commercial"
]
},
"pricing": {
"cost": 100,
"currency": "DOL",
"paymentMethod": [
"Bank Transfer",
"Sth else"
],
"calculationScheme": "Custom"
}
},
"structure": {
"domain": {
"uid": "118",
"majorVersion": 1,
"name": "OccupancyProfile"
},
"primaryConcept": {
"uid": "120",
"name": "buildingSpace"
},
"otherConcepts": [
{
"uid": "122",
"name": "occupant"
},
{
"uid": "125",
"name": "meeting"
}
]
},
"schema": {
"buildingSpace": {
"spaceID": {
"values": [
"S0",
"S1",
"S2"
],
"_uid": "112"
},
"spaceDescription": {
"values": [
"Residence for 2 occupants",
"Private office",
"Residence for 1 occupant",
"Shared office for 3 occupants",
"Private office for 1 doctor",
"Residence for 4 occupants"
],
"_uid": "112"
},
"spaceType": {
"values": [
"ResidentialOwn",
"OfficeRented",
"ResidentialRented",
"OfficeShared",
"OfficePrivate"
],
"_uid": "112"
},
"spaceMaxOccupants": {
"_uid": "112",
"min": 1,
"max": 4
},
"visitingOccupant": {
"occupantID": {
"values": [
"O1",
"O2",
"O3",
"O4"
],
"_uid": "112"
}
},
"officeMeeting": {
"meetingDuration": {
"_uid": "112",
"min": 2,
"max": 120
},
"meetingStartTime": {
"_uid": "112",
"min": "2020-04-27T10:10:00+03:00",
"max": "2020-12-31T23:10:00+03:00"
}
}
}
}
}
Core search service
General information
The search engine's main functionality is searching for datasets and therefore the search service is at its core. However, the search is tightly linked to the way this is implemented in the UI. Therefore, the service offers faceted search and also in its response there are fields that are meant to facilitate the presentation of the information. The core ways to search for datasets, also presented here, are:
- Using free-text search, allowing the user to provide terms which are used to search across the datasets' information (in Elasticsearch). Currently the engine searches in the following dataset fields: name, description, metadata.general.tags, structure.domain.name, structure.concepts.name and structure.tokenized_concepts.
- Using a spatial search, i.e. search for datasets that (a) have a specific field defined as an identifier of their spatial coverage and (b) have specific values in this spatial field.
- Using facets, i.e. applying filters. Currently supported fields to apply facets/filters on are: schema_fields, schema_fields_byId, structure.concepts.names.keyword, structure.domain.names.keyword, metadata.distribution.type, metadata.distribution.format, metadata.distribution.accessibility and metadata.distribution.language.
- Using data queries, allowing the user to provide conditions on selected concepts which are used to search in datasets' actual data. The available conditions are: equals, not equals for all type of concepts, contains, starts with, ends with for string concepts and greater than, greater than or equal to, less than, less than or equal to for numeric and datetime concepts. Conditions can be combined using AND/OR operators.
Some basic information about the search request is as follows:
- The filters and data queries can be used both on their own and in combination with free-text search or spatial search. When used on their own, the "query.text" value should be set to * to declare that all datasets should be considered.
- When multiple values are given for a filter, these are combined with "OR", i.e. the union of the results that match any of the values are considered.
- Free-text and spatial search cannot be used together. If (by mistake) values are given for both, then the spatial search configuration will be ignored and only free-text search will be applied.
- The results can be sorted by relevance(default), title, volume, date_available, date_modified.
- The search request may include any number of the supported facets or none. Not including a facet is equivalent to including it with a value set to [].
Free-text search sample
The terms in the "query.text" field are used for the free-text search using a multi-match ES query.
{
"query": {
"text": "free text search content",
"field": {
"uid": null,
"values": [
]
}
},
"sortBy": {
"field": "volume",
"asc": true
},
"facets": {
"domainsByUid": [
],
"categoriesByUid": [
],
"distribution.accessibility": [
"Through an API"
],
"distribution.type": [
"Text"
],
"distribution.format": [
],
"distribution.language": [
],
"fieldsByUid": [
],
"fields": [
]
},
"dataQuery": {
"conditions": [
],
"operant": "AND"
},
"joinedDatasets": {
}
}
Spatial search sample
Here the information inside "query.field" will be used and the search will return datasets that have field with uid 3000 defined as spatial coverage (i.e. "metadata.extent.spatialCoverage.field.uid"=3000) and the value of that spatial coverage (i.e. "metadata.extent.spatialCoverage.value") set to S0 or S1.
{
"query": {
"text": null,
"field": {
"uid": "3000",
"values": [
"S0",
"S1"
]
}
},
"sortBy": {
"field": "volume",
"asc": true
},
"facets": {
"fields": [
"buildingSpace.officeMeeting.meetingDuration",
"buildingSpace.spaceID"
],
"categoriesByUid": [
],
"domains": [
],
"fieldsByUid": [
"1111",
"112"
],
"distribution.accessibility": [
"Through an API"
],
"distribution.type": [
"Text"
],
"distribution.format": [
],
"distribution.language": [
]
},
"dataQuery": {
"conditions": [
],
"operant": "AND"
},
"joinedDatasets": {
}
}
Data query search sample
The information inside "dataQuery" will be used and the search will return datasets that have a value in "aircraft.certificateIssueDate" field which is greater than "2021-08-23T14:43:45.552Z" AND have a value in "carrier.carrierCode" field which equals to "test".
{
"query": {
"text": "*",
"field": {
"uid": null,
"values": [
]
}
},
"facets": {
"categoriesByUid": [
],
"distribution.accessibility": [
],
"distribution.format": [
],
"distribution.language": [
],
"distribution.type": [
],
"domainsByUid": [
"cb2c8a25-4ab9-43e1-93db-c801aa4c0010"
],
"fieldsByUid": [
"a082ccd6-0a0b-4a4c-b8be-6ca2f913d387",
"a34a05fa-05ae-4e46-b28b-9cf90c94874c"
],
"fields": [
],
"categories": [
],
"domains": [
]
},
"sortBy": {
"field": "relevance",
"asc": true
},
"dataQuery": {
"operant": "AND",
"conditions": [
{
"concept": "aircraft.certificateIssueDate",
"operant": "GREATER_THAN",
"value": "2021-08-23T14:43:45.552Z",
"metadata": {
"id": 210,
"uid": "a082ccd6-0a0b-4a4c-b8be-6ca2f913d387",
"type": "datetime"
},
"conditionUid": "be9acf7d-f74f-4154-b43b-109d6c4a0f44"
},
{
"concept": "carrier.carrierCode",
"operant": "EQUALS",
"value": "test",
"metadata": {
"id": 170,
"uid": "a34a05fa-05ae-4e46-b28b-9cf90c94874c",
"type": "string"
},
"conditionUid": "d333d7aa-0667-4a56-b014-abb78e693b62"
}
]
},
"joinedDatasets": {
}
}
Search response sample
{
"results": [
{
"id": 3,
"name": "Project 1 - KPIs",
"description": "KPI calculations for scenario 1 - project 1",
"score": null,
"schemaFields": [
"buildingSpace.spaceID",
"buildingSpace.spaceMaxOccupants",
"buildingSpace.visitingOccupant.occupantID",
"buildingSpace.spaceDescription",
"buildingSpace.spaceType"
],
"schemaFieldsById": [
],
"schemaSelection": null,
"modifiedAt": "2020-06-17T08:10:27.429Z",
"volume": null,
"copyrightOwner": "John Parkins",
"accessibility": [
"As a downloadable file"
],
"concepts": [
"kpi"
]
},
{
"id": 13,
"name": "Project 2 - occupancy",
"description": "Building-related occupancy data - Project 2",
"score": null,
"schemaFields": [
"buildingSpace.officeMeeting.meetingStartTime",
"buildingSpace.spaceDescription",
"buildingSpace.visitingOccupant.occupantID",
"buildingSpace.spaceID",
"buildingSpace.officeMeeting.meetingDuration",
"buildingSpace.spaceType",
"buildingSpace.officeMeeting.spaceID"
],
"schemaFieldsById": [
],
"schemaSelection": {
"buildingSpace.spaceID": "4321",
"buildingSpace.spaceDescription": "112",
"buildingSpace.spaceType": "121",
"buildingSpace.visitingOccupant.occupantID": "321",
"buildingSpace.officeMeeting.spaceID": "4321",
"buildingSpace.officeMeeting.meetingDuration": "1111",
"buildingSpace.officeMeeting.meetingStartTime": "876"
},
"modifiedAt": "2020-06-17T08:10:27.429Z",
"volume": null,
"copyrightOwner": "John Perkins",
"accessibility": [
"Through an API"
],
"concepts": [
"meeting",
"building"
]
}
],
"facets": {
"fields": [
{
"value": "buildingSpace.spaceDescription",
"count": 2,
"selected": false
},
{
"value": "buildingSpace.spaceID",
"count": 2,
"selected": true
},
{
"value": "buildingSpace.spaceType",
"count": 2,
"selected": false
},
{
"value": "buildingSpace.visitingOccupant.occupantID",
"count": 2,
"selected": false
},
{
"value": "buildingSpace.officeMeeting.meetingDuration",
"count": 1,
"selected": true
},
{
"value": "buildingSpace.officeMeeting.meetingStartTime",
"count": 1,
"selected": false
},
{
"value": "buildingSpace.officeMeeting.spaceID",
"count": 1,
"selected": false
},
{
"value": "buildingSpace.spaceMaxOccupants",
"count": 1,
"selected": false
}
],
"fieldsByUid": [
{
"value": "112",
"count": 2,
"selected": true
},
{
"value": "121",
"count": 1,
"selected": false
},
{
"value": "321",
"count": 1,
"selected": false
},
{
"value": "876",
"count": 1,
"selected": false
},
{
"value": "1111",
"count": 1,
"selected": true
},
{
"value": "1129",
"count": 1,
"selected": false
},
{
"value": "4321",
"count": 1,
"selected": false
},
{
"value": "9112",
"count": 1,
"selected": false
}
],
"categories": [
{
"value": "behavior",
"count": 1,
"selected": false
},
{
"value": "building",
"count": 1,
"selected": true
},
{
"value": "buildingSpace",
"count": 1,
"selected": false
},
{
"value": "kpi",
"count": 1,
"selected": true
},
{
"value": "meeting",
"count": 1,
"selected": true
},
{
"value": "need",
"count": 1,
"selected": false
},
{
"value": "project",
"count": 1,
"selected": false
},
{
"value": "scenario",
"count": 1,
"selected": false
}
],
"domains": [
{
"value": "KeyPerformanceIndicators",
"count": 1,
"selected": false
},
{
"value": "OccupancyProfile",
"count": 1,
"selected": false
}
],
"distribution.type": [
{
"value": "Text",
"count": 2,
"selected": false
}
],
"distribution.format": [
{
"value": "JSON",
"count": 2,
"selected": false
}
],
"distribution.accessibility": [
{
"value": "As a downloadable file",
"count": 1,
"selected": false
},
{
"value": "Through an API",
"count": 1,
"selected": false
}
],
"distribution.language": [
{
"value": "English",
"count": 2,
"selected": false
}
]
}
}