Scope & Functionalities
The different reusable components are defined from a data-centric perspective and can collectively provide the functionalities of a Big Data Platform, a Data Marketplace, a Data Analytics/ ML Platform or a Data Middleware depending on the feature selection that will be enabled. From a conceptual perspective, they allow the definition and flexible execution of different pipelines / jobs for data check-in and data analytics, but also expand over data search, retrieval and acquisition.
The functionalities supported by the different reusable components (mostly RC#1: S5-Collect and RC#2: S5-Search) currently include at high level:
- The configuration and execution of data check-in jobs that may anticipate the data harvesting, data mapping, data cleaning and data loading (for permanent data storage) steps in their pipeline.
- The profiling of the data assets in terms of providing their associated metadata and the applicable access policies.
- The search and retrieval of data with the help of search queries that are configured by the users.
- The individual users access and support for organization-based access.
- The definition and management of data models.
- The notifications for the execution of data check-in jobs
Data Harvesting
Batch, near real-time and streaming data can be collected in accordance with the preferences and configuration provided by a data provider in the Harvester step that is executed in the Harvester Service:
- Option 1: Batch data files can be uploaded in different formats and are handled accordingly. In case the file complies with a csv/tsv, json, xml format, then a sample file containing representative data needs to be provided together with the full file in order to define the applicable configuration for all upcoming steps. If the file(s) follow other formats (e.g. images, ifc files, etc.), then they are handled as “objects” and their contents are not processed in any way.
- Option 2: Data can be ingested through APIs exposed by external, 3rd party systems. In order to allow for data harvesting from as many JSON and XML RESTful APIs as possible, the data provider needs to provide: (a) the API authentication details (ranging from different types of bearer token to custom authentication that allows logging in to a URL to retrieve the authentication token), (b) the method (GET-POST-PUT), the URL and the body of the request (whenever applicable), (c) the API pagination options (currently including page and offset pagination), (d) the request parameters that should be used (ranging from static to dynamic parameters for the query, pagination or authentication, (e) any extra header that needs to be part of the API calls, (f) the retrieval schedule (i.e. once, periodic retrieval on an hourly-daily-weekly-monthly schedule, polling), (g) the processing schedule (defined based on the retrieval schedule), (h) a basic error handling strategy to handle errors harvesting the data, (i) selection of the data sample (in terms of selecting what is the base path, the fields that should be stored and any query parameters that should be included in the data).
- Option 3: Data can be ingested through the RCs APIs. In order to allow data providers to push data to the RCs (in case their applications/systems have not exposed any APIs), the data provider needs to provide: (a) the processing schedule, (b) a basic error handling strategy to handle errors posting the data, (c) an accurate sample of the data that will be uploaded. The data providers can then view the method (GET-POST), the URL and the body of the request (whenever applicable) that should be used by their applications, the instructions on how to use the API endpoint (with an appropriate access token provided by the RCs) and the sample that has been selected.
- Option 4: Streaming data can be ingested through the RCs Kafka streaming mechanism. In order for the data providers to be able to use the kafka streaming service provided by the RCs, they need to define: (a) the format of the streaming data they intend to upload, (b) the retrieval settings (i.e. until when streaming data will be provided), (c) the processing schedule, (d) a basic error handling strategy to handle errors posting the data, (e) an accurate sample of the data that will be uploaded. The data providers may view the connection details that they should use to upload streaming data (e.g. connection URL, topic, SASL mechanism, credentials that are visible only the 1st time the harvesting step is configured and are then stored in an encrypted form without being visible again).
- Option 5: Streaming data can be ingested through the Kafka streaming mechanism provided by external, 3rd party systems. TBA
Once the harvester configuration is finalized, data will start being collected (even if the next pre-processing steps have not been yet configured).
Data Mapping
The data mapping step is instrumental for data harmonization and semantic interoperability by mapping the data (that may comply with any schema and standard) to an appropriate data model, prior to applying any additional processing steps and to storing the data. The data mapping leverages different data models that bring the domain knowledge in the reusable components and are managed with the help of the Data Model Manager.
Once the configuration of the harvesting step is finalized and if the mapping step is enabled, the data providers are able to proceed with the configuration of the mapping step. Initially, they need to define the domain to which their data refer from the list of supported domains (i.e. for which data models are already available), the standard with which their data comply, if applicable, and the basic concept to which the data refer in order to facilitate the mapping process. Then, the data providers come across the Mapping Playground in which the predictions to concepts of the selected data model are automatically provided along with their confidence level. The data providers can:
- Select a column/field in their data in order to remove the mapping prediction if it is not correct,
- Select a column/field in their data in order to provide additional mapping details for correct mappings, e.g. the measurement unit for their numeric data (in order for the Mapping-Transformation service to make the necessary unit transformation to the baseline measurement unit dictated by the data model), the datetime format and time zone (in order for the Mapping-Transformation service to make the necessary datetime transformations for consistency of the data stored if a time zone is not already included in the data),
- Select a column/field in their data that has no mapping to drag-n-drop an exact concept from the data model,
- Select one or multiple columns/fields in their data that have no mapping to define a related concept to which they refer in order to navigate to the relevant branch of the data model and provide the appropriate mapping.
At any moment, the data providers can save and validate the mapping rules that have been configured. Upon reviewing the mapping-transformation rules that have been defined, the data providers are able to finalize the mapping step. It needs to be noted that any columns/fields that do not have any mapping to the relevant data model, are discarded from the data after the mapping step is executed. Once the mapping step has been successfully executed in the Mapping-Transformation Service, the data providers are able to navigate to a summary of the execution results in order to see what transformations took place. If the mapping step failed, the data providers view a summary of the execution results and which transformation/mapping rules are responsible for the mapping step failure.
Data Cleaning
The data cleaning step is responsible for improving the data quality by detecting incomplete, incorrect or irrelevant parts of the data and replacing, modifying, or deleting such dirty or coarse data. Data Cleaning is configured in the cleaning step and is executed by the Cleaning Service. Taking into consideration that the mapping/transformation service already handles the data type transformations, measurement unit transformations and time zone transformations, the cleaning rules that are currently supported take into consideration the data type of each field/column that appears in a data asset and include:
- Range Constraints: typically, columns/fields with numeric or datetime data type should fall within a certain range, otherwise they are considered as outliers that should be handled.
- Mandatory Constraints: certain columns cannot have null/empty values, but they should be dropped or filled in with a value (e.g. previous value, min-max-mean value, a certain value the user provides, etc.)
- Regular expression patterns: text fields that have follow a certain pattern. For example, depending on the country, phone numbers are required to have a certain pattern.
- Cross-field validation: certain conditions that span across multiple fields must hold. For example, the value of a column/field must be always greater/less than the value of another column/field in the same row/record of the data.
- Unique Constraints: a field, or a combination of fields, must be unique in each row/record in the dataset.
- Foreign-key constraints: as in relational databases, a foreign key column is not allowed to have a value that does not exist in the referenced column that acts as the primary key.
In case exceptions or outliers are detected for any of the rules, then the specific rows/records are dropped or replaced with a value depending on the harvesting method (e.g. min-max-mean values are not allowed in cases that the data are frequently updated).
The data providers can define cleaning rules per column/field or for multiple columns/fields of the same data type. At any moment, the data providers can save the cleaning rules that have been configured. Upon reviewing the cleaning rules that have been defined, the data providers are able to finalize the cleaning step.
Once the cleaning step has been successfully executed in the Cleaning Service, the data providers are able to navigate to a summary of the execution results in order to see which cleaning rules were applied and what transformations took place. If the cleaning step failed, the data providers view a summary of the execution results and which cleaning rules are responsible for the specific step’s failure.
Data Loading
The data loading step manages the permanent storage of the data. The data providers need to define whether the data should be stored in a new data asset along with the title and description of the data asset.
For data that have been processed (i.e. at least, the mapping step is to be applied), the data loading step also handles the necessary extraction of metadata per field (for indexing purposes, but also for calculating the value of certain spatial-temporal metadata), as well as the applicable processing (e.g. cleaning) rules, and results into the storage of the data to the NoSQL database (MongoDB). In cases of data that are not to be processed (e.g. other files or data check-in jobs in which the mapping steps is not enabled), they are directly stored as assets in the object storage (MinIO).
Assets’ Metadata
In order to make available their data assets and facilitate their search, the data providers need to fill in a set of metadata according to the instructions provided. Such metadata are classified under the following categories:
- General information about the profile of the specific data asset, including its title, description, tags and reference data assets (stored as other files) to which it may be linked.
- Distribution Details that concern the availability and access to the specific data asset, i.e. format, type, language.
- Extent Details regarding the coverage and granularity of the data asset from a temporal and spatial perspective. They can be either manually provided by the data providers (in case of static values) or automatically calculated (and updated) based on the values of a specific column/field within the data.
- Licensing Information including the license and its associated terms under which a data asset is made available. Depending on the access level (public, private, confidential) selected by the data provider, the necessary licensing and pricing metadata vary while the access policies can be enabled/disabled.
- Pricing Details that concern the payment method and cost for 3rd parties (acting as data consumers) to acquire the specific data asset.
Data Search
The users (acting as data asset consumers) can search for data that their applications need and to which they have legitimate access (as the access policies for each data asset as defined by their respective data providers are enforced). The provided functionalities allow users to:
- Search for keywords in the title, description and tags of the datasets.
- Search based on a selected field that represents an identifier for the spatial coverage aspects of the data assets.
- Explore the faceted search functionalities so as to filter the search results based on different parameters, i.e. the domains and their concepts, the categories, the accessibility method, the type, the format and the language of the data asset.
- View and navigate to a search query they have already defined and saved.
The users can navigate to the profile (i.e. metadata and structure) of each data asset that appears in the results and can decide how they intend to acquire them as described in the Data Retrieval section.
Once the data asset(s) they are interested in have been selected, the users save the specific query by providing its title and description.
Data Retrieval
Depending on the harvesting method that has been used in each data asset, different options for retrieving data assets are available:
- Add to Query Results: The user can select one or multiple data assets in case they have been fully processed and stored, and are thus available through the RC#2 API endpoints. They can select which fields/columns of the data they need exactly and define which of these fields/columns should also act as query parameters (for which a single value or a range of values is provided in each API request) in order to properly filter the expected API results. Based on the selection they have made, they can quickly test the API endpoint and see the results that they can get. Finally, they can view all related information on how to acquire data: (a) the API endpoints, including the URL for a GET or POST API endpoint, (b) the authentication and the pagination guidelines.
- Download File(s): The user can select the data assets for which Other data files have been stored. At the moment, they are only available to directly download the files through the RCs user interface, but it will be possible in a future release to also retrieve them through an API (full files or links to download the respective files).
- Subscribe to Stream: The user can subscribe to the topic that is related with the specific data asset in order to get real-time access to non-processed data. They can view the topic title, the sample of the data they will consume and the connection details that they should use (e.g. connection URL, topic, SASL mechanism, credentials that are visible only the 1st time the query is configured and are then stored in an encrypted form without being visible again). This is practically an alternative option to the acquisition through the RC#2 endpoints in case real-time access is required.
Access Policies
Depending on the access level that data providers select, they are able to define authorization policies for permitting or denying access requests to any data asset available in the RCs, in real-time.
Access to data assets is regulated through Attribute-Based Access Control (ABAC) policies, that allow the data providers to protect and share their data assets, even when they do not have any prior knowledge of the potential individual data consumers in the platform.
In principle, the data providers select the main “allow-all” or “deny-all” strategy and define the corresponding exceptions. Indicative access policies that can be expressed in this way are: “no airline will access the data asset” or that “only company X and Y can access the data asset” or that “only airlines from Greece or Cyprus can access the data asset” or “all organizations except for organizations of type Z can access the data”. Currently, only the user type and organization type (if organization-based access is enabled) are assessed as subject’s properties in the data access policies – once the profiles of the users and organizations are enriched (e.g. with the Country of origin of organization), the available properties for the access policies will be also enriched. A proper separation of concerns between policy definition (at the time the asset’s metadata are defined) and policy enforcement (any time search for a specific data asset or retrieval of a data slice from the specific data asset is performed) has been effectively ensured. The applicable policies may be updated and deleted only by the data provider (if the data provider is an organization - all users under the specific organization are authorized to edit/delete access policies) and they are immediately enforced.