Data Check-in Overview

Data Check-in Pipelines

The execution services are responsible for performing the actual ETL process of the data-collection. The extract stage is performed by the Harvester service, the transform stage is performed by the Transformer-Cleaner-Anonymizer-Encryption services and the load stage is permormed by the Loader service. These services are containerized with Docker and are executed as Kubernetes Jobs in a Kubernetes cluster. Responsible for creating/submitting these jobs in the Kubernetes cluster is the Execution Director service. Information about the development of execution services can be found to python-services github repository.

Sample Run Execution

General Information

In some cases, a Data Check-in Pipeline can perform Sample Run executions up until a specific step. During a sample run execution, the backend spawns an execution with type sample, which runs only on cloud and executes all data check-in's tasks until an output task on the sample data. The output task is the task that we want the execution to stop and return the result in JSON format.

Supported Steps

The Sample Run can only be executed on the following data check-in steps:

  • File Harvester (only for Parquet files)
  • Mapping
  • Cleaning

Sample Run Flow

A Sample Run execution has the follwing flow:

  1. During the configuration of the supported data check-in steps (read section above), the user can execute a Sample Run either by clickin on the 'Run On Sample' button of the UI or navigating to the next tab, in order to see what will happen to their sample data up until the current step, using the specific configuration.
  2. The backend prepares and sends the sample execution's configuration to the appropriate RabbitMQ queue so to be executed by the execution service. Preparing the sample execution's configuration means that the harvester task is always replaced with a File Harvester Task, which reads the sample file from MinIO (all the other tasks remain untouched).
  3. The execution service executes all data checkin steps, until the output task, on the sample data. At the end of the final task, the result is converted to unflatten JSON format and sent to the appropriate RabbitMQ queue.
  4. The backend reads the RabbitMQ message, containing the sample execution's result, and saves it in the processed_sample column (in postgres) of the specific data checkin step. Finally, it sends an SSE message, notifying the frontend that the sample run has been completed.
  5. The frontend receives the 'completed' SSE message and fetches the step's processed sample from the database.
  6. The user is automatically navigated to the next tab in the UI, where the processed sample is presented.
  7. The user can navigate back to the previous step and make any changes on the configuration and execute a new sample run at any time.

Sample Run Errors

If the Sample Run execution was unsuccessfull (Failed) for the Mapping or Cleaing steps, the fields/constraints, that cause the execution to fail, are highlighted in red and the reason about what went wrong is presented to the UI. However, if the cleaning step failed because of empty sample result (due to specific constraints), the user is requested to decide whether to continue with the current constraints/rules or make additional changes to the cleaning configuration.