Data Check-in

This library provides functionality for the configuration of data collection workflows: the DataCheckinJobs (or DCJ for short). Each workflow is a sequence of steps (DataCheckinJobSteps), each of which is responsible for a specific task.

The user can define step types (DataCheckinStepTypes), which might be optional or required, on other steps, have specific ordering, etc.

The minimum installation should include at least two steps: harvester and loader, both required and with orders of 0 and 100 respectively. This is because you need to harvest data from somewhere (i.e. an API) and load them somewhere else (i.e. a database or a file in your disk).

When the user creates a new data check-in job, along with the name and (an optional) description, should select which of the available steps he will need to configure for the correct processing of this data. Currently, adding or removing a step is not allowed.

If any part of the configuration is marked as sensitive, its deducted from the configuration, stored to Vault and a link to the vault takes its place.

Available Steps

This module provides only means for configuring this tasks. The execution is managed by the orchestrator library, and you can read more on execution in the relevant section.

  1. Harvester is used for retrieving data from various sources, converting them to JSON and storing them to temporary storage. This might be done by:

    • accepting user-uploaded files,
    • retrieving data from external APIs (either open, or user-defined),
    • providing a DCJ specific API, where the user can POST his data,

      • The user can upload either only text data (in JSON format) or text and binary data by sending a multipart POST request with two key-value pairs, one for the binary file (with key: _uploaded_file) and one for the actual data in JSON format (with key: data).
    • providing the user with a Kafka broker to publish his data,
    • subscribing to a user-provided Kafka broker

    This step should always be the first to be executed, this its order is 0.

  2. Mapping (Transformer) is responsible for merging the files that harvester produced, apply different transformations, rename fields to match the common data model and store the output to the temporary storage.
  3. Cleaner is responsible for detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It depends on Mapping, as it uses information from the common data model.
  4. Anonymizer PENDING
  5. Loader is responsible for storing (or updating) the data to the user selected database and/or storage.

This step should always be the last to be executed, this its order is 100.

Events

As the whole execution of the data check-in process is asynchronous, a set of events are defined to be raised when needed:

Job Events

  • DCJ_CREATED is raised when a data check-in job is created,
  • DCJ_DELETED is raised when a data check-in job is deleted,

Step Events

  • DCJ_CONFIGURED is raised when a step configuration is finalised,
  • DCJ_STEP_UPDATAED is raised when a step's configuration is updated,
  • DCJ_STEP_RESCHEDULE is raised by the scheduler to notify the orchestrator that a defered step should be moved to the execution queue,
  • DCJ_STEP_DELETED is raised when a step is deleted,
  • DCJ_STEP_STARTED is raised by the orchestrator when a step's execution started,
  • DCJ_STEP_DCJ_STEP_PROGRESS is raised by the orchestrator when a step's execution made progress (i.e. a file was processed/generated),
  • DCJ_STEP_DCJ_STEP_CANCELLED is raised by the orchestrator when a step's execution was cancelled by the user
  • DCJ_STEP_DCJ_STEP_FAILED is raised by the orchestrator when the execution of a step has failed,
  • DCJ_STEP_DCJ_STEP_COMPLETED is raised by the orchestrator when the execution of a step was completed successfuly.

Execution Statuses

Based on the execution state of a step, its status can change to:

  • Configuration: (default) waiting for the user to finalise the configuration,
  • Idle: waiting for the orchestrator to forward it to the execution engine,
  • Queued: forwarded to the execution engine, queued for execution,
  • Running
  • Failed
  • Cancelled: user cancelled the execution,
  • Completed: execution completed successfuly

Locking

The following Data Check-in actions require locking:

  1. Edit Job (Configuration)
  2. Harvester
  3. Mapping (Transformer)
  4. Cleaner
  5. Anonymizer (PENDING)
  6. Loader

During the above actions, the DCJ locks, which means that no other users can access these actions. If the action is already locked by another user, an informative message will appear.

Release of a DCJ:

  1. The DCJ automatically released after a pre-difined expiration period
  2. The user that locked the DCJ navigates to any component that is unrelated with the above actions (e.g. Home)

Notes:

- The locking period is 1 hour. For mapping, when a user saves the process we renew that 1 hour period.

- When a DCJ is locked, it cannot be cloned by another user.

- Execution history remains visible even if the DCJ is locked

Cloning

Cloning functionality facilitates the creation of a data check-in pipeline by coping almost the whole configuration of an existing pipeline to a new cloned one.

Cloning functionality for every available step:

Harvester: The selected harvesting option in the original pipeline remains in the cloned pipeline.

  • File Upload: The sample file is copied from the original pipeline and remains the same. The file that contains the whole data needs to be uploaded.
  • Platform's API: The data sample remains the same as the original pipeline, and a new uploading Method & URL is provided.
  • Data Provider's API:

    Non sensitive data: The configuration is copied from the original pipeline, including Response Format, Authentication Details, Method, URL & Body, Pagination, Requested Parameters, Extra Headers, and Retrieval Setting. The user can modify all these parameters.

    Sensitive data: The configuration of the sensitive data is copied only for the owner of the pipeline. The rest of the users does not have access to that data and they need to reconfigure the corresponding.

Mapping (Transformer): The configuration remains the same as the original pipeline. The user can add or remove fields in the cloned pipeline.

Cleaner: The configuration remains the same as the original pipeline. The user can add or remove cleaning rules in the cloned pipeline.

Loader: Needs reconfiguration.