Data Check-in
This library provides functionality for the configuration of data collection workflows: the DataCheckinJob
s (or DCJ for short). Each workflow is a sequence of steps (DataCheckinJobStep
s), each of which is responsible for a specific task.
The user can define step types (DataCheckinStepType
s), which might be optional or required, on other steps, have specific ordering, etc.
The minimum installation should include at least two steps: harvester
and loader
, both required and with orders of 0
and 100
respectively. This is because you need to harvest data from somewhere (i.e. an API) and load them somewhere else (i.e. a database or a file in your disk).
When the user creates a new data check-in job, along with the name and (an optional) description, should select which of the available steps he will need to configure for the correct processing of this data. Currently, adding or removing a step is not allowed.
If any part of the configuration is marked as sensitive, its deducted from the configuration, stored to Vault and a link to the vault takes its place.
Available Steps
This module provides only means for configuring this tasks. The execution is managed by the orchestrator library, and you can read more on execution in the relevant section.
-
Harvester is used for retrieving data from various sources, converting them to JSON and storing them to temporary storage. This might be done by:
- accepting user-uploaded files,
- retrieving data from external APIs (either open, or user-defined),
-
providing a DCJ specific API, where the user can POST his data,
- The user can upload either only text data (in JSON format) or text and binary data by sending a multipart POST request with two key-value pairs, one for the binary file (with key:
_uploaded_file
) and one for the actual data in JSON format (with key:data
).
- The user can upload either only text data (in JSON format) or text and binary data by sending a multipart POST request with two key-value pairs, one for the binary file (with key:
- providing the user with a Kafka broker to publish his data,
- subscribing to a user-provided Kafka broker
This step should always be the first to be executed, this its order is 0.
- Mapping (Transformer) is responsible for merging the files that harvester produced, apply different transformations, rename fields to match the common data model and store the output to the temporary storage.
- Cleaner is responsible for detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It depends on Mapping, as it uses information from the common data model.
- Anonymizer PENDING
- Loader is responsible for storing (or updating) the data to the user selected database and/or storage.
This step should always be the last to be executed, this its order is 100.
Events
As the whole execution of the data check-in process is asynchronous, a set of events are defined to be raised when needed:
Job Events
DCJ_CREATED
is raised when a data check-in job is created,DCJ_DELETED
is raised when a data check-in job is deleted,
Step Events
DCJ_CONFIGURED
is raised when a step configuration is finalised,DCJ_STEP_UPDATAED
is raised when a step's configuration is updated,DCJ_STEP_RESCHEDULE
is raised by the scheduler to notify the orchestrator that a defered step should be moved to the execution queue,DCJ_STEP_DELETED
is raised when a step is deleted,
Execution related events
DCJ_STEP_STARTED
is raised by the orchestrator when a step's execution started,DCJ_STEP_DCJ_STEP_PROGRESS
is raised by the orchestrator when a step's execution made progress (i.e. a file was processed/generated),DCJ_STEP_DCJ_STEP_CANCELLED
is raised by the orchestrator when a step's execution was cancelled by the userDCJ_STEP_DCJ_STEP_FAILED
is raised by the orchestrator when the execution of a step has failed,DCJ_STEP_DCJ_STEP_COMPLETED
is raised by the orchestrator when the execution of a step was completed successfuly.
Execution Statuses
Based on the execution state of a step, its status can change to:
- Configuration: (default) waiting for the user to finalise the configuration,
- Idle: waiting for the orchestrator to forward it to the execution engine,
- Queued: forwarded to the execution engine, queued for execution,
- Running
- Failed
- Cancelled: user cancelled the execution,
- Completed: execution completed successfuly
Locking
The following Data Check-in actions require locking:
- Edit Job (Configuration)
- Harvester
- Mapping (Transformer)
- Cleaner
- Anonymizer (PENDING)
- Loader
During the above actions, the DCJ locks, which means that no other users can access these actions. If the action is already locked by another user, an informative message will appear.
Release of a DCJ:
- The DCJ automatically released after a pre-difined expiration period
- The user that locked the DCJ navigates to any component that is unrelated with the above actions (e.g. Home)
Notes:
- The locking period is 1 hour. For mapping, when a user saves the process we renew that 1 hour period.
- When a DCJ is locked, it cannot be cloned by another user.
- Execution history remains visible even if the DCJ is locked
Cloning
Cloning functionality facilitates the creation of a data check-in pipeline by coping almost the whole configuration of an existing pipeline to a new cloned one.
Cloning functionality for every available step:
Harvester: The selected harvesting option in the original pipeline remains in the cloned pipeline.
- File Upload: The sample file is copied from the original pipeline and remains the same. The file that contains the whole data needs to be uploaded.
- Platform's API: The data sample remains the same as the original pipeline, and a new uploading Method & URL is provided.
-
Data Provider's API:
Non sensitive data: The configuration is copied from the original pipeline, including Response Format, Authentication Details, Method, URL & Body, Pagination, Requested Parameters, Extra Headers, and Retrieval Setting. The user can modify all these parameters.
Sensitive data: The configuration of the sensitive data is copied only for the owner of the pipeline. The rest of the users does not have access to that data and they need to reconfigure the corresponding.
Mapping (Transformer): The configuration remains the same as the original pipeline. The user can add or remove fields in the cloned pipeline.
Cleaner: The configuration remains the same as the original pipeline. The user can add or remove cleaning rules in the cloned pipeline.
Loader: Needs reconfiguration.