Cleaner

Introduction

The Cleaner service is responsible for detecting and correcting (or removing) corrupt or inaccurate records from an asset which has gone through the Transformer Service. It refers to identifying and replacing or deleting incomplete, incorrect, inaccurate or irrelevant parts of the data by applying user defined cleaning rules and constraints. In order to run, it needs a json configuration file as an input. As an output, it produces and stores the cleaned file in MinIO and communicates the execution state as feedback to the Backend service, through RabbitMQ. The Cleaner service is implemented in python.

Requirements

A list of services that need to be deployed (running), in order for the Cleaner to be fully functional:

Functionality

The functionality of the Cleaner service is straight-forward. At first, it retrieves the data as a single JSON file that the Transformer has stored in MinIO. It constructs a Pandas Dataframe from the data, which is used to apply the cleaning rules and constraints that the user has defined in the UI. Then, the cleaned file is stored to MinIO (/cleaner path). The cleaning stats are calculated and are sent along with the feedback message with RabbitMQ. If a cleaning constraint fails, the step fails and the failure details are sent instead.

Cleaner Flow Chart

Configuration File

A JSON configuration file is required from the Cleaner. The requirements of the configuration file are described in the cleaner_schema.py file that exists in the Cleaner Project.

Available Cleaning Rules & Constraints

Available Cleaning Rules & Constraints are described in the constraints.py and outlier_rules.py files that exists in the Cleaner Service.