Anonymiser
Introduction
The Anonymisation service is responsible for anonymising the data produced by Transformer or Cleaner, applying the user defined anonymisation technics and storing the new file back to MinIO. In order to run, it needs a json configuration file as an input. As an output, it produces and stores the transformed file in MinIO and communicates the execution state as feedback to the Backend service, through RabbitMQ. Anonymisation service is implemented in python.
Requirements
A list of services that need to be deployed (running), in order for the Anonymiser to be fully functional:
Anonymisation background
We have implemented a k-anonymity algorithm, in order to achieve anonymisation. In order to anonymise a dataset, we must distinguish fields in 4 main categories. These categories are presented below sorted by their importance.
-
Identifiers
These fields can uniquely identify a person in the dataset on their own. So these should be removed/hide from dataset completely.
-
Quasi-identifiers
These fields can identify a person in the dataset, when someone combines them with (one or more) other quasi-identifier fields. We have to generate hierarchies per quasi-identifier, in order to achieve a desired level of generalization at algorith execution.
These hierarchies differ, based on the data type of each field and the configuration the user define in the anonymisation step configuration.
-
Insensitive
These fields does not compromise any person's identiny, no matter what. So there is no need to alter them at all.
-
Sensitive
hese fields will be used and explained when the l-diversity extension is implemented.
In order to execute succesfully, the user must define the k-factor (minimun 2) and the lowest acceptable limit of anonymisation data loss, in comparisson to initial dataset.
Functionality
Anonymiser service prompt the user to define anonymisation "rules" upon fields, and algorithm, which mentioned in Anonymisation background.
It retrieves the data as a single JSON file that the Transformer or Cleaner has stored in MinIO and consructs a Pandas Dataframe. Based on this dataframe and the user configuration defined in UI, it generates all the necessary hierarchies for each field.
Based on specific characteristics, the algorithm choose the next field for generalization and every time, it checks if the dataframe is k-anonymous. When the dataframe is finally k-anonymous, it checks the data loss limit (defined from the user), the anonymised file is being stored to MinIO (/anonymisation path) and responds accordingly (success or fail responses).
If the maximum level of generalization in reached, in all fields and the dataframe is not k-anonymous yet, then it respond with the proper fail message.
Column names changes
In some cases, the anonymisation algorithm needs to change the name of some input columns by adding a suffix (keyword) which indicates the type of transformation that has occured on each column. The available keywords are:
- :range
The values have changed to arithmetic intervals (e.g. "10-20"). - :group
The values have changed to group names (e.g. "adult"). - :masking
Some letters have been obscured by a masking character (e.g. "*").
Configuration File
A JSON configuration file is required from the Anonymiser. The requirements of the configuration file are described in the schema.py file that exists in the Anonymiser Project.