Architecture
The architecture of the reusable components is designed in a modular manner to allow for the reuse of different libraries and services depending on each project’s needs. As depicted in the following figure, the architecture of S5-Collect (RC#1) and S5-Search (RC#2) leverages state-of-the art technologies across its four conceptual layers:
- The Front-end Presentation Layer, that is built on VueJS and TailwindCSS, consists of different user interfaces which are customized per project. The reusable components have their own user interface that is available in the demo front-end and is intended to be used internally.
- The Backend Business Logic Layer, that is built on the NestJS (NodeJS) web framework and written in TypeScript, includes a set of libraries that wrap different functionalities provided by the reusable components, namely: the Common library (with all common classes), the Core library (containing core functionality reused in all libraries), the Data Check-in library (that manages the data collection at design time for the configuration of a data check-in job), the Orchestrator library (which orchestrates the data check-in jobs’ execution, according to the schedule defined in their configuration in the Execution Engine Business Logic Layer), the Asset library (that is responsible for handling data assets to which the data collected from the data check-in jobs are conceptually organized), the Access Control library (that handles attribute-based access control policies for the data assets stored), the Kafka library (handling the streaming data collection through a Kafka PubSub mechanism), the MinIO library (responsible for the object storage of the different data files), the Vault library (that securely manages sensitive credentials), the Data Model library (that allows for lifecycle management of the underlying data models per domain), the Notifications library (that sends real-time notifications to the users). Such libraries are complemented by: the Predictor Engine (which provides predictions about the concepts and fields of the applicable data model to which the data should be mapped) and the Search Engine (that manages the search queries, and handles the respective retrieval of “data slices” depending on the user’s preferences), which are built on the Flask micro-web framework and written in Python.
- The Execution Business Logic Layer, that includes the different containerized services that are executed by the Execution Director on demand (upon triggering by the Orchestrator) in the execution cluster (in Kubernetes). The data check-in job execution services include the Harvester, the Mapper-Transformer, the Cleaner, and the Loader, at the moment, while they are built in a Python mono-repo.
- The Data Access Layer, that is responsible for the persistence of all data related to the functionalities and operation of the reusable components. It includes a relational database (PostgreSQL) for storage of the data check-in jobs and operational data (e.g. users, organizations), a NoSQL database (MongoDB) for the storage of the data assets, a search and indexing engine (Elasticsearch) for the metadata storage, a secure credentials storage service (Vault) and an object storage (MinIO) for files (i.e. temporary storage of the intermediate steps of the data check-in job execution, but also for permanent storage of the other files).
The different features and functionalities are enabled based on each project’s needs according to different activation strategies and are controlled through a feature toggling mechanism (built on Unleash). The front-end layer is fully customized per project while the back-end layers essentially remain the same across the project’s customizations.
Versioning is applied at the service and library levels to ensure consistency in the different projects. Docker (for containers of the different services, the back-end and front-end apps) and Docker Hub (as the registry of the different Docker images) as well as Sentry (for diagnosing any issues in the code and optimizing its performance) are also used to facilitate the deployment and operation monitoring of the RCs.