Data Ingestion Process
The ingestion process applies iterative constraints, starting from the handling of CSV files and producing a pseudonymised schema for research purposes.
Data submissions from individual sites will move through different levels of quality checks (referred to as “Bronze” and “Silver”), before being merged into a final “Gold” schema. This will exist as both a fully identifiable database with very limited access, and a parallel pseudonymised version.
At this point, only full uploads are supported by the ingestion process.
-
CSV to Bronze: this is a 2-step process that generates a local view of the uploaded CSV files for the sites to inspect and ensure that we have been able to load your data into our database:
- First, unconstrained text data from the CSV files is copied in an initial schema “Raw”.
- Then, basic type and
NOT NULL
constraints are applied; resulting in a database where we can be confident we have the required information for the following steps.
-
Bronze to Silver: at this stage existing data is truncated in favour of the newly ingested data, and we ensure that foreign key constraints are respected (e.g., all “concept_id” columns must match an existent value in the CONCEPT table). “Silver” represents the latest local version of a site’s data in an OMOP-compliant format.
-
Silver to Gold: preparation of the final clean and identifiable multi-site data using each site’s “Silver” schema. This involves:
- Person-matching algorithms to merge data coming from different hospitals.
- Data clean-up, removing children and patients who opted-out, and applying high-level constraints (e.g. Information Governance restrictions).
In addition, a pseudonymised schema will be automatically populated from “Gold” that should be the default data for most users working inside the UCL Data Safe Haven (DSH).
Persistence of deleted records
Following this process, when records are deleted locally at the site and then a new extract is processed in the DSH, the deleted records will stop to appear in Silver and Gold. However, previous instances of such records will still exist in Raw and Bronze, that store historical data.