Data Ingestion Process
The ingestion process applies iterative constraints, starting from the handling of CSV files and producing a pseudonymised schema for research purposes.
Data submissions from individual sites will move through different levels of quality checks (referred to as “bronze” and “silver”), before being merged into a final “gold” schema. This will exist as both a fully identifiable database with very limited access, and a parallel pseudonymised version.
-
CSV to Bronze: this a 2-step process that generates a local view of the uploaded CSV files for the sites to inspect and ensure that we have been able to load your data into our database:
a. Unconstrained text data from the CSV files is copied in the initial schema “raw”. b. Application of basic type and
NOT NULL
constraints; the result is a database where we can be confident we have the required information for the following steps. -
Bronze to Silver: at this stage we now update and merge your existing data with your newly ingested data. This means ensuring that we have valid foreign keys, we have removed duplicates; processed deletions and updates so that this represents the latest local version of a site’s data.
-
Silver to Gold: preparation of the final clean and identifiable multi-site data using each site’s Silver data. This involves:
a. Person-matching algorithms to merge data coming from different hospitals. b. Data clean-up, removing children and opted-out patients, applying high-level constraints (e.g. Information Governance restrictions).
In addition, a pseudonymised schema will be automatically populated that should be the default data for most users working inside the UCL Data Safe Haven.