Data Ingestion Process

The ingestion process applies iterative constraints, starting from the handling of CSV files and producing a pseudonymised schema for research purposes.

Data submissions from individual sites will move through different levels of quality checks (referred to as “bronze” and “silver”), before being merged into a final “gold” schema. This will exist as both a fully identifiable database with very limited access, and a parallel pseudonymised version.

CSV to Bronze: this a 2-step process that generates a local view of the uploaded CSV files for the sites to inspect and ensure that we have been able to load your data into our database:

a. Unconstrained text data from the CSV files is copied in the initial schema “raw”. b. Application of basic type and NOT NULL constraints; the result is a database where we can be confident we have the required information for the following steps.
Bronze to Silver: at this stage we now update and merge your existing data with your newly ingested data. This means ensuring that we have valid foreign keys, we have removed duplicates; processed deletions and updates so that this represents the latest local version of a site’s data.
Silver to Gold: preparation of the final clean and identifiable multi-site data using each site’s Silver data. This involves:

a. Person-matching algorithms to merge data coming from different hospitals. b. Data clean-up, removing children and opted-out patients, applying high-level constraints (e.g. Information Governance restrictions).

In addition, a pseudonymised schema will be automatically populated that should be the default data for most users working inside the UCL Data Safe Haven.

Alchemist for HIC Hearing Health

The Alchemist Project is part of the core HIC initiatives supported by UCL/UCLH.

Data Ingestion Process