Alchemist for HIC Hearing Health

The Alchemist Project is part of the core HIC initiatives supported by UCL/UCLH.

Data Ingestion Process

The ingestion process applies iterative constraints, starting from the handling of CSV files and producing a pseudonymised schema for research purposes.

Data submissions from individual sites will move through different levels of quality checks (referred to as “bronze” and “silver”), before being merged into a final “gold” schema. This will exist as both a fully identifiable database with very limited access, and a parallel pseudonymised version.

  1. CSV to Bronze: this a 2-step process that generates a local view of the uploaded CSV files for the sites to inspect and ensure that we have been able to load your data into our database:

    a. Unconstrained text data from the CSV files is copied in the initial schema “raw”. b. Application of basic type and NOT NULL constraints; the result is a database where we can be confident we have the required information for the following steps.

  2. Bronze to Silver: at this stage we now update and merge your existing data with your newly ingested data. This means ensuring that we have valid foreign keys, we have removed duplicates; processed deletions and updates so that this represents the latest local version of a site’s data.

  3. Silver to Gold: preparation of the final clean and identifiable multi-site data using each site’s Silver data. This involves:

    a. Person-matching algorithms to merge data coming from different hospitals. b. Data clean-up, removing children and opted-out patients, applying high-level constraints (e.g. Information Governance restrictions).

In addition, a pseudonymised schema will be automatically populated that should be the default data for most users working inside the UCL Data Safe Haven.