The challenges of infusing privacy and compliance technologies in the iToBoS project

In any data processing project that deals with personal information there is an inherent tradeoff between safeguarding data subjects’ privacy and yielding useful and accurate insights from the data.

This problem is compounded on both fronts when dealing with healthcare projects: On one hand, health data is especially sensitive, and even defined under article 9 of GDPR as a special category of personal data to which additional restrictions apply. On the other hand, when making medical decisions that can have a “life or death” effect on people, you need to be able to make base those decisions on information and technology that is as accurate as possible.

Another dilemma is how to cooperate with the “open research data” paradigm being strongly promoted in the EU when it relates to health data, and deciding what data is safe to share outside the consortium. The iToBoS project, for example, is interested in fostering the development of cognitive assistant algorithms, such as the one developed in the project, within the scientific community. To this end, it plans to release some of the data collected in the project in the form of two open challenges for skin lesion analysis. However, for obvious reasons, any patient data shared publicly must be properly anonymized such that it can’t be re-identified by any reasonable means, even with external knowledge. This means that merely removing direct identifiers such as names and IDs is not enough, and further measures to ensure proper anonymization of the released data must be applied, for example k-anonymity or differential privacy.

When employing privacy-preserving methods on datasets, small amounts of noise and/or generalizations of the data are typically applied to prevent any record from being identifiable by better “blending in the crowd”. However, the smaller the dataset, the more difficult it is to create this blending, and the amount of noise or generalization required to reach that effect grows larger. Since the dataset in this project is relatively small (600 patients), this implies reduced utility of the published data.

Further compounding these issues is the fact that this project is considering analyzing more data sources than have been used in previous studies, including image data, clinical data, genomic data as well as family history. This makes it difficult to determine beforehand what parts of these data will be most useful for training the AI algorithms. Therefore, we cannot determine with certainty which fields can safely be removed/generalized or otherwise perturbed without harming the models’ accuracy in a way that renders them useless for this type of application. We therefore plan to employ an iterative approach: initial versions of the AI models will be trained on (almost) all of the data collected during the prospective patient study. The importance or effect of each of the collected data points will be analyzed to guide the decision to possible discard some of the features and not include them in the final training or challenges. For the remaining data which we do wish to use, and include in the challenge data released publicly, we will attempt to create a tailored anonymization that takes into account the needs of the AI models and generalizes the data in a way that has a less harmful effect on those fields that are most important for the analysis.

The fact that the data has multiple modalities and formats also poses some challenges. The project will be employing the DICOM format to store, transfer and publish the image data in the project. The DICOM standard also includes many header fields to store metadata about each image, such as to which patient, study and series it belongs, what is the anatomical location of the image, along with additional demographic and clinical data which may be of relevance to the physician looking at the image and trying to make a diagnosis. In the case of iToBoS, some of these fields overlap with the data collected in the RedCap system and exported as csv files. In order to consistently mask or anonymize both data formats, the same masking techniques must be employed, with the same parameters and internal state, to yield coherent results. For example, it is necessary to mask the patient ID in the same way in both types of files so that all information belonging to a single patient can be cross-referenced. However, the standard tools used for de-identifying or anonymizing DICOM files do not currently contain the capability to mask tabular data, and vice versa.

Moreover, tabular data is not the only kind of data that may contain identifying information. Skin images may contain identifying marks such as birthmarks, tattoos or scars that, if included in the released data, may violate patients’ privacy. Image anonymization is a fairly new and yet unsolved research problem, especially in cases where very high resolution and level of detail is required, as in the case of dermoscopic images for skin cancer screening. In this project we decided to employ manual identification and selection of the identifying areas on the images by the technician operating the scanner, such that they can be removed from the image prior to being uploaded into the iToBoS system. One remaining challenge is how to ensure that the resulting image still looks smooth and natural enough to be processed by the system, including by AI models.

We also had to decide what to do with images of the facial area. On one hand, this part of the body is one of the areas with the most sun exposure and thus has high probability of sun damage and related disease, however these images are also potentially the most identifying. So as not to give up this important area entirely, we opted to use much smaller crops (1x1 cm) than those used for the rest of the body (6x4 cm).

Finally, in a project such as this, that introduces a new AI-based decision-support technology, explainability of the models is of utmost importance to enable increased adoption of the tool by physicians and patients. Many methods for explaining AI decisions are based on revealing to the AI user specific samples (or characteristics of samples) from the training data used to train the model. This again poses a privacy risk to patients whose data was used in training the models. We are therefore considering ways in which auxiliary (non-private) datasets can be utilized in generating explanations for the AI decisions instead of the actual training data.