Data masking is the process by which sensitive data is replaced, with data that is unintelligible to the receiver, addressing data security and privacy requirements and regulations.
The process of masking may appear to be straightforward. If we are simply replacing an item with ‘***’, this can be achieved with a simple regular expression. However, the masking process becomes more complex and elaborate when addressing requirements such as multiple payload types, composite payloads or advanced masking methods such as format preserving encryption (FPE) and format preserving tokenization (FPT). Although many masking tools exist, most of them address a particular payload such as database tables, pdf files or DICOM images, to name a few. These tools fall short when the payloads are composite, when advanced masking is required for user-defined data classes (e.g., proprietary organization ID) or when multiple payload types (e.g., DICOM and CSV files) are to be processed in a consistent manner.
The iToBoS project has two requirements that need to be addressed. The first requirements is to maintain the ability to cross reference clinical data related to the same patient regardless of whether the data is contained in a CSV file or DICOM image. Usually, each data record, whether in CSV or DICOM, will contain a patient ID which can be used to identify all records or files belonging to the same patient. However, there are no standard tools able to mask both DICOM and CSV files. If multiple tools are to be used to mask the patient ID the result may be inconsistent – effectively losing referential integrity. To summarize, there is need to tokenize the patient ID consistently such that (1) identical patient ID’s are always tokenized to the same token (referential integrity) and (2) the tokenized value is not reversible. One method to achieve this is to apply the same hashing function on all patient ID’s across all the patient records.
The second requirement relates to calendar dates. iToBoS patients may have multiple recorded visits (e.g., for scanning) which are recorded in both DICOM and the RedCap systems. The sequence of dates could pose a privacy concern as combining it with external data may lead to re-identifying a person. However, the sequence of dates, and in particular the duration between visits may provide necessary information for the AI algorithms. It is therefore necessary to process the dates in a way that addresses the privacy risk of de-identification while maintaining the required utility. Several possible alternatives exist, each one representing a different tradeoff between privacy and utility. For example, one could process the sequence of dates while maintaining the exact duration between dates (e.g. 1/5, 10/5 can be replaced with 15/9, 24/9). This approach maintains much of the utility (duration) of the original sequence without divulging the exact dates, providing some privacy. However, as the sequences grow longer, they become more unique, and may be used, combined with other data, to facilitate reidentification. Alternatively, the sequence of dates could be processed to maintain the order while allowing for some variation in the duration between dates (e.g. 1/5, 10/5 can be replaced with 16/9, 20/9). This alternative does not provide the exact duration and, in some cases, could hurt utility. However, it provides improved privacy as reidentification of a person based on a noisy sequence of dates becomes much harder. These alternatives, and possibly more, will be considered, implemented, and evaluated during the iToBoS project, as well as any additional fields that may require masking.
Micha Moffie. IBM.