Generative AI for Specialized Dataset Enhancement and Expansion

An important challenge for applying machine and deep learning methods in applications where data collection is difficult, or costly is the reduced amount of annotated data.

This is the motivation behind various methods for enhancing and extending existing datasets. Especially in the medical field where data acquisition and annotation are expensive and time-consuming, the importance of enhancing the already existing data is amplified.

An important way to extend annotated data that has gained popularity in the last few months, especially but not solely in applications that deal with images, is with the use of state-of-the-art generative AI models, such as Deep Image Prior and Denoising Diffusion Models. Denoising Diffusion Models, such as stable diffusion, are among the most dominant paradigms for such tasks, as the generated samples can also be conditioned on specific inputs, that can be provided as natural language among other forms, hence a large part of the literature is currently focusing on them. Being able to add very realistic yet synthetically generated data in existing datasets, proposes new opportunities for further expanding the use of deep learning models in medical applications, not only for image classification/detection problems, but also for other data modalities like clinical and even genetic data, that the iToBoS project is focusing on.

Source: https://arxiv.org/abs/2210.04133

Generative AI models are also an excellent way for handling sensitive and scarce data. Some of the most prominent difficulties of medical data collection are issues related to patient privacy. This problem can be partially overcome using generative AI models, as they can be employed to generate new data that follow the same distribution with some real collected ones, using prompts. To ensure consistency in the generated data instances, methodologies such as textual inversion can be performed to create a token representing specific features of interest, thus allowing faithful replication. This approach enables data-driven methodologies like classification and localization to be extensively trained with a significantly reduced real-life, hard-to-acquire data.