Using machine learning for assessing melanoma risk based on genetic data

The natural progression of a disease depends on genetic inheritance-susceptibility of an individual and several environmental factors. The human genome contains millions of genetic variants (single nucleotide polymorphisms-SNPs).

Genome-wide association (GWA) studies identify SNPs associated with specific diseases, so SNP data can be used to estimate the risk of developing the disease.

Polygenic risk scoring and machine learning are the two main contenders to the disease risk score prediction. Polygenic risk scoring is the traditional statistical method where the risk score is calculated based on the aggregation of the genetic markers that have an influence on the disease. GWAS determine the weight coefficient of each specific SNP based on a population, while the final risk score is the sum of the weighted associated alleles.

Machine learning and deep learning models are making predictions by learning associations between SNPs and diseases based on the collected data. They are quite powerful as they can extract patterns from complex high dimensional data. Using data from a given population to train the machine learning model, a prediction about the risk score for a subject from the same population can be made.

Besides the recent proliferation of methods based on machine learning for addressing a wild range of real-world challenges, machine learning is not yet used as much in clinical decision making. The main reasons are two. Firstly, training a complex model that generalizes well to different subjects and populations requires a large number of samples. Since DNA data is highly sensitive and not easily accessible, there are very few datasets available providing information about SNPs and, at the same time, the existence or not of specific diseases. Secondly, due to the significant impact clinical decisions can have on the patient’ lives, it is of paramount importance that decisions supported by machine learning systems are as transparent and explainable as possible, so the clinicians that use these systems can properly assess their results. These features must be explicitly implemented in the models, and they also need to be verified in practice. On the other hand, Polygenic risk scores can be interpreted in a more straightforward way by clinicians.

Nevertheless, the fixed model approach of polygenic risk scoring assumes that the available data is not correlated and normally distributed, which are assumptions that may not be accurate enough and can lead to biased predictions. In addition, the linear regression modelling that polygenic risk scores incorporate is not able to capture more complex associations that machine learning models may establish.

The scientific community continues to explore all the opportunities and face the challenges that arise from both approaches. Building a machine learning model that inspires confidence and achieves high effectiveness can prove very valuable in the prediction and anticipation of melanoma and other diseases with genetic predisposition.