Selección de funciones de voz mediante algoritmos genéticos para la detección de la enfermedad de Parkinson

José Adrián Zambrano Miranda; Jorge Eduardo Correa Pillajo; Felipe Leonel Grijalva Arévalo; José David Vega Sánchez

Voice feature Selection using genetic algorithms for detecting Parkinson’s disease

José Adrián Zambrano Miranda jose.zambrano@epn.edu.ec

Escuela Politécnica Nacional, Ecuador

Jorge Eduardo Correa Pillajo jorge.correa@epn.edu.ec

Escuela Politécnica Nacional, Ecuador

Felipe Leonel Grijalva Arévalo felipe.grijalva@epn.edu.ec

Escuela Politécnica Nacional, Ecuador

Universidad de las Américas (UDLA), Ecuador

José David Vega Sánchez jose.vega01@epn.edu.ec

Escuela Politécnica Nacional, Ecuador

Revista de Investigación en Tecnologías de la Información

Universitat Politècnica de Catalunya, España

ISSN-e: 2387-0893

Periodicity: Bianual

vol. 10, no. 21, Esp., 2022

revista.riti@gmail.com

Received: June 06, 2022

Accepted: August 24, 2022

URL: http://portal.amelica.org/ameli/journal/368/3683473013/

DOI: https://doi.org/10.36825/RITI.10.21.013

This work is licensed under Creative Commons Attribution-NonCommercial 4.0 International.

Abstract: Today, Parkinson’s Disease (PD) is one of the most common neurodegenerative diseases in the world after Alzheimer’s disease. About 6.2 million people have it and it is estimated that by 2040 the number of Parkinson’s patients will double it. PD reduces motor function, which is why patients suffer from decreased movement, stiffness, tremors and even the voice and speech production including breathing, articulation, and phonation. For this reason, the voice features of patients vary in comparison to people who do not have PD. Therefore, we are looking for a method that allows us to select the features of the voice that affects the most to the patient's diagnosis. There are several methods for feature selection and in our case, we use genetic algorithms (GA). To validate our feature selection approach, we constructed an SVM classifier where the best accuracy of 88.54% was achieved with 8 features selected by GA.

Keywords: Parkinson, Machine Learning Algorithms, Genetic Algorithm, SVM.

Resumen: Hoy en día, la enfermedad de Parkinson (EP) es una de las enfermedades neurodegenerativas más comunes en el mundo después de la enfermedad de Alzheimer. Cerca de 6,2 millones de personas la padecen y se estima que para 2040 el número de enfermos de Parkinson se duplicará. La EP reduce la función motora, por lo que los pacientes sufren disminución del movimiento, rigidez, temblores e incluso la producción de la voz y el habla, incluida la respiración, la articulación y la fonación. Por esta razón, las características de la voz de los pacientes varían en comparación con las personas que no tienen EP. Por ello, buscamos un método que nos permita seleccionar las características de la voz que más inciden en el diagnóstico del paciente. Existen varios métodos para la selección de características y, en nuestro caso, utilizamos algoritmos genéticos (GA). Para validar nuestro enfoque de selección de características, construimos un clasificador SVM en el que se logró la mejor precisión del 88,54 % con 8 características seleccionadas por GA.

Palabras clave: Parkinson, Algoritmos de Aprendizaje automático, Algoritmo Genético, SVM.

1. Introduction

Parkinson´s disease (PD) is a complex degenerative disorder of the central nervous system and belongs to a group of conditions known as movement disorders. Its cause is unknown, although some cases are inherited and may be because of a genetic susceptibility or an exposure to one or more environmental factors. The anatomopathological basis of this disease is characterized by the progressive loss of nerve cells in an area near to the base of the brain known as the black substance where dopamine is produced [1]. The absence of dopamine begins to denote with problems of movement, production of speech and others that imply muscular control. Since the systems that regulate the motor control are affected, the production of voice and speech is also altered, including breathing, articulation and phonation. This result in a monotonous speech, low intensity, shaky speech with inappropriate pauses, and some people may even hesitate before speaking or dragging words [2].

Currently, there are no blood or laboratory tests to diagnose PD, so the diagnoses made are based on the medical history and neurological examination of the patient. By processing the voice signal, it is possible to select a set of parameters that can help to detect pathologies based on a comparison of the features of the voice obtained from healthy people with respect to sick people [2]. To mention some features that represent the voice, we have: Jitter which is one of the parameters that are mainly affected due to the lack of vibration in the vocal cords, Shimmer which is a feature of the voice that is associated with the emission of noise, the relation of harmonics to noise (HNR) and noise to harmonics (NHR) which allow the quantity of noise in a voice signal, etc [3]. This study searches to be a contribution to solving the problem of PD detection using the voice. Since the production of the voice is affected in the early stages of PD, the diagnosis of the disease through the voice is also anticipated compared with other detection methods and motor complications [2].

2. State-of-the-Art

There are some voice analysis and feature selection that were used in several PD investigations. For instance, in [4] chose two features using the diffuse entropy measurement selection method and a similarity classifier. Likewise, [5] picked 10 features implementing Dirichlet process mixture models. On another work, i.e., [6] took 10 features applying the Rotation Forest method and the Ibk classification method. Authors in [7] chose 10 features and employed a pre-selection filter method and exhausting search in conjunction with an SVM classifier. Researchers in [8] proposed a multiple feature evaluation approach based on several learning techniques (e.g., Decision Tree, Neural Network (NN), Support Vector Machine(SVM)) to diagnose PD by testis multiple feature evaluation on the patient. Although the average rate of accuracy of the methods classifiers used are good, they can still be improved with the use of newer techniques. Some of the techniques given in [8] are used in [9], i.e., NN and SVM; here, the authors proposed an early diagnosis of PD consisting of two stages, namely, feature selection and classification processes. In that work, it was inferred that SVM reached the best accuracy within the classification techniques with the least number of voice features for PD. A Deep learning-based diagnosis of PD using a convolutional neural network is presented in [10]. Here, a magnetic resonance (MR) imaging approach captures the changes in the brain that produce PD. Furthermore, the authors classify the MR images by deep learning, obtaining better accuracy than traditional machine learning. In [11][12], nice surveys of different learning classification methods used to diagnose PD are introduced. Unfortunately, these works do not include a novel genetic algorithm (GA) (the approach proposed here). Based on the above considerations and inspired by the promising potential of the heuristic learning algorithm for classification (i.e., GA), we introduce two stages for diagnosing PD, i.e., selection and classification. To this end, we propose to use GA to accurately select the main features over the database acquired from [7]. Then, via the SVM classifier, we quantify how robust are the chosen features provided by GA. Unlike previous works, our approach can be easily implemented while providing a high level of accuracy compared to conventional standard methods used in machine learning.

3. Materials and Methods

In Figure 1, we show the stages of voice analysis to determine PD. Our paper is focused on features selection. In section 3.1, we describe the dataset that we use in order to optimize the best features that could help us to determine the PD. In section 3.2, we explain how the genetic algorithm works, the concept of mutual information and the fitness function that we use in this study. Next in section 3.3, we explain the Support Vector Machine (SVM) used to classify and determine the value of accuracy of the features that GA found. Finally in section 3.4, we describe the metrics employed to test our approach.

Figure 1.
Stages of voice analysis to determine PD.

3.1. Database

We used the Parkinson DataSet obtained in [7], which is a dataset created by Max Little of Oxford University in collaboration with “The center for voice and speech” that was responsible for recording the voice signals. We decided to use the UCI Machine Learning Repository, because this dataset is handled in several articles that make reference to PD and will serve to make a comparison of the results obtained in this study and other research studies with different methods of feature selection and classification. The data include 22 voice features, and 195 observations representing the data obtained from voice signal recordings, and a label that indicates whether the patient is healthy or ill. These recordings were made from 31 male and female people. 23 participants have PD and their ages range from 46 to 85 years. On average six observations have been recorded for each subject.

Figure 2 shows an example of recorded signals from a person with PD (b) and a healthy person (a) [7].

Examples of voice signals from [7]. This are two
examples of speech signal, (a) corresponds to a healthy person, (b) corresponds
to a person with PD. Also, X axis represents time in seconds, and Y axis is the
signal amplitude.

Figure 2.
Examples of voice signals from [7]. This are two examples of speech signal, (a) corresponds to a healthy person, (b) corresponds to a person with PD. Also, X axis represents time in seconds, and Y axis is the signal amplitude.

Next, we describe in detail the features summarized in Table 1, where some definitions are formulated as:

Fo (Hz): It is the Mean Fundamental Frequency
Fhi (Hz): It is the Maximum Fundamental Frequency
Flo (Hz): It is the Minimum Fundamental Frequency
Jitter (%): It is the average absolute difference between consecutive periods of fundamental frequency, divided by the average period. It is expressed as a percentage, given by [13].

(1)

where $T_{i}$ is the period of fundamental frequencies of a window of number $i$ ; and Nis the total number of windows.

Table 1.

Features of Parkinson DataSet.

Feature	Description
1	Mean fundamental frequency (Fo)
2	Maximum fundamental frequency (Fhi)
3	Minimum fundamental frequency (Flo)
4	Jitter (%)
5	Jitter (Abs)
6	RAP (Relative Average Perturbation)
7	PPQ (five-point Period Perturbation Quotient)
8	Jitter (DDP)
9	Shimmer
10	Shimmer (dB)
11	Shimmer APQ3 (Three Point Amplitude Perturbation Quotient)
12	Shimmer APQ5 (Five Point Amplitude Perturbation Quotient)
13	APQ (Amplitude Perturbation Quotient)
14	Shimmer (DDA)
15	NHR (Harmonics-to-Noise Ratio)
16	HNR (Noise-to-Harmonics-Ratio)
17	PPE (Pitch Period Entropy)
18	RPDE (Recurrence Period Density Entropy)
19	DFA (Detrended Fluctuation Analysis)
20	spread1 (Non-linear measure of fundamental frequency)
21	spread2 (Non-linear measure of fundamental frequency)
22	D2 (Correlation Dimension)

Own Elaboration.

Jitter (ABS): This is the average absolute difference between consecutive periods of fundamental frequency in microseconds, and given by

(2)

Jitter (RAP): It is the Average Relative Perturbation, i.e., the average absolute difference between a fundamental frequency period and the average of this and two other neighbors, divided by the average period [13].
PPQ: Disturbance period of five points [7].
Jitter (DDP): It is the average absolute difference between consecutive differences between consecutive periods, divided by the average period.
Shimmer: It is the average absolute difference between amplitudes of consecutive periods, divided by the average amplitude. It is given by [13]

(3)

where $A_{i}$ is the peak amplitude of a window of number $i$ ; and N is the total number of windows.

Shimmer (dB): It represents the absolute average in logarithm base 10 of the difference between amplitudes of conservative periods, multiplied by 20, and given by [13]

(4)

Shimmer (APQ3): It is the three-point Amplitude perturbation quotient, i.e. the average absolute difference between the amplitude of the period and the average amplitudes of its neighbors, divided by the average amplitude [13].
Shimmer (APQ5): It is the five-point Amplitude perturbation quotient, i.e. the average absolute difference between the amplitude of the period and the average amplitudes of this and the four nearby neighbors, divided for the average amplitude [13].
APQ: It is the eleven-point amplitude Perturbation quotient [7].
Shimmer (DDA): It is the average absolute difference between consecutive differences between the amplitudes of consecutive periods [14].
HNR: Relation harmonic to noise.
NHR: Relation noise to harmonic.
PPE: It is a new measure of PD dysphonia, since there are no methods to effectively characterized dysphonia in the presence of factors such as the gender and the acoustic environments that may be variable. This is a robust measure that is sensible to changes observed in the specific speech of PD.
RPDE: It is the period of entropy density recurrence, the same one that allows to determine the periodicity of a signal since randomness and noise are inherent to the vocal production.
DFA: This is the (Detrended Fluctuation Analysis). It is a tool for analysis of non-linear time series, they are used since the vocal production is a non-linear dynamic system and that changes are produced by deficiencies, in the vocal organs muscles and nerves that affect the dynamics of the entire system.
Spread1: Non-linear measurement of fundamental frequency variation according to [15].
Spread2: Non-linear measurement of fundamental frequency variation according to [15].
D2: It is the correlation dimension calculated based on the first-time delay, which incorporates the signal to recreate the phase of a non-linear dynamic system that is proposed to generate the voice signal [7].

3.2. Genetic algorithms for feature selection

The genetic algorithms are part of the evolutionary computing which consists of computational models that are inspired by natural evolution. They are used in search problems and the parameter optimization based on the principle of survival of the fittest [16].The functioning of the GA for feature selection is presented in the Figure 3.

Before the use of the GA, we have to consider the concept of mutual information which is a general measure of association or dependence between variables. This gives us an idea of the relationship that exists between the features being analyzed $X$ and the output $Y$ (i.e., whether the person is ill or healthy). Mutual information is given by [16].

where H(X) is the entropy obtained by [16]

and H(Y|X) represents the uncertainty at the output, which is calculated as follows [16]

Figure 3.
Blog diagram for GA feature selection.

The value of mutual information is 0 when two random variables are completely independent, and higher than 0 if they are dependent. When starting the algorithm, we generated a population, this population could have a set of features that will be called chromosomes. The chromosomes created are evaluated by fitness function so this function will assign a value to each of these. The criterion used for the fitness function is mRMR (minimum redundancy and maximum relevance) to consider those features of the voice that have the greatest influence on the determination of PD, but otherwise there is no redundancy between them.

The implemented fitness equation is defined by [13]

where V represents the amount of relevance between patterns and targets that is measured by each chromosome using [13]

where $I$ is the mutual information between features and targets. P is the amount of redundancy between patterns and features that are measured for each chromosome using [13]

Subsequently, the algorithm continues with the processes of crossover and mutation in a way that a new population is generated. This process is repeated depending on the stop criterion that has been set for the AG.

There are several stop criteria that can be used for GA such as:

1. When it reaches some maximum number of generations.
2. When there is no change in the best fitness value within a given time in seconds.
3. When there is no change in the best fitness value in a number of generations.
4. When the GA exceeds a maximum time limit.

3.3. Classification Based on the SVM Method

The support vector machines are used when the data present two separable classes. Their classification is originated when looking for the hyperplane that divides them, in other words, those that secure a maximum margin between the two classes, positive or negative [17] (see Figure 4). The margin is a distance separating the classes in which its maximization allows to find the best hyperplane, and the supporting vectors are points that are in the frontiers of the classes. SVMs are optimal for learning tasks where the number of features is large, in relation to the number of training instances.

Within the SVM, a correct selection of the Kernel function is important, since it defines the space of features in which the instances of the training set and it will be classified [6].

Figure 4.
Support Vector Machine [17].

For the classification process, we used the validation method stratified k-fold Cross Validation with k=8. It is worth mentioning that we randomly splitted the folds by subject and not by recordings to avoid data contamination (i.e., at each iteration a subject's recordings is only present in either the test fold or the training folds). In order to determine which features obtained from the GA experiments have the best perform, we developed an SVM classifier with Bayesian optimization for hyperparameters search using a linear kernel.

3.4. Performance Metrics

We employ the ROC (Receiver Operating Characteristics) curve, the accuracy and the F₁ score to measure the performance of our methodology. The ROC curves constitute a statistical method that allows determining the diagnostic accuracy of the tests using continuous scales with the objective of evaluating the discriminatory capacity of the diagnostic to differentiate healthy versus sick subjects. The graph of the ROC curve illustrates the proportion of true positives versus the proportion of false positives (TPR vs. FPR). On the other hand, the parameter AUC (area under the curve) reflects how good the test is to discriminate patients with and without the disease. We use the metrics ROC and AUC to check which set of futures has the highest value of accuracy.

Finally, we also use the F₁ score metric define as

where TP, FP and FN stand for True Positives, False Positives and False Negative respectively.

4. Results and Discussions

We implemented our GA-based feature selection approach using the Matlab's Global Optimization Toolbox. We set the maximum number of generations at 30 because after that number of generations, no change was evident, and we varied the number of features between 5 and 15.

In Figure 5, we can see an example of the value of the fitness function with respect to the number of generations for 8 features. Note that the 8 features selected doesn't show any variance after the 10 generations. Because of that, we use the third stop criterion for the GA explained in Section 3.2 (i.e., when there is no change in the best fitness value in a number of generations).

Table 2 summarizes the optimized features after the GA selection procedure. Note that each set of optimized features are those whose fitness function value are the lowest among 100 repetitions of the GA.

With respect to SVM, we constructed a classifier for each set of features from Table 2 (i.e., 11 classifiers) and Table 3 summarizes the accuracy, AUC and F1 score results that we got when we applied the SVM classifier.

Figure 5.
Fitness value for the eight features selected.

Table 2.

Features Optimized using Genetic Algorithm

Number of Features	Optimized Features
5	3,5,10,19,21
6	2,3,5,12,19,21
7	2,3,5,13,15,19,21
8	2,3,5,13,15,17,19,21
9	2,3,5,10,18,19,20,21,22
10	2,3,5,8,13,17,18,19,21,22
11	2,3,5,7,9,15,17,18,19,21,22
12	2,3,5,7,9,13,15,17,18,19,21,22
13	2,3,5,8,10,12,13,15,18,19,20,21,22
14	1,2,3,5,7,8,12,14,15,18,19,20,21,22
15	1,2,3,5,6,7,11,12,13,15,18,19,20,21,22

Observe that we got a similar value of accuracy between 6 and 8 features, but we select as our best result the 8 optimized features because of the AUC value that is better than the one obtained from the 6 optimized features. Also, from Table 3, notice that our best accuracy is 88.54\% using 8 features with linear SVM.

On the other hand, Figure 6 shows the ROC curve for 8 optimized features where we can note the trade-off between false positive rate and true positive rate.

Table 3.

Features Optimized using Genetic Algorithm (%)

Number of Features	AUC	Accuracy (%)	F₁ Score
5	0.775	86.98	0.698
6	0.774	88.54	0.725
7	0.762	88.02	0.716
8	0.837	88.54	0.725
9	0.780	79.69	0.493
10	0.782	87.50	0.707
11	0.731	87.50	0.707
12	0.684	87.50	0.707
13	0.690	79.69	0.493
14	0.638	84.38	0.615
15	0.660	84.38	0.615

Figure 6.
ROC curve for 8 optimized features with linear SVM. The positive class is healthy people.

Table 4 summarizes a comparison between our study and other works. In order to compare, we looked for studies that use the same dataset.

Notice that, our accuracy is among the top three. This confirm that our feature selection approach using GA is picking up relevant features for the PD. Finally, Figure 7 shows a t-SNE (t-distributed Stochastic Neighbor Embedding) [18] visualization for the 8 optimized features.

From Figure 7, it is observed that even though the features are relevant, there is still confusion between PD patients and healthy ones, which in turn stresses the need of better feature engineering and larger datasets.

Table 4.

Comparison of our approach to other studies that have used the “Parkinson’s dataset” [7].

Approach	Featured Selected	Experimental conditions	Accuracy (%)
[4]	2	Diffuse entropy measurement selection method and similarity classifier	85.03
[5]	10	Dirichlet process mixture and 5-fold cross validation	87.70
[6]	10	Rotation Forest method and Ibk classification method	95.89
[7]	10	Pre-selection filter method and exhausting search with SVM	91.40
Our approach	8	mRMR and SVM with 8-fold Cross Validation	88.54

Figure 7.
t-SNE visualization for 8 optimized features.

5. Conclusions

In this study, we selected the best features of the voice using genetic algorithms (GA). GA optimize the selection of features according to a selected fitness function, in our case mRMR. We selected the criterion of mRMR to determine features of the voice that have the higher influence in determining PD and at the same time to avoid redundancy. Also, we have to consider the concept of mutual information, that is a general measure of association or dependence between variables.

After that, we developed an SVM classifier in order to evaluate the features that GA gave us. Based on our simulations, the highest accuracy value is 88.54%, which was achieved by optimizing 8 features by the GA and a linear SVM classifier. Moreover, we showed that, while the number of features to be optimized decreases, the value of the fitness function will also decrease, but this does not mean that a lower fitness function value leads to a better classification performance.

For future researches, we recommend the development of a database with a higher number of recordings and subjects. In addition to more recordings, the database should have more information regarding the evolution of the disease, medication of the patients, the degree of disease. The development of this database will imply a multidisciplinary work conformed by engineers, doctors and volunteers. Also, we recommended to use other methods to make the selection of features, examine other fitness functions and make a robust classifier in order to improve the classification performance.

7. References

[1] Martínez-Fernández, R., Gasca-Salas, C., Sánchez-Ferro, A., Obeso, J. Á. (2016). Actualización en la Enfermeda de Parkinson. Revista Médica Clínica Los Condes, 27 (3), 363-379. https://doi.org/10.1016/j.rmclc.2016.06.010

[2] Delgado Hernández, J., Izquierdo Arteaga, L. M. (2016). Eficacia de la rehabilitación de la voz en etapas tempranas de la Enfermedad de Parkinson. Revista Discapacidad Clínica Neurociencias, 3 (1), 42–47. https://doi.org/10.14198/DCN.2016.3.1.04

[3] Martínez-Sánchez, F. (2010). Speech and voice disorders in Parkinson’s disease. Revista de Neurología, 51 (9), 542–550. https://doi.org/10.33588/rn.5109.2009509

[4] Luukka, P. (2011). Feature selection using fuzzy entropy measures with similarity classifier. Expert Systems with Applications, 38 (4), 4600–4607. https://doi.org/10.1016/j.eswa.2010.09.133

[5] Shahbaba, B., Neal, R. (2009). Nonlinear models using dirichlet process mixtures. Journal of Machine Learning Research, 10, 1829–1850. https://www.jmlr.org/papers/volume10/shahbaba09a/shahbaba09a.pdf

[6] Ozcift, A. (2012). SVM feature selection based rotation forest ensemble classifiers to improve computer-Aided diagnosis of Parkinson disease. Journal of Medical Systems, 36 (4), 2141–2147. https://doi.org/10.1007/s10916-011-9678-1

[7] Little, M. A., McSharry, P. E., Hunter, E. J., Spielman, J., Ramig, L. O. (2009). Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. IEEE Transactions on Biomedical Engineering, 56 (4), 1015–1022. https://doi.org/10.1109/TBME.2008.2005954

[8] Mostafa, S. A., Mustapha, A., Mohammed, M. A., Hamed, R. I., Arunkumar, N., Ghani, M. K. A., Jaber, M. M., Khaleefah, S. H. (2019). Examining multiple feature evaluation and classification methods for improving the diagnosis of Parkinson’s disease. Cognitive Systems Research, 54, 90–99. https://doi.org/10.1016/j.cogsys.2018.12.004

[9] Karapinar Senturk, Z. (2020). Early diagnosis of Parkinson’s disease using machine learning algorithms. Medical Hypotheses, 138, 1-5. https://doi.org/10.1016/j.mehy.2020.109603

[10] Sivaranjini, S., Sujatha, C. M. (2020). Deep learning based diagnosis of Parkinson’s disease using convolutional neural network. Multimedia Tools and Applications, 79, 15467–15479. https://doi.org/10.1007/s11042-019-7469-8

[11] Das, R. (2010). A comparison of multiple classification methods for diagnosis of Parkinson disease. Expert Systems with Applications, 37 (2), 1568–1572. https://doi.org/10.1016/j.eswa.2009.06.040

[12] Mei, J., Desrosiers, C., Frasnelli, J. (2021). Machine Learning for the Diagnosis of Parkinson’s Disease: A Review of Literature. Frontiers in Aging Neuroscience, 13, 1–41. https://doi.org/10.3389/fnagi.2021.633752

[13] Arefi Shirvan, R., Tahami, E. (2011). Voice analysis for detecting Parkinson's disease using genetic algorithm and KNN classification method. 18th Iranian Conference on BioMedical Engineering (ICBME), Tehran, Iran. https://doi.org/10.1109/ICBME.2011.6168572

[14] Tsanas, A., Little, M. A., McSharry, P. E., Ramig, L. O. (2010). Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson’s disease progression. IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA. https://doi.org/10.1109/ICASSP.2010.5495554

[15] Polat, K. (2012). Classification of Parkinson’s disease using feature weighting method on the basis of fuzzy C-means clustering. International Journal of Systems Science, 43 (4), 597–609. https://doi.org/10.1080/00207721.2011.581395

[16] Gestal, M., Rivero, D., Rabuñal, J. R., Dorado, J. Pazos, A. (2010). Introducción a los Algoritmos Genéticos y la Programación Genética. Universidade da Coruña.

[17] Aguado González, E. (2017). Detección Automática De Anomalías en Patrullaje Robotizado [Tesis de Grado]. Universidad Politécnica de Madrid. https://oa.upm.es/49198/

[18] van der Maaten, L., Hinton, G. (2008). Visualizing Data using t-SNE Laurens. Journal of Machine Learning Research, 9 (86), 2579-2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html