PhD Student in Astronomy
Stockholm University
axel.runnholm [at] astro.su.se
08-5537 8502
Projects
Lyman Alpha Reference Sample
A sample of local starbursting galaxies which is used to understand the complex physics of Lyman alpha transfer and escape.Galaxy Wind Synthesis
This is a project where we are trying to understand the connection between stars and galaxy scale outflows.Research notes / blog
Principal Component Analysis
10-01-2018
In a nutshell principal component analysis find the directions that contain most of the information in a dataset. Basically it uses a matrix analysis to find whatever direction in the dataset contains the largest spread, variance, in the data and uses this as the first basis vector, or axis, of a new coordinate system. Then the next axis is chosen along the direction of the second largest variance but constrained to be orthogonal to the first axis. This procedure is repeated for all dimensions.
This gives a new set of basis vectors that can be used to represent the data. Each of these principal components consists of a linear combination of the original variables of the form
PC = l1 * var1 + l2 * var2 ...
where l1, l2 etc are called the loadings. These loadings essentially determine how much each original variable contributes to the component in question.
Mathematically principal component analysis is based on a eigenvalue, eigenvector problem. What is done is that the data table is used to calculate a covariance/correlation matrix. Then you find the eigenvalues of this matrix and the associated eigenvectors. The eigenvalues describe how much of the variance in the sample is described by the corresponding eigenvector. The eigenvectors constitute the actual principal components.
Perhaps the primary use of PCA is to take a complex multidimensional dataset and project it onto a smaller set of dimensions while still retaining as much 'information' as possible. It is important to note that in these contexts what is refererred to as information means the variance in the sample. This may or may not be the same thing depending on the nature of your dataset.
In this context it is important to examine the eigenvalues closely since they tell you how large a fraction of the total variance present in the sample each principal component explains. You then have to make a choice of how many of the principal components you want to keep in the analysis by weighing the number of dimensions after reduction against the total variance explained. This is essentially an arbitrary choice, and how many components are needed to explain the major features of the data is entirely dependent on the nature of that data.
Fitting a plane As we stated before the PCA finds the direction of largest variance in the data. This can be rephrased as minimizing the deviations of the points from the current principal component vector. This means that the principal components corresponding to the highest n
eigenvalues, will in fact describe the best fit n
-dimensional hyperplane to the data. If n=2
this will constitute a standard plane. Since n is somewhat arbitrary (between 1 and the total number of dimensions in your dataset) it can obviously be used to fit a line as well.
I found it instructive to look at how this procedure for fitting a line is different from fitting a line using an ordinary least squares method. Out of the box fitting of a line using least squares minimizes the distances between the datapoints and the line perpendicularly to the x-axis. Fitting the line using PCA minimizes the error as measured perpendicularly to the actual line. These two approaches do not in general produce the same fit, so it is worth considering which approach is most useful for the current purpose.
Figure showing the direction of the distances minimized by a standard least squares approach. Image taken from Cerebral Mastication blog
Figure showing the direction of the distances minimized by a PCA. Image taken from Cerebral Mastication blog
The first step is obviously to take your raw data and transform it in such a way as to maximise the amount of information we can get out of the PCA. In practice this means taking the following steps:
Therefore we standardise the data by subtracting the mean from each variable and 'whiten' the the noise by dividing with the standard deviation:
standardized_var = (var - mean(var)) / std(var)
from sklearn.decomposition import PCA
pca_2c = PCA(n_components=0.95)
pca_res = pca_2c.fit_transform(standardizedData)
# Show how much variance is explained:
print(pca_2c.explained_variance_ratio_.sum())