Principal component analysis (PCA) on astronomical data: interstellar medium (Part I)

6 min readJan 16, 2021

There are lots of articles describing what principal component analysis (PCA) is, when should you use PCA, how to use PCA, and even a thorough walkthrough on the mathematical process behind PCA. What we are going to do in this series (yes, it would be more than one article) is what PCA is widely used for: find the most representative variables which characterize the whole data. In this case, it is the condition of the interstellar medium.

We are going to make an attempt to reproduce the work of Ensor et al. 2017 (Paper I hereafter) that explores the line-of-sight parameters toward stars in the Milky Way. The line-of-sight parameters are equivalent width of diffuse interstellar bands (DIBs), color excess, column density of atomic hydrogen, column density of molecular hydrogen, the fraction of molecular hydrogen, total depletion, and the ratio of the equivalent width of DIB5797 to equivalent width of DIB5780 in 30 line-of-sight.

Notes: In short, those are measurements that describe the environment of interstellar matter toward a star. You can treat them as variables that make you as a person (i.e. name, date of birth, birthplace, height, weight) for a more familiar understanding.

Diffuse interstellar bands (DIBs) is a set of weak and shallow absorptions originated in the interstellar medium. DIBs are observed toward medium- to high-resolution spectra of astronomical objects located behind interstellar clouds. Today, there are more than 500 identified DIBs in the optical and infrared bands. In Paper 1, the strength of 23 DIB species are used.

Diffuse interstellar bands (DIBs) with central wavelengths at around 5780 and 5797 Angstrom were observed toward stars in the Milky Way, M31, and LMC (Cordiner 2011).

All but DIBs equivalent widths are provided in Paper 1, so before we proceed to reproduce PCA for all of the variables, we will first demonstrate how PCA performs using only 2 parameters: color excess and hydrogen column densities. The data can be downloaded here and you can get the metadata here. The measurement published by the author of Paper I can be retrieved here.

Load the data

The cataloged data was transferred from paper 1 into CSV format and it consists of more than 30 x 23 columns since the data not only store the measurements but also the name of the object, the object’s coordinate, additional columns to store uncertainty (as it clear that several of the measurements have asymmetrical uncertainty), the reference from which the measurements are taken, and the source of the spectra.

import pandas as pd
df = pd.read_csv('ensor2017.csv', encoding='latin-1')

Calculate the column density of total hydrogen

The power of column density of molecular hydrogen value is stored in a separated column since it is different for each line of sight, therefore its value should be calculated first. Paper 1 used this conversion

to calculate the column density of total hydrogen.

df['n_h2'] = df['n_h2']*10**df['power']
df['n_h'] = (df['n_hi']*10**21) + (2*df['n_h2'])

Additional column of N(H) or the column density of total hydrogen.

Standardized the data

The paper illustrates how PCA works using two measurements, which were color excess and column density of total hydrogen.

Data standardization is an integral part of performing PCA analysis as the analysis is heavily affected by a scaling factor. StandardScaler from sklearn calculates the mean and the variance of the measurements and scales each value into:

where z is the standardized value, x is the original value, u is the mean, and s is the variance. All measurements that will be used in PCA analysis needs to be standardized.

from sklearn.preprocessing import StandardScaler
features = ['ebv','n_h']
x = df.loc[:, features].values
x = StandardScaler().fit_transform(x)

The measurements before and after standardization step.

PCA 2D projection

Most PCA practice is aiming to reduce the dimensionality of the data. In this tutorial, we will stick to what the paper has done: keep the dimension as it is.

import numpy as np
from sklearn.decomposition import PCApca = PCA(n_components=2)principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents,
                          columns = ['PC1', 'PC2'])

Principal component 1 and principal component 2

Find eigenvalues

The eigenvalues of the PCA analysis in Paper 1 were 1.813 for PC1 and 0.187 for PC2 meanwhile in this tutorial we get 1.659 for PC1 and 0.410 for PC2.

eigenvalues = pca.explained_variance_

Array[0] for PC1 and array[1] for PC2.

Find the variations

Variation is the fraction of total variation in the data for which each PC is responsible (Ensor et al. 2017). The result in the paper was 90.63 for PC 1 and 9.37 for PC2 meanwhile we get 80.16 for PC1 and 19.84 for PC2.

variation = pca.explained_variance_ratio_

Array[0] for PC1 and array[1] for PC2.

Find the transformation matrix

The transformation matrix is used to obtain transformed data points. The transformed data points are obtained by this notation:

where Y is the transformed data points, A is the transformation matrix, and X is the standardized measurements.

The eigenvector result in the paper is (0.707, 0.707) for PC 1 and (0.707, -0.707) whereas in this tutorial we obtained the same results. Therefore, the transformation matrix is

component = pca.components_

The transformation matrix.

Get the transformed data

Based on the previous equation, the transformed data thus:

principalDf = np.dot(component,x.T)

Visualization of standardized original data

Next, we plot the standardized parameters.

import matplotlib.pyplot as plt
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('standarized ebv', fontsize = 15)
ax.set_ylabel('standarized nh', fontsize = 15)
ax.set_ylim(-2,4)
ax.set_xlim(-2,4)
ax.scatter(x[:,0],x[:,1],s = 50)
ax.grid()

Scatter plot of standardized measurements.

Visualize 2D projection

Then we plot the transformed data. Note that we have not taken into account the uncertainty of the measurements involved in this PCA analysis.

import matplotlib.pyplot as plt
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('PC1', fontsize = 15)
ax.set_ylabel('PC2', fontsize = 15)
ax.set_xlim(-3,4.5)
ax.set_ylim(-3,3)
ax.scatter(principalDf.T[:,0],principalDf.T[:,1],s = 50)
ax.grid()

Scatter plot of the principal components.

Closing Statement

The next part will be about measuring the diffuse interstellar bands used in Paper 1 so will be able to reconstruct what the paper has done using all the measurements.

Update:
03/06/21

Adding the data used in Paper I primary source.
Fixing minor grammatical errors.
Readjust the paragraph for continuity.

Principal component analysis (PCA) on astronomical data: interstellar medium (Part I)

Written by Ade N Istiqomah