A Supplementary Discussion on Correlation Coefficients

Introduction

In my previous blog post, we examined covariance matrices, looking into their derivation, properties, and the geometric insights they offer for understanding multidimensional data. Covariance matrices provide a solid foundation for understanding how variables interact, yet they have certain limitations—especially when we want a clearer measure of the strength of these relationships.

This brings us to correlation and the correlation matrix, essential concepts for interpreting the strength of relationships between variables independently of their original units or scales. Given their importance, this follow-up post serves as a brief guide to correlation coefficients and the correlation matrix. Here, we’ll cover the basics of correlation, the structure and derivation of the correlation matrix, and some of its most useful properties and applications.

In the previous discussion on covariance, we established a foundational understanding of how variables co-vary, yet covariance itself is sensitive to the original units of measurement, limiting its direct interpretability across different data scales. Here, the correlation coefficient, \(\rho_{X,Y}\), refines this measure by standardizing the relationship between two variables, allowing a comparison of their linear association independent of units.

The Correlation Coefficient

Definition

The correlation coefficient \(\rho_{X,Y}\) between two random variables \(X\) and \(Y\) is formally defined as:

\[\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}\]

where \(\text{Cov}(X, Y)\) represents the covariance between \(X\) and \(Y\), \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\), respectively.

This formula can be viewed as a “normalized” covariance, essentially adjusting the relationship between \(X\) and \(Y\) by their individual dispersions, or spreads. This normalization is the key: it removes the influence of scale, making \(\rho_{X,Y}\) a dimensionless quantity that ranges from -1 to 1.

Derivation from Covariance

To delve deeper, consider the covariance definition between two random variables \(X\) and \(Y\):

\[\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]\]

where \(\mu_X = E[X]\) and \(\mu_Y = E[Y]\) are the expected values (means) of \(X\) and \(Y\). Covariance, while useful, retains the units of \(X\) and \(Y\), which complicates comparisons across variables with differing units or scales.

The correlation coefficient refines this by dividing covariance by the product of the standard deviations of \(X\) and \(Y\):

\[\sigma_X = \sqrt{E[(X - \mu_X)^2]}, \quad \sigma_Y = \sqrt{E[(Y - \mu_Y)^2]}\]

Thus, correlation can be expressed as:

\[\rho_{X,Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}\]

This reformulation gives us a measure of association on a standardized scale. Specifically, if \(\rho_{X,Y} = 1\), there exists a perfect positive linear relationship between \(X\) and \(Y\): as \(X\) increases, \(Y\) increases proportionally. If \(\rho_{X,Y} = -1\), the variables exhibit perfect negative linear correlation, where \(Y\) decreases as \(X\) increases. If \(\rho_{X,Y} = 0\), there is no linear relationship between \(X\) and \(Y\).

Other Properties

One of the most useful features of \(\rho_{X,Y}\) is its unit invariance. Unlike covariance, which scales with the units of \(X\) and \(Y\), correlation removes these effects by normalizing with standard deviations, making \(\rho_{X,Y}\) dimensionless. This quality enables comparisons across variables with different units or scales, ensuring that the strength of relationships is evaluated consistently, regardless of measurement. Furthermore, the symmetry of correlation is another noteworthy property: \(\rho_{X,Y} = \rho_{Y,X}\). This symmetry arises naturally from the covariance term and reinforces that correlation measures a mutual relationship between variables, independent of which one is considered first. These properties, together, establish the correlation coefficient as a powerful, standardized metric for interpreting linear dependencies across diverse contexts

The Correlation Matrix

When moving from a single pair of variables to multidimensional data, we naturally extend the concept of correlation to capture all pairwise relationships at once. This is where the correlation matrix \(R\) comes in, giving us a compact summary of the linear associations across an entire dataset. For a random vector \(X = [X_1, X_2, \dots, X_n]^T\), the correlation matrix \(R\) is an \(n \times n\) matrix, where each element \(R_{ij}\) represents the correlation coefficient between \(X_i\) and \(X_j\):

\[R_{ij} = \rho_{X_i, X_j} = \frac{\text{Cov}(X_i, X_j)}{\sigma_{X_i} \sigma_{X_j}}\]

Definition and Structure

In other words, if \(X\) is a random vector with a covariance matrix \(\Sigma\), then \(R\) is constructed element-by-element as:

\[R = \begin{bmatrix} \rho_{X_1, X_1} & \rho_{X_1, X_2} & \dots & \rho_{X_1, X_n} \\ \rho_{X_2, X_1} & \rho_{X_2, X_2} & \dots & \rho_{X_2, X_n} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{X_n, X_1} & \rho_{X_n, X_2} & \dots & \rho_{X_n, X_n} \end{bmatrix}.\]

Each diagonal entry \(R_{ii}\) equals 1, since \(\rho_{X_i, X_i} = \frac{\text{Cov}(X_i, X_i)}{\sigma_{X_i}^2} = 1\). This makes sense because a variable is perfectly correlated with itself. The off-diagonal elements \(R_{ij}\) (for \(i \neq j\)) give us the correlation between different variables, capturing their linear relationships in a standardized form that’s easy to interpret across the entire matrix.

Relationship to the Covariance Matrix

The correlation matrix \(R\) is directly derived from the covariance matrix \(\Sigma\) by a process of standardization. The idea here is to convert the units and scale-dependent covariances into unit-free, comparable correlation values. To achieve this, we divide each covariance by the product of the standard deviations of the relevant variables, so we arrive at the expression:

\[R = D^{-1} \Sigma D^{-1}\]

where \(D\) is the diagonal matrix of standard deviations, given by:

\[D = \text{diag}(\sigma_{X_1}, \sigma_{X_2}, \dots, \sigma_{X_n}).\]

Expanding this calculation, let’s see how each entry in \(R\) is computed in terms of \(\Sigma\) and \(D\):

Start with the covariance matrix element \(\Sigma_{ij} = \text{Cov}(X_i, X_j)\).
Divide \(\Sigma_{ij}\) by \(\sigma_{X_i} \sigma_{X_j}\), where \(\sigma_{X_i} = \sqrt{\Sigma_{ii}}\) and \(\sigma_{X_j} = \sqrt{\Sigma_{jj}}\).
This yields each element in \(R\) as:
\[R_{ij} = \frac{\Sigma_{ij}}{\sigma_{X_i} \sigma_{X_j}}.\]

In matrix form, we achieve this by pre-multiplying and post-multiplying \(\Sigma\) with \(D^{-1}\), transforming the raw covariance entries into standardized, unitless correlation coefficients. This transformation is crucial when comparing variables measured on different scales, as it removes any units, letting us focus purely on the strength of the relationships.

Some Properties

The correlation matrix \(R\) possesses a few key properties that make it an elegant and powerful tool for analyzing multidimensional data. First, it is symmetric by nature, since \(\rho_{X_i, X_j} = \rho_{X_j, X_i}\) for any pair of variables \(X_i\) and \(X_j\). This symmetry ensures that each pairwise correlation is mutual and gives the matrix a balanced, mirror-like structure around its diagonal. This diagonal, in turn, consists entirely of ones, as each variable is perfectly correlated with itself—a subtle reminder that correlation is inherently a self-consistent measure.

In addition to its symmetry, \(R\) is positive semi-definite, meaning that for any vector \(z\), the quadratic form \(z^T R z\) is non-negative. This positive semi-definiteness implies that all eigenvalues of \(R\) are non-negative, which is significant because it confirms that \(R\) has a stable variance structure. This property becomes particularly valuable in applications like principal component analysis (PCA), where the correlation matrix’s eigenvalues reflect the spread of the data along different directions.

Furthermore, every element in \(R\) falls within the interval \([-1, 1]\), a direct result of the correlation coefficient’s own bounded nature. This bounded range ensures that each entry in \(R\) is a pure, unitless indicator of linear association strength. Regardless of the scale or units of the original variables, the values in \(R\) give a consistent, standardized view of how variables align with each other.

The Geometric Meaning

Geometric Interpretation: Angles and Alignments

Consider two random variables \(X_i\) and \(X_j\), each represented as vectors in an \(n\)-dimensional data space. The correlation coefficient \(\rho_{X_i, X_j}\) between \(X_i\) and \(X_j\) can be understood as the cosine of the angle \(\theta\) between these two vectors. Formally, this relationship is given by:

\[\rho_{X_i, X_j} = \cos \theta_{ij}\]

where \(\theta_{ij}\) is the angle between the vectors corresponding to \(X_i\) and \(X_j\). When \(\rho_{X_i, X_j} = 1\), the vectors point in the same direction (\(\theta = 0^\circ\)), indicating a perfect positive linear relationship. Conversely, if \(\rho_{X_i, X_j} = -1\), the vectors point in opposite directions (\(\theta = 180^\circ\)), representing a perfect negative linear relationship. A correlation of zero corresponds to \(\theta = 90^\circ\), suggesting that the vectors are orthogonal and thus linearly uncorrelated.

This geometric interpretation gives a clear, visual sense of the relationships encoded in \(R\): the closer the angle between two variables’ vectors is to zero, the stronger and more positive their correlation; the closer the angle is to \(180^\circ\), the stronger and more negative the correlation. And when the vectors are perpendicular, they are uncorrelated in a linear sense, even though nonlinear relationships might still exist.

Eigenvalues and Eigenvectors: The Structure of \(R\)

The geometric story of \(R\) deepens when we consider its eigenvalues and eigenvectors. By performing an eigen-decomposition on \(R\), we can break it down as follows:

\[R = Q \Lambda Q^T\]

where \(Q\) is an orthogonal matrix whose columns are the eigenvectors of \(R\), and \(\Lambda\) is a diagonal matrix containing the eigenvalues of \(R\). Each eigenvalue \(\lambda_i\) in \(\Lambda\) represents the variance explained along the direction specified by the corresponding eigenvector \(q_i\) in \(Q\).

Geometrically, the eigenvectors of \(R\) indicate the principal directions of variance in the data. The eigenvalues, on the other hand, tell us the “lengths” or “strengths” of these directions. A large eigenvalue associated with an eigenvector indicates that the data has significant variance along that direction, meaning the variables exhibit strong alignment with each other in this space. Conversely, smaller eigenvalues correspond to directions with less variance, suggesting that the data is more tightly clustered or linearly dependent in those directions.

Since \(R\) is positive semi-definite, all eigenvalues are non-negative, and the sum of the eigenvalues equals the dimensionality of the space. Each eigenvalue-eigenvector pair offers a glimpse into the “shape” of the data cloud in terms of its stretch and orientation in different directions.

PCA: Extracting Linear Patterns

The correlation matrix \(R\) is a natural tool for conducting PCA, a technique used to identify the main linear patterns in data by transforming it into a new coordinate system based on the eigenvectors of \(R\). In PCA, the data is projected onto the eigenvectors of \(R\), with each projection representing a principal component. The eigenvalue associated with each eigenvector measures the variance along that principal component, so ordering the eigenvalues from largest to smallest provides a ranking of the directions of greatest variability.

To perform PCA, we start by obtaining the eigen-decomposition of \(R\):

\[R = Q \Lambda Q^T\]

The columns of \(Q\), the eigenvectors, form the new basis in which the data is represented. Each eigenvector corresponds to a principal component, and the associated eigenvalue indicates the amount of variance captured by that component. For instance, the first principal component (corresponding to the largest eigenvalue) captures the direction of maximum variance in the data, providing the best one-dimensional summary of the data’s spread. Adding successive principal components gives progressively refined approximations of the data, capturing most of the structure with far fewer dimensions than the original dataset.

By using \(R\) in PCA, we effectively perform dimensionality reduction based on the strength of the linear relationships between variables, preserving the most informative aspects of the data while discarding redundancy.

Demo

To illustrate how PCA leverages the correlation matrix \(R\) to extract meaningful linear patterns, let’s walk through a concrete example. Suppose we have a dataset with three variables—say, height, weight, and age—measured across a sample of individuals. Let’s assume that after standardizing the data, we calculate the correlation matrix \(R\) as follows:

\[R = \begin{bmatrix} 1 & 0.8 & 0.5 \\ 0.8 & 1 & 0.4 \\ 0.5 & 0.4 & 1 \end{bmatrix}\]

Each entry \(R_{ij}\) represents the correlation between pairs of variables. For example, the correlation between height and weight is 0.8, indicating a strong positive linear relationship, while the correlation between height and age is 0.5, a moderate positive association.

Step 1: Eigen-Decomposition of the Correlation Matrix

The first step in PCA is to perform an eigen-decomposition of \(R\) to identify the principal directions of variance. By finding the eigenvalues and eigenvectors of \(R\), we can understand the main directions in which the data varies most.

For our example, suppose the eigenvalues and their corresponding eigenvectors for \(R\) are as follows:

Eigenvalue \(\lambda_1 = 1.8\), with eigenvector \(q_1 = \begin{bmatrix} 0.7 \\ 0.6 \\ 0.4 \end{bmatrix}\)
Eigenvalue \(\lambda_2 = 0.9\), with eigenvector \(q_2 = \begin{bmatrix} -0.5 \\ 0.7 \\ 0.5 \end{bmatrix}\)
Eigenvalue \(\lambda_3 = 0.3\), with eigenvector \(q_3 = \begin{bmatrix} 0.5 \\ -0.2 \\ 0.8 \end{bmatrix}\)

Each eigenvalue represents the variance explained by its corresponding eigenvector direction. The largest eigenvalue, \(\lambda_1 = 1.8\), tells us that the first principal component explains the most variance in the data—roughly \(\frac{1.8}{3} = 60\%\) of the total variance (since the sum of the eigenvalues is 3 in a three-variable system). The second eigenvalue \(\lambda_2 = 0.9\) accounts for \(30\%\) of the variance, and the third eigenvalue \(\lambda_3 = 0.3\) contributes \(10\%\).

Step 2: Interpreting the Principal Components

Each eigenvector represents a direction in the original variable space (height, weight, age) along which there is a certain level of variability in the data. Let’s break down what each principal component reveals:

First Principal Component (PC1): The first eigenvector \(q_1 = \begin{bmatrix} 0.7 \\ 0.6 \\ 0.4 \end{bmatrix}\) suggests that PC1 is a weighted combination of all three variables, with the largest weights on height and weight. This component captures the common variability between height and weight, reflecting a general “size” factor—individuals with larger heights tend to have larger weights. Since this component explains 60% of the variance, it’s the most informative single direction for summarizing the data.
Second Principal Component (PC2): The second eigenvector \(q_2 = \begin{bmatrix} -0.5 \\ 0.7 \\ 0.5 \end{bmatrix}\) has significant contributions from all three variables, but with a negative sign for height and positive signs for weight and age. This suggests that PC2 captures a contrast between height and a combined weight-age factor, which might reflect a tendency where, for a given age, people with smaller heights have relatively higher weights.
Third Principal Component (PC3): The third eigenvector \(q_3 = \begin{bmatrix} 0.5 \\ -0.2 \\ 0.8 \end{bmatrix}\) explains only 10% of the total variance. This component primarily reflects variability in age with some influence from height and a negative contribution from weight, capturing a pattern that is less pronounced in the data.

Step 3: Projecting the Data onto the Principal Components

With the principal components identified, we can project our original data onto these components to transform it into a new coordinate system defined by \(q_1\), \(q_2\), and \(q_3\). This projection can be computed as:

\[\text{Projected Data} = Q^T X\]

where \(Q\) is the matrix of eigenvectors:

\[Q = \begin{bmatrix} 0.7 & -0.5 & 0.5 \\ 0.6 & 0.7 & -0.2 \\ 0.4 & 0.5 & 0.8 \end{bmatrix}\]

For each individual in the dataset, their original measurements in terms of height, weight, and age are transformed into scores along PC1, PC2, and PC3. These scores reflect the new, simplified representation of each individual in terms of the most significant patterns in the data. For example, projecting the original data onto PC1 (the direction of greatest variance) gives a one-dimensional summary of “size” variation across individuals, effectively condensing the information from three variables into a single informative score.

Step 4: Reducing Dimensionality

Since PC1 and PC2 together account for 90% of the total variance, we might decide to approximate our data using only these two components, discarding PC3, which contributes relatively little. By retaining only PC1 and PC2, we reduce the dimensionality of the dataset from three to two while preserving the bulk of the information.

This reduced representation is particularly useful for visualization: each individual can now be represented by a point in a two-dimensional plane defined by PC1 and PC2. The positions of these points reflect the primary structure and relationships in the original data, without the “noise” from minor variations that PC3 captures. Moreover, patterns and clusters within the data often become more apparent in this reduced-dimensional space, revealing insights that might not be obvious in the original three-dimensional view.

Summary

In this example, PCA has taken a dataset with three interrelated variables and transformed it into a new set of uncorrelated components that reveal the primary patterns of variability. By examining the eigenvalues and eigenvectors of the correlation matrix \(R\), we extracted principal components that provided both a compact and interpretable representation of the data.

Discussion

In conclusion, this post serves as a natural extension to our previous discussion on covariance, diving into the aspects of correlation that were left unexplored. While the covariance matrix lays the groundwork for understanding how variables interact, it lacks the standardization needed for clear, direct comparisons. The correlation matrix fills this gap, distilling complex relationships into a standardized, unit-free format that provides immediate insights into linear dependencies across variables. Each entry in the matrix reflects a pairwise relationship as an angle or alignment.

Enjoy Reading This Article? Here are some more articles you might like to read next: