In statistics and data analysis, often there is a need to generate correlated random variables that follow a specific distribution. This is a common situation in Monte Carlo simulations, computational statistics, and numerical methods in general. In this tutorial, you will learn how to generate correlated random variables in Python using the numpy library.
Step 1: Import Necessary Libraries
First and foremost, we shall import the necessary modules for our operations. In our case, we will require the Numpy and Scipy libraries. These libraries are fundamental for scientific computing in Python. You can install them via pip:
1 |
pip install numpy scipy |
Then import them in our script:
1 2 |
import numpy as np from scipy import linalg |
Step 2: Define Variables
For this tutorial, we assume our task is to generate two standard normally distributed random variables with a given correlation coefficient (ρ) which in our case will be 0.6.
1 2 3 4 5 6 |
num_vars = 2 corr_mat = np.array([ [1.0, 0.6], [0.6, 1.0] ]) |
Step 3: Generate Uncorrelated Random Variables
Here, we generate two sets of uncorrelated standard normally distributed random variables and store them in a two-dimensional array.
1 2 |
x = np.random.normal(0, 1, size=(10000, num_vars)) print("Correlation coefficient of original data: ", np.corrcoef(x, rowvar=False)) |
Step 4: Cholesky Decomposition
We then use Cholesky decomposition, which is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose. We use this decomposition to transform our original uncorrelated variables, with the goal of influencing them to have our desired correlation.
1 2 3 |
upper_chol = linalg.cholesky(corr_mat) y = np.dot(x, upper_chol) print("Correlation coefficient of transformed data: ", np.corrcoef(y, rowvar=False)) |
Your complete Python code is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import numpy as np from scipy import linalg num_vars = 2 corr_mat = np.array([ [1.0, 0.6], [0.6, 1.0] ]) x = np.random.normal(0, 1, size=(10000, num_vars)) print("Correlation coefficient of original data: ", np.corrcoef(x, rowvar=False)) upper_chol = linalg.cholesky(corr_mat) y = np.dot(x, upper_chol) print("Correlation coefficient of transformed data: ", np.corrcoef(y, rowvar=False)) |
Output
Correlation coefficient of original data: [[ 1. -0.00479441] [-0.00479441 1. ]] Correlation coefficient of transformed data: [[1. 0.60020262] [0.60020262 1. ]]
Conclusion
In summary, generating correlated random variables in Python is fairly simple and can be accomplished with just a few lines of code by using the Numpy and Scipy libraries.
Knowledge of generating correlated variables is essential, especially for Monte Carlo simulations, uncertainty analysis, and machine learning problems where variables often exhibit interdependence.