In this tutorial, we will learn how to make a distance matrix in Python. A distance matrix is a square matrix that contains the distances between all the elements in a dataset. It is helpful in many applications such as genetics, data analysis, machine learning, and clustering. We’ll be using popular Python libraries such as **scipy** and **sklearn** to achieve this.

### Step 1: Import Required Libraries

First, we need to import the required libraries for our code. We’ll use **numpy** for working with arrays, **pandas** for manipulating and displaying data, and **scipy** and **sklearn** for calculating distances.

1 2 3 4 |
import numpy as np import pandas as pd from scipy.spatial.distance import pdist from sklearn.metrics import pairwise_distances |

Make sure you have all the necessary libraries installed. If you haven’t, you can install them using `pip`

:

1 |
pip install numpy pandas scipy scikit-learn |

### Step 2: Prepare the Data

For this tutorial, we’ll create a simple dataset to compute the distance matrix. In practice, you will likely have more complex data, such as a CSV file or a dataset retrieved from a database.

1 2 3 4 5 6 7 8 |
data = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ]) labels = ["A", "B", "C", "D"] |

In this example, we have a 2D array called `data`

containing four points in three dimensions and a list of labels for each point.

### Step 3: Calculate the Distance Matrix using Scipy

The **Scipy** library provides a function called `pdist`

that calculates the pairwise distances among the elements in a given 2D array. By default, the function uses the Euclidean distance, but you can specify other distance metrics.

1 |
dist_matrix_scipy = pdist(data, metric="euclidean") |

This command returns a condensed distance matrix that we can convert to a square distance matrix using the `squareform`

function provided by the `scipy.spatial.distance`

module.

1 2 3 |
from scipy.spatial.distance import squareform square_dist_matrix_scipy = squareform(dist_matrix_scipy) |

Now our distance matrix using Scipy is stored in the variable `square_dist_matrix_scipy`

.

### Step 4: Calculate the Distance Matrix using Sklearn

As an alternative, we can use the **scikit-learn** library to compute distance matrices. The `pairwise_distances`

function from the `sklearn.metrics`

module can be used to perform this task.

1 |
dist_matrix_sklearn = pairwise_distances(data, metric="euclidean") |

Here, we compute the distance matrix using the Euclidean distance as the metric. The `pairwise_distances`

function returns a square distance matrix. The distance matrix using scikit-learn is stored in the variable `dist_matrix_sklearn`

.

### Step 5: Display the Results

We can now display the distance matrices we’ve computed using both Scipy and Sklearn. We can use **pandas** to create a DataFrame to display our distance matrices in a more readable format. Here we show how to do it with the scikit-learn distance matrix, but you can do the same with the Scipy distance matrix.

1 2 |
df = pd.DataFrame(dist_matrix_sklearn, columns=labels, index=labels) print(df) |

A B C D A 0.000000 5.196152 10.392305 15.588457 B 5.196152 0.000000 5.196152 10.392305 C 10.392305 5.196152 0.000000 5.196152 D 15.588457 10.392305 5.196152 0.000000

## Full Code

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import numpy as np import pandas as pd from scipy.spatial.distance import pdist, squareform from sklearn.metrics import pairwise_distances data = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ]) labels = ["A", "B", "C", "D"] dist_matrix_scipy = pdist(data, metric="euclidean") square_dist_matrix_scipy = squareform(dist_matrix_scipy) dist_matrix_sklearn = pairwise_distances(data, metric="euclidean") df = pd.DataFrame(dist_matrix_sklearn, columns=labels, index=labels) print(df) |

## Conclusion

In this tutorial, we learned how to create a distance matrix in Python using the Scipy and Scikit-learn libraries. We demonstrated how to compute and display distance matrices for a simple dataset and showed the full code for doing so. You can easily adapt this code for more complex datasets or other distance metrics available in these libraries.