How To Make A Distance Matrix In Python

In this tutorial, we will learn how to make a distance matrix in Python. A distance matrix is a square matrix that contains the distances between all the elements in a dataset. It is helpful in many applications such as genetics, data analysis, machine learning, and clustering. We’ll be using popular Python libraries such as scipy and sklearn to achieve this.

Step 1: Import Required Libraries

First, we need to import the required libraries for our code. We’ll use numpy for working with arrays, pandas for manipulating and displaying data, and scipy and sklearn for calculating distances.

Make sure you have all the necessary libraries installed. If you haven’t, you can install them using pip:

Step 2: Prepare the Data

For this tutorial, we’ll create a simple dataset to compute the distance matrix. In practice, you will likely have more complex data, such as a CSV file or a dataset retrieved from a database.

In this example, we have a 2D array called data containing four points in three dimensions and a list of labels for each point.

Step 3: Calculate the Distance Matrix using Scipy

The Scipy library provides a function called pdist that calculates the pairwise distances among the elements in a given 2D array. By default, the function uses the Euclidean distance, but you can specify other distance metrics.

This command returns a condensed distance matrix that we can convert to a square distance matrix using the squareform function provided by the scipy.spatial.distance module.

Now our distance matrix using Scipy is stored in the variable square_dist_matrix_scipy.

Step 4: Calculate the Distance Matrix using Sklearn

As an alternative, we can use the scikit-learn library to compute distance matrices. The pairwise_distances function from the sklearn.metrics module can be used to perform this task.

Here, we compute the distance matrix using the Euclidean distance as the metric. The pairwise_distances function returns a square distance matrix. The distance matrix using scikit-learn is stored in the variable dist_matrix_sklearn.

Step 5: Display the Results

We can now display the distance matrices we’ve computed using both Scipy and Sklearn. We can use pandas to create a DataFrame to display our distance matrices in a more readable format. Here we show how to do it with the scikit-learn distance matrix, but you can do the same with the Scipy distance matrix.

          A          B          C          D
A  0.000000   5.196152   10.392305  15.588457
B  5.196152   0.000000   5.196152   10.392305
C  10.392305  5.196152   0.000000   5.196152
D  15.588457  10.392305  5.196152   0.000000

Full Code

Conclusion

In this tutorial, we learned how to create a distance matrix in Python using the Scipy and Scikit-learn libraries. We demonstrated how to compute and display distance matrices for a simple dataset and showed the full code for doing so. You can easily adapt this code for more complex datasets or other distance metrics available in these libraries.