{ "cells": [ { "cell_type": "markdown", "id": "offensive-enterprise", "metadata": {}, "source": [ "# Working with sparse matrices in `scipy`\n", "\n", "In this notebook, we demonstrate some examples utilizing sparse matrices in `scipy`. Make sure you have `numpy` and `scipy` installed:\n", "\n", "```bash\n", "pip install --user numpy scipy # or: pip3 install --user numpy scipy\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "average-clinton", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<4x3 sparse matrix of type ''\n", "\twith 6 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import scipy.sparse as sps\n", "\n", "# Method 1: create a scipy array directly from numpy / list\n", "A = [\n", " [1, 0, 0],\n", " [4, 0, 1],\n", " [0, 0, 2],\n", " [5, 4, 0]\n", "]\n", "A_sp = sps.csr_matrix(A) # use csr/csc directly if array is already built\n", "A_sp" ] }, { "cell_type": "code", "execution_count": 3, "id": "seven-cliff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([1, 4, 1, 2, 5, 4]),\n", " array([0, 0, 2, 2, 0, 1], dtype=int32),\n", " array([0, 1, 3, 4, 6], dtype=int32))" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we can access the fields of A_sp directly\n", "A_sp.data, A_sp.indices, A_sp.indptr # same interface for csc format" ] }, { "cell_type": "code", "execution_count": 16, "id": "annoying-wings", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Keys: dict_keys([(0, 0), (0, 2), (1, 1), (2, 0), (2, 2), (3, 1)]) Values: dict_values([1, 1, 1, 1, 1, 1])\n" ] }, { "data": { "text/plain": [ "<4x3 sparse matrix of type ''\n", "\twith 6 stored elements in Compressed Sparse Column format>" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Method 2: creating an array incrementally\n", "A_sp = sps.dok_matrix((4, 3), dtype=np.int)\n", "for i in range(4):\n", " for j in range(3):\n", " if (i + j) % 2 == 0:\n", " A_sp[i, j] = 1\n", "# dok matrices have similar interface to dict\n", "print(\"Keys:\", A_sp.keys(), \"Values:\", A_sp.values())\n", "\n", "# can convert to other formats\n", "A_sp.tocsc()" ] }, { "cell_type": "code", "execution_count": 19, "id": "addressed-cambodia", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 0, 1],\n", " [0, 1, 0],\n", " [1, 0, 1],\n", " [0, 1, 0]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# can also convert to numpy (dense) array\n", "# do *NOT* use .todense(), use .toarray() instead\n", "A_sp.toarray()" ] }, { "cell_type": "code", "execution_count": 23, "id": "frozen-necklace", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 0, 0],\n", " [4, 0, 1],\n", " [0, 0, 2],\n", " [5, 4, 0]])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Method 3: create CSR / CSC directly\n", "\n", "# CSR: col_idx, row_ptr\n", "values = [1, 4, 1, 2, 5, 4]\n", "col_id = [0, 0, 2, 2, 0, 1]\n", "row_pt = [0, 1, 3, 4, len(values)]\n", "\n", "A_csr = sps.csr_matrix((values, col_id, row_pt), shape=(4, 3))\n", "A_csr.toarray()" ] }, { "cell_type": "code", "execution_count": 25, "id": "several-portsmouth", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 0, 0],\n", " [4, 0, 1],\n", " [0, 0, 2],\n", " [5, 4, 0]])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# CSC: row_idx, col_ptr\n", "values = [1, 4, 5, 4, 1, 2]\n", "row_id = [0, 1, 3, 3, 1, 2]\n", "col_pt = [0, 3, 4, len(values)]\n", "\n", "A_csc = sps.csc_matrix((values, row_id, col_pt), shape=(4, 3))\n", "A_csc.toarray()" ] }, { "cell_type": "code", "execution_count": 35, "id": "golden-custom", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([[1, 0, 0],\n", " [4, 0, 1],\n", " [0, 0, 2],\n", " [5, 4, 0]]),\n", " array([[1, 0, 0],\n", " [4, 0, 1],\n", " [0, 0, 2],\n", " [5, 4, 0]]))" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# another way: use (data, (i, j)) for either CSC/CSR format (convenience method)\n", "# note: values can appear in any order (not necessarily per row / column)\n", "data = [1, 5, 4, 1, 2, 4]\n", "rows = [0, 3, 1, 1, 2, 3]\n", "cols = [0, 0, 0, 2, 2, 1]\n", "B_csr = sps.csr_matrix((data, (rows, cols)), shape=(4, 3))\n", "B_csc = sps.csc_matrix((data, (rows, cols)), shape=(4, 3))\n", "B_csr.toarray(), B_csc.toarray()" ] }, { "cell_type": "markdown", "id": "certified-preserve", "metadata": {}, "source": [ "## Linear algebra in `scipy`\n", "\n", "Dedicated modules:\n", "1. `scipy.sparse.linalg` for linear algebra with sparse matrices\n", "1. `scipy.linalg` for numpy-like linear algebra (many common methods, more fine-grained control)\n", "\n", "**Note**: in `numpy`, `A * B` is elementwise multiplication (with broadcasting). But if `A` or `B` is a scipy sparse array, `*` stands for usual matrix-matrix multiplication. Use `A.multiply(B)` for elementwise multiplication instead." ] }, { "cell_type": "code", "execution_count": 40, "id": "fixed-second", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([[1, 0, 1],\n", " [0, 1, 0],\n", " [1, 0, 1],\n", " [0, 1, 0]]),\n", " array([[1, 0, 0],\n", " [4, 0, 1],\n", " [0, 0, 2],\n", " [5, 4, 0]]),\n", " array([[2, 0, 1],\n", " [4, 1, 1],\n", " [1, 0, 3],\n", " [5, 5, 0]]))" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A_sp.toarray(), B_csc.toarray(), (A_sp + B_csc).toarray()" ] }, { "cell_type": "code", "execution_count": 41, "id": "bizarre-impossible", "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "dimension mismatch", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mA_sp\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mB_csc\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/.local/lib/python3.9/site-packages/scipy/sparse/base.py\u001b[0m in \u001b[0;36m__mul__\u001b[0;34m(self, other)\u001b[0m\n\u001b[1;32m 477\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0missparse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 478\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 479\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'dimension mismatch'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 480\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_mul_sparse_matrix\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 481\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: dimension mismatch" ] } ], "source": [ "A_sp * B_csc # this will try matrix multiplication, but shapes don't match" ] }, { "cell_type": "code", "execution_count": 43, "id": "featured-favorite", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 0, 0],\n", " [0, 0, 0],\n", " [0, 0, 2],\n", " [0, 4, 0]])" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the below will work\n", "C = A_sp.multiply(B_csc) # or B_csc.multiply(A_sp)\n", "C.toarray()" ] }, { "cell_type": "code", "execution_count": 46, "id": "mathematical-satisfaction", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([0.2838 , 3.0328217 , 3.79524337, 1.68237336]),\n", " array([0.2838 , 3.0328217 , 3.79524337, 1.68237336]))" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# matrix-vector multiplication\n", "v = np.random.randn(3)\n", "A_csr @ v, A_csr * v # same results" ] }, { "cell_type": "code", "execution_count": 52, "id": "identical-cream", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([[1, 0, 0],\n", " [4, 1, 0],\n", " [0, 2, 0],\n", " [5, 0, 4]]),\n", " array([[1, 0, 0],\n", " [4, 1, 0],\n", " [0, 2, 0],\n", " [5, 0, 4]]))" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# matrix-matrix multiplication\n", "V = sps.csc_matrix([[1, 0, 0], [0, 0, 1], [0, 1, 0]])\n", "(A_csr @ V).toarray(), (A_csr * V).toarray() # same results with `@` and `*`" ] }, { "cell_type": "markdown", "id": "violent-calculation", "metadata": {}, "source": [ "The `scipy.sparse.linalg` module offers several implementations of iterative algorithms for solving systems of the form $Ax = b$, least-squares problems, and eigenvalue problems. A full list can be found [here](https://docs.scipy.org/doc/scipy/reference/sparse.linalg.html)." ] }, { "cell_type": "code", "execution_count": 115, "id": "irish-excerpt", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MINRES distance: 8.497616290366594e-06\n" ] } ], "source": [ "import scipy.sparse.linalg as sps_linalg\n", "\n", "# setup a sparse linear system\n", "A = np.random.randint(0, 2, (100, 100))\n", "A = sps.csr_matrix(A + A.T, dtype=np.float64)\n", "b = np.random.rand(100)\n", "\n", "# Example 1: solve using MINRES (for symmetric matrices)\n", "x_minres, status = sps_linalg.minres(A, b, tol=1e-8)\n", "print(\"MINRES distance:\", np.linalg.norm(A @ x_minres - b))" ] }, { "cell_type": "code", "execution_count": 116, "id": "analyzed-folder", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LSQR error: 1.3225258756063518\n" ] } ], "source": [ "# Example 2: solve noisy system using least squares\n", "m, n = 250, 50\n", "A_ls = np.random.rand(250, 50)\n", "A_ls[A_ls < 0.5] = 0.0\n", "A_ls = sps.csr_matrix(A_ls)\n", "x = np.random.rand(50)\n", "b_ls = A_ls @ x + 0.1 * np.random.randn(250)\n", "\n", "x_ls = sps_linalg.lsqr(A_ls, b_ls)[0]\n", "print(\"LSQR error:\", np.linalg.norm(A_ls @ x_ls - b_ls))" ] }, { "cell_type": "code", "execution_count": 125, "id": "living-audience", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-13.59685714, -13.02633564, -12.66742863, -11.95097689,\n", " -11.79172183, 12.02290524, 12.35624536, 13.53789273,\n", " 14.55214696, 102.60929514])" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Example 3: Compute the eigenvectors and eigenvalues of a sparse matrix\n", "# Use which='LM' to get largest magnitude eigenvalues\n", "λ, V = sps_linalg.eigsh(A, k=10, which='LM') # k=6 by default\n", "λ" ] }, { "cell_type": "code", "execution_count": null, "id": "smooth-volunteer", "metadata": {}, "outputs": [], "source": [ "# use which='LA' to get algebraically largest eigenvalues\n", "λ_a, V_a = sps.linalg.eigsh(A, k=10, which='LA')\n", "λ_a" ] }, { "cell_type": "code", "execution_count": 129, "id": "bridal-hostel", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((100, 10), (100, 10))" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# V and V_a have the 10 eigenvectors as their columns\n", "V.shape, V_a.shape" ] }, { "cell_type": "markdown", "id": "aquatic-ribbon", "metadata": {}, "source": [ "### Application: Spectral Clustering\n", "\n", "In spectral clustering, we want a partition a graph containing nodes and edges that connect each other to a set of distinct \"communities\". For simplicity, we treat the undirected setting.\n", "\n", "Formally, our graph is represented by an *adjacency matrix* $A \\in \\mathbb{R}^{n \\times n}$, such that\n", "\n", "$$\n", "A_{ij} = \\begin{cases}\n", "1, & \\text{ if node $i$ is connected to node $j$}, \\\\\n", "0, & \\text{ otherwise}\n", "\\end{cases}\n", "$$\n", "\n", "Here we are implicitly assuming that nodes are numbered from $1$ to $n$. Moreover, let $D$ be a diagonal matrix such that\n", "$D_{ii} = \\sum_{j=1}^n A_{ij}$, i.e., the number of neighbors of node $i$. We then form the *normalized adjacency matrix*:\n", "\n", "$$\n", "A_N = D^{-1/2} A D^{-1/2}.\n", "$$\n", "\n", "Here, $A_N$ is \"weighting\" each edge $(i, j)$ by the harmonic mean of the number of neighbors of $i$ and $j$. Spectral clustering computes the top-$k$ eigenvectors, $V \\in \\mathbb{R}^{n \\times k}$, and treats each row, $V_{i}$, as the $k$-dimensional \"feature\" of node $i$. **These features are then fed to a point clustering algorithm such as k-means.**" ] }, { "cell_type": "markdown", "id": "engaged-romantic", "metadata": {}, "source": [ "The graph we will build here is called a Stochastic Block Model (SBM) with $k$ communities. It is a random graph where each node is randomly assigned to one of the $k$ communities. Then, for each pair of nodes $i$ and $j$:\n", "\n", "- if $i$ and $j$ belong to the same community, they are connected with an edge with probability $p$.\n", "- if $i$ and $j$ belong to different communities, they are connected with an edge with probability $q \\ll p$.\n", "\n", "The functions below generate the (sparse) adjacency matrix of an instance of an SBM." ] }, { "cell_type": "code", "execution_count": 38, "id": "stretch-locking", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import random\n", "import scipy.sparse as sps\n", "\n", "def gen_sbm(n, k, p, q):\n", " \"\"\"\n", " Generate a k-community SBM where every node has the same probability\n", " of belonging to either community.\n", " \n", " Arguments\n", " ---------\n", " n : int\n", " The number of nodes\n", " k : int\n", " The number of communities\n", " p, q : float\n", " The probability of intra-group and inter-group edges, respectively.\n", " \n", " Returns\n", " -------\n", " communities : list\n", " A list containing the community of each node\n", " A : scipy.sparse.spmatrix\n", " The adjacency matrix of the graph\n", " \"\"\"\n", " A = sps.dok_matrix((n, n), dtype=np.int)\n", " communities = [0] * n\n", " # generate community memberships\n", " for i in range(n):\n", " communities[i] = random.randint(0, k-1)\n", " # generate edges\n", " # NOTE: random.random() <= x is event with probability x if 0 < x < 1\n", " for i in range(n):\n", " for j in range(i+1, n): # no self-loops\n", " if communities[i] == communities[j]:\n", " A[i, j] = random.random() <= p # automatically cast to 1 if True\n", " else:\n", " A[i, j] = random.random() <= q\n", " A[j, i] = A[i, j] # symmetry\n", " # return membership and adjacency matrix\n", " return communities, A.tocsr()" ] }, { "cell_type": "markdown", "id": "varying-headline", "metadata": {}, "source": [ "We now need some utility functions for forming the normalized adjacency matrix and computing its top-$k$ eigenvectors. Since we are dealing with sparse matrices, we will use the linear algebra routines found under the `sparse` module.\n", "\n", "Recall that for the normalized adjacency matrix we want to form $A_N = D^{-1/2} A D^{-1/2}$, where $D_{ii} = \\sum_{j=1}^n A_{ij}$." ] }, { "cell_type": "code", "execution_count": 130, "id": "ready-darkness", "metadata": {}, "outputs": [], "source": [ "def normalized_adjacency(A):\n", " \"\"\"\n", " Given an adjacency matrix `A`, convert it to a normalized adjacency matrix.\n", " \n", " Arguments\n", " ---------\n", " A : scipy.sparse.spmatrix\n", " The adjacency matrix\n", " \n", " Returns\n", " -------\n", " scipy.sparse.spmatrix\n", " The normalized adjacency matrix\n", " \"\"\"\n", " n = A.shape[0]\n", " d = A * np.ones(n, dtype=np.int) # sum_{j} A_{i,j} for each i\n", " Dinv = sps.diags(1 / np.sqrt(d)) # create D^(-1/2)\n", " return Dinv * A * Dinv # normalized adjacency matrix\n", "\n", "\n", "def top_k_evecs(A, k):\n", " \"\"\"\n", " Compute the top-k (largest in magnitude) eigenvectors of a symmetric matrix `A`.\n", " \"\"\"\n", " _, evecs = sps_linalg.eigsh(A, k)\n", " return evecs" ] }, { "cell_type": "markdown", "id": "healthy-wales", "metadata": {}, "source": [ "We now create an SBM with $k = 2$ communities and $n = 100$ nodes and use the k-means algorithm to classify each node to a community using its coefficients in the top-$k$ eigenvectors as node features. The k-means algorithm is also available in `scipy` in the `cluster.vq` module.\n", "\n", "**Note**: below we use `kmeans2`, which returns the list of assigned labels, instead of `kmeans`, which only returns the centroids found by the k-means algorithm." ] }, { "cell_type": "code", "execution_count": 131, "id": "appropriate-technology", "metadata": {}, "outputs": [], "source": [ "from scipy.cluster.vq import kmeans2\n", "\n", "communities, A = gen_sbm(100, 2, 0.25, 0.05)\n", "V = top_k_evecs(normalized_adjacency(A), 2)\n", "_, labels = kmeans2(V, 2)" ] }, { "cell_type": "markdown", "id": "ready-recipe", "metadata": {}, "source": [ "At this point, note that the labels we generated could be permuted arbitrarily (i.e. we could make all `0`s into `1`s and vice-versa) without changing the community structure. Therefore, when computing the accuracy of our cluster assignment, we should consider both versions. This is reflected in the evaluation function below." ] }, { "cell_type": "code", "execution_count": 132, "id": "incorrect-segment", "metadata": {}, "outputs": [], "source": [ "def cluster_acc(true_labels, pred_labels):\n", " return max(np.mean(true_labels == pred_labels), np.mean(true_labels == 1 - pred_labels))" ] }, { "cell_type": "code", "execution_count": 133, "id": "animal-cedar", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cluster_acc(communities, labels)" ] }, { "cell_type": "markdown", "id": "loved-helmet", "metadata": {}, "source": [ "### References\n", "\n", "[Scipy Lecture Notes](" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 5 }