{ "cells": [ { "cell_type": "markdown", "id": "shaped-establishment", "metadata": {}, "source": [ "# Introduction\n", "\n", "This notebook presents a number of data science applications using Python. We start with a short intro to numpy." ] }, { "cell_type": "markdown", "id": "offensive-enterprise", "metadata": {}, "source": [ "## 1. Intro to numpy\n", "\n", "NumPy introduces a the `ndarray` object, an $n$-dimensional array implementation that allows for efficient linear algebra operations, \"fancy\" indexing, and MATLAB-like broadcasting. It also provides a number of utility functions for e.g. generating random vectors / arrays. Before proceeding, make sure you have Python 3.5+ installed, and install numpy using `pip` or `conda`:\n", "\n", "```bash\n", "pip install --user numpy # or: pip3 install --user numpy\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "average-clinton", "metadata": {}, "outputs": [], "source": [ "import numpy as np # standard way to import\n", "\n", "# ways to create vectors\n", "v1 = np.array([0, 1, 2, 3, 4]) # 1. wrap np.array around a list\n", "v2 = np.arange(5) # 2. make a range (0 - 4)\n", "v3 = np.linspace(0, 2 * np.pi, 50) # 3. 50 equispaced points between 0 and 2π\n", "\n", "# ways to create matrices\n", "A1 = np.array([\n", " [1, 2, 3, 4],\n", " [0, 3, 6, 9],\n", " [3, 4, 5, 6]]) # 1. explicitly from list\n", "A2 = np.ones((3, 4)) # 2. 3x4 ones matrix\n", "A3 = np.ones((3, 4), dtype=np.int) # 3. use dtype to force a type\n", "A4 = np.zeros((5, 4)) # 4. same with zeros\n", "A5 = np.eye(10) # 5. 10x10 identity matrix\n", "A1.T # transpose of A1" ] }, { "cell_type": "markdown", "id": "certified-preserve", "metadata": {}, "source": [ "The operators `+`, `-`, `*` and `/` in NumPy do **broadcasting**. If `A` and `B` have the same shape, `A * B` is the elementwise (Hadamard) product. If not, NumPy *attempts* to broadcast the smaller array across the larger array so that the operation \"makes sense\"." ] }, { "cell_type": "code", "execution_count": null, "id": "fixed-second", "metadata": {}, "outputs": [], "source": [ "A1 + A2" ] }, { "cell_type": "code", "execution_count": null, "id": "confused-illness", "metadata": {}, "outputs": [], "source": [ "A1 * A2" ] }, { "cell_type": "code", "execution_count": null, "id": "worst-thread", "metadata": {}, "outputs": [], "source": [ "# the example below will \"broadcast\" v across all rows of A1 and add them\n", "v = np.array([2, 4, 6, 8])\n", "A1 + v" ] }, { "cell_type": "code", "execution_count": null, "id": "complete-selection", "metadata": {}, "outputs": [], "source": [ "A1 / v, A1 * v" ] }, { "cell_type": "markdown", "id": "vulnerable-colors", "metadata": {}, "source": [ "Since `*` operates elementwise, matrix-vector and matrix-matrix products are implemented using the `np.dot` function, which can also be written using the `@` infix operator." ] }, { "cell_type": "code", "execution_count": null, "id": "greek-excuse", "metadata": {}, "outputs": [], "source": [ "# np.random.randn(m, n) creates an mxn matrix with i.i.d. Gaussian entries\n", "A = np.random.randn(4, 2)\n", "B = np.random.randn(2, 3)\n", "v = np.random.randn(2) # 2d vector\n", "\n", "A @ B, A @ v # or: np.dot(A, B), np.dot(A, v)" ] }, { "cell_type": "markdown", "id": "leading-ecuador", "metadata": {}, "source": [ "Vectors in NumPy have a single axis, interpreted as the \"row\" axis. To create a column vector, you must create a $k \\times 1$ array." ] }, { "cell_type": "code", "execution_count": null, "id": "pregnant-tobago", "metadata": {}, "outputs": [], "source": [ "# method 1: direct\n", "v_col1 = np.array([[1], [2], [3]])\n", "v_col1" ] }, { "cell_type": "code", "execution_count": null, "id": "whole-avenue", "metadata": {}, "outputs": [], "source": [ "# method 2: create a vector and *reshape*\n", "v_col2 = np.array([1, 2, 3]).reshape(-1, 1)\n", "v_col2" ] }, { "cell_type": "code", "execution_count": null, "id": "first-conservation", "metadata": {}, "outputs": [], "source": [ "A1, v_col1, A1 + v_col1" ] }, { "cell_type": "markdown", "id": "institutional-dylan", "metadata": {}, "source": [ "### Loops vs. vectorization\n", "\n", "NumPy allows for operations to be **vectorized** via a concept called *universal functions* (`ufunc`s). A universal function is roughly a NumPy function that defers to compiled code, leading to much faster execution. Array arithmetic, as shown above, uses universal functions under the hood." ] }, { "cell_type": "code", "execution_count": null, "id": "great-prerequisite", "metadata": {}, "outputs": [], "source": [ "example_vec = np.random.rand(10000)\n", "\n", "%timeit -n 1000 sum(example_vec) # \"slow\" sum, Python version" ] }, { "cell_type": "code", "execution_count": null, "id": "double-alias", "metadata": {}, "outputs": [], "source": [ "%timeit -n 1000 np.sum(example_vec) # fast sum" ] }, { "cell_type": "code", "execution_count": null, "id": "welsh-discharge", "metadata": {}, "outputs": [], "source": [ "# another example\n", "from math import sin\n", "\n", "def slow_sin(v):\n", " return [sin(i) for i in v]\n", "\n", "def fast_sin(v):\n", " return np.sin(v)\n", "\n", "%timeit -n 1000 slow_sin(example_vec)" ] }, { "cell_type": "code", "execution_count": null, "id": "alive-technician", "metadata": {}, "outputs": [], "source": [ "%timeit -n 1000 fast_sin(example_vec)" ] }, { "cell_type": "code", "execution_count": null, "id": "noble-label", "metadata": {}, "outputs": [], "source": [ "# don't do this...\n", "def slow_func(v):\n", " res = [0] * len(v)\n", " for idx, elt in enumerate(v):\n", " res[idx] = elt ** 2 + 2 * elt + 1 - min(elt, 0.5)\n", " return res\n", "\n", "# ... but do this instead\n", "def fast_fun(v):\n", " q = np.array(v); return q ** 2 + 2 * q + 1 - np.minimum(q, 0.5)\n", "\n", "%timeit -n 100 slow_func(example_vec)" ] }, { "cell_type": "code", "execution_count": null, "id": "banner-variable", "metadata": {}, "outputs": [], "source": [ "%timeit -n 100 fast_fun(example_vec)" ] }, { "cell_type": "markdown", "id": "criminal-effect", "metadata": {}, "source": [ "### Indexing and Slicing\n", "\n", "One-dimensional NumPy arrays behave similarly to Python lists." ] }, { "cell_type": "code", "execution_count": null, "id": "chinese-nursing", "metadata": {}, "outputs": [], "source": [ "v1 = np.arange(10)\n", "v1[:3], v1[3:], v1[2:5], v1[2:8:2], v1[8:2:-1]" ] }, { "cell_type": "markdown", "id": "humanitarian-wonder", "metadata": {}, "source": [ "Multidimensional NumPy arrays have one index per axis. If an index is omitted, it is treated as a *complete* slice of the axis." ] }, { "cell_type": "code", "execution_count": null, "id": "second-garlic", "metadata": {}, "outputs": [], "source": [ "A = np.random.randint(1, 10, (4, 5))\n", "A" ] }, { "cell_type": "code", "execution_count": null, "id": "equivalent-amendment", "metadata": {}, "outputs": [], "source": [ "A[:3, :2], A[2:], A[2, :], A[:, 3]" ] }, { "cell_type": "code", "execution_count": null, "id": "different-bristol", "metadata": {}, "outputs": [], "source": [ "# note: A[2, :] is the same as A[2,]\n", "print(A[2, :] - A[2,])\n", "# unfortunately A[,3] is not syntactically valid and cannot be used in place of A[:,3]" ] }, { "cell_type": "markdown", "id": "fantastic-treasurer", "metadata": {}, "source": [ "More advanced indexing and slicing techniques allow you to use e.g. logical conditions to index a matrix." ] }, { "cell_type": "code", "execution_count": null, "id": "drawn-tract", "metadata": {}, "outputs": [], "source": [ "A, A[A > 0.5], A[A != 0]" ] }, { "cell_type": "code", "execution_count": null, "id": "searching-milton", "metadata": {}, "outputs": [], "source": [ "X = np.random.randn(10, 5)\n", "Y = np.random.choice([-1, 1], (10,))\n", "X, Y" ] }, { "cell_type": "code", "execution_count": null, "id": "secure-grant", "metadata": {}, "outputs": [], "source": [ "X[Y > 0] # keep X[i] if Y[i] > 0" ] }, { "cell_type": "code", "execution_count": null, "id": "german-bailey", "metadata": {}, "outputs": [], "source": [ "X[Y < 0] = 0.0 # set all elements in rows i such that Y[i] < 0 to zero\n", "X" ] }, { "cell_type": "code", "execution_count": null, "id": "exposed-anthony", "metadata": {}, "outputs": [], "source": [ "X[Y < 0] = np.ones(5) # make X[i] = [1, ..., 1] if Y[i] < 0\n", "X, Y" ] }, { "cell_type": "markdown", "id": "interracial-timothy", "metadata": {}, "source": [ "### Reshaping and flattening\n", "\n", "You can also manipulate the shape of a numpy array, by passing a tuple indicating the desired shape to `np.reshape`. You can also use `-1` for an axis to let NumPy infer the shape for that axis.\n", "\n", "Similarly, you can flatten an NumPy array `A` using `A.flatten()` or `A.ravel()` (or `numpy.ravel(A)`). The difference between the two is that `A.ravel()` will return a *view* instead of creating a new object. An example demonstrating this difference is shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "cubic-copper", "metadata": {}, "outputs": [], "source": [ "Br = np.arange(20).reshape(5, 4)\n", "Br" ] }, { "cell_type": "code", "execution_count": null, "id": "gross-southwest", "metadata": {}, "outputs": [], "source": [ "Br.reshape(4, 5)" ] }, { "cell_type": "code", "execution_count": null, "id": "helpful-apache", "metadata": {}, "outputs": [], "source": [ "Br.reshape(4, -1), Br.reshape(-1, 4)" ] }, { "cell_type": "code", "execution_count": null, "id": "unnecessary-shell", "metadata": {}, "outputs": [], "source": [ "# Example demonstrating difference between flatten() and ravel()\n", "\n", "Br1 = np.copy(Br)\n", "Br2 = np.copy(Br)\n", "\n", "flattened = Br1.flatten()\n", "raveled = Br2.ravel()\n", "\n", "flattened, raveled" ] }, { "cell_type": "code", "execution_count": null, "id": "subtle-emergency", "metadata": {}, "outputs": [], "source": [ "# flattened is a new object made from Br1. Editing it should not affect Br1\n", "flattened[1] = 1000\n", "Br1" ] }, { "cell_type": "code", "execution_count": null, "id": "normal-aluminum", "metadata": {}, "outputs": [], "source": [ "# raveled is a *view* into Br2. Editing it affects Br2!\n", "raveled[1] = 1000\n", "Br2" ] }, { "cell_type": "markdown", "id": "otherwise-cherry", "metadata": {}, "source": [ "### Linear algebra in NumPy\n", "\n", "NumPy offers standard linear algebra methods, such as determinants, traces, and eigenvalue / singular value decompositions, among others." ] }, { "cell_type": "code", "execution_count": null, "id": "alone-spelling", "metadata": {}, "outputs": [], "source": [ "import numpy.linalg as linalg\n", "\n", "C = np.random.rand(5, 5) # create a random 5x5 matrix\n", "det_C = linalg.det(C)\n", "tr_C = C.trace() # or np.trace(C)\n", "\n", "\n", "# get eigenvalue decomposition of C\n", "eval_C, evec_C = linalg.eig(C)\n", "\n", "# make a symmetric matrix:\n", "C_sym = C + C.T\n", "eval_Csym, evec_Csym = linalg.eigh(C_sym) # use eigh for symmetric matrices\n", "\n", "eval_C, eval_Csym" ] }, { "cell_type": "code", "execution_count": null, "id": "novel-standard", "metadata": {}, "outputs": [], "source": [ "# other types of decompositions\n", "\n", "C_rect = np.random.rand(5, 4)\n", "\n", "# 1. SVD\n", "U_full, S, V_full = linalg.svd(C_rect)\n", "U_econ, _, V_econ = linalg.svd(C_rect, full_matrices=False)\n", "\n", "print(\"U shapes:\", U_full.shape, U_econ.shape)\n", "print(\"V shapes:\", V_full.shape, V_econ.shape)" ] }, { "cell_type": "code", "execution_count": null, "id": "literary-charm", "metadata": {}, "outputs": [], "source": [ "# 2. QR decomposition\n", "Q, R = linalg.qr(C_rect)\n", "print(\"Q.T @ Q == I:\", np.allclose(Q.T @ Q, np.eye(Q.shape[1])))\n", "print(\"R upper triangular:\", np.allclose(np.triu(R) - R, 0))" ] }, { "cell_type": "markdown", "id": "approved-nerve", "metadata": {}, "source": [ "### Further reading\n", "\n", "Advanced Indexing: https://numpy.org/doc/1.20/user/quickstart.html#advanced-indexing-and-index-tricks\n", "\n", "Linear Algebra in Numpy: https://numpy.org/doc/1.20/user/quickstart.html#linear-algebra\n", "\n", "NumPy for MATLAB users: https://numpy.org/doc/1.20/user/numpy-for-matlab-users.html" ] }, { "cell_type": "markdown", "id": "apparent-terrorist", "metadata": {}, "source": [ "## Application 1: Linear Regression\n", "\n", "Suppose you are given a set of observations (here $\\langle, \\rangle$ denotes dot product):\n", "\n", "$$\n", "y_i = \\langle \\beta^{\\star}, x_i \\rangle + \\epsilon_i, \\; i = 1, \\dots, m,\n", "$$\n", "\n", "where $\\beta^{\\star} \\in \\mathbb{R}^d$ is some unknown parameter and $\\epsilon_i$ is noise, but $(y_i, x_i)$ are observed.\n", "\n", "If $\\epsilon_i$ is i.i.d. Gaussian, recovering $\\beta^{\\star}$ reduces to solving the following least-squares problem:\n", "\n", "$$\n", "\\min_{\\beta} \\frac{1}{2m} \\sum_{i=1}^m \\left(y_i - \\langle \\beta, x_i \\rangle\\right)^2 \\equiv\n", "\\min_{\\beta} \\frac{1}{2m} \\| X \\beta - y \\|_2^2,\n", "$$\n", "\n", "where $X$ is a $m \\times d$ matrix containing each $x_i$ as one of its rows and $y$ is a $m \\times 1$ vector containing all observations $y_i$. Below, we set up a synthetic instance and solve it using tools from `numpy`." ] }, { "cell_type": "code", "execution_count": null, "id": "quantitative-remove", "metadata": {}, "outputs": [], "source": [ "# set up number of observations and dimension\n", "m, d = 1000, 100\n", "X = np.random.rand(m, d) # x_i's are uniform in [0, 1]\n", "β = np.random.choice([-1, 1], d) # β* is a vector in {-1, 1}^d" ] }, { "cell_type": "code", "execution_count": null, "id": "extensive-strengthening", "metadata": {}, "outputs": [], "source": [ "# setup the least squares problem and measure errors for varying levels of noise\n", "%matplotlib inline\n", "import numpy.linalg as linalg\n", "import matplotlib.pyplot as plt\n", "\n", "sq_errors = [0] * 10\n", "for idx, σ in enumerate(np.linspace(0.1, 1.0, 10)):\n", " y = X @ β + σ * np.random.randn(m)\n", " β_ls, res = linalg.lstsq(X, y, rcond=None)[:2] # solve for β_ls\n", " sq_errors[idx] = res / (2 * m)" ] }, { "cell_type": "code", "execution_count": null, "id": "physical-hacker", "metadata": {}, "outputs": [], "source": [ "plt.plot(np.linspace(0.1, 1.0, 10), sq_errors)\n", "plt.ylabel(\"MSE\")\n", "plt.xlabel(\"σ\")\n", "plt.title(\"MSE vs. noise for linear regression instances\")" ] }, { "cell_type": "markdown", "id": "geological-liberal", "metadata": {}, "source": [ "### Regularization\n", "\n", "When the matrix $X$ is ill-conditioned or the system is underdetermined (i.e. $m < d$), it is common to add a regularization term to the problem. Regularization can (i) improve the conditioning of the problem, and (ii) prevent overfitting (e.g. when there are less samples than features).\n", "\n", "Here, we discuss Tikhonov regularization, which amounts to adding a squared norm penalty to the least squares objective:\n", "\n", "$$\n", "\\min_{\\beta} \\frac{1}{2m} \\| X \\beta - y \\|_2^2 + \\lambda \\| \\Gamma \\beta \\|_2^2.\n", "$$\n", "\n", "In particular, when $\\Gamma = I_d$ we recover the well-known instance of *ridge regression*. Writing down the optimality conditions, we arrive at\n", "\n", "$$\n", "\\left( \\frac{1}{2m} X^{\\mathsf{T}} X + \\lambda I \\right) \\beta = \\frac{1}{2m} X^{\\mathsf{T}} y\n", "$$\n", "\n", "Below we show two ways to solve the regularized problem: one uses `numpy.linalg.solve` to solve the above system, which is guaranteed to have a unique solution when $\\lambda > 0$, while the other uses `scikit-learn`'s `Ridge` model. You will want to install `scikit-learn` for the second snippet:\n", "\n", "```bash\n", "pip install --user scikit-learn # or pip3 install...\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "elder-community", "metadata": {}, "outputs": [], "source": [ "# Step 1: setup train, test set\n", "m, d = 200, 2000\n", "X = np.random.rand(m, d)\n", "β = np.random.choice([-1, 1], d)\n", "y = X @ β + np.random.randn(m) # noise with variance 0.01" ] }, { "cell_type": "code", "execution_count": null, "id": "exposed-basic", "metadata": {}, "outputs": [], "source": [ "# Method 1: Pure Numpy\n", "λ = 1.0\n", "β_LS = linalg.solve(X.T @ X + λ * np.eye(d), X.T @ y)\n", "\n", "# Method 2: sklearn\n", "from sklearn.linear_model import Ridge\n", "clf = Ridge(alpha=λ, solver='cholesky')\n", "clf.fit(X, y)\n", "linalg.norm(clf.coef_ - β_LS)" ] }, { "cell_type": "markdown", "id": "residential-relative", "metadata": {}, "source": [ "## Application 2: Logistic regression (with `sklearn`)\n", "\n", "Logistic regression solves a *classification* problem. Consider a simple scenario, where we have\n", "\n", "$$\n", "y_i = f(\\beta_0 + \\langle \\beta^{\\star}, x_i \\rangle + \\epsilon_i), \\quad f: \\mathbb{R} \\to \\{0, 1\\}.\n", "$$\n", "\n", "When $y_i = 0$, we say that $x_i$ belongs to class $0$ and to class $1$ otherwise. In logistic regression, we assume that each $y_i$ is 1 with some probability depending on $x_i$ and the parameter $\\beta^{\\star}$ through the *logistic function*:\n", "\n", "$$\n", "\\mathbb{P}(Y_i = 1 \\mid X_i, \\theta) = \\frac{1}{1 + e^{-(\\beta_0 + \\langle \\beta^{\\star}, x_i \\rangle)}},\n", "$$\n", "\n", "and wish to maximize the log-likelihood assuming that all observations are independent and follow a Bernoulli distribution with parameter as above.\n", "\n", "Below, we will use `numpy` to set up a synthetic example and use the `scikit_learn` library to use its built-in logistic regression classifier." ] }, { "cell_type": "code", "execution_count": null, "id": "conservative-lounge", "metadata": {}, "outputs": [], "source": [ "# Step 1: set up dimensions and samples\n", "m, d = 1000, 100\n", "β = np.random.choice([-1, 1], d) # β is uniform in {-1, 1}^d\n", "X = np.random.rand(m, d) # each X_i is uniform in [0, 1]^d\n", "ϵ = np.random.randn(m)\n", "β_0 = 0.5 # offset" ] }, { "cell_type": "code", "execution_count": null, "id": "nasty-kinase", "metadata": {}, "outputs": [], "source": [ "# setup class indicators\n", "Y = np.zeros(m)\n", "ips = X @ β + ϵ\n", "Y[ips > β_0] = 1 # all other Y's are already 0" ] }, { "cell_type": "code", "execution_count": null, "id": "pointed-entry", "metadata": {}, "outputs": [], "source": [ "# setup logistic regression from scikit-learn\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "classifier = LogisticRegression(penalty='none') # options such as regularization or weights for classes are part of constructor\n", "classifier.fit(X, Y) # .fit(X, Y) \"trains\" the classifier\n", "\n", "# mean accuracy on training set\n", "classifier.score(X, Y)" ] }, { "cell_type": "code", "execution_count": null, "id": "exterior-popularity", "metadata": {}, "outputs": [], "source": [ "# generate new test data\n", "X_test = np.random.randn(250, d)\n", "Y_test = np.zeros(250)\n", "Y_test[X_test @ β + np.random.randn(250) > β_0] = 1.0\n", "\n", "# mean accuracy on test data\n", "classifier.score(X_test, Y_test)" ] }, { "cell_type": "code", "execution_count": null, "id": "wound-absence", "metadata": {}, "outputs": [], "source": [ "# let's train a classifier with regularization\n", "classifier_penalty = LogisticRegression(C=2.5)\n", "classifier_penalty.fit(X, Y)\n", "\n", "# mean accuracy on training and test set\n", "print(\"Mean accuracy on training set:\", classifier_penalty.score(X, Y))\n", "print(\"Mean accuracy on test set:\", classifier_penalty.score(X_test, Y_test))" ] }, { "cell_type": "markdown", "id": "useful-restriction", "metadata": {}, "source": [ "## Application 3: Using the QR decomposition for solving multiple linear systems\n", "\n", "Suppose you want to solve a sequence of linear systems of the form\n", "\n", "$$\n", "A x_t = b_t, \\quad t = 1, 2, \\dots, \\quad A \\in \\mathbb{R}^{n \\times n},\n", "$$\n", "\n", "where $A$ remains fixed for all $t$ but $b_t$ and the solution $x_t$ may change. Let $A = QR$ be the QR decomposition of $A$, where $Q^{\\mathsf{T}} Q = I$ and $R$ is upper triangular, and observe that\n", "\n", "$$\n", "A x_t = b_t \\Leftrightarrow Q R x_t = b_t \\Leftrightarrow Q^{\\mathsf{T}} Q R x_t = Q^{\\mathsf{T}} b_t\n", "\\Rightarrow R x_t = \\underbrace{Q^{\\mathsf{T}} b_t}_{\\tilde{b}_t}\n", "$$\n", "\n", "Since $R$ is upper triangular, we can solve for $x_t$ in $O(n^2)$ (instead of $O(n^3)$) time per system! To take advantage of this, we will use `scipy`, which provides a subroutine for solving triangular linear systems:\n", "\n", "```bash\n", "pip install --user scipy\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "intensive-german", "metadata": {}, "outputs": [], "source": [ "# import scipy to make \n", "from scipy.linalg import solve_triangular\n", "\n", "def solve_nocache(A):\n", " for t in range(100):\n", " b_t = np.random.rand(A.shape[0])\n", " x_t = linalg.solve(A, b_t)\n", "\n", " \n", "def solve_cache(A):\n", " Q, R = linalg.qr(A)\n", " for t in range(100):\n", " b_t = np.random.rand(A.shape[0])\n", " x_t = solve_triangular(R, Q.T @ b_t)\n", "\n", "A = np.random.randn(250, 250) # full rank with high probability\n", "\n", "%timeit -n 25 solve_cache(A)" ] }, { "cell_type": "code", "execution_count": null, "id": "bridal-thousand", "metadata": {}, "outputs": [], "source": [ "%timeit -n 25 solve_nocache(A)" ] }, { "cell_type": "markdown", "id": "nutritional-spectrum", "metadata": {}, "source": [ "## Application 4: Principal Components Analysis\n", "\n", "Principal Component Analysis (PCA) is a staple method in data science, used for exploratory data analysis, dimensionality reduction, etc. Suppose we are given $m$ samples in $\\mathbb{R}^d$, arranged in a matrix $X \\in \\mathbb{R}^{m \\times d}$. We are looking to project our samples to a lower-dimensional linear subspace $V$:\n", "\n", "* $\\mathrm{dim}(V) = r \\ll d$\n", "* the subspace $V$ is the \"best\" linear subspace, in the sense that it maximizes the variance of the projected points among all linear subspaces of the same dimension.\n", "\n", "Principal Component Analysis does the following:\n", "\n", "1. First, it subtracts the empirical mean from all samples, so that their empirical mean is $0$\n", "1. Then, it \"standardizes\" the data by scaling each column so that its standard deviation is $1$.\n", "1. Compute the $r$ largest eigenvectors of the empirical covariance matrix:\n", " \n", " $$\n", " \\Sigma_m = \\frac{1}{m - 1} \\tilde{X}^{\\mathsf{T}} \\tilde{X},\n", " $$\n", " \n", " where $\\tilde{X}$ is the resulting matrix after the first two preprocessing steps.\n", "1. Putting the eigenvectors in a matrix $V \\in \\mathbb{R}^{d \\times r}$, the projected dataset is $X \\mapsto X V$." ] }, { "cell_type": "markdown", "id": "ceramic-harvey", "metadata": {}, "source": [ "To demonstrate, we give a NumPy function below that automates all these steps." ] }, { "cell_type": "code", "execution_count": null, "id": "embedded-balloon", "metadata": {}, "outputs": [], "source": [ "def numpy_pca(X, r):\n", " \"\"\"\n", " Arguments\n", " ---------\n", " X : numpy.array\n", " A `m x d` array containing the data\n", " r : int\n", " The number of principal components\n", " \n", " Returns\n", " -------\n", " numpy.array\n", " The projected dataset.\n", " \"\"\"\n", " X_center = X - np.mean(X, axis=0) # use axis=0 to compute mean for each column\n", " X_std = X_center / linalg.norm(X, axis=0) # use axis=0 to compute norm for each column\n", " _, S, V = linalg.svd(X_std)\n", " # keep the top-r eigenvectors of X.T @ X\n", " Vproj = V.T[:, :r]\n", " return X_std @ Vproj" ] }, { "cell_type": "markdown", "id": "commercial-invalid", "metadata": {}, "source": [ "In practice, we can use one of the built-in functions in `scikit-learn`, which automatically takes care of the preprocessing for us. Below, we project a set of handwritten digits to $2$ dimensions, and show their 2D projection." ] }, { "cell_type": "code", "execution_count": null, "id": "powerful-montgomery", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "from sklearn.decomposition import PCA\n", "\n", "digits = load_digits()\n", "\n", "# plot them\n", "fig = plt.figure(figsize=(8, 8)) # figure size in inches\n", "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n", "for i in range(64):\n", " ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n", " ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')" ] }, { "cell_type": "code", "execution_count": null, "id": "printable-mistake", "metadata": {}, "outputs": [], "source": [ "# take PCA\n", "pca = PCA(n_components=2, whiten='True')\n", "proj = pca.fit_transform(digits.data)\n", "plt.scatter(proj[:, 0], proj[:, 1], c=digits.target, cmap=\"Paired\")\n", "plt.colorbar()" ] }, { "cell_type": "markdown", "id": "aquatic-ribbon", "metadata": {}, "source": [ "## Application 5: Spectral Clustering\n", "\n", "Another application that relies on eigenvectors is spectral clustering. In spectral clustering, we want a partition a graph containing nodes and edges that connect each other to a set of distinct \"communities\". For simplicity, we treat the undirected setting.\n", "\n", "Formally, our graph is represented by an *adjacency matrix* $A \\in \\mathbb{R}^{n \\times n}$, such that\n", "\n", "$$\n", "A_{ij} = \\begin{cases}\n", "1, & \\text{ if node $i$ is connected to node $j$}, \\\\\n", "0, & \\text{ otherwise}\n", "\\end{cases}\n", "$$\n", "\n", "Here we are implicitly assuming that nodes are numbered from $1$ to $n$. Moreover, let $D$ be a diagonal matrix such that\n", "$D_{ii} = \\sum_{j=1}^n A_{ij}$, i.e., the number of neighbors of node $i$. We then form the *normalized adjacency matrix*:\n", "\n", "$$\n", "A_N = D^{-1/2} A D^{-1/2}.\n", "$$\n", "\n", "Here, $A_N$ is \"weighing\" each edge $(i, j)$ by the harmonic mean of the number of neighbors of $i$ and $j$. Spectral clustering computes the top-$k$ eigenvectors, $V \\in \\mathbb{R}^{n \\times k}$, and treats each row, $V_{i}$, as the $k$-dimensional \"feature\" of node $i$. These features are fed to a point clustering algorithm such as k-means.\n", "\n", "We will present an example using `scipy` and some of the sparse matrix formats we previously introduced in the next lecture." ] }, { "cell_type": "code", "execution_count": null, "id": "exempt-dance", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 5 }