Learn the key skills and concepts from linear algebra, multivariable calculus, and probability & statistics that you need to know in order to understand and implement core machine learning algorithms.
This course will prepare you for a university-level machine learning course that covers topics such as gradient descent, neural networks and backpropagation, support vector machines, extensions of linear regression (e.g. logistic and lasso regression), naive Bayes classifiers, principal component analysis, matrix factorization methods, and Gaussian mixture models.

Formal mathematical notation is often used in machine learning textbooks and papers. While formal symbols are relatively simple to learn through direct instruction, their meaning can be difficult to pick up from context clues, causing them to become a source of bewilderment and intimidation if not understood beforehand.

- Construct sets using set-builder notation and demonstrate fluency with set operations and terminology.
- Write and interpret functions using arrow notation.
- Translate between formal and informal language using quantifiers.

Hyperbolic tangent is a common activation function in the context of neural networks.

- Evaluate and graph hyperbolic functions.

Many algorithms in machine learning rely on advanced matrix methods that in turn rely on more foundational linear algebra topics. For example, principal component analysis involves finding eigenvalues and eigenvectors, which requires the use of determinants and gaussian elimination (respectively). Likewise, fitting a linear regression model involves using subspace projection to project the desired outputs onto the subspace of outputs that could possibly be generated by the model.

- Compute the determinant of an NxN matrix using Laplace expansions, and use properties of the determinant to simplify calculations.
- Use Gaussian elimination to solve systems of linear equations and compute inverse matrices.
- Find the projection of a vector onto a subspace formed by a span of other vectors.

Machine learning systems such as recommender systems often utilize latent factor models to identify the most prominent patterns underlying individual records in a data set, sometimes with the additional goal of reducing the complexity of the data by discarding negligible patterns. This is often accomplished using advanced techniques from linear algebra: the eigenvectors of a square matrix represent independent patterns, their eigenvalues represent the prominence of those patterns, and singular values generalize the idea of eigenvalues to rectangular matrices.

- Understand eigenvalues/eigenvectors geometrically, calculate them algebraically, and use them to diagonalize a matrix.
- Compute and interpret properties of quadratic forms including definiteness and principal axes.
- Compute the singular values of a matrix, understand their relationship to the constrained optimization of quadratic forms, and find the singular value decomposition of a matrix.

Support vector machines classify data into two classes by drawing the best possible boundary line between the classes. Even when data is not linearly separable, a kernel function can be used to map the data into an inner product space where it becomes linearly separable. This is known as the “kernel trick” and it tends to baffle those who are not familiar with inner product spaces. Knowledge of inner product spaces can also help provide intuition for similarity measures in clustering algorithms (similarity measures are conceptually opposite to inner products).

- Compute dot products, norms, and distances between vectors in N-dimensional Euclidean space.
- Extend the concept of the dot product to the more general concept of an inner product, and use the inner product to compute norms and distances between vectors in abstract vector spaces.

Gradient descent, the most popular family of optimization methods in machine learning, involves computing partial derivatives of multivariable functions. Likewise, in order to extend methods from probability and statistics to multivariate distributions (which are used in e.g. Gaussian mixture models), one must integrate multivariable functions. Finally, the concept of a “hyperplane” (which comes up frequently in the context of classification algorithms) can feel nebulous if one is not already familiar with equations of planes in 3D space.

- Construct equations of lines and planes in 3D space.
- Extend prior knowledge of single-variable derivative rules to compute partial derivatives of multivariable functions, including the chain rule.
- Compute the gradient of a multivariable function and interpret it geometrically as representing direction and magnitude of the function’s greatest rate of increase.
- Extend prior knowledge of single-variable integrals to evaluate double integrals using the fundamental theorem of calculus.

To understand the advanced probability topics that appear in machine learning, one must be able to manipulate random variables and distributions. For example, the covariance matrix of multiple random variables is central to principal component analysis because the principal components are themselves the eigenvectors of the covariance matrix. Likewise, uniform and normal distributions are frequently used during parameter initialization – and the multivariate normal distribution is the central figure in the Gaussian mixture model, a popular clustering algorithm.

- Combine prior knowledge of discrete random variables and integration to compute the probability, mean, and variance of a continuous random variable.
- Generalize prior knowledge of univariate probability distributions to joint distributions.
- Compute the mean, variance, and covariance matrix for a given sample of observations.
- Demonstrate fluency with the uniform distribution and the multivariate normal distribution.

Conditional probability and likelihood functions are central to machine learning models and algorithms such as naive Bayes classifiers and the expectation maximization algorithm (which is commonly used to fit Gaussian mixture models).

- Apply Bayes’ theorem to compute conditional probabilities and solve problems in real-world context.
- Combine Bayes’ theorem with knowledge of probability distributions to conceptualize, compute, and apply marginal distributions and conditional distributions.
- Understand the concept of a point estimator and what it means for an estimator to be unbiased.
- Compute likelihood functions and fit probability models to data using maximum likelihood estimation.

Many machine learning models (e.g. logistic regression and lasso regression) are extensions of linear regression. Confidence intervals are often used to place bounds on the uncertainty of a model’s predictions or parameters. Models are usually trained on a sample of a population, and hypothesis testing can be used to determine whether there is sufficient evidence to draw a conclusion about the population as a whole.

- Carry out z-tests and t-tests: formulate null and alternative hypotheses, compute p-values, and accept or reject the null hypothesis at a desired level of significance.
- Identify type I and type II errors and their consequences in modeling contexts.
- Apply subspace projection to fit linear, polynomial, and multiple linear regression models to data.
- Construct confidence intervals for statistical quantities including linear regression coefficients.

1.

Preliminaries
24 topics

1.1. Sets

1.1.1. | Special Sets | |

1.1.2. | Set-Builder Notation | |

1.1.3. | Equivalent Sets | |

1.1.4. | Cardinality of Sets | |

1.1.5. | Subsets | |

1.1.6. | The Complement of a Set | |

1.1.7. | The Difference of Sets | |

1.1.8. | The Cartesian Product | |

1.1.9. | Sets and Functions | |

1.1.10. | Interior and Boundary Points | |

1.1.11. | Interiors and Boundaries of Sets | |

1.1.12. | Open and Closed Sets |

1.2. Logical Quantifiers

1.2.1. | Statements and Propositions | |

1.2.2. | Universal and Existential Quantifiers | |

1.2.3. | Formal and Informal Language |

1.3. Vector Geometry

1.3.1. | The Vector Equation of a Line | |

1.3.2. | The Parametric Equations of a Line | |

1.3.3. | The Cartesian Equation of a Line | |

1.3.4. | The Vector Equation of a Plane | |

1.3.5. | The Cartesian Equation of a Plane | |

1.3.6. | The Parametric Equations of a Plane | |

1.3.7. | The Intersection of Two Planes |

1.4. The Hyperbolic Functions

1.4.1. | The Hyperbolic Functions | |

1.4.2. | Graphs of Hyperbolic Functions |

2.

Matrices
21 topics

2.5. Determinants

2.5.1. | The Determinant of an NxN Matrix | |

2.5.2. | Finding Determinants Using Laplace Expansions | |

2.5.3. | Basic Properties of Determinants | |

2.5.4. | Further Properties of Determinants |

2.6. Gaussian Elimination

2.6.1. | Systems of Equations as Augmented Matrices | |

2.6.2. | Row Echelon Form | |

2.6.3. | Solving Systems of Equations Using Back Substitution | |

2.6.4. | Elementary Row Operations | |

2.6.5. | Creating Rows or Columns Containing Zeros Using Gaussian Elimination | |

2.6.6. | Solving 2x2 Systems of Equations Using Gaussian Elimination | |

2.6.7. | Solving 2x2 Singular Systems of Equations Using Gaussian Elimination | |

2.6.8. | Solving 3x3 Systems of Equations Using Gaussian Elimination | |

2.6.9. | Identifying the Pivot Columns of a Matrix | |

2.6.10. | Solving 3x3 Singular Systems of Equations Using Gaussian Elimination | |

2.6.11. | Reduced Row Echelon Form | |

2.6.12. | Gaussian Elimination For NxM Systems of Equations |

2.7. The Inverse of a Matrix

2.7.1. | Finding the Inverse of a 2x2 Matrix Using Row Operations | |

2.7.2. | Finding the Inverse of a 3x3 Matrix Using Row Operations | |

2.7.3. | Matrices With Easy-to-Find Inverses | |

2.7.4. | The Invertible Matrix Theorem in Terms of 2x2 Systems of Equations | |

2.7.5. | Triangular Matrices |

3.

Vector Spaces
18 topics

3.8. Vectors in N-Dimensional Space

3.8.1. | Vectors in N-Dimensional Euclidean Space | |

3.8.2. | Linear Combinations of Vectors in N-Dimensional Euclidean Space | |

3.8.3. | Linear Span of Vectors in N-Dimensional Euclidean Space | |

3.8.4. | Linear Dependence and Independence |

3.9. Subspaces of N-Dimensional Space

3.9.1. | Subspaces of N-Dimensional Space | |

3.9.2. | Subspaces of N-Dimensional Space: Geometric Interpretation | |

3.9.3. | The Column Space of a Matrix | |

3.9.4. | The Null Space of a Matrix |

3.10. Bases of N-Dimensional Space

3.10.1. | Finding a Basis of a Span | |

3.10.2. | Finding a Basis of the Column Space of a Matrix | |

3.10.3. | Finding a Basis of the Null Space of a Matrix | |

3.10.4. | Expressing the Coordinates of a Vector in a Given Basis | |

3.10.5. | Writing Vectors in Different Bases |

3.11. Dimension and Rank in N-Dimensional Space

3.11.1. | The Dimension of a Span | |

3.11.2. | The Rank of a Matrix | |

3.11.3. | The Dimension of the Null Space of a Matrix | |

3.11.4. | The Invertible Matrix Theorem in Terms of Dimension, Rank and Nullity | |

3.11.5. | The Rank-Nullity Theorem |

4.

Diagonalization of Matrices
12 topics

4.12. Eigenvectors and Eigenvalues

4.12.1. | The Eigenvalues and Eigenvectors of a 2x2 Matrix | |

4.12.2. | Calculating the Eigenvalues of a 2x2 Matrix | |

4.12.3. | Calculating the Eigenvectors of a 2x2 Matrix | |

4.12.4. | The Characteristic Equation of a Matrix | |

4.12.5. | Calculating the Eigenvectors of a 3x3 Matrix With Distinct Eigenvalues | |

4.12.6. | Calculating the Eigenvectors of a 3x3 Matrix in the General Case |

4.13. Diagonalization

4.13.1. | Diagonalizing a 2x2 Matrix | |

4.13.2. | Diagonalizing a 3x3 Matrix With Distinct Eigenvalues | |

4.13.3. | Diagonalizing a 3x3 Matrix in the General Case | |

4.13.4. | Symmetric Matrices | |

4.13.5. | Diagonalization of 2x2 Symmetric Matrices | |

4.13.6. | Diagonalization of 3x3 Symmetric Matrices |

5.

Orthogonality & Projections
16 topics

5.14. Inner Products

5.14.1. | The Dot Product in N-Dimensional Euclidean Space | |

5.14.2. | The Norm of a Vector in N-Dimensional Euclidean Space | |

5.14.3. | Introduction to Abstract Vector Spaces | |

5.14.4. | Defining Abstract Vector Spaces | |

5.14.5. | Inner Product Spaces |

5.15. Orthogonality

5.15.1. | Orthogonal Vectors in Euclidean Spaces | |

5.15.2. | The Cauchy-Schwarz Inequality and the Angle Between Two Vectors | |

5.15.3. | Orthogonal Complements | |

5.15.4. | Orthogonal Sets in Euclidean Spaces | |

5.15.5. | Orthogonal Matrices and Linear Transformations |

5.16. Orthogonal Projections

5.16.1. | Projecting Vectors Onto One-Dimensional Subspaces | |

5.16.2. | The Components of a Vector with Respect to an Orthogonal or Orthonormal Basis | |

5.16.3. | Projecting Vectors Onto Subspaces in Euclidean Spaces (Orthogonal Bases) | |

5.16.4. | Projecting Vectors Onto Subspaces in Euclidean Spaces (Arbitrary Bases) | |

5.16.5. | Projecting Vectors Onto Subspaces in Euclidean Spaces (Arbitrary Bases): Applications | |

5.16.6. | The Gram-Schmidt Process for Two Vectors |

6.

Singular Value Decomposition
10 topics

6.17. Quadratic Forms

6.17.1. | Bilinear Forms | |

6.17.2. | Quadratic Forms | |

6.17.3. | Positive-Definite and Negative-Definite Quadratic Forms | |

6.17.4. | Constrained Optimization of Quadratic Forms |

6.18. Singular Value Decomposition

6.18.1. | The Singular Values of a Matrix | |

6.18.2. | Computing the Singular Values of a Matrix | |

6.18.3. | Singular Value Decomposition of 2x2 Matrices | |

6.18.4. | Singular Value Decomposition of 2x2 Matrices With Zero or Repeated Eigenvalues | |

6.18.5. | Singular Value Decomposition of Larger Matrices | |

6.18.6. | Singular Value Decomposition and the Pseudoinverse Matrix |

7.

Applications of Linear Algebra
8 topics

7.19. Principal Component Analysis

7.19.1. | Introduction to Principal Component Analysis | |

7.19.2. | Computing Principal Components | |

7.19.3. | The Connection Between PCA and SVD |

7.20. Linear Least-Squares Problems

7.20.1. | The Least-Squares Solution of a Linear System (Without Collinearity) | |

7.20.2. | The Least-Squares Solution of a Linear System (With Collinearity) |

7.21. Linear Regression

7.21.1. | Linear Regression | |

7.21.2. | Polynomial Regression | |

7.21.3. | Multiple Linear Regression |

8.

Multivariable Calculus
27 topics

8.22. Partial Derivatives

8.22.1. | Introduction to Multivariable Functions | |

8.22.2. | Level Curves and Contour Plots | |

8.22.3. | Limits and Continuity of Multivariable Functions | |

8.22.4. | Introduction to Partial Derivatives | |

8.22.5. | Computing Partial Derivatives Using the Rules of Differentiation | |

8.22.6. | Geometric Interpretations of Partial Derivatives | |

8.22.7. | Partial Differentiability of Multivariable Functions | |

8.22.8. | Higher-Order Partial Derivatives | |

8.22.9. | Equality of Mixed Partial Derivatives | |

8.22.10. | The Multivariable Chain Rule |

8.23. Vector-Valued Functions

8.23.1. | Defining Vector-Valued Functions | |

8.23.2. | The Gradient Vector | |

8.23.3. | Directional Derivatives | |

8.23.4. | The Multivariable Chain Rule in Vector Form |

8.24. Approximating Volumes With Riemann Sums

8.24.1. | Partitions of Intervals | |

8.24.2. | Calculating Double Summations Over Partitions | |

8.24.3. | Approximating Volumes Using Lower Riemann Sums | |

8.24.4. | Approximating Volumes Using Upper Riemann Sums | |

8.24.5. | Lower Riemann Sums Over General Rectangular Partitions | |

8.24.6. | Upper Riemann Sums Over General Rectangular Partitions | |

8.24.7. | Defining Double Integrals Using Lower and Upper Riemann Sums |

8.25. Double Integrals

8.25.1. | Double Integrals Over Rectangular Domains | |

8.25.2. | Double Integrals Over Non-Rectangular Domains | |

8.25.3. | Properties of Double Integrals | |

8.25.4. | Type I and II Regions in Two-Dimensional Space | |

8.25.5. | Double Integrals Over Type I Regions | |

8.25.6. | Double Integrals Over Type II Regions |

9.

Probability & Random Variables
37 topics

9.26. Probability

9.26.1. | The Law of Total Probability (Extended) | |

9.26.2. | Bayes' Theorem | |

9.26.3. | Extending Bayes' Theorem |

9.27. Random Variables

9.27.1. | Probability Density Functions of Continuous Random Variables | |

9.27.2. | Calculating Probabilities With Continuous Random Variables | |

9.27.3. | Continuous Random Variables Over Infinite Domains | |

9.27.4. | Cumulative Distribution Functions for Continuous Random Variables | |

9.27.5. | Approximating Discrete Random Variables as Continuous | |

9.27.6. | Simulating Random Observations |

9.28. Transformations of Random Variables

9.28.1. | One-to-One Transformations of Discrete Random Variables | |

9.28.2. | Many-to-One Transformations of Discrete Random Variables | |

9.28.3. | The Distribution Function Method | |

9.28.4. | The Change-of-Variables Method for Continuous Random Variables | |

9.28.5. | The Distribution Function Method With Many-to-One Transformations |

9.29. Expectation

9.29.1. | Expected Values of Discrete Random Variables | |

9.29.2. | Properties of Expectation for Discrete Random Variables | |

9.29.3. | Moments of Discrete Random Variables | |

9.29.4. | Variance of Discrete Random Variables | |

9.29.5. | Properties of Variance for Discrete Random Variables | |

9.29.6. | Expected Values of Continuous Random Variables | |

9.29.7. | Moments of Continuous Random Variables | |

9.29.8. | Variance of Continuous Random Variables | |

9.29.9. | The Rule of the Lazy Statistician |

9.30. Discrete Probability Distributions

9.30.1. | The Bernoulli Distribution | |

9.30.2. | Mean and Variance of the Binomial Distribution | |

9.30.3. | The Discrete Uniform Distribution | |

9.30.4. | Modeling With Discrete Uniform Distributions | |

9.30.5. | Mean and Variance of Discrete Uniform Distributions | |

9.30.6. | The Poisson Distribution | |

9.30.7. | Modeling With the Poisson Distribution |

9.31. Continuous Probability Distributions

9.31.1. | The Continuous Uniform Distribution | |

9.31.2. | Mean and Variance of Continuous Uniform Distributions | |

9.31.3. | Modeling With Continuous Uniform Distributions | |

9.31.4. | The Gamma Function | |

9.31.5. | The Chi-Square Distribution | |

9.31.6. | The Student's T-Distribution | |

9.31.7. | The Exponential Distribution |

10.

Combining Random Variables
29 topics

10.32. Distributions of Two Discrete Random Variables

10.32.1. | Double Summations | |

10.32.2. | Joint Distributions for Discrete Random Variables | |

10.32.3. | Marginal Distributions for Discrete Random Variables | |

10.32.4. | Independence of Discrete Random Variables | |

10.32.5. | Conditional Distributions for Discrete Random Variables | |

10.32.6. | The Joint CDF of Two Discrete Random Variables |

10.33. Distributions of Two Continuous Random Variables

10.33.1. | Joint Distributions for Continuous Random Variables | |

10.33.2. | Marginal Distributions for Continuous Random Variables | |

10.33.3. | Independence of Continuous Random Variables | |

10.33.4. | Conditional Distributions for Continuous Random Variables | |

10.33.5. | The Joint CDF of Two Continuous Random Variables | |

10.33.6. | Properties of the Joint CDF of Two Continuous Random Variables |

10.34. Expectation for Joint Distributions

10.34.1. | Expected Values of Sums and Products of Random Variables | |

10.34.2. | Variance of Sums of Independent Random Variables | |

10.34.3. | Computing Expected Values From Joint Distributions | |

10.34.4. | Conditional Expectation for Discrete Random Variables | |

10.34.5. | Conditional Variance for Discrete Random Variables | |

10.34.6. | Conditional Expectation for Continuous Random Variables | |

10.34.7. | Conditional Variance for Continuous Random Variables | |

10.34.8. | The Rule of the Lazy Statistician for Two Random Variables |

10.35. Covariance of Random Variables

10.35.1. | The Covariance of Two Random Variables | |

10.35.2. | Variance of Sums of Random Variables | |

10.35.3. | The Correlation Coefficient for Two Random Variables | |

10.35.4. | The Covariance Matrix |

10.36. Normally Distributed Random Variables

10.36.1. | Normal Approximations of Binomial Distributions | |

10.36.2. | Combining Two Normally Distributed Random Variables | |

10.36.3. | Combining Multiple Normally Distributed Random Variables | |

10.36.4. | I.I.D Normal Random Variables | |

10.36.5. | The Bivariate Normal Distribution |

11.

Parametric Inference
29 topics

11.37. Point Estimation

11.37.1. | The Sample Mean | |

11.37.2. | Statistics and Sampling Distributions | |

11.37.3. | Variance of Sample Means | |

11.37.4. | The Sample Variance | |

11.37.5. | Sample Means From Normal Populations | |

11.37.6. | The Central Limit Theorem | |

11.37.7. | Sampling Proportions From Finite Populations | |

11.37.8. | Point Estimates of Population Proportions | |

11.37.9. | The Sample Covariance Matrix |

11.38. Maximum Likelihood

11.38.1. | Product Notation | |

11.38.2. | Logarithmic Differentiation | |

11.38.3. | Likelihood Functions for Discrete Probability Distributions | |

11.38.4. | Log-Likelihood Functions for Discrete Probability Distributions | |

11.38.5. | Likelihood Functions for Continuous Probability Distributions | |

11.38.6. | Log-Likelihood Functions for Continuous Probability Distributions | |

11.38.7. | Maximum Likelihood Estimation |

11.39. Hypothesis Testing

11.39.1. | One-Tailed Hypothesis Tests | |

11.39.2. | Two-Tailed Hypothesis Tests | |

11.39.3. | Type I and Type II Errors in Hypothesis Testing | |

11.39.4. | Hypothesis Tests for One Mean: Known Population Variance | |

11.39.5. | Hypothesis Tests for One Mean: Unknown Population Variance | |

11.39.6. | Hypothesis Tests for Two Means: Known Population Variances |

11.40. Confidence Intervals

11.40.1. | Confidence Intervals for One Mean: Known Population Variance | |

11.40.2. | Confidence Intervals for One Mean: Unknown Population Variance | |

11.40.3. | Confidence Intervals for Proportions | |

11.40.4. | Confidence Intervals for Two Means: Known Population Variances | |

11.40.5. | Confidence Intervals for Variances | |

11.40.6. | Confidence Intervals for Slope Parameters in Linear Regression | |

11.40.7. | Confidence Intervals for Intercept Parameters in Linear Regression |