正规方程
Notes of Andrew Ng’s Machine Learning —— (4) Normal Equation
Normal Equation
Gradient dedcent gives one way of minimizing our cost function $J$. Let’s discuss a second way of doing so – Normal Equation
.
Normal Equation minimize $J$ by explicitly taking its derivatives with resspect to the $\theta_j$s, and setting them to zero. This alllows us to find the optimum theta without resorting to an iteration.
The normal equation formula is given below:
$$
\theta = (X^TX)^{-1}X^Ty
$$
There is no need to do feature scaling with the normal equation.
Example
Normal Equation V.S. Gradient Descent
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
Time cost: $O(kn^2)$ | Need to calculate inverse of $X^TX$, which costs $O(n^3)$ |
Works well when n is large | Slow if n is very large |
In practice, when $n > 10,000$, we are tend to use gradient descent, otherwise, normal equation will perform better.
Normal Equation Noninvertibility
When implementing the normal equation in octave we want to usr the ** pinv
** function rather than inv
. The pinv
will give you a value of $\theta$ even if $X^TX$ is not invertible.
if $X^TX$ is non-invertible, the common causes might be having:
- Redundant features, where two features are linearly dependent. (e.g. there are the size of house in feet^2 and the size of house in meter^2, where we know that 1 meter = 3.28 feet)
- Too many features ($m \le n$). In this case, delete some features or use “regularization”.
Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.