I'd be pleased to add this vector-vs-covector content if you see fit.
It's not just a point of theoretical pedantry: once students understand it, things like the coordinate-dependence of gradient descent (said another way, the fact that the learning rate is an inverse riemannian metric rather than a number) become manifest. From here, implicit regularization becomes easy to analyze near a minimum just based on dimensional analysis! Thus we come to appreciate once again how an inner product (that says how "alike" its two inputs are) controls generalization behavior --- just as with kernel svms, l2 regularized underdetermined linear regression, and more.
[email protected]