### Where Is Noether’s Principle in Machine Learning?

This post accompanies a poster to be presented at the MLSS 2024. Physics likes optimization! Subject to its boundary conditions, the time evolution of a physical system is a critical point for a quantity called an action. This point of view sets the stage for Noether’s principle, a remarkable correspondence between continuous invariances of the

*This post accompanies a poster to be presented at the MLSS 2024.*

Physics likes optimization! Subject to its boundary conditions, the time evolution of a physical system is a critical point for a quantity called an **action.** This point of view sets the stage for **Noether’s principle**, a remarkable correspondence between continuous invariances of the action and conservation laws of the system.

In machine learning, we often deal with discrete “processes” whose control parameters are chosen to minimize some quantity. For example, we can see a deep residual network as a process where the role of “time” is played by depth. We may ask:

- Does Noether’s theorem apply to these processes?
- Can we find meaningful conserved quantities?

Our answers: “yes,” and “not sure!”

## Noether’s Principle in Physics

In 1630, Fermat observed that the trajectory a beam of light takes through a lens is a path of least time. Formally, where $v(q)$ gives the velocity of light at a point $q,$ a trajectory $q(t)$ taken by a beam of light between two points $q(t_0)$ and $q(t_1)$ will be a stationary point for

$S[q] = int_{t_0}^{t_1} frac{1}{v(q)} lVert dot q rVert , dt,$

in the sense that

$frac{d}{d epsilon} S[q + epsilon h] = 0$

for any “perturbation” $h$ verifying $h(t_0) = h(t_1) = 0.$ Furthermore, this condition turns out to characterize the paths that light can take.

In fact, all fundamental physical theories can be written in the form

$delta S = 0$

for a suitable “action” $S,$ where $delta$ denotes a (variational) derivative with respect to the process trajectory. Such an equation is called a **stationary action principle**.

Given a physical system, a function $F(q, dot q, t, dots)$ of the system state is called a conserved quantity when its value is constant over any given physical trajectory. For dynamics expressed by a stationary action principle, **Noether’s principle** gives a remarkable correspondence between **conserved quantities** and certain **invariances** of the action $S.$

For the purposes of this article, we will not give more details on Noether’s principle. The interested reader is invited to seek out Chapter 4 of Arnold’s *Mathematical Methods of Classical Mechanics* for a concise description from the point of view of Lagrangian mechanics. (For a much more complete reference, including the more subtle “off-shell” version applicable to local gauge transformations, see Chapter 5 of Olver’s *Application of Lie Groups to Differential Equations*.) However, as a guide for the imagination, we illustrate two examples.

In physics, the famous two-body problem is described by the action

$S = int frac{1}{2} (m_1 dot q_1^2 + m_2 dot q_2^2) – frac{m_1 m_2 G}{lVert q_1 – q_2 rVert} , dt,$

where $q_1$ and $q_2$ are the positions of our two bodies and $m_1$ and $m_2$ are their masses. The integrand does not change if the trajectories $q_1(t)$ and $q_2(t)$ are translated or rotated, so these transformations are *invariants* for $S.$ Applying Noether’s theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be horizontal, vertical, and angular momentum.

In the illustration above, we consider two bodies with equal mass. Conservation of linear momentum means that the sum of the arrows is constant, while conservation of angular momentum means that the sum of highlighted areas is constant.

For a less familiar example of a conservation law, consider a ray of light passing through a rotationally symmetric lens. Rotating a trajectory about the center of the lens does not affect the time needed to traverse it. What conserved quantity does this symmetry produce?

At each instant, let $p$ be a vector pointing in the direction of $dot q$ with norm equal to $1/v(q).$ (We’ve drawn this vector in the widget above.) This vector can be seen as the gradient of $S$ with respect to $q;$ it gives the marginal price we would pay to move $q$ in a certain direction at a particular instant. Noether’s principle tells us that

$p wedge (q – c)$

is conserved, where $c$ is the center of the lens. In our illustration, this conserved quantity is represented by the area of the triangle.

## Noether’s Principle in ML: A Toy Example

In machine learning, we routinely deal with processes whose control parameters are chosen to minimize some quantity. For example, say we are trying to solve a supervised learning problem by choosing parameters $(theta_1, dots, theta_N)$ so that a composition

$F = varphi_N(theta_N) circ dots circ varphi_2(theta_2) circ varphi_1(theta_1)$

minimizes some expected loss $E[L(F(x), y)]$ with respect to some random variables $x$ and $y.$

Let’s view the sequence of random variables

$begin{align*}
x_0 &= x \
x_1 &= varphi_1(theta_1)(x_0) \
& vdots \
x_N &= varphi_N(theta_N)(x_{N – 1})
end{align*}$

as the states of a discrete-time “process.” After optimization, the trajectory $(x_0, dots, x_N)$ is approximately a **critical point for a loss function**. As we saw above, physical trajectories are also characterized as critical points of certain functions. Can we think about our trajectory of intermediate values $x_i$ from a physical point of view?

Of course, several things are different. Most obviously, “time” is now discrete rather than continuous. Furthermore, our trajectory of intermediate values is constrained in a different way. In physics, we specified boundary conditions and allowed the intermediate states of the process to vary, but in machine learning we specify an initial condition and constrain the trajectory $(x_0, dots, x_N)$ to be driven by some parameters $theta_i.$ Fortunately, Noether’s principle continues to apply!

To see how, let’s consider a toy example that’s easy to visualize. First, we’ll define a family of “deformations” of the plane, as depicted below. (Click the screen or drag the sldier to change the parameters of the deformation.)

Next, we’ll set up some simple optimization problems. In each case, we’ll try to find a sequence of small deformations that work together to send some “clusters” to their corresponding “destinations.” This is easy to do by initializing randomly and optimizing our parameters with gradient descent. In the following widget, a composition of 50 maps is able to swap the positions of three clusters.

Now, observe that our family $Fc$ of deformations is **invariant** under conjugation by translations and rotations. This means that, where $Q$ is a translation or rotation, we have an equality

$Fc = { Q circ f circ Q^{-1} mid f in Fc }$

of sets. In other words, our family $Fc$ “looks the same” from a translated or rotated coordinate system. Moreover, conjugating a map $f$ by a transformation $Q$ near the identity can be realized by making a small change to the parameters that define $f.$ As it turns out, this is the right notion of invariance to apply Noether’s principle. What conserved quantity do we get?

For each data point, let $(g_i)_{i = 1}^N$ be the backpropagated gradients of the loss function with respect to the intermediate positions $(x_i)_{i = 1}^N$ that this data point takes through our “process.” Applying a certain discrete Noether’s principle, detailed at the end of this post, will turn invariance of $Fc$ under translations into conservation of the *expected value* $E[g_i]$ from one layer to the next. (You can display the individual vectors $g_i$ in the widget above.) Invariance under rotation gives a conserved quantity reminiscent of angular momentum, namely $E[g_i wedge x_i].$

In this particular example, it turns out that $E[g_i]$ is also approximately conserved (and nearly zero) over each individual *cluster*. In the next widget, we add some

examples where the per-cluster gradient is not conserved. (Each cluster now has its average gradient drawn with a single arrow. For simplicity, we are not visualizing “angular gradient.”)

In the physical world, interactions are often accompanied by transfers of conserved quantities. Think, for example, of particles in a gas exchanging momentum and energy. Conversely, if two systems are not interacting with one another, then conservation laws will hold for each system in isolation. In mechanics, this is known as Newton’s first law.

In our first toy example, the subgoals of bringing each cluster to its respective destination could be achieved without compromise between clusters, meaning that each cluster acted like an independent subsystem. This lack of interaction explains the conservation of gradient within each cluster. (In other words, we can apply Newton’s first law.) In our next examples, “transfers” of average gradient indicate “interactions” between clusters, meaning that the trajectory of our process is no longer a stationary point for each subgoal considered in isolation. On the other hand, we can sometimes identify *subset* of clusters over which gradient is approximately conserved.

In general, if the layers of a deep model have meaningful “Noether equivariances,” then we can build conserved quantities and perhaps study a physically-inspired notion of “interaction.” However, I don’t know if this idea works in practice! Can you think of interesting equivariances that we might find on real-world machine learning models? (Keep in mind that they might only be valid near the actual parameters of the model.)

## Proof of a Discrete Noether’s Principle

In the above, we’ve been intentionally vague about the details of Noether’s principle and how it applied to our toy example. We end this post with a general proof of our “discrete” Noether’s principle.

We’ll focus on a single step of an “optimized process.” Let $varphi colon Theta times X to Y$ describe some family of functions parameterized over $Theta,$ and define the action

$S = langle g_x, x rangle – langle g_y, varphi_theta(x) – y rangle$

where $x$ and $y$ are understood to be random variables, $g_x$ and $g_y$ are their associated loss gradients, and inner products are taken in expectation. In the typical context of supervised learning, $varphi$ will parameterize a layer of a deep model, and $langle g_x, x rangle$ (for example) stands for a sum

$frac{1}{n} sum_{i = 1}^n langle g_x(i), x(i) rangle$

where $i$ indexes over a dataset and $(x(i), g_x(i))$ are activation and gradient vectors for the $i$th data point.

The “forward pass” $y = varphi_theta(x)$ and “backward pass” $g_x = (partial varphi / partial x)^T g_y$ are encoded by the stationarity of $S$ with respect to $g_y$ and $x,$ since

$frac{partial S}{partial g_y} = varphi_theta(x) – y, quad frac{partial S}{partial x} = g_x – left( frac{partial varphi}{partial x}right)^T g_y.$

Similarly, first-order optimality of the parameter $theta$ can be expressed as

$frac{partial S}{partial theta} = 0.$

In practice, this will hold only approximately, so our conservation law will also be approximate. We ignore this detail for simplicity.

Now, suppose that $(x, y, theta)$ are made to depend on a real-valued parameter $epsilon$ and that

$begin{align*}
frac{partial}{partial epsilon} langle g_y, varphi_theta(x) – y rangle = 0.tag{1}
end{align*}$

For example, when we have a Lie group $G$ acting on $X,$ $Theta$ and $Y$ such that

$varphi_{g cdot theta}(g cdot x) = g cdot varphi_theta(x),$

then any one-parameter subgroup of $G$ gives rise to such a variation of $(x, y, theta).$ This is the case for our toy example, where $G$ is the group $operatorname{SE}(2)$ of translations and rotations of the plane. I’m calling this kind of symmetry a **Noether equivariance**.

Let’s compute $partial S / partial epsilon.$ Applying $(1),$ we find

$frac{partial S}{partial epsilon} = leftlangle g_x, frac{partial x}{partial epsilon} rightrangle.$

On the other hand, if $S$ is stationary with respect to $x$ and $theta,$ then $partial S / partial epsilon$ can be explained entirely by the variation of $y,$ which is

$frac{partial S}{partial epsilon} = frac{partial S}{partial y} frac{partial y}{partial epsilon} = leftlangle g_y, frac{partial y}{partial epsilon} rightrangle.$

Overall, we conclude that

$leftlangle g_x, frac{partial x}{partial epsilon} rightrangle = leftlangle g_y, frac{partial y}{partial epsilon} rightrangle.$

This is quite analogous to the conservation law

$frac{d}{dt} leftlangle p, frac{partial q}{partial epsilon} rightrangle = 0$

that Noether’s principle implies for a Hamiltonian system invariant under a family of canonical transformations.

One simple kind of Noether equivariance is a relationship like

$varphi_{g cdot theta}(x) = g cdotvarphi_theta(x),$

meaning that our family of maps is expressive enough to be closed under the action of $G$ on its outputs. In this case, since the action on $x$ is trivial, $partial x / partial epsilon = 0$ and the “conservation law” associated with any one-parameter subgroup of $G$ simply reads

$0 = leftlangle g_y, frac{partial y}{partial epsilon} rightrangle.$

Indeed, this inner product is exactly the derivative of the loss with respect to the transformation of $y$ by a given one-parameter subgroup of $G.$ If variations to $theta$ can produce these transformations and $theta$ is chosen optimally, the inner product must be $0.$ (Informally, this explains why the average gradients in our toy example are approximately $0.$)

The same idea applies to a group acting only on $x$ and $theta,$ in which case we get

$leftlangle g_x, frac{partial x}{partial epsilon} rightrangle = 0.$

For example, if $x in R^n$ is the input to a linear layer, we get a “Noether equivariance” in the form of an action of $GL(n)$ on $x$ and the layer parameters. Letting $partial x / partial epsilon$ range over all linear functions of $x$ in the equation above—which we get from the action of one-parameter subgroups of $GL(n)$—shows that the tensor product $g_x otimes x$ will vanish in expectation if the loss is stationary with respect to the parameters of the linear layer. The general principle we proved above can be understood as the “two-sided” extension of these kinds of observations.