Mode Connectivity

Let $w_1$ and $w_2$ be two sets of weights corresponding to two neural networks independently trained by minimizing any specified loss $\mathcal{L}(w)$ (like the cross-entropy loss), and the net space is net. Moreover, we could let $\phi_{\theta}(t):[0,1] \rightarrow$ net be a continuous piece-wise smooth parametric curve connecting $w_1$ and $w_2$, with parameters $\theta$, such that $\phi_{\theta}(0) = w_1$ and $\phi_{\theta}(1) = w_2$. To find a low-loss and high-accuracy path between $w_1$ and $w_2$, we can find the set of parameters $\theta$ that minimizes the expectation over a uniform distribution on the curve, $\hat \ell(\theta)$:

\begin{equation} \label{eq1} \hat \ell(\theta)=\int_{0}^{1}\mathcal{L}(\phi_{\theta}(t))q_{\theta}(t)dt=E_{t \sim q_{\theta}(t)}\mathcal{L}(\phi_{\theta}(t)), \end{equation}

where the $q_\theta(t)$ is the distribution for sampling the models on the path indexed by t. However, stochastic gradients of $\hat \ell(\theta)$ in \eqref{eq1} are generally intractable since $q_{\theta}(t)$ depends on $\theta$. Therefore one can choose a more computationally tractable loss

\begin{equation} \label{eq2} \ell(\theta)=\int_{0}^{1} \mathcal{L}(\phi_{\theta}(t))dt=E_{t \sim U(0,1)} \mathcal{L}(\phi_{\theta}(t)), \end{equation}

where $U(0,1)$ is the uniform distribution in the interval $[0,1]$. The difference between \eqref{eq1} and \eqref{eq2} is that the former is an expectation of the loss $\mathcal{L}(\phi_{\theta}(t))$ with respect to a uniform distribution on the curve, while \eqref{eq2} is an expectation with respect to an uniform distribution on $t\in[0,1]$.

To minimize $\ell(\theta)$, at each iteration one can sample $\hat t$ from the uniform distribution $U(0,1)$ and make a gradient step for $\theta$ with respect to the loss $\mathcal{L}(\phi_{\theta}(\hat t))$. This means that we would use $\nabla_\theta \mathcal{L}(\phi_{\theta}(\hat t))$ to estimate the true gradient of $\ell(\theta)$,

\begin{equation} \nabla_\theta \mathcal{L}(\phi_{\theta}(\hat t)) \backsimeq E_{t \sim U(0, 1)} \nabla_\theta \mathcal{L}(\phi_{\theta}(t))=\nabla_\theta E_{t \sim U(0, 1)} \mathcal{L}(\phi_{\theta}(t))=\nabla_\theta \ell(\theta). \end{equation}

We can choose the Bezier curve as the basic parametric function to characterize the parametric curve $\phi_\theta(t)$. And We could initialize $\theta$ with $\frac{1}{2}(w_p+w_c)$. A Bezier curve provides a convenient parametrization of smooth paths with given endpoints. For instance, a qadratic Bezier curve $\phi_{\theta}(t)$ with endpoints $w_1$ and $w_2$ is given by

\begin{equation} \phi_{\theta}(t) = (1 -t)^2 w_1 + 2t(1-t) \theta + t^2 w_2,~~ 0 \le t \le 1. \end{equation}