Mode Connectivity
Let \(w_1\) and \(w_2\) be two sets of weights corresponding to two neural networks independently trained by minimizing any specified loss \(\mathcal{L}(w)\) (like the cross-entropy loss), and the net space is net. Moreover, we could let \(\phi_{\theta}(t):[0,1] \rightarrow\) net be a continuous piece-wise smooth parametric curve connecting \(w_1\) and \(w_2\), with parameters \(\theta\), such that \(\phi_{\theta}(0) = w_1\) and \(\phi_{\theta}(1) = w_2\). To find a low-loss and high-accuracy path between \(w_1\) and \(w_2\), we can find the set of parameters \(\theta\) that minimizes the expectation over a uniform distribution on the curve, \(\hat \ell(\theta)\):
\begin{equation} \label{eq1} \hat \ell(\theta)=\int_{0}^{1}\mathcal{L}(\phi_{\theta}(t))q_{\theta}(t)dt=E_{t \sim q_{\theta}(t)}\mathcal{L}(\phi_{\theta}(t)), \end{equation}
where the \(q_\theta(t)\) is the distribution for sampling the models on the path indexed by t. However, stochastic gradients of \(\hat \ell(\theta)\) in \eqref{eq1} are generally intractable since \(q_{\theta}(t)\) depends on \(\theta\). Therefore one can choose a more computationally tractable loss
\begin{equation} \label{eq2} \ell(\theta)=\int_{0}^{1} \mathcal{L}(\phi_{\theta}(t))dt=E_{t \sim U(0,1)} \mathcal{L}(\phi_{\theta}(t)), \end{equation}
where \(U(0,1)\) is the uniform distribution in the interval \([0,1]\). The difference between \eqref{eq1} and \eqref{eq2} is that the former is an expectation of the loss \(\mathcal{L}(\phi_{\theta}(t))\) with respect to a uniform distribution on the curve, while \eqref{eq2} is an expectation with respect to an uniform distribution on \(t\in[0,1]\).
To minimize \(\ell(\theta)\), at each iteration one can sample \(\hat t\) from the uniform distribution \(U(0,1)\) and make a gradient step for $\theta$ with respect to the loss \(\mathcal{L}(\phi_{\theta}(\hat t))\). This means that we would use \(\nabla_\theta \mathcal{L}(\phi_{\theta}(\hat t))\) to estimate the true gradient of \(\ell(\theta)\),
\begin{equation} \nabla_\theta \mathcal{L}(\phi_{\theta}(\hat t)) \backsimeq E_{t \sim U(0, 1)} \nabla_\theta \mathcal{L}(\phi_{\theta}(t))=\nabla_\theta E_{t \sim U(0, 1)} \mathcal{L}(\phi_{\theta}(t))=\nabla_\theta \ell(\theta). \end{equation}
We can choose the Bezier curve as the basic parametric function to characterize the parametric curve \(\phi_\theta(t)\). And We could initialize \(\theta\) with \(\frac{1}{2}(w_p+w_c)\). A Bezier curve provides a convenient parametrization of smooth paths with given endpoints. For instance, a qadratic Bezier curve \(\phi_{\theta}(t)\) with endpoints \(w_1\) and \(w_2\) is given by
\begin{equation} \phi_{\theta}(t) = (1 -t)^2 w_1 + 2t(1-t) \theta + t^2 w_2,~~ 0 \le t \le 1. \end{equation}