From Discrete to Continuous Time
From discrete optimization methods to continuous-time dynamics at varying levels of precision.
Most optimization algorithms are introduced in discrete form: a recipe that updates parameters step by step.
But there’s another way to look at them.
Why Continuous Time?
By reinterpreting a discrete optimizer as a continuous-time dynamical system, hidden structure becomes visible.
This perspective often provides cleaner intuition, simpler analysis tools, and sometimes even inspires new algorithms.
The continuous-time view yields an ordinary differential equation (ODE) that describes how the parameters evolve over time.
In the ideal case, the trajectory of the resulting ODE matches the discrete iterates exactly at regularly spaced time points.
Standard low-resolution ODEs often collapse distinct algorithms into the same dynamics, hiding important differences.
To resolve this, we can use high-resolution ODEs (HRDEs) .
With infinite precision, an HRDE perfectly matches the discrete method.
For practical analysis, we usually work with reduced resolution: we drop higher-order terms but keep the leading corrections in the step size.
This makes HRDEs more faithful than standard low-resolution ODEs, while still being mathematically tractable.
In this post, we’ll walk through a clear, step-by-step recipe for deriving ODEs—both low- and high-resolution—directly from a discrete update rule.
We’ll keep the discussion concrete, assume only basic ML knowledge, and avoid unnecessary math jargon.
1. Start with a Discrete Update Rule
Start from your favorite optimization method. In this post we will consider vanilla
gradient descent and optimistic gradient descent (sometimes called Popov method) .
Gradient Descent (GD) moves the next iterate in the direction that localy minimizes the loss:
\[
\mathbf{x}_{k+1} = \mathbf{x}_k - \gamma \ \nabla f(\mathbf{x}_k) \,.
\]
Here, \( \mathbf{x}_k \in \mathcal{R}^d\) are the model's parameters at step \(k\), \(f\colon \mathcal{R}^d \to \mathcal{R} \)
is a loss function we aim to minimize, and \(\gamma \in (0,1) \) is the step size.
This is a discrete recurrence: we move from \( \mathbf{x}_k \) to \( \mathbf{x}_{k+1} \) in steps of size \( \gamma \).
We can equivalently rewrite GD as follows:
\[ \tag{GD}
\frac{ \mathbf{x}_{k+1} - \mathbf{x}_k }{\gamma } = - \ \nabla f(\mathbf{x}_k) \,.
\]
Optimistic Gradient Descent (OGD)OGD is a popular method for multi-player problems,
such as \( \min_{x \in \mathcal{R}} \max_{y \in \mathcal{R}} \ x \cdot y \) for instance.
OGD converges for this simple problem, while GD does not.
applies correction to the gradient by re-using the one computed at the previous iteration \( \nabla f( \mathbf{x}_{k-1})\) as follows:
\[
\mathbf{x}_{k+1} \;=\; \mathbf{x}_k \;-\; \gamma\, \Big(2\,\nabla f(\mathbf{x}_k)\;-\;\nabla f(\mathbf{x}_{k-1})\Big).
\]
Equivalently,
\[ \tag{OGD}
\frac{ \mathbf{x}_{k+1} - \mathbf{x}_k }{\gamma } = - \Big(2\,\nabla f(\mathbf{x}_k)\;-\;\nabla f(\mathbf{x}_{k-1})\Big) \,.
\]
2. Introduce Time-Dependent Variable (in \(\mathcal{R}^d\))
To connect the iterates \(\{ \mathbf{x}_k\}_{k=1}^\infty \) to a continuous-time dynamics, define a time-dependent variable \( X(\cdot)\):
\[
\mathbf{x}_k \approx X(k \cdot \delta) \,,
\]
where \( \delta \) is a fixed sampling interval.
Think of the discrete iterates \( \mathbf{x}_k \) as being sampled from some smooth trajectory \( X(t) \) at times \( k \cdot \delta \);
In our discrete to continuous-time modeling mapping, we will soon fix the relation \( t = k \delta \) so that we "match" \( \mathbf{x}_k = X(t) \).
However, unlike \( k \in \mathcal{N} \), now \( t \in \mathcal{R}_+ \).
3. Expand Terms Involving Index \( i \neq k \) with Taylor
We next make each term's step-size dependence explicit through Taylor Expansion (TE). This enables to later drop terms of a given order.
Expand \(\mathbf{x}_{k+1} \).
Use a Taylor expansion of \(X\) around \(t \equiv k \cdot \delta \):Notice that here we relate \(\mathbf{x}_{k+1} \) to \(\mathbf{x}_{k} \);
hence the TE is over the time, and our notation for step size in time
is \( \delta\).
\[ \mathbf{x}_{k+1} \approx X \big( (k+1) \cdot \delta \big) \approx X( k \cdot \delta ) + \delta\,\dot{X}(t) + \tfrac{\delta^2}{2} \ddot{X}(t) + \mathcal{O}(\delta^3) \,.
\]
Since \( \mathbf{x}_{k} \approx X( k \cdot \delta) \), for the numerator of the left hand side of Eq. (GD) and (OGD) we get:
\[ \tag{nom-LHS}
\delta \big( \,\dot{X}(t) + \tfrac{\delta}{2} \ddot{X}(t) + \mathcal{O} (\delta^2) \big) \,.
\]
Expand \(\nabla f( \mathbf{x}_{k-1} )\). Similarly, we get:
\[
\nabla f\big(X(k \delta -\delta)\big)
\approx \nabla f\big(X ( k\delta ) \big) \;-\; \delta \nabla^2 f\big(X(k\delta)\big) \dot{X}(k\delta) + \mathcal{O}(\delta^2) \,,
\]
where \(\dot{X} \) denotes the first time derivative of \( X \).
Hence, for the right-hand-side of Eq. (OGD) we get,
\[ \tag{OGD-rhs}
- \nabla f\big(X ( k\delta ) \big) - \delta \nabla^2 f\big(X(k\delta)\big) \dot{X}(k\delta) + \mathcal{O}(\delta^2) \,.
\]
4. Choose \(\gamma\)/\(\delta\) Ratio, the Resolution \( r \) and Take the Limit \( \gamma^{r+1} \to 0\)
Choose \(\gamma\)/\(\delta\) ratio.
We tie the sampling interval \( \delta \) to the algorithm’s step size \( \gamma \):
pick a ratio \(\delta = f_\delta(\gamma)\)
Multiple choices are possible/common; however the same \( f_\delta \) should be used when discretizing back, for coherence of conclusions.
. A common simple choice is
\[ \delta = \gamma.\]
In the following, we will use this ratio and stick with \( \gamma \).
Choose resolution \( r \).
Above we developed the TEs up to \(\mathcal{O}(\gamma^2)\) order.
Choosing higher resolution means retaining more correction terms in \( \gamma\).
This makes the ODE more accurate but also more complex to analyze.
Typically we select resolution \( r \in \mathcal{N} \), and set \( \gamma^{r+1} \to 0 \). All higher order terms tend to \( 0 \) as well since \(\gamma < 1\).
Here we consider:
- Low-resolution ODE, in which case \( \gamma \to 0 \) (\( r=0 \) ).
- \( \mathcal{O}(\gamma) \)-HRDEs, in which case we set \( \gamma^{2} \to 0 \) and retain all terms that depend up to linearly on \( \gamma \).
Take the limit.
Take the limit \( \gamma^{r+1}\rightarrow0\) and let \( k\delta\rightarrow t\).
Summary.
Intuitively, these choices sum-up as follows: a movement of \(\gamma\) in parameter spaces takes \( \delta \) time,
which is a tiny interval (small positive constant) since \( \gamma^2 \to 0 \).
5. Putting Everything Together
Gradient descent. Replacing the above, for GD we get:
\[ \dot{X}(t) + \tfrac{\gamma}{2}\,\ddot{X}(t) + \mathcal{O} (\gamma^2) = -\,\nabla f\!\big(X(t)\big) \,. \]
Optimistic gradient descent. Replacing the above, for OGD we get:
\[ \dot{X}(t) + \tfrac{\gamma}{2}\,\ddot{X}(t) + \mathcal{O} (\gamma^2) =
- \nabla f\big(X ( t ) \big) - \gamma \nabla^2 f\big( X(t)\big) \dot{X}(t) + \mathcal{O}(\gamma^2) \,. \]
Hence, for the two selected resolutions we have the following ODE/HRDEs:
- Low-resolution ODE: setting \( \gamma \to 0\) gives the identical (low resolution) ODE for both GD and OGD as follows:
\[ \tag{(O)GD-ODE}
\dot{X}(t) = -\,\nabla f\!\big(X(t)\big) \,.
\]
- \( \mathcal{O}(\gamma) \)-HRDEs: setting \( \gamma^{2} \to 0 \) yields the following HRDEs:
\[ \tag{GD-HRDE}
\dot{X}(t) + \tfrac{\gamma}{2}\,\ddot{X}(t) = -\,\nabla f\!\big(X(t)\big) \,.
\]
\[ \tag{OGD-HRDE}
\dot{X}(t) + \tfrac{\gamma}{2}\,\ddot{X}(t) = -\,\nabla f\!\big(X(t)\big) - \gamma \nabla^2 f\big( X(t)\big) \dot{X}(t) \,.
\]
✨ Recap of the Steps
For any discrete method with step size \(\gamma\):
- Write the update in a generic “update-per-step-size” form (make the \(\gamma\) dependence explicit).
- Introduce \(x_k \approx X(k\delta)\) with a yet-unknown \(\delta=f_\delta(\gamma)\).
- Apply Taylor expansions to all terms indexed at times \(\neq k\) so only \(X(\cdot)\) at time \(t=k\delta\) remains.
- Choose \(\delta=f_\delta(\gamma)\) (e.g., \(\delta=\gamma\) or \(\delta=\sqrt{\gamma}\)), substitute, and collect powers of \(\gamma\).
- Pick a resolution (e.g., keep terms up to \(O(\gamma)\)), then let \(k\delta \to t\).
🧠 Discussions
Why go through this exercise? Broadly, there are two complementary ways to connect discrete optimization algorithms with continuous-time dynamics:
1. Discretization view.
In this perspective, one starts with an ODE and interprets different algorithms as different discretization schemes for it.
This approach often requires careful discretization techniques, and the quality of the correspondence can depend strongly on the problem at hand.
2. High-Resolution ODE View.
Here, we start directly from the discrete update rule and derive its continuous-time counterpart at infinite precision.
For tractability, we then choose a finite resolution, truncating higher-order terms.
At certain level of approximation, it is possible for distinct algorithms to collapse to the same ODE, highlighting their shared structure.
To distinguish between them, higher resolution may be needed.
Relation to Prior Work.
The idea of using refined ODE models is not new.
What is new here is a systematic, general recipe that extends the classical low-resolution derivation.
Compared to earlier efforts (e.g., ),
this approach applies Taylor expansions consistently to all terms
indexed at times other than \(k\), including those on the left-hand side of the update.
📝 Takeaway
Deriving an HRDE is a short, mechanical recipe: express the update, insert the time ansatz, expand, choose \(\delta=f_\delta(\gamma)\), keep terms up to your chosen resolution, and pass to the limit.
The result is a continuous-time model that mirrors the discrete method closely enough to analyze and compare behaviors that low-resolution ODEs can miss.
Acknowledgements
Based on joint work with M. Jordan and M. Zampetakis .