RNRI: Regularized Newton Raphson Inversion for Text-to-Image Diffusion Models

1OriginAI, 2Tel Aviv University, 3Bar Ilan University, 4NVIDIA Research

Image editing using our RNRI for inversion demonstrates significant speed-up and improved quality compared to previous state-of-the-art methods. Results are shown for both Latent Diffusion models and fast Latent consistency models.

Real Time Editing

The videos are running in real time speed. RNRI enables fast and accurate text-to-image editing using diffusion models.

Editing Examples

Various editing with same input: Note the RNRI capability in both subtle and extensive changes as one would expect from the particular prompt change.


Abstract

Diffusion inversion is the problem of taking an image and a text prompt that describes it, and finding a noise latent that would generate the image. Most current inversion techniques operate by approximately solving an implicit equation, and may converge slowly or yield poor reconstructed images.

Here, we formulate the problem as finding the roots of an implicit equation and design a method to solve it efficiently. Our solution is based on Newton-Raphson (NR), a well-known technique in numerical analysis. A naive application of NR may be computationally infeasible and tends to converge to incorrect solutions. We describe an efficient regularized formulation that converges quickly to solution that provide high-quality reconstructions. We also identify a source of inconsistency stemming from prompt conditioning during the inversion process, which significantly degrades the inversion quality. To address this, we introduce a prompt-aware adjustment of the encoding, effectively correcting this issue.

Our solution, Regularized Newton-Raphson Inversion, inverts an image within 0.5 sec for latent consistency models, opening the door for interactive image editing. We further demonstrate improved results in image interpolation and generation of rare objects.


Inversion Pipeline

Newton-Raphson Inversion iterates over the obejective function \\(\mathcal{F}(z_t)\\). at every time step in the inversion path. It starts with \(z_t^0=z_{t-1}\) and quickly converges (within 2 iterations) to \(z_t\). Each box denotes one inversion step; black circles correspond to intermediate latents in the denoising process; green circles correspond to intermediate Newton-Raphson iterations.

PSNR & Run Time Results

(a) Convergence rate. Comparison of iterative methods in an image inversion-reconstruction task over the COCO validation set. The mean PSNR of reconstructed images is plotted against the number of iterations. The dashed line represents the upper bound on reconstruction quality determined by the VAE in Stable Diffusion. Mean convergence time (in seconds) is denoted for each method. Our RNRI achieves a PSNR close to the upper limit and converges within only 1-2 iterations.
(b) Prior effect on convergence. Incorporating our prior not only aids in finding the correct solution but also accelerates convergence.

BibTeX

@misc{samuel2023regularized,
  author    = {Dvir Samuel and Barak Meiri and Nir Darshan and Shai Avidan and Gal Chechik and Rami Ben-Ari},
  title     = {Regularized Newton Raphson Inversion for Text-to-Image Diffusion Models},
  year      = {2023}
}