Skip to content

Latest commit

 

History

History
304 lines (262 loc) · 38.8 KB

README.md

File metadata and controls

304 lines (262 loc) · 38.8 KB

Awesome Instruction Editing

A Survey of Instruction-Guided Image and Media Editing in LLM Era

Awesome arXiv GitHub stars Hits Contrib

A collection of academic articles, published methodology, and datasets on the subject of Instruction-Guided Image and Media Editing.

A sortable version is available here: https://awesome-instruction-editing.github.io/

🔖 News!!!

📌 We are actively tracking the latest research and welcome contributions to our repository and survey paper. If your studies are relevant, please feel free to create an issue or a pull request.

📰 2024-11-15: Our paper Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era has been revised into version 1 with new methods and dicussions.

🔍 Citation

If you find this work helpful in your research, welcome to cite the paper and give a ⭐.

Please read and cite our paper: arXiv

Nguyen, T.T., Ren, Z., Pham, T., Huynh, T.T., Nguyen, P.L., Yin, H., and Nguyen, Q.V.H., 2024. Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM Era. arXiv preprint arXiv:2411.09955.

@article{nguyen2024instruction,
  title={Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era},
  author={Thanh Tam Nguyen and Zhao Ren and Trinh Pham and Thanh Trung Huynh and Phi Le Nguyen and Hongzhi Yin and Quoc Viet Hung Nguyen},
  journal={arXiv preprint arXiv:2411.09955},
  year={2024}
}

Existing Surveys

Paper Title Venue Year Focus
A Survey of Multimodal Composite Editing and Retrieval arXiv 2024 Media Retrieval
INFOBENCH: Evaluating Instruction Following Ability in Large Language Models arXiv 2024 Text Editing
Multimodal Image Synthesis and Editing: The Generative AI Era TPAMI 2023 X-to-Image Generation
LLM-driven Instruction Following: Progresses and Concerns EMNLP 2023 Text Editing

Pipeline

pipeline


Approaches

Title Year Venue Category Code
Guiding Instruction-based Image Editing via Multimodal Large Language Models 2024 ICLR LLM-guided, Diffusion, Concise instruction loss, Supervised fine-tuning Code
Hive: Harnessing human feedback for instructional visual editing 2024 CVPR RLHF, Diffusion, Data augmentation Code
InstructBrush: Learning Attention-based Instruction Optimization for Image Editing 2024 arXiv Diffusion, Attention-based Code
FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing 2024 arXiv Controllable diffusion Code
Pix2Pix-OnTheFly: Leveraging LLMs for Instruction-Guided Image Editing 2024 arXiv on-the-fly, tuning-free, training-free Code
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models 2024 arXiv Video editing, decoupled classifier-free Code
Grounded-Instruct-Pix2Pix: Improving Instruction Based Image Editing with Automatic Target Grounding 2024 ICASSP Diffusion, mask generation image editing Code
TexFit: Text-Driven Fashion Image Editing with Diffusion Models 2024 AAAI Fashion editing, region locaation, diffusion Code
InstructGIE: Towards Generalizable Image Editing 2024 arXiv Diffusion, context matching Code
An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control 2024 arXiv Freestyle, Diffusion, Group attention Code
Text-Driven Image Editing via Learnable Regions 2024 CVPR Region generation, diffusion, mask-free Code
ChartReformer: Natural Language-Driven Chart Image Editing 2024 ICDAR chart editing Code
GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models 2024 arXiv Hybrid, direction transfer Code
StyleBooth: Image Style Editing with Multimodal Instruction 2024 arXiv style editing, diffusion Code
ZONE: Zero-Shot Instruction-Guided Local Editing 2024 CVPR Local editing, localisation Code
Inversion-Free Image Editing with Natural Language 2024 CVPR Consistent models, unified attention Code
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation 2024 CVPR Diffusion, multi-instruction Code
MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers 2024 arXiv MoE, LLM-powered Code
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists 2024 ICLR Diffusion, LLM-based, classifier-free Code
Iterative Multi-Granular Image Editing Using Diffusion Models 2024 WACV Diffusion, Iterative editing
Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing 2024 NeurIPS Diffusion, dynamic prompt Code
Object-Aware Inversion and Reassembly for Image Editing 2024 ICLR Diffusion, multi-object Code
Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models 2024 arXiv video editing, zero-shot Code
Video-P2P: Video Editing with Cross-attention Control 2024 CVPR Decoupled-guidance attention control, video editing Code
NeRF-Insert: 3D Local Editing with Multimodal Control Signals 2024 arXiv 3D Editing
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models 2024 arXiv 3D Editing Code
AudioScenic: Audio-Driven Video Scene Editing 2024 arXiv audio-based instruction
LocInv: Localization-aware Inversion for Text-Guided Image Editing 2024 CVPR-AI4CC Localization-aware inversion Code
SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models 2024 arXiv Audio-driven Code
Exploring Text-Guided Single Image Editing for Remote Sensing Images 2024 arXiv Remote sensing images Code
GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting 2024 arXiv Fashion editing Code
TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing 2024 arXiv Chain of thought
Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection 2024 arXiv Diffusion, Self-attention Injection Code
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning 2024 arXiv Music editing, diffusion Code
Text Guided Image Editing with Automatic Concept Locating and Forgetting 2024 arXiv Diffusion, concept forgetting
InstructPix2Pix: Learning To Follow Image Editing Instruction 2023 CVPR Core paper, Diffusion Code
Visual Instruction Inversion: Image Editing via Image Prompting 2023 NeurIPS Diffusion, visual instruction Code
Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions 2023 ICCV 3D scene editing Code
Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion 2023 arXiv 3D editing, Dynamic scaling Code
InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models 2023 arXiv Music editing, diffusion Code
EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models 2023 arXiv authorized editing, diffusion Code
Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis 2023 arXiv Video editing, cross-time attention Code
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models 2023 NeurIPS Audio, Diffusion Code
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following 2023 arXiv Refinement prior, instrucitonal tuning Code
Learning to Follow Object-Centric Image Editing Instructions Faithfully 2023 EMNLP Diffusion, additional supervision Code
StableVideo: Text-driven Consistency-aware Diffusion Video Editing 2023 ICCV Diffusion, Video Code
Vox-E: Text-Guided Voxel Editing of 3D Objects 2023 ICCV Diffusion, 3D Code
FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion 2023 arXiv GAN, fashion images Code
NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models 2023 CVPR null-tex embedding, Diffusion, CLIP Code
Imagic: Text-based real image editing with diffusion models 2023 CVPR Diffusion, embedding interpolation Code
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models 2023 arXiv Diffusion, dual-branch concept Code
InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions 2023 arXiv Diffusion, LLM-powered Code
Instructdiffusion: A generalist modeling interface for vision tasks 2023 arXiv Multi-task, multi-turn, Diffusion, LLM Code
Emu Edit: Precise Image Editing via Recognition and Generation Tasks 2023 arXiv Diffusion, multi-task, multi-turn Code
Dialogpaint: A dialog-based image editing model 2023 arXiv Dialog-based
Inst-Inpaint: Instructing to Remove Objects with Diffusion Models 2023 arXiv Scene Editing Code
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation 2023 NeurIPS Example-based instruction
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models 2023 arXiv MLLM, Diffusion Code
ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation 2023 arXiv LLM, Diffusion Code
iEdit: Localised Text-guided Image Editing with Weak Supervision 2023 arXiv Localized diffusion
Prompt-to-Prompt Image Editing with Cross Attention Control 2023 ICLR Diffusion, Cross Attention Code
Target-Free Text-Guided Image Manipulation 2023 AAAI 3D Editing Code
Paint by example: Exemplar-based image editing with diffusion models 2023 CVPR Diffusion, example-based Code
De-net: Dynamic text-guided image editing adversarial networks 2023 AAAI GAN, multi-task Code
Imagen editor and editbench: Advancing and evaluating text-guided image inpainting 2023 CVPR Diffusion, benchmark, CLIP Code
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation 2023 CVPR Diffusion, feature injection Code
MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing 2023 ICCV Diffusion, mutual self-attention Code
Unitune: Text-driven image editing by fine tuning a diffusion model on a single image 2023 TOG Diffusion, fine-tuning Code
Dreamix: Video Diffusion Models are General Video Editors 2023 arXiv Cascaded diffusion, video Code
LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models 2022 BMVC latent diffusion
StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation 2022 WACV GAN, CLIP Code
Blended Diffusion for Text-Driven Editing of Natural Images 2022 CVPR Diffusion, CLIP, Blend Code
VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance 2022 ECCV GAN, CLIP Code
StyleGAN-NADA: CLIP-guided domain adaptation of image generators 2022 TOG GAN, CLIP Code
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation 2022 CVPR Diffusion, CLIP, Noise combination Code
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models 2022 ICML Diffusion, CLIP, Classifier-free guidance Code
DiffEdit: Diffusion-based semantic image editing with mask guidance 2022 ICLR Diffusion, DDIM, Mask generation Code
Text2mesh: Text-driven neural stylization for meshes 2022 CVPR 3D Editing Code
Manitrans: Entity-level text-guided image manipulation via token-wise semantic alignment and generation 2022 CVPR GAN, multi-entities Code
Text2live: Text-driven layered image and video editing 2022 ECCV GAN, CLIP, Video editing Code
SPEECHPAINTER: TEXT-CONDITIONED SPEECH INPAINTING 2022 Interspeech Speech editing Code
Talk-to-Edit: Fine-Grained Facial Editing via Dialog 2021 ICCV GAN, dialog, semantic field Code
Manigan: Text-guided image manipulation 2020 CVPR GAN, affine combination Code
SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning 2020 EMNLP GAN, Cross-task consistency Code
Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions 2020 ECCV GAN Code
Sequential Attention GAN for Interactive Image Editing 2020 MM GAN, Dialog, Sequential Attention
Lightweight generative adversarial networks for text-guided image manipulation 2020 NeurIPS Light-weight GAN Code
Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction 2019 ICCV GAN Code
Language-Based Image Editing With Recurrent Attentive Models 2018 CVPR GAN, Recurrent Attention Code
Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language 2018 NeurIPS GAN, simple Code
FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction 2024 arXiv Diffusion, instruction-driven editing Code
Revealing Directions for Text-guided 3D Face Editing 2024 arXiv Text-guided 3D face editing
Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing 2024 arXiv Text-to-image, editing, diffusion
Hyper-parameter tuning for text guided image editing 2024 arXiv Text Editing Code
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models 2024 arXiv Text-guided Object Insertion Code

Other types of Editing

Title Year Venue Category Code
SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing 2024 SIGGRAPH Asia Diffusion, scene graph, image-editing Code
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition 2024 arXiv Text-to-Audio, Multimodal
AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework 2024 arXiv Diffusion-based text-to-audio Code
Enabling Local Editing in Diffusion Models by Joint and Individual Component Analysis 2024 BMVC Diffusion-based local image manipulation Code
Steer-by-prior Editing of Symbolic Music Loops 2024 MML Masked Language Modelling, music instruments Code
Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning 2024 ISMIR Diffusion-based text-to-audio Code
GroupDiff: Diffusion-based Group Portrait Editing 2024 ECCV Diffusion-based image editing Code
RegionDrag: Fast Region-Based Image Editing with Diffusion Models 2024 ECCV Diffusion-based image editing Code
SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing 2024 arXiv Multi-view consistency
DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation 2024 arXiv Diffusion-based editing Code
MEDIC: Zero-shot Music Editing with Disentangled Inversion Control 2024 arXiv Audio editing
3DEgo: 3D Editing on the Go! 2024 ECCV Monocular 3D Scene Synthesis Code
MedEdit: Counterfactual Diffusion-based Image Editing on Brain MRI 2024 SASHIMI Biomedical editing
FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing 2024 ECCV Image editing
LEMON: Localized Editing with Mesh Optimization and Neural Shaders 2024 arXiv Mesh editing
Diffusion Brush: A Latent Diffusion Model-based Editing Tool for AI-generated Images 2024 arXiv Image editing
Streamlining Image Editing with Layered Diffusion Brushes 2024 arXiv Image editing
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing 2024 arXiv Image Editing Dataset Code
Environment Maps Editing using Inverse Rendering and Adversarial Implicit Functions 2024 arXiv Inverse rendering, HDR editing
HairDiffusion: Vivid Multi-Colored Hair Editing via Latent Diffusion 2024 arXiv Hair editing, Diffusion models
DiffuMask-Editor: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing to Improve Segmentation Ability 2024 arXiv Synthetic Data Generation
Taming Rectified Flow for Inversion and Editing 2024 arXiv Image Inversion Code
Pathways on the Image Manifold: Image Editing via Video Generation 2024 arXiv video-based editing, Frame2Frame, Temporal Editing Caption

Datasets

Type: General

Dataset #Items #Papers Used Link
Reason-Edit 12.4M+ 1 Link
MagicBrush 10K 1 Link
InstructPix2Pix 500K 1 Link
EditBench 240 1 Link

Type: Image Captioning

Dataset #Items #Papers Used Link
Conceptual Captions 3.3M 1 Link
CoSaL 22K+ 1 Link
ReferIt 19K+ 1 Link
Oxford-102 Flowers 8K+ 1 Link
LAION-5B 5.85B+ 1 Link
MS-COCO 330K 2 Link
DeepFashion 800K 2 Link
Fashion-IQ 77K+ 1 Link
Fashion200k 200K 1 Link
MIT-States 63K+ 1 Link
CIRR 36K+ 1 Link

Type: ClipArt

Dataset #Items #Papers Used Link
CoDraw 58K+ 1 Link

Type: VQA

Dataset #Items #Papers Used Link
i-CLEVR 70K+ 1 Link

Type: Semantic Segmentation

Dataset #Items #Papers Used Link
ADE20K 27K+ 1 Link

Type: Object Classification

Dataset #Items #Papers Used Link
Oxford-III-Pets 7K+ 1 Link

Type: Depth Estimation

Dataset #Items #Papers Used Link
NYUv2 408K+ 1 Link

Type: Aesthetic-Based Editing

Dataset #Items #Papers Used Link
Laion-Aesthetics V2 2.4B+ 1 Link

Type: Dialog-Based Editing

Dataset #Items #Papers Used Link
CelebA-Dialog 202K+ 1 Link
Flickr-Faces-HQ 70K 2 Link

Evaluation Metrics

Category Evaluation Metrics Formula Usage
Perceptual Quality Learned Perceptual Image Patch Similarity (LPIPS) $\text{LPIPS}(x, x') = \sum_l ||\phi_l(x) - \phi_l(x')||^2$ Measures perceptual similarity between images, with lower scores indicating higher similarity.
Structural Similarity Index (SSIM) $\text{SSIM}(x, x') = \frac{(2\mu_x\mu_{x'} + C_1)(2\sigma_{xx'} + C_2)}{(\mu_x^2 + \mu_{x'}^2 + C_1)(\sigma_x^2 + \sigma_{x'}^2 + C_2)}$ Measures visual similarity based on luminance, contrast, and structure.
Fréchet Inception Distance (FID) $\text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$ Measures the distance between the real and generated image feature distributions.
Inception Score (IS) $\text{IS} = \exp(E_x D_{KL}(p(y|x) || p(y)))$ Evaluates image quality and diversity based on label distribution consistency.
Structural Integrity Peak Signal-to-Noise Ratio (PSNR) $\text{PSNR} = 10 \log_{10} \left( \frac{\text{MAX}^2}{\text{MSE}} \right)$ Measures image quality based on pixel-wise errors, with higher values indicating better quality.
Mean Intersection over Union (mIoU) $\text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \frac{|A_i \cap B_i|}{|A_i \cup B_i|}$ Assesses segmentation accuracy by comparing predicted and ground truth masks.
Mask Accuracy $\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$ Evaluates the accuracy of generated masks.
Boundary Adherence $\text{BA} = \frac{|B_{\text{edit}} \cap B_{\text{target}}|}{|B_{\text{target}}|}$ Measures how well edits preserve object boundaries.
Semantic Alignment Edit Consistency $\text{EC} = \frac{1}{N} \sum_{i=1}^{N} 1{E_i = E_{\text{ref}}}$ Measures the consistency of edits across similar prompts.
Target Grounding Accuracy $\text{TGA} = \frac{\text{Correct Targets}}{\text{Total Targets}}$ Evaluates how well edits align with specified targets in the prompt.
Embedding Space Similarity $\text{CosSim}(v_x, v_{x'}) = \frac{v_x \cdot v_{x'}}{||v_x|| , ||v_{x'}||}$ Measures similarity between the edited and reference images in feature space.
Decomposed Requirements Following Ratio (DRFR) $\text{DRFR} = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{Requirements Followed}}{\text{Total Requirements}}$ Assesses how closely the model follows decomposed instructions.
User-Based Metrics User Study Ratings Captures user feedback through ratings of image quality.
Human Visual Turing Test (HVTT) $\text{HVTT} = \frac{\text{Real Judgements}}{\text{Total Judgements}}$ Measures the ability of users to distinguish between real and generated images.
Click-through Rate (CTR) $\text{CTR} = \frac{\text{Clicks}}{\text{Total Impressions}}$ Tracks user engagement by measuring image clicks.
Diversity and Fidelity Edit Diversity $\text{Diversity} = \frac{1}{N} \sum_{i=1}^{N} D_{KL}(p_i || p_{\text{mean}})$ Measures the variability of generated images.
GAN Discriminator Score $\text{GDS} = \frac{1}{N} \sum_{i=1}^N D_{\text{GAN}}(x_i)$ Assesses the authenticity of generated images using a GAN discriminator.
Reconstruction Error $\text{RE} = ||x - \hat{x}||$ Measures the error between the original and generated images.
Edit Success Rate $\text{ESR} = \frac{\text{Successful Edits}}{\text{Total Edits}}$ Quantifies the success of applied edits.
Consistency and Cohesion Scene Consistency $\text{SC} = \frac{1}{N} \sum_{i=1}^{N} \text{Sim}(I_{\text{edit}}, I_{\text{orig}})$ Measures how edits maintain overall scene structure.
Color Consistency $\text{CC} = \frac{1}{N} \sum_{i=1}^{N} \frac{|C_{\text{edit}} \cap C_{\text{orig}}|}{|C_{\text{orig}}|}$ Measures color preservation between edited and original regions.
Shape Consistency $\text{ShapeSim} = \frac{1}{N} \sum_{i=1}^{N} \text{IoU}(S_{\text{edit}}, S_{\text{orig}})$ Quantifies how well shapes are preserved during edits.
Pose Matching Score $\text{PMS} = \frac{1}{N} \sum_{i=1}^{N} \text{Sim}(\theta_{\text{edit}}, \theta_{\text{orig}})$ Assesses pose consistency between original and edited images.
Robustness Noise Robustness $\text{NR} = \frac{1}{N} \sum_{i=1}^{N} ||x_i - x_{i,\text{noisy}}||$ Evaluates model robustness to noise.
Perceptual Quality $\text{PQ} = \frac{1}{N} \sum_{i=1}^{N} \text{Score}(x_i)$ A subjective quality metric based on human judgment.

Disclaimer

Feel free to contact us if you have any queries or exciting news. In addition, we welcome all researchers to contribute to this repository and further contribute to the knowledge of this field.

If you have some other related references, please feel free to create a Github issue with the paper information. We will glady update the repos according to your suggestions. (You can also create pull requests, but it might take some time for us to do the merge)

HitCount visitors