按任务分类

Posted by 夜雨声烦 on October 26, 2022

论文阅读标识约定

  • 红色:重要点
  • 黄色:阅读断点
  • 绿色:没读懂的

Generation

Captioning

  1. EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
  2. Quantifying Societal Bias Amplification in Image Captioning
  3. It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection
  4. Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources
  5. NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge
  6. End-to-end Generative Pretraining for Multimodal Video Captioning
  7. Injecting Semantic Concepts into End-to-End Image Captioning
  8. Hierarchical Modular Network for Video Captioning
  9. Scaling Up Vision-Language Pre-training for Image Captioning
  10. Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
  11. SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Text-to-Image Generation

  1. Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
  2. AnyFace: Free-style Text-to-Face Synthesis and Manipulation
  3. Scene Graph Expansion for Semantics-Guided Image Outpainting
  4. CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
  5. HairCLIP: Design Your Hair by Text and Reference Image
  6. StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis
  7. RU-Net: Regularized Unrolling Network for Scene Graph Generation
  8. Text2Mesh: Text-Driven Neural Stylization for Meshes
  9. Neural Texture Extraction and Distribution for Controllable Person Image Synthesis
  10. Towards Implicit Text-Guided 3D Shape Generation
  11. The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation
  12. Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation
  13. LAFITE : Towards Language-Free Training for Text-to-Image Generation
  14. CLIPstyler: Image Style Transfer with a Single Text Condition
  15. ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation
  16. 3D-aware Image Synthesis via Learning Structural and Textural Representations
  17. Aesthetic Text Logo Synthesis via Content-aware Layout Inferring
  18. Text to Image Generation with Semantic-Spatial Aware GAN
  19. Reduce Information Loss in Transformers for Pluralistic Image Inpainting
  20. FlexIT: Towards Flexible Semantic Image Translation
  21. Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
  22. L-Verse: Bidirectional Generation Between Image and Text
  23. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
  24. Blended Diffusion for Text-driven Editing of Natural Images

Understanding

Visual Question and Answer

  1. ScanQA: 3D Question Answering for Spatial Scene Understanding
  2. Grounding Answers for Visual Questions Asked by Visually Impaired People
  3. Measuring Compositional Consistency for Video Question Answering
  4. From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering
  5. WebQA: Multihop and Multimodal QA
  6. V-Doc : Visual questions answers with Documents
  7. Dual-Key Multimodal Backdoors for Visual Question Answering
  8. SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering
  9. LaTr: Layout-Aware Transformer for Scene-Text VQA
  10. Learning to Answer Questions in Dynamic Audio-Visual Scenarios
  11. Invariant Grounding for Video Question Answering
  12. SimVQA: Exploring Simulated Environments for Visual Question Answering

Visual Reasoning

  1. Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks
  2. REX: Reasoning-aware and Grounded Explanation
  3. 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos
  4. A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation
  5. Revisiting the “Video” in Video-Language Understanding
  6. UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
  7. MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
  8. Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation
  9. NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

Retrieval

Image-text Retrieval

  1. Pushing the Performance Limit of Scene Text Recognizer without Human Annotation
  2. Towards End-to-End Unified Scene Text Detection and Layout Analysis
  3. Open-Set Text Recognition via Character-Context Decoupling
  4. Learning Program Representations for Food Images and Cooking Recipes
  5. Cross Modal Retrieval with Querybank Normalisation
  6. Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer
  7. GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection
  8. Disentangling visual and written concepts in CLIP
  9. SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization
  10. Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection
  11. SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition
  12. Vision-Language Pre-Training for Boosting Scene Text Detectors
  13. A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution
  14. Knowledge Mining with Scene Text for Fine-Grained Recognition
  15. ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
  16. Object-aware Video-language Pre-training for Retrieval

Text-image Retrieval

  1. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
  2. Cross Language Image Matching for Weakly Supervised Semantic Segmentation
  3. Sign Language Video Retrieval with Free-Form Textual Queries
  4. Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
  5. Language as Queries for Referring Video Object Segmentation
  6. Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation
  7. End-to-End Referring Video Object Segmentation with Multimodal Transformers

Grounding

  1. One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
  2. 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection
  3. MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
  4. Less is More: Generating Grounded Navigation Instructions from Landmarks
  5. ENVEDIT: Environment Editing for Vision-and-Language Navigation
  6. Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning
  7. Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
  8. Text2Pos: Text-to-Point-Cloud Cross-Modal Localization
  9. Reinforced Structured State-Evolution for Vision-Language Navigation
  10. TubeDETR: Spatio-Temporal Video Grounding with Transformers
  11. HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation
  12. ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
  13. Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning
  14. Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
  15. Multi-View Transformer for 3D Visual Grounding
  16. Cross-modal Map Learning for Vision and Language Navigation
  17. Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
  18. Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
  19. Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Vision-Language Models

  1. Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic Filter Attention
  2. Are Multimodal Transformers Robust to Missing Modality?
  3. Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production
  4. Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
  5. VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
  6. Robust Cross-Modal Representation Learning with Progressive Self-Distillation
  7. CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields
  8. Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
  9. Video-Text Representation Learning via Differentiable Weak Temporal Alignment
  10. On Guiding Visual Attention with Language Specification
  11. On the Integration of Self-Attention and Convolution
  12. Masked Autoencoders Are Scalable Vision Learners
  13. FLAVA: A Foundational Language And Vision Alignment Model
  14. Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification
  15. Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
  16. RegionCLIP: Region-based Language-Image Pretraining
  17. CRIS: CLIP-Driven Referring Image Segmentation
  18. Vision Transformer with Deformable Attention
  19. An Empirical Study of Training End-to-End Vision-and-Language Transformers
  20. Vision-Language Pre-Training with Triple Contrastive Learning
  21. COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
  22. Conditional Prompt Learning for Vision-Language Models
  23. Integrating Language Guidance into Vision-based Deep Metric Learning
  24. Align and Prompt: Video-and-Language Pre-training with Entity Prompts
  25. Multi-modal Alignment using Representation Codebook
  26. CLIP-Event: Connecting Text and Images with Event Structures
  27. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
  28. LiT: Zero-Shot Transfer with Locked-image Text Tuning
  29. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
  30. VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
  31. Grounded Language-Image Pre-training
  32. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Other

  1. Sub-word Level Lip Reading With Visual Attention
  2. P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision
  3. More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
  4. Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale
  5. VALHALLA: Visual Hallucination for Machine Translation
  6. XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding
  7. GroupViT: Semantic Segmentation Emerges from Text Supervision

视觉-语言模型(模型设计、联合表示、理解、预训练)

nlp任务中加入图像或cv任务中加入语言

共计137篇