论文阅读标识约定
- 红色:重要点
- 黄色:阅读断点
- 绿色:没读懂的
Generation
Captioning
- EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
- Quantifying Societal Bias Amplification in Image Captioning
- It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection
- Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources
- NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge
- End-to-end Generative Pretraining for Multimodal Video Captioning
- Injecting Semantic Concepts into End-to-End Image Captioning
- Hierarchical Modular Network for Video Captioning
- Scaling Up Vision-Language Pre-training for Image Captioning
- Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
- SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Text-to-Image Generation
- Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation
- AnyFace: Free-style Text-to-Face Synthesis and Manipulation
- Scene Graph Expansion for Semantics-Guided Image Outpainting
- CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
- HairCLIP: Design Your Hair by Text and Reference Image
- StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis
- RU-Net: Regularized Unrolling Network for Scene Graph Generation
- Text2Mesh: Text-Driven Neural Stylization for Meshes
- Neural Texture Extraction and Distribution for Controllable Person Image Synthesis
- Towards Implicit Text-Guided 3D Shape Generation
- The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation
- Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation
- LAFITE : Towards Language-Free Training for Text-to-Image Generation
- CLIPstyler: Image Style Transfer with a Single Text Condition
- ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation
- 3D-aware Image Synthesis via Learning Structural and Textural Representations
- Aesthetic Text Logo Synthesis via Content-aware Layout Inferring
- Text to Image Generation with Semantic-Spatial Aware GAN
- Reduce Information Loss in Transformers for Pluralistic Image Inpainting
- FlexIT: Towards Flexible Semantic Image Translation
- Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
- L-Verse: Bidirectional Generation Between Image and Text
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
- Blended Diffusion for Text-driven Editing of Natural Images
Understanding
Visual Question and Answer
- ScanQA: 3D Question Answering for Spatial Scene Understanding
- Grounding Answers for Visual Questions Asked by Visually Impaired People
- Measuring Compositional Consistency for Video Question Answering
- From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering
- WebQA: Multihop and Multimodal QA
- V-Doc : Visual questions answers with Documents
- Dual-Key Multimodal Backdoors for Visual Question Answering
- SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering
- LaTr: Layout-Aware Transformer for Scene-Text VQA
- Learning to Answer Questions in Dynamic Audio-Visual Scenarios
- Invariant Grounding for Video Question Answering
- SimVQA: Exploring Simulated Environments for Visual Question Answering
Visual Reasoning
- Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks
- REX: Reasoning-aware and Grounded Explanation
- 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos
- A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation
- Revisiting the “Video” in Video-Language Understanding
- UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
- MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
- Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation
- NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Retrieval
Image-text Retrieval
- Pushing the Performance Limit of Scene Text Recognizer without Human Annotation
- Towards End-to-End Unified Scene Text Detection and Layout Analysis
- Open-Set Text Recognition via Character-Context Decoupling
- Learning Program Representations for Food Images and Cooking Recipes
- Cross Modal Retrieval with Querybank Normalisation
- Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer
- GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection
- Disentangling visual and written concepts in CLIP
- SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization
- Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection
- SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition
- Vision-Language Pre-Training for Boosting Scene Text Detectors
- A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution
- Knowledge Mining with Scene Text for Fine-Grained Recognition
- ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
- Object-aware Video-language Pre-training for Retrieval
Text-image Retrieval
- LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
- Cross Language Image Matching for Weakly Supervised Semantic Segmentation
- Sign Language Video Retrieval with Free-Form Textual Queries
- Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
- Language as Queries for Referring Video Object Segmentation
- Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation
- End-to-End Referring Video Object Segmentation with Multimodal Transformers
Grounding
- One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
- 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection
- MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
- Less is More: Generating Grounded Navigation Instructions from Landmarks
- ENVEDIT: Environment Editing for Vision-and-Language Navigation
- Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning
- Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
- Text2Pos: Text-to-Point-Cloud Cross-Modal Localization
- Reinforced Structured State-Evolution for Vision-Language Navigation
- TubeDETR: Spatio-Temporal Video Grounding with Transformers
- HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation
- ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
- Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning
- Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
- Multi-View Transformer for 3D Visual Grounding
- Cross-modal Map Learning for Vision and Language Navigation
- Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation
- Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
- Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Vision-Language Models
- Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic Filter Attention
- Are Multimodal Transformers Robust to Missing Modality?
- Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production
- Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
- VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
- Robust Cross-Modal Representation Learning with Progressive Self-Distillation
- CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields
- Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
- Video-Text Representation Learning via Differentiable Weak Temporal Alignment
- On Guiding Visual Attention with Language Specification
- On the Integration of Self-Attention and Convolution
- Masked Autoencoders Are Scalable Vision Learners
- FLAVA: A Foundational Language And Vision Alignment Model
- Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification
- Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
- RegionCLIP: Region-based Language-Image Pretraining
- CRIS: CLIP-Driven Referring Image Segmentation
- Vision Transformer with Deformable Attention
- An Empirical Study of Training End-to-End Vision-and-Language Transformers
- Vision-Language Pre-Training with Triple Contrastive Learning
- COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
- Conditional Prompt Learning for Vision-Language Models
- Integrating Language Guidance into Vision-based Deep Metric Learning
- Align and Prompt: Video-and-Language Pre-training with Entity Prompts
- Multi-modal Alignment using Representation Codebook
- CLIP-Event: Connecting Text and Images with Event Structures
- Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
- LiT: Zero-Shot Transfer with Locked-image Text Tuning
- Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
- VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
- Grounded Language-Image Pre-training
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
Other
- Sub-word Level Lip Reading With Visual Attention
- P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision
- More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
- Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale
- VALHALLA: Visual Hallucination for Machine Translation
- XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding
- GroupViT: Semantic Segmentation Emerges from Text Supervision
视觉-语言模型(模型设计、联合表示、理解、预训练)
nlp任务中加入图像或cv任务中加入语言
共计137篇