TechcraftingAI Computer Vision

By: Brad Edwards
  • Summary

  • TechcraftingAI Computer Vision brings you summaries of the latest arXiv research daily. Research is read by your virtual host, Sage. The podcast is produced by Brad Edwards, an AI Engineer from Vancouver, BC, and a graduate student of computer science studying AI at the University of York. Thank you to arXiv for use of its open access interoperability.
    Brad Edwards
    Show More Show Less
activate_Holiday_promo_in_buybox_DT_T2
Episodes
  • Ep. 247 - Part 3 - June 13, 2024
    Jun 15 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024.


    00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data

    01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth

    03:08: GGHead: Fast and Generalizable 3D Gaussian Heads

    04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

    06:34: Towards Vision-Language Geo-Foundation Model: A Survey

    08:11: SimGen: Simulator-conditioned Driving Scene Generation

    09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

    11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

    12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

    13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image

    15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis

    16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

    17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

    19:39: Real-Time Deepfake Detection in the Real-World

    21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

    23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant

    24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

    26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

    28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

    29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

    31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

    33:16: Towards Evaluating the Robustness of Visual State Space Models

    34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

    36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

    37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach

    40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    41:40: Explore the Limits of Omni-modal Pretraining at Scale

    42:46: Interpreting the Weight Space of Customized Diffusion Models

    43:58: Depth Anything V2

    45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

    46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

    48:11: Rethinking Score Distillation as a Bridge Between Image Distributions

    49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

    Show More Show Less
    52 mins
  • Ep. 247 - Part 2 - June 13, 2024
    Jun 15 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024.


    00:21: INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

    02:11: Large-Scale Evaluation of Open-Set Image Classification Techniques

    03:43: PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

    05:00: MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era

    06:41: Auto-Vocabulary Segmentation for LiDAR Points

    07:30: AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring

    08:43: EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

    10:23: Fine-Grained Domain Generalization with Feature Structuralization

    12:03: SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution

    14:13: ReMI: A Dataset for Reasoning with Multiple Images

    15:41: A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

    17:26: Thoracic Surgery Video Analysis for Surgical Phase Recognition

    18:58: Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval

    20:40: Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

    22:26: CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification

    24:22: Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024

    25:21: Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns

    26:30: WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals

    27:44: MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction

    29:28: Comparison Visual Instruction Tuning

    30:51: MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

    32:14: Deep Transformer Network for Monocular Pose Estimation of Ship-Based UAV

    33:10: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    34:33: Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

    36:04: StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning

    37:30: Parameter-Efficient Active Learning for Foundational models

    38:31: Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

    40:22: Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    42:38: Towards AI Lesion Tracking in PET/CT Imaging: A Siamese-based CNN Pipeline applied on PSMA PET/CT Scans

    44:36: Memory-Efficient Sparse Pyramid Attention Networks for Whole Slide Image Analysis

    46:19: Instance-level quantitative saliency in multiple sclerosis lesion segmentation

    48:37: CMC-Bench: Towards a New Paradigm of Visual Signal Compression

    50:05: Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

    52:05: CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

    Show More Show Less
    53 mins
  • Ep. 247 - Part 1 - June 13, 2024
    Jun 15 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024.


    00:21: FouRA: Fourier Low Rank Adaptation

    01:41: Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

    03:18: Few-Shot Anomaly Detection via Category-Agnostic Registration Learning

    04:57: Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

    06:46: ToSA: Token Selective Attention for Efficient Vision Transformers

    08:00: Computer vision-based model for detecting turning lane features on Florida's public roadways

    09:08: Improving Adversarial Robustness via Feature Pattern Consistency Constraint

    10:52: Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

    12:10: NeRF Director: Revisiting View Selection in Neural Volume Rendering

    13:36: Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency

    15:03: Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

    16:40: COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing

    18:16: Fusion of regional and sparse attention in Vision Transformers

    19:26: Zoom and Shift are All You Need

    20:17: EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

    21:49: The Penalized Inverse Probability Measure for Conformal Classification

    23:24: OpenMaterial: A Comprehensive Dataset of Complex Materials for 3D Reconstruction

    24:47: Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

    26:30: Computer Vision Approaches for Automated Bee Counting Application

    27:17: Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

    28:16: A Label-Free and Non-Monotonic Metric for Evaluating Denoising in Event Cameras

    29:43: Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

    31:25: Neural NeRF Compression

    32:29: Preserving Identity with Variational Score for General-purpose 3D Editing

    33:50: AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings

    34:51: Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition

    36:10: Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

    37:34: AMSA-UNet: An Asymmetric Multiple Scales U-net Based on Self-attention for Deblurring

    38:49: Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark

    40:45: A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding

    42:02: Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?

    43:28: FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

    45:11: How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models

    47:08: Suitability of KANs for Computer Vision: A preliminary investigation

    Show More Show Less
    48 mins

What listeners say about TechcraftingAI Computer Vision

Average customer ratings

Reviews - Please select the tabs below to change the source of reviews.