M-VQA Challenge | Machine-oriented Visual Media Quality Assessment

Overview

Welcome to M-VQA Challenge

Bridging Image Quality Assessment and Visual Question Answering for robust embodied intelligence

Background

Vision-Language Models (VLMs) act as the "eyes" and "brains" of modern embodied agents. However, the physical world is filled with uncertainties: motion blur, poor lighting, sensor noise, and transmission loss.

An agent's ability to maintain stable visual understanding under these harsh conditions is critical for achieving safe and reliable embodied intelligence. This challenge bridges Image Quality Assessment (IQA) and Visual Question Answering (VQA) to measure the performance degradation of models under various levels of visual distortion.

Synthetic to Real

Evolution from controlled simulation environments to real-world robotic deployment

Performance Metrics

Correlate quality metrics with ground-truth VLM performance scores

Robustness Testing

30 distinct distortion types across 5 severity levels

Three Tracks

Absolute Score, Relative Score & Real-Robot evaluation tracks

Competition Structure

Competition Structure & Core Tasks

The competition is organized into three distinct tracks, spanning from synthetic robustness evaluation to real-world robotic deployment

01

Track 1: Absolute Performance Prediction (In-Silico)

Participants must predict the absolute accuracy (0.0 to 1.0) of VLM models on a large-scale dataset containing 12,400 images with controlled noise. The goal is to accurately estimate the model's task-solving capability under various distortions.

02

Track 2: Relative Degradation Assessment (In-Silico)

Participants are tasked with predicting the relative performance drop of VLM models compared to their baseline on original, clean images. This track focuses on the sensitivity of metrics to increasing levels of image degradation.

03

Track 3: Real-World Robot Deployment (Real-Robot)

Starting May 1, 2026, top-performing models will be evaluated on physical robot platforms.

Eligibility: Only the top 10 teams from the combined rankings of Track 1 and Track 2 are eligible to participate.
Format: Direct testing on real-world sensor data collected from laboratory robot platforms, with no prior training phase.

Phase Details

Preliminary Round

Development & Challenge

Status: Open
Objective: Evaluation for Track 1 and Track 2 based on the released Training and Validation datasets. Participants must submit their predictions to the CodaBench platform for ranking.

Final Round

Sim-to-Real

Status: Invitation Only
Objective: Evaluation for Track 3. Qualified teams will deploy their algorithms on private, real-world robotic data to test generalization in physical environments.

Awards & Publications

Awards

Prizes will be awarded to the Top 3 winners of each individual track (Track 1, Track 2, and Track 3).

Publications

The final evaluation for paper acceptance and official competition reports will be based on the comprehensive performance across all three tracks.

Dataset

Dataset Statistics

12,400 images with controlled distortions for robust evaluation

12,400

Total Images

400

Original Images

30

Distortion Types

5

Severity Levels

Split	Original Images	Distorted Images	Total	Role
Train	280 (img_000 - img_279)	8,400 (img_000_noise01 - img_279_noise30)	8,680	Model Training & Fine-tuning
Val	40 (img_280 - img_319)	1,200 (img_280_noise01 - img_319_noise30)	1,240	Model Selection & Debugging
Test	80 (img_320 - img_399)	2,400 (img_320_noise01 - img_399_noise30)	2,480	Competition Evaluation
Total	400	12,000	12,400	-

Distortion Types

30 distinct types of distortions simulating real-world visual degradation

Severity Levels

5 severity levels (Level 4 to Level 8) for each distortion type

Image Pairs

Each original image has 30 distorted variations for comprehensive analysis

Evaluation

Evaluation Metrics

Submissions are evaluated based on correlation with Ground Truth scores

PLCC

Pearson Linear Correlation Coefficient

Measures the linear relationship between the predicted scores and the ground truth.

SRCC

Spearman Rank-Order Correlation Coefficient

Measures the monotonic relationship (ranking consistency) between the predicted scores and the ground truth.

Competition Tracks

01

Absolute Score

Robustness Level

Predicting the mean performance level (Mean Opinion Score) of models on a specific distorted image.

Evaluated using mean of PLCC and SRCC

02

Relative Score

Performance Degradation

Measuring the ratio of performance on a distorted image compared to its original, clean counterpart (Ratio of Means).

Evaluated using mean of PLCC and SRCC

Important Notes

Consistency

Ensure the index column matches the Test Set exactly. Any missing rows or mismatched indices will result in a submission error.

Encoding

Do not include Base64 image data in your submission; only the predicted scores are required.

Precision

We recommend providing scores with at least 4 decimal places for better ranking granularity.

Submission

Submit Your Results

All submissions are handled through CodaBench platform

Submit via CodaBench

All competition submissions are managed through the CodaBench platform. Please visit the competition page for detailed submission instructions and format requirements.

Go to CodaBench