MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, realistic, and adaptive scene generation for applications in the metaverse, AR/VR, and game development.
MetaSpatial enhances spatial reasoning in VLMs using RL, internalizing 3D spatial reasoning to enable real-time 3D scene generation without hard-coded optimizations. By incorporating physics-aware constraints and rendering image evaluations, our framework optimizes layout coherence, physical consistency, and aesthetics. We provide an adaptive, efficient solution for applications in the metaverse, AR/VR, digital twins, and game development.
Our goal is to enable 3D spatial reasoning in the metaverse, where perfect ground truth doesn’t exist, making traditional SFT ineffective while RL naturally fits the task by learning from physics-aware constraints and adaptive rewards.
cd IDesign
conda create -n idesign python=3.9
conda activate idesign
pip install -r requirements.txt
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine
conda install -c dglteam/label/cu113 dgl
#sudo apt install -y libgl1-mesa-glx
pip install "numpy<2"
<!-- conda install sqlite=3.35.5 -->
#In IDesign/utils.py, comment the im.show() in function create_empty_image_with_boxes if you run the codes on server without GUI display.
pip install transformers objaverse accelerate
huggingface-cli login
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
git clone https://huggingface.co/OpenShape/openshape-demo-support
cd openshape-demo-support
pip install -e .
cd ..
For rendering, we use Blender:
a. install blender-4.3.2-linux-x64
b. create a python3.11 environment and activate it:
conda create -n blender python=3.11
conda activate blender
c. install the blender python api:
pip install bpy==4.3.0
For 3d assets retrieval, we install the OpenShape in the IDesign repository, please follow the instructions in the IDesign to install.
For R1 training, we install the EasyR1 repository, please follow the instructions in the EasyR1 to install: https://github.com/hiyouga/EasyR1
Remember to set the OPENAI_API_KEY in system environment variables.
a. Change the openai api key in the script (line 7)
b. Change the output file path in the script (line 64)
c. Run the script and get generated_room_descriptions.json
a. Change the json file path of the generated room descriptions in the script (line 41)
b. Change the curated_data_path in the script (line 42). (This folder will store all rooms’ data as individual folders named as room_0, room_1, etc.)
c. Run the script and get the room only blend image in each room’s folder.
In this step, we will use IDesign to generate the scene graph of a room.
a. Change the json file path of the generated room descriptions in the python file (line 14).
b. Change the curated_data_path in the python file (line 15).
c1. You can directly run the python file to generate the scene graph of a room in each room’s folder.
c2. You can also run the “data-curation-utils/data_curation_batch_generation.py” to generate sbatch scripts to generate the scene graphs parallelly.
c2.1 Change the start and end of the range of rooms and how many rooms per sbatch would solve in line 18 - 20.
c2.2 Change the path of the data_curation.py in **line 14** of "data_curation_batch_generation.py".
c2.3 Change the output file path in the python file in **line 23** of "data_curation_batch_generation.py".
c2.4 Run the "data_curation_batch_generation.py" to generate the sbatch scripts.
c2.5 Change the script_dir in **line 4** of "data-curation-utils/data_curation_scripts_execute.sh" to the path of the sbatch scripts.
c2.4 Run the "data-curation-utils/data_curation_scripts_execute.sh" to execute all the scripts automatically.
(ps: sometime for some rooms, the IDesign will stuck, so I set a limit time for each room. If the room is stuck more than 5 minutes, the scirpt will skip the room and continue to the next room. So, some rooms’folders will not have the scene graph after this step.)
a. Change the folder path in line 108 of retrieve.py. This folder includes all rooms’ folders.
b. Change the retrieve.py file path in line 15 of “retriever_batch_generation.py”.
c. Change the start and end of the range of rooms and how many rooms per sbatch would solve in line 18 - 20 of “retriever_batch_generation.py”.
d. Change the generated sbatch scripts folder path in line 22 of “retriever_batch_generation.py”.
e. Run the “retriever_batch_generation.py” to generate the sbatch scripts.
f. Run all the sbatch scripts and wait.
a. Change the room_description file path in line 10 of “hf_data_curation.py”.
b. Change the curated_data_path in line 11 of “hf_data_curation.py”.
c. Change the name of dataset and huggingface in line 93, 96, 97, 98 of “hf_data_curation.py”.
c. Run the “hf_data_curation.py” to summarize the all data and generate the dataset to update in huggingface.
bash examples/run_qwen2_5_vl_7b_3d_reasoning.sh
We plan to further improve and expand MetaSpatial with the following updates:
Hyperparameter Optimization: Due to computational constraints, our current results have not undergone thorough parameter tuning or hyperparameter selection. We will conduct additional experiments to identify the most effective configurations for training.
Comparative Analysis of RL Algorithms: We will provide more experimental results comparing different RL algorithms in terms of their effectiveness and stability, offering deeper insights into their impact on spatial reasoning.
Multi-Turn Rollout Implementation: Our current release only supports single-turn rollouts. We aim to enhance our framework by incorporating multi-turn reasoning, where the VLM refines its outputs iteratively by analyzing the rendered scene from generated coordinates, improving spatial coherence over multiple iterations.
Comprehensive Report Paper: We will soon release a more detailed report summarizing our findings, experimental results, and technical insights into MetaSpatial.
Training Speed Optimization: Due to time and resource constraints, development has been handled by few contributors. There are several potential optimizations to accelerate training, such as replacing GPT-4o scoring with a locally deployed high-performance VLM, exploring faster rendering alternatives to Blender, and refining the overall pipeline for more efficient computation.
Stay tuned for updates as we continue to refine and expand the capabilities of MetaSpatial!
This work builds upon the codebases of VERL and EasyR1, extending their foundations to enhance 3D spatial reasoning in VLMs. We sincerely appreciate the EasyR1 development team for their generous support and insightful guidance throughout our implementation process. Additionally, this project was inspired by RAGEN, who provides valuable insights into the adaptation of PPO for new tasks. Their work contributed to our understanding of reinforcement learning strategies in this context. We would also like to express our gratitude to Zhuo Liu for the valuable discussions that contributed to this work. Furthermore, we sincerely appreciate IDesign for providing an excellent code framework that facilitated the generation of our dataset.
If you find this work useful, please consider give us a star! Also, please consider citing:
@article{pan2025metaspatial,
title={MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse},
author={Pan, Zhenyu and Liu, Han},
journal={arXiv preprint arXiv:2503.18470},
year={2025}
}