Skip to content

grpo 多机多卡训练不同 #6810

@dhhcj1

Description

@dhhcj1

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
[图片]
rollout 的

Image 测试是同的。。但是访问rollout 确不通 **Your hardware and system info** Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等) a100 8 *2

Additional context
Add any other context about the problem here(在这里补充其他信息)

两台机器,一台4卡启动rollout, 另一台8卡启动训练

如下为测试启动脚本
CUDA_VISIBLE_DEVICES=0,1,2,3 swift rollout --model /home/search/houchangjian/cephfs-houchangjian/ms-swift-main/output/SFT/SFT-lora/v0-20251127-155534/checkpoint-199-merged --port 8002 --model_type qwen2_5_vl --vllm_tensor_parallel_size 4

训练脚本如下

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,7
MAX_PIXELS=691200
NPROC_PER_NODE=8
swift rlhf
--rlhf_type grpo
--model /home/search/houchangjian/cephfs-houchangjian/ms-swift-main/output/SFT/SFT-lora/v0-20251127-155534/checkpoint-199-merged
--external_plugins /home/search/houchangjian/cephfs-houchangjian/ms-swift/examples/train/grpo/plugin/plugin.py
--reward_funcs table_edit_distance_strict
--use_vllm true
--vllm_mode server
--vllm_server_host 10.178.143.49
--vllm_server_port 8002
--train_type full
--torch_dtype bfloat16
--dataset /home/search/houchangjian/cephfs-houchangjian/data/tablerecg/train/0217_TSR_Real_all_converted.json
--load_from_cache_file false
--max_completion_length 2048
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--learning_rate 1e-6
--gradient_accumulation_steps 1
--save_strategy steps
--eval_strategy steps
--eval_steps 500
--save_steps 500
--save_total_limit 10
--logging_steps 1
--output_dir output/DAPO_CLEVR_COUNTDOWN_test
--warmup_ratio 0.01
--dataloader_num_workers 4
--num_generations 1
--generation_batch_size 20
--temperature 1.0
--deepspeed zero3
--log_completions false
--num_iterations 1
--async_generate False
--beta 0.001
--model_type qwen2_5_vl
--max_pixels 691200
--loss_type dapo
--dynamic_sample true
--max_resample_times 3

目前rollout 完全没有反应

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions