[BUG] exits with return code = -11 #4063 New issue Closed Description DogeWatch opened on Jul 31, 2023 I run this repo https://github.com/yangjianxin1/Firefly on my own data, ds config is like { "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print...
[2023-07-20 08:31:15,767] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=3', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-r...
在一台配备8块NVIDIA A100-40G GPU的单个DGX节点上,DeepSpeed-Chat可以在13.6小时内训练一个130亿参数的ChatGPT模型。在多GPU多节点系统(云环境)中,例如,8个配备8块NVIDIA A100 GPU的DGX节点,DeepSpeed-Chat可以在不到9小时内训练一个660亿参数的ChatGPT模型。最后,它实现了相对于现有RLHF系统的15倍速度提升,并...
我们目前不支持70B llama-2模型(这里的模型架构与较小的llama-2变体不同)。我们正在努力尽快添加支持!
3b'] exits with return code = 1提示accelerate的版本需要>=0.20.3,而当前的版本是0.19.0。
对于这种问题,有什么建议吗?我似乎从文档中丢失了关于多GPU的翻译,因为在同一服务器上没有相关联的...
一.模型:facebook/opt-350m 二.训练服务器配置 1.CPU - 8 核 | 内存 - 32GB 2.GPU-16GB+|8+TFlops SP 3.操作系统 Ubuntu 20.04 4.Python 3.8, 5.CUDA 12.0, 6. cuDNN 8 7.pytorch 2.3.1+cu118 三.开发工具Visual …
_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/data_turbo/home/zhangxiaoyu/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code =...
[2023-10-26 17:54:48,155] [ERROR] [launch.py:321:sigkill_handler] ['/data_new/sjy98/polyglot-ko/data-parallel/deepspeed-venv/bin/python3', '-u', 'deepspeed-trainer.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_2.json'] exits with return code = -9 ...
bigscience/T0 multi-gpu inference exits with return code -9 #16616 Closed 4 tasks archieCanada commented Sep 8, 2022 Hello, I was able to run the code stas00 mentioned above. Though my task is along the same lines, it is a little more demanding. And I struggle to make the adjust...