在使用DOTAv1数据集进行复现时,使用单卡4090进行训练,执行命令python ./tools/train.py ./configs/strip_rcnn/strip_rcnn_s_fpn_1x_dota_le90.py会报错,将strip_rcnn/strip_rcnn_s_fpn_1x_dota_le90.py文件中SyncBN改为BN后顺利执行,但训练结果仅54%。后续使用双卡训练./tools/dist_train.sh ./configs/strip_rcnn/strip_rcnn_s_fpn_1x_dota_le90.py 2,改回SyncBN,训练结果有85.9%,后续我使用单卡进行分布式训练:./tools/dist_train.sh ./configs/strip_rcnn/strip_rcnn_s_fpn_1x_dota_le90.py 1,只训练了一轮,但是结果已经达到了66.7%。
报错信息:
Traceback (most recent call last):
File "./tools/train.py", line 197, in
main()
File "./tools/train.py", line 186, in main
train_detector(
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/root/miniconda3/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 149, in new_func
output = old_func(*new_args, **new_kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/detectors/two_stage.py", line 127, in forward_train
x = self.extract_feat(img)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/detectors/two_stage.py", line 67, in extract_feat
x = self.backbone(img)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/backbones/stripnet.py", line 208, in forward
x = self.forward_features(x)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/backbones/stripnet.py", line 198, in forward_features
x, H, W = patch_embed(x)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/backbones/stripnet.py", line 114, in forward
x = self.norm(x)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 732, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 845, in get_world_size
return _get_group_size(group)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 306, in _get_group_size
default_pg = _get_default_group()
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
在使用DOTAv1数据集进行复现时,使用单卡4090进行训练,执行命令
python ./tools/train.py ./configs/strip_rcnn/strip_rcnn_s_fpn_1x_dota_le90.py会报错,将strip_rcnn/strip_rcnn_s_fpn_1x_dota_le90.py文件中SyncBN改为BN后顺利执行,但训练结果仅54%。后续使用双卡训练./tools/dist_train.sh ./configs/strip_rcnn/strip_rcnn_s_fpn_1x_dota_le90.py 2,改回SyncBN,训练结果有85.9%,后续我使用单卡进行分布式训练:./tools/dist_train.sh ./configs/strip_rcnn/strip_rcnn_s_fpn_1x_dota_le90.py 1,只训练了一轮,但是结果已经达到了66.7%。报错信息:
Traceback (most recent call last):
File "./tools/train.py", line 197, in
main()
File "./tools/train.py", line 186, in main
train_detector(
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/apis/train.py", line 141, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/root/miniconda3/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 149, in new_func
output = old_func(*new_args, **new_kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/detectors/two_stage.py", line 127, in forward_train
x = self.extract_feat(img)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/detectors/two_stage.py", line 67, in extract_feat
x = self.backbone(img)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/backbones/stripnet.py", line 208, in forward
x = self.forward_features(x)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/backbones/stripnet.py", line 198, in forward_features
x, H, W = patch_embed(x)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/Strip-R-CNN/mmrotate/models/backbones/stripnet.py", line 114, in forward
x = self.norm(x)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 732, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 845, in get_world_size
return _get_group_size(group)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 306, in _get_group_size
default_pg = _get_default_group()
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.