Error details: RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
This error occurs when using torch.nn.parallel.DistributedDataParallel
to train a model parallelly. I launched program A with python -m torch.distributed.launch --nproc_per_node=2 trainA.py
and worked fine. Then when A is running, I tried to launch program B with python -m torch.distributed.launch --nproc_per_node=2 trainB.py
yet ended up with the error above.
It turns out that the issue arises from the network address. As the error reports, the address 29500
is being used. Hence, modifying the address should work. So I used the command python -m torch.distributed.launch --nproc_per_node=2 --master_port='29501' trainB.py
.
Problem solved!!!文章來源地址http://www.zghlxwxcb.cn/news/detail-690380.html
文章來源:http://www.zghlxwxcb.cn/news/detail-690380.html
到了這里,關(guān)于RuntimeError: The server socket has failed to listen on any local network address. The server socket的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!