Skip to content

Conversation

@KleinYuan
Copy link
Contributor

A few issues occur:

  1. the docker cannot run on CUDA11, aka, all the Amphere arch GPUs, like 3070, 3080, ...
  2. the documented docker run has issues: /bin/sh exe will make pip not available

This PR is fully tested on a 3070 machine, we can run training:

[01/27 22:31:39 fastreid.utils.checkpoint]: No checkpoint found. Training model from scratch
[01/27 22:31:39 fastreid.engine.train_loop]: Starting training from epoch 0
[01/27 22:32:24 fastreid.utils.events]:  eta: 1:21:55  epoch/iter: 0/199  total_loss: 7.745  loss_cls: 6.461  loss_triplet: 1.292  time: 0.2043  data_time: 0.0013  lr: 6.60e-05  max_mem: 4862M
[01/27 22:32:24 fastreid.utils.events]:  eta: 1:21:55  epoch/iter: 0/201  total_loss: 7.726  loss_cls: 6.445  loss_triplet: 1.26  time: 0.2043  data_time: 0.0010  lr: 6.63e-05  max_mem: 4862M
[01/27 22:33:08 fastreid.utils.events]:  eta: 1:23:00  epoch/iter: 1/399  total_loss: 5.311  loss_cls: 4.884  loss_triplet: 0.4171  time: 0.2082  data_time: 0.0010  lr: 9.75e-05  max_mem: 4862M
[01/27 22:33:09 fastreid.utils.events]:  eta: 1:23:00  epoch/iter: 1/403  total_loss: 5.273  loss_cls: 4.852  loss_triplet: 0.4111  time: 0.2085  data_time: 0.0010  lr: 9.82e-05  max_mem: 4862M
[01/27 22:33:58 fastreid.utils.events]:  eta: 1:23:21  epoch/iter: 2/599  total_loss: 3.677  loss_cls: 3.44  loss_triplet: 0.227  time: 0.2194  data_time: 0.0007  lr: 1.29e-04  max_mem: 4862M

It includes the following changes:

  1. add a CUDA 11 docker file
  2. move the dockerfile to the root folder
  3. update the docker command documentation
  4. remove the user management -- not necessary

RUN pip install tensorboard cmake # cmake from apt-get is too old
RUN pip install torch==1.10 torchvision==0.11.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
# RUN pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/cu101/torch_stable.html
# RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorboard opencv-python cython yacs termcolor scikit-learn tabulate gdown gpustat faiss-gpu ipdb h5py
Copy link
Contributor Author

@KleinYuan KleinYuan Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed https://pypi.tuna.tsinghua.edu.cn/simple, due to constant timeout. Do we need this ? @L1aoXingyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant