Docker【部署 05】docker使用tensorflow-gpu安装及调用GPU踩坑记录



Building wheels for collected packages: tensorflow-gpu
  Building wheel for tensorflow-gpu (setup.py): started
  Building wheel for tensorflow-gpu (setup.py): finished with status 'error'
  Running setup.py clean for tensorflow-gpu
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─>[18 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in<module>
        File "<pip-setuptools-caller>", line 34, in<module>
        File "/tmp/pip-install-i6frcfa8/tensorflow-gpu_2cea358528754cc596c541f9c2ce45ca/setup.py", line 37, in<module>
          raise Exception(TF_REMOVAL_WARNING)

      The "tensorflow-gpu" package has been removed!

      Please install"tensorflow" instead.

      Other than the name, the two packages have been identical
      since TensorFlow 2.1, or roughly since Sep 2019. For more
      information, see: pypi.org/project/tensorflow-gpu
      =========================================================[end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tensorflow-gpu
Failed to build tensorflow-gpu

Other than the name, the two packages have been identical since TensorFlow 2.1 也就是说安装2.1版本的已经自带GPU支持。



[root@localhost ~]# nvidia-smi# 查询结果
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2||-------------------------------+----------------------+----------------------+

2.1 Could not find cuda drivers

# 报错
I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.


  1. 检查CUDA版本:首先,需要确认宿主机上已经正确安装了CUDA。在宿主机上运行nvcc --version命令来检查CUDA版本。
  2. 使用NVIDIA Docker镜像:NVIDIA提供了一些预先配置好的Docker镜像,这些镜像已经包含了CUDA和其他必要的库。可以使用这些镜像作为Dockerfile的基础镜像。
  3. 设置环境变量:在Dockerfile中,可以使用ENV指令来设置环境变量。例如,如果CUDA安装在/usr/local/cuda目录下,可以添加以下行到Dockerfile中:ENV PATH /usr/local/cuda/bin:$PATH
  4. 使用nvidia-docker:nvidia-docker是一个用于运行GPU加速的Docker容器的工具。




# 添加cuda的环境变量-ePATH=/usr/local/cuda-11.2/bin:$PATH-eLD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64:$LD_LIBRARY_PATH# 启动命令
nvidia-docker run --name deepface --privileged=true --restart=always --net="host"-ePATH=/usr/local/cuda-11.2/bin:$PATH-eLD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64:$LD_LIBRARY_PATH-v /root/.deepface/weights/:/root/.deepface/weights/ -v /usr/local/cuda-11.2/:/usr/local/cuda-11.2/ -d deepface_image

2.2 was unable to find libcuda.so DSO

I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: localhost.localdomain
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: localhost.localdomain
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 460.27.4


# 1.查找 libcuda.so 文件位置find / -name libcuda.so*
# 查找结果


# 3.将64位的libcuda.so.460.27.04复制到LD_LIBRARY_PATH路径下【libcuda.so和libcuda.so.1都是软连接】cp /usr/lib64/libcuda.so.460.27.04 /usr/local/cuda-11.2/lib64/

# 4.创建软连接ln-s libcuda.so.460.27.04 libcuda.so.1
ln-s libcuda.so.1 libcuda.so

2.3 Could not find TensorRT&&Cannot dlopen some GPU libraries

I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...




RUN pip install tensorrt -i https://pypi.tuna.tsinghua.edu.cn/simple


# 1.查询容器IDdockerps# 2.在running状态进入容器dockerexec-it ContainerID /bin/bash

# 3.安装软件
pip install tensorrt -i https://pypi.tuna.tsinghua.edu.cn/simple

# 4.提交新的镜像【可以将新的镜像导出使用】docker commit ContainerID imageName:version


root@localhost:/app# python
Python 3.8.18 (default, Sep 202023, 11:41:31)[GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license"formore information.

# 使用tensorflow报错>>>import tensorflow as tf
2023-10-09 10:15:55.482545: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-09 10:15:56.498608: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

# 先导入tensorrt后使用tensorflow看我用>>>import tensorrt as tr>>>import tensorflow as tf
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-10-09 10:16:41.452672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /device:GPU:0 with 11389 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:2f:00.0, compute capability: 7.5




import tensorrt as tr
import tensorflow as tf

if __name__ =="__main__":
    available = tf.config.list_physical_devices('GPU')print(f"available:{available}")


# 3rd parth dependenciesimport tensorrt as tr
import tensorflow as tf
from flask import Flask
from routes import blueprint

    available = tf.config.list_physical_devices('GPU')print(f"available:{available}")
    app = Flask(__name__)
    app.register_blueprint(blueprint)return app


nvidia-docker run --name deepface --privileged=true --restart=always --net="host"-ePATH=/usr/local/cuda-11.2/bin:$PATH-eLD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64:$LD_LIBRARY_PATH-v /root/.deepface/weights/:/root/.deepface/weights/ -v /usr/local/cuda-11.2/:/usr/local/cuda-11.2/ -v /opt/xinan-facesearch-service-public/deepface/api/app.py:/app/app.py -d deepface_image

2.4 Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED

E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:437] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:441] Memory usage: 1100742656 bytes free, 15843721216 bytes total.
E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:451] Possibly insufficient driver version: 460.27.4
W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at conv_ops_impl.h:770 : UNIMPLEMENTED: DNN library is not found.


2.5 CuDNN library needs to have matching major version and equal or higher minor version


E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:425] Loaded runtime CuDNN library: 8.1.1 but source was compiled with: 8.6.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
