系统:Ubuntu24.04 LTS 安装失败,还是切换到ubuntu-22.04.5-live-server-amd64执行脚本

参考博客

debian12安装rocm

到 AMD 官网下载 Ubuntu20.04 驱动安装程序:
https://www.amd.com/zh-hans/support/linux-drivers
我下载的是:amdgpu-install_6.2.60204-1_all.deb

apt install amdgpu-install_6.1.60103-1_all-u20.04.deb

安装过程中可能会报缺少libpython3.10,但是不能直接使用apt安装,没有这个包。按照以下方式解决:

##先编译安装python3.10
wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
tar -xvf Python-3.10.0.tgz -C /usr/local/
sudo apt update  

sudo apt install -y build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev
cd /usr/local/Python-3.10.x  

./configure --prefix=/usr/local/python3.10  
make  
sudo make install
## 添加到环境变量
ln -s /usr/local/python3.10/bin/python3.10 /usr/local/bin/python3.10

## 使用equivs安装一个虚拟的libpython3.10的包
## 实际上只需要安装equivs 即可
apt-get install equivs equivs-control libpython3.10.control
cat >libpython3.10.control <<EOF
Package: libpython3.10
Version: 3.10.1-1
Description: Dummy package for manually installed libpython3.10
 This is a dummy package created to satisfy dependencies for libpython3.10.
EOF

equivs-build libpython3.10.control
sudo dpkg -i libpython3.10_3.10.1-1_all.deb

安装rocm

amdgpu-install --usecase=dkms,opencl,hip,rocm

基于debian12的容器内安装也可以按照次操作,附上容器的配置

version: '3.9'

services:
  code-server:
    image: registry.cn-hangzhou.aliyuncs.com/tuon-pub/code-server:latest
    container_name: code-server
    volumes:
      - ./data:/home/coder
      - ./data_root:/root
    user: root
    environment:
      DOCKER_USER: root
      HTTP_PROXY: http://192.168.1.6:10809
      HTTPS_PROXY: http://192.168.1.6:10809
    ports:
      - 7081:8080
    restart: unless-stopped
    cap_add:
      - ALL
    devices:
      - /dev/dri/
    group_add:
      - video
    ipc: host
    shm_size: 8G
    security_opt:
      - seccomp=unconfined

vllm-rocm安装脚本

## AMD ROCm在Ubuntu22.04编译,所以24.04缺少依赖包,需要增加仓库
sudo add-apt-repository -y -s deb http://security.ubuntu.com/ubuntu jammy main universe
## 一键部署脚本
curl -L https://vllm.9700001.xyz/install.sh -o install.sh && chmod +x install.sh && bash install.sh

当前状态

  • 安装中…,已放弃

问题记录

  • 缺少numpy包
error: subprocess-exited-with-error
  
  × Getting requirements to build editable did not run successfully.
  │ exit code: 1
  ╰─> [22 lines of output]
      /tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
        cpu = _conversion_method_template(device=torch.device("cpu"))
      No ROCm runtime is found, using ROCM_HOME='/opt/rocm'
      Traceback (most recent call last):
        File "/data/vllmenv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/data/vllmenv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/data/vllmenv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 144, in get_requires_for_build_editable
          return hook(config_settings)
        File "/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 473, in get_requires_for_build_editable
          return self.get_requires_for_build_wheel(config_settings)
        File "/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 331, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
        File "/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 301, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 317, in run_setup
          exec(code, locals())
        File "<string>", line 606, in <module>
        File "<string>", line 475, in get_vllm_version
        File "<string>", line 428, in get_nvcc_cuda_version
      AssertionError: CUDA_HOME is not set
      [end of output]

​PVE 显卡直通卡死,不支持PVE甚至基于kvm的虚拟机显卡直通,直接上物理机吧。

预构建的docker镜像

vllm-rocm-gcn5

## 这个镜像很大,由16.18G,基本上要相当长的时间了
docker pull btbtyler09/vllm-rocm-gcn5:0.8.5

Ollama

在构建vllm-rocm无果后,转向Ollama,使用官网的一键安装脚本即可

环境变量

  • OLLAMA_HOST=http://0.0.0.0:11434
  • OLLAMA_MODELS=/data/ollama/.ollama
  • OLLAMA_KEEP_ALIVE=10m
  • OLLAMA_NUM_PARALLEL=1
  • OLLAMA_MAX_LOADED_MODELS=3
  • OLLAMA_FLASH_ATTENTION=1
  • OLLAMA_CONTEXT_LENGTH=8192
systemd
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"
Environment="OLLAMA_HOST=http://0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/.ollama"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"

[Install]
WantedBy=default.target
windows

设置系统环境变量即可

问题记录

FastGPT对接,运行一段时间后,感觉后卡死,ollama出现异常
FastGPT对接知识库,长的系统提示词,system无效:

需要设置OLLAMA_CONTEXT_LENGTH=8192,可能是因为默认的上下文比较短,导致知识库的提示无效。

运行报错:CUDA error: HIPBLAS_STATUS_INTERNAL_ERROR

可能是pytorch版本太高,最后使用python3.10.12 + torch2.4.0 + rocm6.1版本才没报错

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/rocm6.1

系统版本是debian12, pve8.x。

HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access me

没找到什么问题,重启后又可以了,AMD GPU,特别是老版本的不要玩

fancontrol控制

安装pwmconfig, fancontrol

cat > /etc/fancontrol<<EOF
# Configuration file generated by pwmconfig, changes will be lost
INTERVAL=10
DEVPATH=hwmon0=devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:00.0/0000:04:00.0 hwmon3=devices/pci0000:00/0000:00:1f.3/i2c-0/0-002d
DEVNAME=hwmon0=amdgpu hwmon3=nct7904
FCTEMPS=hwmon3/pwm2=hwmon0/temp2_input
FCFANS= hwmon3/pwm2=hwmon3/fan4_input
MINTEMP=hwmon3/pwm2=50
MAXTEMP=hwmon3/pwm2=80
MINSTART=hwmon3/pwm2=150
MINSTOP=hwmon3/pwm2=0
MAXPWM=hwmon3/pwm2=220
EOF
debian12 ollama运行模型失败
531 12:23:52 pve ollama[32127]: ggml_cuda_compute_forward: RMS_NORM failed
531 12:23:52 pve ollama[32127]: ROCm error: invalid device function
531 12:23:52 pve ollama[32127]:   current device: 0, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2362
531 12:23:52 pve ollama[32127]:   err
531 12:23:52 pve ollama[32127]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:75: ROCm error
531 12:23:53 pve ollama[32127]: ptrace: 不允许的操作.
531 12:23:53 pve ollama[32127]: No stack.
531 12:23:53 pve ollama[32127]: The program is not being run.
531 12:23:53 pve ollama[32127]: SIGABRT: abort
531 12:23:53 pve ollama[32127]: PC=0x7f4b3c3a7eec m=14 sigcode=18446744073709551610
531 12:23:53 pve ollama[32127]: signal arrived during cgo execution

增加环境变量

### 值为AMD显卡的型号,GFX906,MI50 ?? 瞎猜的,不过确实可以了
HSA_OVERRIDE_GFX_VERSION=9.0.6
Logo

欢迎加入 MCP 技术社区!与志同道合者携手前行,一同解锁 MCP 技术的无限可能!

更多推荐