AMD MI50大模型折腾 debian安装rocm amdgpu
系统:Ubuntu22.04 LTS。
·
系统:Ubuntu24.04 LTS 安装失败,还是切换到ubuntu-22.04.5-live-server-amd64执行脚本
参考博客
debian12安装rocm
到 AMD 官网下载 Ubuntu20.04 驱动安装程序:
https://www.amd.com/zh-hans/support/linux-drivers
我下载的是:amdgpu-install_6.2.60204-1_all.deb
apt install amdgpu-install_6.1.60103-1_all-u20.04.deb
安装过程中可能会报缺少libpython3.10,但是不能直接使用apt安装,没有这个包。按照以下方式解决:
##先编译安装python3.10
wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
tar -xvf Python-3.10.0.tgz -C /usr/local/
sudo apt update
sudo apt install -y build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev
cd /usr/local/Python-3.10.x
./configure --prefix=/usr/local/python3.10
make
sudo make install
## 添加到环境变量
ln -s /usr/local/python3.10/bin/python3.10 /usr/local/bin/python3.10
## 使用equivs安装一个虚拟的libpython3.10的包
## 实际上只需要安装equivs 即可
apt-get install equivs equivs-control libpython3.10.control
cat >libpython3.10.control <<EOF
Package: libpython3.10
Version: 3.10.1-1
Description: Dummy package for manually installed libpython3.10
This is a dummy package created to satisfy dependencies for libpython3.10.
EOF
equivs-build libpython3.10.control
sudo dpkg -i libpython3.10_3.10.1-1_all.deb
安装rocm
amdgpu-install --usecase=dkms,opencl,hip,rocm
基于debian12的容器内安装也可以按照次操作,附上容器的配置
version: '3.9'
services:
code-server:
image: registry.cn-hangzhou.aliyuncs.com/tuon-pub/code-server:latest
container_name: code-server
volumes:
- ./data:/home/coder
- ./data_root:/root
user: root
environment:
DOCKER_USER: root
HTTP_PROXY: http://192.168.1.6:10809
HTTPS_PROXY: http://192.168.1.6:10809
ports:
- 7081:8080
restart: unless-stopped
cap_add:
- ALL
devices:
- /dev/dri/
group_add:
- video
ipc: host
shm_size: 8G
security_opt:
- seccomp=unconfined
vllm-rocm安装脚本
## AMD ROCm在Ubuntu22.04编译,所以24.04缺少依赖包,需要增加仓库
sudo add-apt-repository -y -s deb http://security.ubuntu.com/ubuntu jammy main universe
## 一键部署脚本
curl -L https://vllm.9700001.xyz/install.sh -o install.sh && chmod +x install.sh && bash install.sh
当前状态
- 安装中…,已放弃
问题记录
- 缺少numpy包
error: subprocess-exited-with-error
× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> [22 lines of output]
/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
No ROCm runtime is found, using ROCM_HOME='/opt/rocm'
Traceback (most recent call last):
File "/data/vllmenv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
main()
File "/data/vllmenv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/data/vllmenv/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 144, in get_requires_for_build_editable
return hook(config_settings)
File "/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 473, in get_requires_for_build_editable
return self.get_requires_for_build_wheel(config_settings)
File "/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 331, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=[])
File "/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 301, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-s6y2y8ao/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 317, in run_setup
exec(code, locals())
File "<string>", line 606, in <module>
File "<string>", line 475, in get_vllm_version
File "<string>", line 428, in get_nvcc_cuda_version
AssertionError: CUDA_HOME is not set
[end of output]
PVE 显卡直通卡死,不支持PVE甚至基于kvm的虚拟机显卡直通,直接上物理机吧。
预构建的docker镜像
## 这个镜像很大,由16.18G,基本上要相当长的时间了
docker pull btbtyler09/vllm-rocm-gcn5:0.8.5
Ollama
在构建vllm-rocm无果后,转向Ollama,使用官网的一键安装脚本即可
环境变量
- OLLAMA_HOST=http://0.0.0.0:11434
- OLLAMA_MODELS=/data/ollama/.ollama
- OLLAMA_KEEP_ALIVE=10m
- OLLAMA_NUM_PARALLEL=1
- OLLAMA_MAX_LOADED_MODELS=3
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_CONTEXT_LENGTH=8192
systemd
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"
Environment="OLLAMA_HOST=http://0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/.ollama"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
[Install]
WantedBy=default.target
windows
设置系统环境变量即可
问题记录
FastGPT对接,运行一段时间后,感觉后卡死,ollama出现异常
FastGPT对接知识库,长的系统提示词,system无效:
需要设置OLLAMA_CONTEXT_LENGTH=8192,可能是因为默认的上下文比较短,导致知识库的提示无效。
运行报错:CUDA error: HIPBLAS_STATUS_INTERNAL_ERROR
可能是pytorch版本太高,最后使用python3.10.12 + torch2.4.0 + rocm6.1版本才没报错
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/rocm6.1
系统版本是debian12, pve8.x。
HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access me
没找到什么问题,重启后又可以了,AMD GPU,特别是老版本的不要玩
fancontrol控制
安装pwmconfig, fancontrol
cat > /etc/fancontrol<<EOF
# Configuration file generated by pwmconfig, changes will be lost
INTERVAL=10
DEVPATH=hwmon0=devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:00.0/0000:04:00.0 hwmon3=devices/pci0000:00/0000:00:1f.3/i2c-0/0-002d
DEVNAME=hwmon0=amdgpu hwmon3=nct7904
FCTEMPS=hwmon3/pwm2=hwmon0/temp2_input
FCFANS= hwmon3/pwm2=hwmon3/fan4_input
MINTEMP=hwmon3/pwm2=50
MAXTEMP=hwmon3/pwm2=80
MINSTART=hwmon3/pwm2=150
MINSTOP=hwmon3/pwm2=0
MAXPWM=hwmon3/pwm2=220
EOF
debian12 ollama运行模型失败
5月 31 12:23:52 pve ollama[32127]: ggml_cuda_compute_forward: RMS_NORM failed
5月 31 12:23:52 pve ollama[32127]: ROCm error: invalid device function
5月 31 12:23:52 pve ollama[32127]: current device: 0, in function ggml_cuda_compute_forward at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2362
5月 31 12:23:52 pve ollama[32127]: err
5月 31 12:23:52 pve ollama[32127]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:75: ROCm error
5月 31 12:23:53 pve ollama[32127]: ptrace: 不允许的操作.
5月 31 12:23:53 pve ollama[32127]: No stack.
5月 31 12:23:53 pve ollama[32127]: The program is not being run.
5月 31 12:23:53 pve ollama[32127]: SIGABRT: abort
5月 31 12:23:53 pve ollama[32127]: PC=0x7f4b3c3a7eec m=14 sigcode=18446744073709551610
5月 31 12:23:53 pve ollama[32127]: signal arrived during cgo execution
增加环境变量
### 值为AMD显卡的型号,GFX906,MI50 ?? 瞎猜的,不过确实可以了
HSA_OVERRIDE_GFX_VERSION=9.0.6
更多推荐
所有评论(0)