AMD Instinct MI25 | Machine Learning Setup on the Cheap!

Hello!

Recently I’ve been tasked with working on a machine learning environment using ROCM, and AMD.

I’ve never done this before… to the lab!

The Intro


I was able to purchase an AMD MI25 2nd hand card for 100$ USD (as of this writing). This device has plenty of horsepower for ML / AI lab work. This was a pretty interesting setup and was a journey to test and configure. Hopefully this helps someone. 🙂


This MI25 was installed in a Dell R720, and before you comment no this is not supported at ALL. I used BOTH 8 pin PCIE pins from the PCIE risers GPU Power outputs and plugged them into one card. I do NOT recommend doing this, but it worked for me during this testing.

My specs of the Dell R720 I used were:

2x E5-2640v2 | 128GB RAM | 2x 1100W PSUs | 2x 146GB R1 15k SAS HDD + 1.6TB SAS SSD | H710

The Card

What is the MI25 from AMD?

The AMD Radeon Instinct MI25 is a professional graphics card that is designed for use in machine learning and artificial intelligence applications. It is based on the Vega architecture and features 25 teraflops of half-precision floating point performance, which makes it well-suited for tasks such as training deep neural networks. The MI25 also has 16GB of HBM2 memory, which provides fast access to data needed for machine learning algorithms. It is intended to be used in servers and other high-performance computing systems and can be used in a variety of applications including image recognition, natural language processing, and video analytics. At $100 USD on the used market, it can be a great entry point for a HomeLab that wants a high memory capable card to their compute stack.

The Driver

First, I installed the ROCM drivers as specified in the documentation for my distribution:
(Ubuntu 20.04 LTS HWE)

Drivers & Source Material:
Instinctâ„¢ MI25 Drivers & Support | AMD

#Get the Driver
wget https://repo.radeon.com/amdgpu-install/22.20/ubuntu/focal/amdgpu-install_22.20.50200-1_all.deb

#Install the driver
sudo apt-get install ./amdgpu-install_22.20.50200-1_all.deb

#Bootstrap the driver with the required profile
amdgpu-install --accept-eula --usecase=workstation -y --vulkan=pro --opencl=rocr,legacy

#Reboot, yes you MUST reboot for this to apply
sudo reboot now


The Software Stack

Next, for my use case I installed Docker. The AMD ROCM team has a precompiled ROCM Docker Ubuntu container that can run ROCM workloads pre-setup. This worked for my use case.

Software & Source Material:
Enhance your ML research with AMD ROCmâ„¢ 5.1 and Py… – AMD Community


First, install docker:

sudo apt install docker.io -y

Obtain a base docker image with the correct user-space ROCm version installed from  https://hub.docker.com/r/rocm/dev-ubuntu-20.04 or download a base OS docker image and install ROCm following the installation directions. In this example, ROCm 5.1.1 is installed, as supported by the installation matrix from the pytorch.org website.

docker pull rocm/dev-ubuntu-20.04:5.1.1

Start the Docker container.

docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:5.1.1

 Install any dependencies needed for installing the wheels inside the docker container.

apt update -y  && apt install libjpeg-dev python3-dev nano git curl wget htop

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.2/



Lastly, open a python3 terminal, and verify ROCM / pyTorch GPU acceleration.

>>> import torch

>>> torch.cuda.is_available()

True

>> torch.cuda.device_count()

1

>>> torch.cuda.current_device()

0

>>> torch.cuda.device(0)

<torch.cuda.device at 0x7efce0b03be0>

>>> torch.cuda.get_device_name(0)
‘AMD MI25 Instinct’

8 Comments

  1. Hello,
    Currently I have a ryzen 5 2400g, a B450M Bazooka2 motherboard and 16GB of ram. I would like to use vicuna/llama.cpp in a relatively smooth way. Would you advise me a card (Mi25, P40, k80…) to add to my current computer or a second hand configuration?

    1. Hi Fred!

      If you do not have a graphics card right now, and need something to get started with I’d first look at what kind of workload you’re going to be running.

      The Mi25, P40, and K80 are good choices but can be hard to cool / power depending on your system.

      I’d do some research on what you’re planning on running, as in workload type. This Reddit post had some great information:

      https://www.reddit.com/r/LocalLLaMA/wiki/models/

      Regardless, if you’re looking for a SIMPLE answer without a workload consideration or what OS you’re using, I’d recommend the P40. 🙂

  2. Thanks tyler,
    I would like to do mostly inferences and maybe a little bit of fine tuning with 13b and 30b models.

    I bought a P40 and will soon receive it for testing.

    Thanks

  3. I heard AMD dropped support for Mi25 in the latest ROCm. Are you able to confirm this is the case? I’m also wondering if Hugging Face’s Transformer’s “load_in_8bit=True“ would work on this card.

  4. Heya!

    I’ve been trying to use this gpu on ubuntu 20.04 HWE..
    Ive ran into multiple errors until I’ve manged to install the drivers and rocm…
    But not i cannot load the drivers as it always results in:

    [ 0.000000] Linux version 5.15.0-86-generic (buildd@lcy02-amd64-062) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #96~20.04.1-Ubuntu SMP Thu Sep 21 13:23:37 UTC 2023 (Ubuntu 5.15.0-86.96~20.04.1-generic 5.15.122)
    [ 5.399426] [drm] amdgpu kernel modesetting enabled.
    [ 5.399561] amdgpu: CRAT table not found
    [ 5.399565] amdgpu: Virtual CRAT table created for CPU
    [ 5.399589] amdgpu: Topology: Add CPU node
    [ 5.399715] amdgpu 0000:0c:00.0: enabling device (0000 -> 0003)
    [ 5.399838] amdgpu 0000:0c:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
    [ 5.399937] amdgpu 0000:0c:00.0: BAR 6: can’t assign [??? 0x00000000 flags 0x20000000] (bogus alignment)
    [ 5.647827] amdgpu 0000:0c:00.0: amdgpu: Fetched VBIOS from ROM
    [ 5.647831] amdgpu: ATOM BIOS: 113-D0513500-N00
    [ 6.724253] amdgpu 0000:0c:00.0: amdgpu: MEM ECC is active.
    [ 6.724254] amdgpu 0000:0c:00.0: amdgpu: SRAM ECC is not presented.
    [ 6.724390] amdgpu 0000:0c:00.0: amdgpu: VRAM: 16368M 0x000000F400000000 – 0x000000F7FEFFFFFF (16368M used)
    [ 6.724392] amdgpu 0000:0c:00.0: amdgpu: GART: 512M 0x0000000000000000 – 0x000000001FFFFFFF
    [ 6.724393] amdgpu 0000:0c:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 – 0x0000FFFFFFFFFFFF
    [ 6.724431] [drm] amdgpu: 16368M of VRAM memory ready
    [ 6.724431] [drm] amdgpu: 11892M of GTT memory ready.
    [ 6.724824] amdgpu 0000:0c:00.0: amdgpu: PSP runtime database doesn’t exist
    [ 6.726769] amdgpu: hwmgr_sw_init smu backed is vega10_smu
    [ 7.105297] [drm:psp_hw_start [amdgpu]] *ERROR* PSP load sos failed!
    [ 7.105482] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
    [ 7.105594] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block failed -22
    [ 7.105685] amdgpu 0000:0c:00.0: amdgpu: amdgpu_device_ip_init failed
    [ 7.105687] amdgpu 0000:0c:00.0: amdgpu: Fatal error during GPU init
    [ 7.105715] amdgpu 0000:0c:00.0: amdgpu: amdgpu: finishing device.
    [ 7.107532] amdgpu: probe of 0000:0c:00.0 failed with error -22
    [ 7.108004] amdgpu_uvd_suspend+0x202/0x2d0 [amdgpu]
    [ 7.108166] uvd_v7_0_sw_fini+0x26/0x120 [amdgpu]
    [ 7.108285] amdgpu_device_fini_sw+0x113/0x410 [amdgpu]
    [ 7.108379] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
    [ 7.108528] amdgpu_init+0x7c/0x1000 [amdgpu]
    [ 7.108674] Modules linked in: hid_generic usbhid hid amdgpu(+) nouveau(+) iommu_v2 gpu_sched mxm_wmi i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect crc32_pclmul sysimgblt ghash_clmulni_intel fb_sys_fops cec aesni_intel rc_core crypto_simd r8169 cryptd i2c_i801 i2c_smbus drm ahci realtek xhci_pci libahci xhci_pci_renesas wmi video pinctrl_tigerlake

    Have you got any idea on how to fix this?
    I have also tried to recompile ubuntu kernel with some changes as stated in this post here:
    https://superuser.com/questions/1747738/amd-radeon-instinct-mi25-fails-to-initialize-drmamdgpu-device-fw-loading-amd

    but so far nothing…

    1. Hello!

      I did run this card in a Dell R720, and I did not have any troubles running this card. I ran BOTH PCIE power plugs from both risers to the single card, but an R730 might have a different power configuration. The fans did run higher, and wasn’t comfortable to work next to. The performance was fine, and was as expected.

      Hope this helps!

      -Tyler

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.