Tuesday, October 7, 2025

How to execute AMD System Level Test Tool: AGFHC Tool

 

How to execute AMD System Level Test Tool: AGFHC Tool

(AMD GPU Field Health Check) tool

 

System requirements:

MiTAC G8825Z5

OS: Ubuntu 22.04 Server

AMD AGFHC: agfhc v1.21.2

 

Command:

$/opt/amd/agfhc/agfhc -r all_lvl5 -o /tmp

## execute level 5 test for 2 hours and use /tmp as directory where stored log output file.

 

Comparison of Differences Between Check Levels

The check levels (Level1 through Level5) are titled to reflect their total duration, ranging from approximately 5 minutes (Level1) up to approximately 2 hours (Level5). The differences lie primarily in the scope, diversity, and duration of the memory exercisers and stress tests.

 

1. Foundational vs. Initial Expansion (Level1 vs. Level2)

• Level1 (all_lvl1): Focuses on fundamental connectivity and bandwidth tests, including PCIe link validation, bidirectional PCIe peak bandwidth, XGMI a2a bandwidth, HBM bandwidth, and the basic gfx_dgemm GEMM test.

• Level2 (all_lvl2): Introduces two key areas of testing:

Foundational vs. Initial Expansion

Unidirectional PCIe

Adds pcie_unidi_peak validation.

Performance and Exercisers

Adds gfx_maxpower to ensure GPUs can hit full performance, and introduces the first HBM memory exercisers (hbm_ds and hbm_remix2), both lasting 2 minutes.

 

2. Intermediate Breadth (Level3)

• Level3 (all_lvl3): Significant expansion of test types, especially in floating-point operations and initial stress testing.

Intermediate Breadth

GEMM Tests

Introduces the hbm exerciser for 5 minutes, and the mall exerciser.

GEMM Tests

The "More gemm tests" section starts here, adding Tensor Float (TF) tests (gfx_bf16tf, gfx_fp16tf, gfx_fp8tf), each running for 1 minute.

ACF Stress

Introduces additional ACF stress tests (rochpl, hbm_ds_ntd) and the athub test.

 

3. High Intensity and Endurance (Level4 vs. Level5)

• Level4 (all_lvl4): Expands the variety of HBM exercisers and adds Reduced Integrity (RI) GEMM tests. Memory exercisers (hbm_ds, hbm_remix2, hbm_s16_ds, hbm_s16, and mall) are all set to run for 4 minutes. It introduces the sprites stress test for 8 minutes.

• Level5 (all_lvl5): Focuses on maximizing duration and verifying thermal integrity.

High Intensity and Endurance

Thermal Verification

Includes a "thermal recipe" to drive max power, consisting of 10 minutes of hbm_ds followed by 10 minutes of gfx_maxpower.

Memory Duration

The core memory exercisers (including hbm_ds, hbm_remix2, hbm_s16_ds, hbm_s16, and mall) are run for 5 minutes each.

Stress Duration

Stress tests are significantly extended: rochpl (15 minutes), hbm_ds_ntd (15 minutes), and sprites (20 minutes).

 

4. Performance Test Suite

The all_perf script is defined as "A recipe to execute all performance based tests once". It includes every test designed to measure bandwidth or performance (all PCIe, XGMI, HBM BW, gfx_maxpower, and all eight variations of gfx GEMM tests). This script focuses purely on performance metrics and does not include any of the extended memory exercisers or stress components found in Level1-Level5.

 


 


Test Name

Level1 (~5m)

Level2 (~10m)

Level3 (~30m)

Level4 (~1h)

Level5 (~2h)

Perf Test

Purpose/Description

pcie_link_status

V

V

V

V

V

V

A test that validates PCIe link speed and link width

pcie_bidi_peak

V

V

V

V

V

V

A test that validates PCIe bandwidth values in bidirectional peak mode

xgmi_a2a

V

V

V

V

V

V

A test that validates XGMI bandwidth values in a2a mode

hbm_bw

V

V

V

V

V

V

HBM BW testHBM bandwidth, or HBM memory BW tests

gfx_dgemm

V

N/A

V

V

V

V

Run the GEMM test / Part of More GEMM tests

pcie_unidi_peak

N/A

V

V

V

V

V

A test that validates PCIe bandwidth values in unidirectional peak mode

gfx_maxpower

N/A

V

V

V

V

V

To ensure GPUs can hit full perf / Make sure we can hit max power / Used for thermal

hbm_ds

N/A

V

N/A

V

V

N/A

Run the HBM memory exercisers. Used to drive max power for thermal integrity (10m duration in L5).

hbm_remix2

N/A

V

N/A

V

V

N/A

Run the HBM memory exercisers

hbm

N/A

N/A

V

N/A

N/A

N/A

Run the HBM exerciser

mall

N/A

N/A

V

V

V

N/A

MALL exerciser

gfx_bf16tf

N/A

N/A

V

V

V

V

Part of More GEMM tests

gfx_fp16tf

N/A

N/A

V

V

V

V

Part of More GEMM tests

gfx_fp8tf

N/A

N/A

V

V

V

V

Part of More GEMM tests

athub

N/A

N/A

V

V

V

N/A

athub

rochpl

N/A

N/A

V

V

V

N/A

Additional ACF stress

hbm_ds_ntd

N/A

N/A

V

V

V

N/A

Additional ACF stress

hbm_s16_ds

N/A

N/A

N/A

V

V

N/A

Run the HBM memory exercisers

hbm_s16

N/A

N/A

N/A

V

V

N/A

Run the HBM memory exercisers

gfx_sgemm

N/A

N/A

N/A

V

V

V

Part of More GEMM tests

gfx_bf16ri

N/A

N/A

N/A

V

V

V

Part of More GEMM tests

gfx_fp16ri

N/A

N/A

N/A

V

V

V

Part of More GEMM tests

gfx_fp8ri

N/A

N/A

N/A

V

V

V

Part of More GEMM tests

sprites

N/A

N/A

N/A

V

V

N/A

Stress test


Example:

Level5 test

Perf test

No comments:

Post a Comment

How to execute AMD System Level Test Tool: AGFHC Tool

  How to execute AMD System Level Test Tool: AGFHC Tool (AMD GPU Field Health Check) tool   System requirements: MiTAC G8825Z5 O...