How to execute AMD System Level Test
Tool: AGFHC Tool
(AMD
GPU Field Health Check) tool
System requirements:
MiTAC G8825Z5
OS: Ubuntu 22.04 Server
AMD AGFHC: agfhc v1.21.2
Command:
$/opt/amd/agfhc/agfhc -r all_lvl5 -o /tmp
## execute level 5 test for 2 hours and
use /tmp as directory where stored log output file.
Comparison of Differences
Between Check Levels
The check levels (Level1
through Level5) are titled to reflect their total duration, ranging from
approximately 5 minutes (Level1) up to approximately 2
hours (Level5). The differences lie primarily in the scope, diversity, and
duration of the memory exercisers and stress tests.
1. Foundational vs.
Initial Expansion (Level1 vs. Level2)
• Level1 (all_lvl1): Focuses on fundamental
connectivity and bandwidth tests, including PCIe link validation, bidirectional
PCIe peak bandwidth, XGMI a2a bandwidth, HBM bandwidth, and the basic gfx_dgemm GEMM
test.
• Level2 (all_lvl2): Introduces
two key areas of testing:
Foundational vs. Initial Expansion |
|
Unidirectional PCIe |
Adds pcie_unidi_peak validation. |
Performance and Exercisers |
Adds gfx_maxpower to ensure GPUs can hit full performance,
and introduces the first HBM memory exercisers
(hbm_ds and hbm_remix2), both lasting 2 minutes. |
2. Intermediate Breadth
(Level3)
• Level3 (all_lvl3): Significant
expansion of test types, especially in floating-point operations and initial
stress testing.
Intermediate Breadth |
|
GEMM Tests |
Introduces the hbm exerciser for 5 minutes,
and the mall exerciser. |
GEMM Tests |
The "More gemm tests" section starts here, adding Tensor
Float (TF) tests (gfx_bf16tf, gfx_fp16tf, gfx_fp8tf), each
running for 1 minute. |
ACF Stress |
Introduces additional ACF stress tests (rochpl, hbm_ds_ntd)
and the athub test. |
3. High Intensity and
Endurance (Level4 vs. Level5)
• Level4 (all_lvl4): Expands
the variety of HBM exercisers and adds Reduced Integrity (RI) GEMM tests.
Memory exercisers (hbm_ds, hbm_remix2, hbm_s16_ds, hbm_s16,
and mall) are all set to run for 4 minutes. It introduces
the sprites stress test for 8 minutes.
• Level5 (all_lvl5): Focuses
on maximizing duration and verifying thermal integrity.
High Intensity and Endurance |
|
Thermal Verification |
Includes a "thermal recipe" to drive max power,
consisting of 10 minutes of hbm_ds followed by 10
minutes of gfx_maxpower. |
Memory Duration |
The core memory exercisers
(including hbm_ds, hbm_remix2, hbm_s16_ds, hbm_s16,
and mall) are run for 5 minutes each. |
Stress Duration |
Stress tests are significantly extended: rochpl (15
minutes), hbm_ds_ntd (15 minutes), and sprites (20
minutes). |
4. Performance Test Suite
The all_perf script
is defined as "A recipe to execute all performance based tests once".
It includes every test designed to measure bandwidth or performance (all PCIe,
XGMI, HBM BW, gfx_maxpower, and all eight variations of gfx GEMM
tests). This script focuses purely on performance metrics and does not
include any of the extended memory exercisers or stress components
found in Level1-Level5.
Test Name |
Level1 (~5m) |
Level2 (~10m) |
Level3 (~30m) |
Level4 (~1h) |
Level5 (~2h) |
Perf Test |
Purpose/Description |
pcie_link_status |
V |
V |
V |
V |
V |
V |
A test that
validates PCIe link speed and link width |
pcie_bidi_peak |
V |
V |
V |
V |
V |
V |
A test that
validates PCIe bandwidth values in bidirectional peak mode |
xgmi_a2a |
V |
V |
V |
V |
V |
V |
A test that
validates XGMI bandwidth values in a2a mode |
hbm_bw |
V |
V |
V |
V |
V |
V |
HBM BW test, HBM bandwidth,
or HBM memory BW tests |
gfx_dgemm |
V |
N/A |
V |
V |
V |
V |
Run the GEMM
test / Part of More GEMM tests |
pcie_unidi_peak |
N/A |
V |
V |
V |
V |
V |
A test that
validates PCIe bandwidth values in unidirectional peak mode |
gfx_maxpower |
N/A |
V |
V |
V |
V |
V |
To ensure GPUs can hit full perf / Make sure we can
hit max power / Used for thermal |
hbm_ds |
N/A |
V |
N/A |
V |
V |
N/A |
Run the HBM
memory exercisers. Used to drive max power for thermal integrity (10m
duration in L5). |
hbm_remix2 |
N/A |
V |
N/A |
V |
V |
N/A |
Run the HBM
memory exercisers |
hbm |
N/A |
N/A |
V |
N/A |
N/A |
N/A |
Run the HBM
exerciser |
mall |
N/A |
N/A |
V |
V |
V |
N/A |
MALL exerciser |
gfx_bf16tf |
N/A |
N/A |
V |
V |
V |
V |
Part of More
GEMM tests |
gfx_fp16tf |
N/A |
N/A |
V |
V |
V |
V |
Part of More
GEMM tests |
gfx_fp8tf |
N/A |
N/A |
V |
V |
V |
V |
Part of More
GEMM tests |
athub |
N/A |
N/A |
V |
V |
V |
N/A |
athub |
rochpl |
N/A |
N/A |
V |
V |
V |
N/A |
Additional ACF
stress |
hbm_ds_ntd |
N/A |
N/A |
V |
V |
V |
N/A |
Additional ACF
stress |
hbm_s16_ds |
N/A |
N/A |
N/A |
V |
V |
N/A |
Run the HBM
memory exercisers |
hbm_s16 |
N/A |
N/A |
N/A |
V |
V |
N/A |
Run the HBM
memory exercisers |
gfx_sgemm |
N/A |
N/A |
N/A |
V |
V |
V |
Part of More
GEMM tests |
gfx_bf16ri |
N/A |
N/A |
N/A |
V |
V |
V |
Part of More
GEMM tests |
gfx_fp16ri |
N/A |
N/A |
N/A |
V |
V |
V |
Part of More
GEMM tests |
gfx_fp8ri |
N/A |
N/A |
N/A |
V |
V |
V |
Part of More
GEMM tests |
sprites |
N/A |
N/A |
N/A |
V |
V |
N/A |
Stress test |
Example:
Level5 test
Perf test
No comments:
Post a Comment