How to execute MiTAC G8825Z5
burnin-test by AMD AGFHC Tool
(AMD
GPU Field Health Check) tool
System requirements:
MiTAC G8825Z5
OS: Ubuntu 22.04 Server
AMD AGFHC: agfhc v1.21.2
Command:
$/opt/amd/agfhc/agfhc -r all_burnin_24h -o
/tmp
## execute burnin-test for 24
hours and use /tmp as directory where stored log
output file.
This table summarizes the comparison of
parameter values (Test Item and Duration/Status) defined in the system burn-in
configuration files for the approximately 4-hour, 12-hour, and 24-hour checks.
all_burnin_4h (4-Hour Check)
Ø Concentrated Thermal Stress: Dedicates a long, 60-minute run of
gfx_maxpower to check for immediate heat issues.
Ø Basic Memory Stress: Includes initial runs of memory exercisers.
Ø Lacks Validation Checks: Omits the entire suite of performance
validation checks (e.g., pcie_link_status, hbm_bw) found in longer tests.
This is a foundational check designed to
catch major, obvious problems that appear quickly, such as a faulty cooling
system or a critical memory flaw.
all_burnin_12h (12-Hour Check)
Ø Increased Memory Duration: Significantly extends the time spent on
memory exercisers like hbm_ds (from 45m to 155m).
Ø Adds Validation Checks: Introduces performance validation tests
(e.g., pcie_link_status, hbm_bw) that run at the start and end of the entire
sequence.
This level of testing goes further. It
verifies that the system is not only stable during a long work session but that
its performance (like PCIe speed) is just as good at the end as it was at the
beginning.
all_burnin_24h (24-Hour Check)
Ø Massively Extended Durations: Drastically increases the runtime of
key stress exercisers like gfx_dgemm (from 30m to 180m).
Ø Adds a Unique Stress Test: Includes a dedicated 60-minute xgmi_a2a
run to specifically "create xgmi traffic," a stress test not present
in the other plans.
Ø Increased Validation Frequency: Runs validation checks at the start,
midpoint, and end of the test.
This is the most rigorous test. It is
designed to find subtle issues that only appear after very long periods of
continuous operation and to confirm that system performance does not degrade
over that extended time.
Comparison Table of System Burn-in Test
Parameters
Test
Item |
4h
Check (Total Duration/Status) |
12h
Check (Total Duration/Status) |
24h
Check (Total Duration/Status) |
Purpose/Notes |
gfx_maxpower |
60m |
30m
|
30m
|
Used
for thermal stress. |
hbm_ds (Total Duration) |
45m
|
155m (5m + 30m + 120m) |
275m (5m + 30m + 180m + 60m) |
Runs
OBLEX exerciser on GPUs. Includes 5m checks at the start to catch fast fails. |
hbm_remix2 (Total Duration) |
30m
|
125m (5m + 120m) |
245m (5m + 180m + 60m) |
Includes
5m checks at the start to catch fast fails. |
hbm (Total Duration) |
N/A |
65m
(5m + 60m) |
65m
(5m + 60m) |
Included
in the combined 5-hour or 7-hour HBM runs. Includes 5m fast fail check. |
gfx_dgemm |
N/A |
30m
|
180m (60m + 120m) |
Runs
gfx stress. Additional time is added for the dgemm screen. |
sprites |
45m
|
140m
|
300m |
Part
of Additional ACF stress. |
rochpl |
45m
|
120m
|
200m |
Part
of Additional ACF stress. |
hbm_ds_ntd |
15m
|
30m
|
30m
|
Additional
ACF stress. |
mall (MALL exerciser) |
N/A |
10m
|
20m |
MALL
exerciser. Duration is doubled in the 24h check. |
athub |
N/A |
10m
|
10m
|
Athub
test. |
xgmi_a2a (Traffic) |
N/A |
N/A |
60m |
Specific
run to create xgmi traffic. Only present in the 24h check. |
pcie_link_status |
N/A |
Validation
Check (Start/End) |
Validation
Check (Start/Mid/End) |
Validates
PCIe link speed and link width. |
hbm_bw |
N/A |
Validation
Check (Start/End) |
Validation
Check (Start/Mid/End) |
HBM
memory BW tests. |
pcie_unidi_peak |
N/A |
Validation
Check (Start/End) |
Validation
Check (Start/Mid/End) |
Validates
PCIe bandwidth in unidirectional peak mode. |
pcie_bidi_peak |
N/A |
Validation
Check (Start/End) |
Validation
Check (Start/Mid/End) |
Validates
PCIe bandwidth in bidirectional peak mode. |
xgmi_a2a (BW Validation) |
N/A |
Validation
Check (Start/End) |
Validation
Check (Start/Mid/End) |
Validates
XGMI bandwidth values in a2a mode. |
gfx_bf16tf |
N/A |
Validation
Check (Start/End) |
Validation
Check (Start/Mid/End) |
RVS
bench based tests. |
gfx_fp16tf |
N/A |
Validation
Check (Start/End) |
Validation
Check (Start/Mid/End) |
RVS
bench based tests. |
gfx_fp8tf |
N/A |
Validation
Check (Start/End) |
Validation
Check (Start/Mid/End) |
RVS
bench based tests. |
Key Differences Highlighted by the Sources:
1. Test Coverage: The 4h
check focuses primarily on thermal (gfx_maxpower duration: 60m)
and initial memory/ACF stress runs, and does not include the
validation checks for PCIe or HBM bandwidth found in the longer tests.
2. Duration Scaling: Most
duration tests are significantly extended in the 24h check compared
to the 12h check. For example, gfx_dgemm increases from 30m (in 12h)
to a total of 180m (in 24h).
3. Validation Frequency: The
12h check executes validation tests
(like pcie_link_status and hbm_bw) at the beginning and the end
, whereas the 24h check includes these validation sequences at the beginning,
the approximate midpoint, and the end.
4. Unique 24h Component: The 24h
check specifically includes a 60m duration test
for xgmi_a2a dedicated to creating XGMI traffic.
Note:
ACF Stress: Accelerated Compute Function
Stress.
RVS: ROCm Validation Suite.
No comments:
Post a Comment