Tuesday, October 7, 2025

How to execute MiTAC G8825Z5 burnin-test by AMD AGFHC Tool

 

How to execute MiTAC G8825Z5 burnin-test by AMD AGFHC Tool

(AMD GPU Field Health Check) tool

 

System requirements:

MiTAC G8825Z5                                                                 

OS: Ubuntu 22.04 Server

AMD AGFHC: agfhc v1.21.2

 

Command:

$/opt/amd/agfhc/agfhc -r all_burnin_24h -o /tmp

## execute burnin-test for 24 hours and use /tmp as directory where stored log output file.

 

This table summarizes the comparison of parameter values (Test Item and Duration/Status) defined in the system burn-in configuration files for the approximately 4-hour, 12-hour, and 24-hour checks.

all_burnin_4h (4-Hour Check)

Ø   Concentrated Thermal Stress: Dedicates a long, 60-minute run of gfx_maxpower to check for immediate heat issues.

Ø   Basic Memory Stress: Includes initial runs of memory exercisers.

Ø   Lacks Validation Checks: Omits the entire suite of performance validation checks (e.g., pcie_link_status, hbm_bw) found in longer tests.

This is a foundational check designed to catch major, obvious problems that appear quickly, such as a faulty cooling system or a critical memory flaw.

 

all_burnin_12h (12-Hour Check)

Ø   Increased Memory Duration: Significantly extends the time spent on memory exercisers like hbm_ds (from 45m to 155m).

Ø   Adds Validation Checks: Introduces performance validation tests (e.g., pcie_link_status, hbm_bw) that run at the start and end of the entire sequence.

This level of testing goes further. It verifies that the system is not only stable during a long work session but that its performance (like PCIe speed) is just as good at the end as it was at the beginning.

 

all_burnin_24h (24-Hour Check)

Ø   Massively Extended Durations: Drastically increases the runtime of key stress exercisers like gfx_dgemm (from 30m to 180m).

Ø   Adds a Unique Stress Test: Includes a dedicated 60-minute xgmi_a2a run to specifically "create xgmi traffic," a stress test not present in the other plans.

Ø   Increased Validation Frequency: Runs validation checks at the start, midpoint, and end of the test.

This is the most rigorous test. It is designed to find subtle issues that only appear after very long periods of continuous operation and to confirm that system performance does not degrade over that extended time.

Comparison Table of System Burn-in Test Parameters

Test Item

4h Check (Total Duration/Status)

12h Check (Total Duration/Status)

24h Check (Total Duration/Status)

Purpose/Notes

gfx_maxpower

60m 

30m

30m

Used for thermal stress.

hbm_ds (Total Duration)

45m

155m (5m + 30m + 120m)

275m (5m + 30m + 180m + 60m)

Runs OBLEX exerciser on GPUs. Includes 5m checks at the start to catch fast fails.

hbm_remix2 (Total Duration)

30m

125m (5m + 120m)

245m (5m + 180m + 60m)

Includes 5m checks at the start to catch fast fails.

hbm (Total Duration)

N/A

65m (5m + 60m)

65m (5m + 60m)

Included in the combined 5-hour or 7-hour HBM runs. Includes 5m fast fail check.

gfx_dgemm

N/A

30m

180m (60m + 120m)

Runs gfx stress. Additional time is added for the dgemm screen.

sprites

45m

140m

300m 

Part of Additional ACF stress.

rochpl

45m

120m

200m 

Part of Additional ACF stress.

hbm_ds_ntd

15m

30m

30m

Additional ACF stress.

mall (MALL exerciser)

N/A

10m

20m 

MALL exerciser. Duration is doubled in the 24h check.

athub

N/A

10m

10m

Athub test.

xgmi_a2a (Traffic)

N/A

N/A

60m 

Specific run to create xgmi traffic. Only present in the 24h check.

pcie_link_status

N/A

Validation Check (Start/End)

Validation Check (Start/Mid/End)

Validates PCIe link speed and link width.

hbm_bw

N/A

Validation Check (Start/End)

Validation Check (Start/Mid/End)

HBM memory BW tests.

pcie_unidi_peak

N/A

Validation Check (Start/End)

Validation Check (Start/Mid/End)

Validates PCIe bandwidth in unidirectional peak mode.

pcie_bidi_peak

N/A

Validation Check (Start/End)

Validation Check (Start/Mid/End)

Validates PCIe bandwidth in bidirectional peak mode.

xgmi_a2a (BW Validation)

N/A

Validation Check (Start/End)

Validation Check (Start/Mid/End)

Validates XGMI bandwidth values in a2a mode.

gfx_bf16tf

N/A

Validation Check (Start/End)

Validation Check (Start/Mid/End)

RVS bench based tests.

gfx_fp16tf

N/A

Validation Check (Start/End)

Validation Check (Start/Mid/End)

RVS bench based tests.

gfx_fp8tf

N/A

Validation Check (Start/End)

Validation Check (Start/Mid/End)

RVS bench based tests.

Key Differences Highlighted by the Sources:


1. Test Coverage: The 4h check focuses primarily on thermal (gfx_maxpower duration: 60m) and initial memory/ACF stress runs, and does not include the validation checks for PCIe or HBM bandwidth found in the longer tests.

2. Duration Scaling: Most duration tests are significantly extended in the 24h check compared to the 12h check. For example, gfx_dgemm increases from 30m (in 12h) to a total of 180m (in 24h).

3. Validation Frequency: The 12h check executes validation tests (like pcie_link_status and hbm_bw) at the beginning and the end , whereas the 24h check includes these validation sequences at the beginning, the approximate midpoint, and the end.

4. Unique 24h Component: The 24h check specifically includes a 60m duration test for xgmi_a2a dedicated to creating XGMI traffic.

 

Note:

ACF Stress: Accelerated Compute Function Stress.

RVS: ROCm Validation Suite.

No comments:

Post a Comment

How to execute AMD System Level Test Tool: AGFHC Tool

  How to execute AMD System Level Test Tool: AGFHC Tool (AMD GPU Field Health Check) tool   System requirements: MiTAC G8825Z5 O...