Skip to content

drm/xe: Introduce the dev_coredump infrastructure.

Rodrigo Vivi requested to merge rodrigovivi/drm-xe:devcoredump into drm-xe-next

The goal is to use devcoredump infrastructure to report error states captured at the crash time.

The error state will contain useful information for GPU hang debug, such as INSTDONE registers and the current buffers getting executed, as well as any other information that helps user space and allow later replays of the error.

The proposal here is to avoid a Xe only error_state like i915 and use a standard dev_coredump infrastructure to expose the error state.

For our own case, the data is only useful if it is a snapshot of the time when the GPU crash has happened, since we reset the GPU immediately after and the registers might have changed. So the proposal here is to have an internal snapshot to be printed out later.

Also, usually a subsequent GPU hang can be only a cause of the initial one. So we only save the 'first' hang. The dev_coredump has a delayed work queue where it remove the coredump and free all the data withing a few moments of the error. When that happens we also reset our capture state and allow further snapshots.

Right now this infra only print out the time of the hang. More information will be migrated here on subsequent work. Also, in order to organize the dump better, the goal is to propose dev_coredump changes itself to allow multiple files and different controls. But for now we start Xe usage of it without any dependency on dev_coredump core changes.

Cc: Daniel Vetter daniel.vetter@ffwll.ch Signed-off-by: Rodrigo Vivi rodrigo.vivi@intel.com

Merge request reports