1 job for multiple VM binds (!356) · Merge requests · drm / xe / xe kernel driver

Matthew Brost requested to merge mbrost/xe-kernel-driver-new-error-handling:drm-xe-next into drm-xe-next Oct 03, 2023

Looking for some quick feedback before working on this more, based on top of [1]. Last 20 patches are new. I suggest briefly looking at the patches but the end result is what is important.

Very high level pseudo code for new VM bind flow, the key being evrything is now based on xe_vma_ops which are created at the IOCTL level and passed down into VM, PT, and MIGRATE layers to create 1 job (per tile) no matter how many VMA operations there are. If an error occurs at any time in the flow, everything is unwound all the back to the IOCTL (unwind is WIP but code structed to do this). Rebinds (from exec, preempt rebind worker, and page faults) all use a dummy xe_vma_ops to hook into the VM bind code.

vm_bind_ioctl()
        for each VM IOCTL operation
                create are parse into VMA operations

        while drm exec loop
                for each VMA operation
                        lock and validate each VMA operation

                fence = xe_vma_ops_execute()

                install fence into VM dma-resv slots
                for each VMA operation
                        install fence into external BO slots
                signal out fences

        return

xe_vma_ops_execute()
        for each tile
                setup PT arguments

        for each tile
                prepare PT operations

        for each tile
                fence = run PT operations in 1 job

        for each tile
                commit PT operations

        create composite fence

        return composite fence

error_cleanup:
        for each tile
                abort PT operations

        return err

State of the code:

Equivalent functionality to what is place (VM killed just killed on error after 'prepare PT operations' step) but with 1 job per IOCTL
Code structured so proper error handling can be implemented

Next steps:

Code proper error handling after 'prepare PT operations' step
rework trace points
Add prefetch suppression of unnecessary rebinds
Add CPU bind path in run_job()
Rebind worker gets queued more often than needed
multi-GT needs some work (fence install, corking of jobs, etc...)

[1] https://patchwork.freedesktop.org/series/123729/

Edited Oct 03, 2023 by Matthew Brost

Admin message

Admin message

1 job for multiple VM binds

Merge request reports