Skip to content

Conversation

@provos
Copy link

@provos provos commented Nov 30, 2025

Adds MPS device support for both image and video predictors on Apple Silicon.

Changes:

  • Add get_default_device() utility that detects MPS availability
  • Fix device mismatches (coords cache, freqs_cis cache)
  • Add MPS workaround for complex tensor repeat() in RoPE
  • Make torch._assert_async conditional on CUDA
  • Fix MPS memory leak in video predictor via synchronization points

Performance of the Video predictor:

  • ~3x faster than CPU
  • Runs with ~38GB peak memory. This is due to the way that MPS caches graphs. Before adding the synchronization points, running the video predictor would consume all available memory.

this has prebuilt wheels for apple silicon. bump numpy from 1.26 to 1.26.4 to meet dependency requirements for decord2
Allows systems without CUDA to fallback to CPU.
The pin_memory() optimization is only available for CUDA backends.
CUDA handles this internally but we need to handle it directly for CPU
…or cpu

introduce workarounds for torch operations not available on mps like repeats of complex tensors
forcefully flush pending operations with synchronize and empty the cache.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant