Skip to content

Conversation

@lv0325-dz
Copy link

Previously, the code would automatically attempt to use cache files in DDP mode (when is_ddp() returns True), which is detected by the presence of WORLD_SIZE environment variable set by torchrun. This caused failures when running with torchrun without existing cache files, as the program would try to load non-existent cache files.

With this change, cache usage is now solely determined by the use_cache configuration parameter, regardless of whether the code is running in DDP mode or not. This ensures that:

  1. When use_cache=False, cache files will not be used even in DDP mode
  2. Running with python or torchrun will have consistent behavior

This prevents torchrun from failing when cache files are not present, making the cache behavior predictable and controlled by explicit configuration rather than implicit environment variables.

@Alan-LanFeng
Copy link
Collaborator

Hi, thanks for the pr.

I agree the code here needs optimization; however, it is recommended to use cache when you are in ddp mode. This is because if use_cache is set to False in ddp, each of the processes will repeatedly cache the data, which may cause OOM.

@lv0325-dz lv0325-dz changed the title Remove is_ddp() condition to prevent torchrun failures when cache is missing Update cache loading logic to check file existence before use Aug 21, 2025
@lv0325-dz
Copy link
Author

Indeed, it is not appropriate to directly and roughly delete the is_ddp judgment. Now I have changed it to check whether the cache exists before using it to avoid using a non-existent cache.

Previously, the code would automatically attempt to use cache files based on the is_ddp() condition and use_cache configuration. This led to failures when running with torchrun without existing cache files, as the program would try to load non-existent cache files in DDP mode.

With this change, we modify the cache loading behavior to:
1. First check if cache files exist at the expected path
2. Only use cache files if they actually exist, regardless of DDP mode
3. Generate new cache files if they don't exist, even when use_cache=True

This prevents torchrun from failing when cache files are not present, while still allowing efficient cache reuse when available. The behavior is now more predictable and robust, as it depends on actual file existence rather than implicit environment variables or assumptions about cache availability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants