Skip to content

Conversation

@amirafzali
Copy link
Member

Differential Revision: D87329523

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 25, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 25, 2025

@amirafzali has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87329523.

)

try:
from torchcomms._comms_ncclx import RdmaMemory, RdmaTransport, RdmaRemoteBuffer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was feedback about including this in a separate file. not sure i entirely grokked it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we please create a transport/torchcomms_transport_buffer.py class, which contains all torchcomms related buffer code?

Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too shabby at all!!!! Tysm for all the changes. I have one or to more nits. Do you think we should take care of them before we ship?

Btw my idea for Pipe API is to make it extremely simple, and instead move all transport logic into the buffer. This way, transport buffers don't need to agree on a basic algorithm for moving data, just agree that the data will be moved. I can follow up after this with a refactor.

async def handshake(
self, peer_addr: bytes
) -> bytes:
transport, addr = self.transport_context.get_rdma_transport_cache().put(peer_addr, device=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks encapsulation -- can we ew dosomething like:

async def handshake(self, transport_buffer):
     self.transport_buffer.recv_handshake()

The idea here is that the transport_buffer class contains it's ownhandshake logic, that way the storage volume does not have to understand how the handshake is accomplished, and each transport method encapsulates all 'setup' logic

Also, curious why device==0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, that's a much better design

we assume memorystore is always CPU memory. rdma transport still needs a GPU for NIC discovery, so I default to cudadevice = 0. this is fine right @LucasLLC?

)

try:
from torchcomms._comms_ncclx import RdmaMemory, RdmaTransport, RdmaRemoteBuffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we please create a transport/torchcomms_transport_buffer.py class, which contains all torchcomms related buffer code?

Summary: Pull Request resolved: meta-pytorch#79

Differential Revision: D87804517
Summary: Pull Request resolved: meta-pytorch#80

Differential Revision: D87804516
Summary: Pull Request resolved: meta-pytorch#81

Differential Revision: D87329523
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants