run unit tests with torchcomms RDMA #81

amirafzali · 2025-11-25T15:56:51Z

Differential Revision: D87329523

meta-codesync · 2025-11-25T15:57:14Z

@amirafzali has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87329523.

tests/test_models.py

torchstore/storage_volume.py

torchstore/transport/buffers.py

torchstore/transport/pipe.py

amirafzali · 2025-11-25T16:23:55Z

torchstore/transport/buffers.py

        )

+try:
+    from torchcomms._comms_ncclx import RdmaMemory,  RdmaTransport, RdmaRemoteBuffer


there was feedback about including this in a separate file. not sure i entirely grokked it

could we please create a transport/torchcomms_transport_buffer.py class, which contains all torchcomms related buffer code?

LucasLLC

Not too shabby at all!!!! Tysm for all the changes. I have one or to more nits. Do you think we should take care of them before we ship?

Btw my idea for Pipe API is to make it extremely simple, and instead move all transport logic into the buffer. This way, transport buffers don't need to agree on a basic algorithm for moving data, just agree that the data will be moved. I can follow up after this with a refactor.

LucasLLC · 2025-11-25T18:00:17Z

torchstore/storage_volume.py

+    async def handshake(
+        self, peer_addr: bytes
+    ) -> bytes:
+        transport, addr = self.transport_context.get_rdma_transport_cache().put(peer_addr, device=0)


This breaks encapsulation -- can we ew dosomething like:

async def handshake(self, transport_buffer): self.transport_buffer.recv_handshake()

The idea here is that the transport_buffer class contains it's ownhandshake logic, that way the storage volume does not have to understand how the handshake is accomplished, and each transport method encapsulates all 'setup' logic

Also, curious why device==0?

good call, that's a much better design

we assume memorystore is always CPU memory. rdma transport still needs a GPU for NIC discovery, so I default to cudadevice = 0. this is fine right @LucasLLC?

torchstore/storage_volume.py

LucasLLC · 2025-11-25T18:01:45Z

torchstore/transport/buffers.py

        )

+try:
+    from torchcomms._comms_ncclx import RdmaMemory,  RdmaTransport, RdmaRemoteBuffer


could we please create a transport/torchcomms_transport_buffer.py class, which contains all torchcomms related buffer code?

torchstore/transport/buffers.py

torchstore/transport/pipe.py

Summary: Pull Request resolved: meta-pytorch#79 Differential Revision: D87804517

Summary: Pull Request resolved: meta-pytorch#80 Differential Revision: D87804516

Summary: Pull Request resolved: meta-pytorch#81 Differential Revision: D87329523

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 25, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 25, 2025