-
Notifications
You must be signed in to change notification settings - Fork 5
run unit tests with torchcomms RDMA #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@amirafzali has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87329523. |
torchstore/transport/buffers.py
Outdated
| ) | ||
|
|
||
| try: | ||
| from torchcomms._comms_ncclx import RdmaMemory, RdmaTransport, RdmaRemoteBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was feedback about including this in a separate file. not sure i entirely grokked it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we please create a transport/torchcomms_transport_buffer.py class, which contains all torchcomms related buffer code?
LucasLLC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too shabby at all!!!! Tysm for all the changes. I have one or to more nits. Do you think we should take care of them before we ship?
Btw my idea for Pipe API is to make it extremely simple, and instead move all transport logic into the buffer. This way, transport buffers don't need to agree on a basic algorithm for moving data, just agree that the data will be moved. I can follow up after this with a refactor.
torchstore/storage_volume.py
Outdated
| async def handshake( | ||
| self, peer_addr: bytes | ||
| ) -> bytes: | ||
| transport, addr = self.transport_context.get_rdma_transport_cache().put(peer_addr, device=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks encapsulation -- can we ew dosomething like:
async def handshake(self, transport_buffer):
self.transport_buffer.recv_handshake()
The idea here is that the transport_buffer class contains it's ownhandshake logic, that way the storage volume does not have to understand how the handshake is accomplished, and each transport method encapsulates all 'setup' logic
Also, curious why device==0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, that's a much better design
we assume memorystore is always CPU memory. rdma transport still needs a GPU for NIC discovery, so I default to cudadevice = 0. this is fine right @LucasLLC?
torchstore/transport/buffers.py
Outdated
| ) | ||
|
|
||
| try: | ||
| from torchcomms._comms_ncclx import RdmaMemory, RdmaTransport, RdmaRemoteBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we please create a transport/torchcomms_transport_buffer.py class, which contains all torchcomms related buffer code?
Summary: Pull Request resolved: meta-pytorch#79 Differential Revision: D87804517
Summary: Pull Request resolved: meta-pytorch#80 Differential Revision: D87804516
Summary: Pull Request resolved: meta-pytorch#81 Differential Revision: D87329523
32f798a to
fc63586
Compare
Differential Revision: D87329523