|
15 | 15 | "source": [ |
16 | 16 | "In this tutorial we'll go over the basics you need to know to start using Opacus in your distributed model training pipeline. As the state-of-the-art models and datasets get bigger, multi-GPU training became the norm and Opacus comes with seamless, out-of-the box support for Distributed Data Parallel (DDP).\n", |
17 | 17 | "\n", |
18 | | - "This tutorial requires basic knowledge of Opacus and DDP. If you're knew to either of these tools, we suggest to start with the following tutorials: [Building an Image Classifier with Differential Privacy](https://opacus.ai/tutorials/building_image_classifier) and [Getting Started with Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)\n", |
| 18 | + "This tutorial requires basic knowledge of Opacus and DDP. If you're new to either of these tools, we suggest to start with the following tutorials: [Building an Image Classifier with Differential Privacy](https://opacus.ai/tutorials/building_image_classifier) and [Getting Started with Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)\n", |
19 | 19 | "\n", |
20 | 20 | "In Chapter 1 we'll start with a mininmal working example to demonstrate what exactly do you need to do in order to make Opacus work in a distributed setting. This should be enough to get started for most common scenarios.\n", |
21 | 21 | "\n", |
|
35 | 35 | "id": "1089c8c1", |
36 | 36 | "metadata": {}, |
37 | 37 | "source": [ |
38 | | - "Before we begin, thre are few things we need to mention.\n", |
| 38 | + "Before we begin, there are a few things we need to mention.\n", |
39 | 39 | "\n", |
40 | 40 | "First, this tutorial is written to be executed on a single Linux machine with at least 2 GPUs. The general principles remain the same for Windows environment and/or multi-node training, but you'll need to slightly modify the DDP code to make it work.\n", |
41 | 41 | "\n", |
|
409 | 409 | "id": "e8703467", |
410 | 410 | "metadata": {}, |
411 | 411 | "source": [ |
412 | | - "And, finally, running then script. Notice, that we've initialized our `DataLoader` with `batch_size=200`, which is equivalent to 300 batches on the full dataset (60000 images). \n", |
| 412 | + "And, finally, running the script. Notice, that we've initialized our `DataLoader` with `batch_size=200`, which is equivalent to 300 batches on the full dataset (60000 images). \n", |
413 | 413 | "\n", |
414 | 414 | "After passing it to `make_private` on each worker we have a data loader with `batch_size=100` each, but each data loader still goes over 300 batches." |
415 | 415 | ] |
|
458 | 458 | "id": "c86099a0", |
459 | 459 | "metadata": {}, |
460 | 460 | "source": [ |
461 | | - "**Note**: The following two chapters disucss the advanced usage of Opacus and its implementation details. We strongly recommend to read the tutorial on [Advanced Features of Opacus](https://opacus.ai/tutorials/intro_to_advanced_features) before proceeding.\n", |
| 461 | + "**Note**: The following two chapters discuss the advanced usage of Opacus and its implementation details. We strongly recommend to read the tutorial on [Advanced Features of Opacus](https://opacus.ai/tutorials/intro_to_advanced_features) before proceeding.\n", |
462 | 462 | "\n", |
463 | 463 | "Now let's look inside `make_private` method and see what it does to enable DDP processing. And we'll start with the modifications made to the `DataLoader`.\n", |
464 | 464 | "\n", |
|
477 | 477 | "- Distributed, non-private\n", |
478 | 478 | "- Distributed, private (with Poisson sampling)\n", |
479 | 479 | "\n", |
480 | | - "All three are initialized so that the logical batch size is 64. Note, that `DPDataLoader` is initialized with a non" |
| 480 | + "All three are initialized so that the logical batch size is 64." |
481 | 481 | ] |
482 | 482 | }, |
483 | 483 | { |
|
585 | 585 | "source": [ |
586 | 586 | "Let's see what happens when we run it - and what exactly does `from_data_loader` factory did.\n", |
587 | 587 | "\n", |
588 | | - "Notice, that our private DataLoader was initialized with a non-distributed, non-private data loader. And all the basic parameters (per GPU batch size and number of examples per GPU) matches with distributed, non-private data loader." |
| 588 | + "Notice, that our private DataLoader was initialized with a non-distributed, non-private data loader. And all the basic parameters (per GPU batch size and number of examples per GPU) match with distributed, non-private data loader." |
589 | 589 | ] |
590 | 590 | }, |
591 | 591 | { |
|
613 | 613 | "id": "d938f572", |
614 | 614 | "metadata": {}, |
615 | 615 | "source": [ |
616 | | - "## Chapter 3: Syncronisation" |
| 616 | + "## Chapter 3: Synchronisation" |
617 | 617 | ] |
618 | 618 | }, |
619 | 619 | { |
620 | 620 | "cell_type": "markdown", |
621 | 621 | "id": "491a3ed8", |
622 | 622 | "metadata": {}, |
623 | 623 | "source": [ |
624 | | - "One significant difference between `DDP` and `DPDDP` is how it approaches syncronisation.\n", |
| 624 | + "One significant difference between `DDP` and `DPDDP` is how it approaches synchronisation.\n", |
625 | 625 | "\n", |
626 | | - "Normally with Distributed Data Parallel forward and backward passes are syncronisation points, and `DDP` wrapper ensures that the gradients are syncronised across workers at the end of the backward pass.\n", |
| 626 | + "Normally with Distributed Data Parallel forward and backward passes are synchronisation points, and `DDP` wrapper ensures that the gradients are synchronised across workers at the end of the backward pass.\n", |
627 | 627 | "\n", |
628 | | - "Opacus, however, need a later syncronisation point. Before we can use the gradients, we need to clip them add noise. This is done in the optimizer, which moves the syncronisation point from the backward pass to the optimization step.\n", |
| 628 | + "Opacus, however, need a later synchronisation point. Before we can use the gradients, we need to clip them add noise. This is done in the optimizer, which moves the synchronisation point from the backward pass to the optimization step.\n", |
629 | 629 | "Additionally, to simplify the calculations, we only add noise on worker with `rank=0`, and use the noise scale calibrated to the combined batch across all workers." |
630 | 630 | ] |
631 | 631 | }, |
|
698 | 698 | "source": [ |
699 | 699 | "Now we've initialized `DifferentiallyPrivateDistributedDataParallel` model and `DistributedDPOptimizer` let's see how they work together.\n", |
700 | 700 | "\n", |
701 | | - "`DifferentiallyPrivateDistributedDataParallel` is a no-op: we only perform model syncronisation on initialization and do nothing on forward and backward passes.\n", |
| 701 | + "`DifferentiallyPrivateDistributedDataParallel` is a no-op: we only perform model synchronisation on initialization and do nothing on forward and backward passes.\n", |
702 | 702 | "\n", |
703 | 703 | "`DistributedDPOptimizer`, on the other hand does all the heavy lifting:\n", |
704 | 704 | "- It does gradient clipping on each worker independently\n", |
|
804 | 804 | "id": "a9c95972", |
805 | 805 | "metadata": {}, |
806 | 806 | "source": [ |
807 | | - "When we run the code, notice that the gradients are not syncronised after `loss.backward()`, but only after `optimizer.step()`. For this example we've set privacy parameters to effectively disable noise and clipping, so the syncronised gradient is indeed the average between individual worker's gradients." |
| 807 | + "When we run the code, notice that the gradients are not synchronised after `loss.backward()`, but only after `optimizer.step()`. For this example we've set privacy parameters to effectively disable noise and clipping, so the synchronised gradient is indeed the average between individual worker's gradients." |
808 | 808 | ] |
809 | 809 | }, |
810 | 810 | { |
|
0 commit comments