fixed typos in parallelisation section

manuel-morales-a · Nov 16, 2018 · 97a30e9 · 97a30e9
1 parent cfc3165
commit 97a30e9
Showing 1 changed file with 31 additions and 47 deletions.
diff --git a/8_Parallelisation.ipynb b/8_Parallelisation.ipynb
@@ -6,17 +6,17 @@
    "source": [
     "# Parallelisation\n",
     "\n",
-    "Sometimes projects you work on will be too large for even a highly optimised code to run on your desktop/laptop.  In this case the only way to get performance benefits is to write the code to take advantage of multiple CPUs. To do this can be tricky and is something you should consider in the prototyping stage as often serial code can prove impossible to paralleise efficiently without wholesale changes.\n",
+    "Sometimes projects you work on will be too large or slow for even a highly optimised code to run on your desktop/laptop.  In this case the only way to get performance benefits is to write the code to take advantage of multiple CPUs. To do this can be tricky and is something you should consider in the prototyping stage as often serial code can prove impossible to paralleise efficiently without wholesale changes.\n",
     "\n",
-    "Parallelisation is becoming more important as in since 2005 serial performance of CPU's has not improved but the parallel performance has increased by 32x per core. Also machines are getting more CPU's as standard.\n",
+    "Parallelisation is becoming more important as since early 2000 serial performance of CPU's has not improved but the parallel performance has increased by 32x per core. This is the only way we are keeeping up with Moores law and curreltly cutting edge CPUs are being built with 70+ cores which will filter down to normal computing soon so parallelisation is becomming more and more important.\n",
     "\n",
     "![](Plots/CPUClock.png)\n",
     "\n",
-    "The parallelisation model you settle on for your code should also take into account of the likely computer architecture you want to run on and you will need to consider tradeoffs between memory constrants and communication constraints.  In this section we will have a look at some of the issues we face when we try to think parallel\n",
+    "The parallelisation model you settle on for your code should also take into account of the likely computer architecture you want to run on and you will need to consider tradeoffs between memory constrants and communication constraints.  In this section we will have a look at some of the issues we face when we try to think parallel.\n",
     "\n",
     "## Parallelisation levels\n",
     "\n",
-    "There are several levels in which you can paralleise your code\n",
+    "There are multiple levels in which you can paralleise your code\n",
     "\n",
     "0. Multiple serial jobs. The first option for any code as it has perfect scaling and almost no overhead\n",
     "1. CPU level parallelisim like SIMD (single instruction multiple data) or Vectorisation. Works on single CPU, mostly done by compilers\n",
@@ -29,7 +29,9 @@
     "\n",
     "The 2nd, thread based parallelism, can be done in python with the `threading` or `multiprocessing` packages.  This creates multiple threads which share the same data and can all operate on it.  Unfortunatly this is usually pointless due to the Global Interpreter Lock (GIL).  The GIL blocks more than one thread from accessing the interpreter at a time so only one thread can do anything at one time.  As a result threaded code in python seldom runs any faster (there are some exceptions).  This is a general problem with python in that as it is designed to be interactive it is inherently opposed to parallelism.  Numpy does (sometimes) release the GIL so you can get a benefit with multi-threads, otherwise see https://scipy-cookbook.readthedocs.io/items/ParallelProgramming.html.  We will revisit this when we come to Cython which supports `OpenMP` which is the industry standard for thread parallelisim and here we can do it much more easily.  You can however utilise packages which support threading as they are compiled in C so avoid the GIL.\n",
     "\n",
-    "The third, process based parallelisim, can be done in python as it creates multiple interpreters so the GIL is not a problem.  In this case we make multiple seperate instances of your code which run independently and can communicate with each other.  This is the case we will look at here with the package `mpi4py` which wraps the industry standard Message Passing Interface (MPI) used in high performance computing."
+    "The third, process based parallelisim, can be done in python as it creates multiple interpreters so the GIL is not a problem.  In this case we make multiple seperate instances of your code which run independently and can communicate with each other.  This is the case we will look at here with the package `mpi4py` which wraps the industry standard Message Passing Interface (MPI) used in high performance computing.\n",
+    "\n",
+    "MPI is arguably the hardest so if you can master it then the rest should be easy(-er) to pick up."
    ]
   },
   {
@@ -55,14 +57,6 @@
     "\n",
     "The main point is that it would be ridiculoues to keep the paint and to try to pass the walls instead.  Similarly the goal of parallelisation is to complete the process with the smallest amount of communication possible.  This is because communication is usually expensive, somewhere between reading from memory and reading from disk (the one it's closer to is very system dependent). As such 'distributed' usually applies to the largest data objects.\n",
     "\n",
-    "Simple models of using MPI in code would look like:\n",
-    "\n",
-    "1. Take a list of tasks then break it into sections which are distributed to each rank (process).  Each process completes its tasks independently then sends the results back to rank 0 which collates the answers.\n",
-    "\n",
-    "2. Evolve a large simulation which is distributed across multiple ranks.  Before each time step each rank sends the state of it's cells to its neighbours.  Now the time step can be carried out on each section and the new state can be communicated and we keep repeating this sequence.\n",
-    "\n",
-    "3. Rank 0 has a long list of seperate tasks that need completion.  It distributres an initial task each of the other ranks. Then when they return the result to rank 0 it passes them the next task in the list. This is a master-slave system which is useful for large numbers of ranks as no processes wait for messages.  It's unsuitable for small numbers as obviously rank 0 does nothing other than coordinate so is wasted.\n",
-    "\n",
     "### Basic commands\n",
     "\n",
     "Let's try some simple examples to see what MPI commands look like:\n",
@@ -120,20 +114,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "array([8, 6, 7, 3])"
-      ]
-     },
-     "execution_count": 2,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "data_s = np.random.randint(1,9,(4))\n",
     "data_r = np.zeros((4), dtype='int')\n",
@@ -264,7 +247,7 @@
    "source": [
     "We also have `Allreduce` which is a `Reduce` combined with a `Bcast`.  One other usefull command is `Barrier` which has no arguments and simply holds all processes until all reach it so forces syncronisation.  Don't be tempted to use this unless you have to as it just slows down your code.\n",
     "\n",
-    "All the above commands are blocking commands in that the process doesn't move on until they are complete.  This means that we always have to wait until comunications have completed before we can continue with our code.  Communication is usually slow so these are also non-blocking versions of `send` and `recv` we can use.  This allows us to send information then continue calculation while we wait for it to arrive.  The non-blocking versions are called `Isend` and `Irecv` and work like this:\n",
+    "All the above commands are blocking commands in that the process doesn't move on until they are complete (in theory).  This means that we always have to wait until comunications have completed before we can continue with our code.  Communication is usually slow so there are also non-blocking versions of `send` and `recv` we can use.  This allows us to send information then continue calculation while we wait for it to arrive.  The non-blocking versions are called `Isend` and `Irecv` and work like this:\n",
     "\n",
     "**Example 7:**\n",
     "Use non-blocking commands so we can do stuff while we wait"
@@ -302,9 +285,9 @@
    "source": [
     "### Deadlocks\n",
     "\n",
-    "One key thing to watch our for when using MPI is deadlocks.  This is when you have one rank stuck waiting for a message that another rank can't send.  This usually happens when you get you send\\recv out of order which is easier to do that you think as you never know the order the ranks will meet the code.  It is possible you already met this as some of the examples are not safe.  The first and second example can both deadlock if they happen to be executed at almost exactly the same time.  \n",
+    "One key thing to watch our for when using MPI is deadlocks.  This is when you have one rank stuck waiting for a message that another rank can't send.  This usually happens when you get you send\\recv out of order which is easier to do that you think as you never know the order the ranks will meet the code.  It is possible you already met this as some of the examples are not safe.  The first and second example can both deadlock if they happen to have large data that they can't buffer.  \n",
     "\n",
-    "What happens if all ranks enter the `send` at the same time? As the sends are blocking sends they won't complete until the data is recieved by `recv`. But as all ranks are held in the `send` command, none make it to the `recv` and the code stalls forever.  Here this is unlikely as the message is short and the communication is fast and all we need is for the first rank to exit `send` before the last starts to avoid a deadlock.  This is the danger with parallel programming.  Small tests like this can be fine but if I then ported the code to a large machine with slow communication and I was sending large amounts of data this may happen most of the time.  This is why we have `sendrecv` which both sends and waits for data so can't deadlock. This solution forces the code to syncronise which can slow the code down.  The alternative is to use non-blocking communication which is more efficent but you have to check the messages complete before you try to use the data sent.  Collectives can also block is not all ranks can get to them.  For example this will stall forever:"
+    "What happens if all ranks enter the `send` at the same time? As the sends are blocking sends they won't complete until the data is recieved by `recv`. But as all ranks are held in the `send` command, none make it to the `recv` and the code stalls forever.  Here this is unlikely as the message short so is sent to buffer alowing the send to more onto the recieve.  This is the danger with parallel programming.  Small tests like this can be fine but if I then ported the code to a large machine with slow communication and I was sending large amounts of data this may happen most of the time. Also every time I ran a small test to see what was going on it would work fine.  This is why we have `sendrecv` which both sends and waits for data so can't deadlock. This solution forces the code to syncronise which can slow the code down.  The alternative is to use non-blocking communication which is more efficent but you have to check the messages complete before you try to use the data sent.  Collectives can also block if not all ranks can get to them.  For example this code which to new people sort of looks sensible will stall forever:"
    ]
   },
   {
@@ -328,25 +311,25 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The best idea it to try to use collectives wherever possible then non-blocking then blocking\n",
+    "The best idea it to try to use collectives wherever possible, then non-blocking, then blocking.\n",
     "\n",
     "You should note that collectives can be either blocking or non-blocking depending on the implementation so don't rely on them to sync your ranks.  Also make sure all ranks see the same collectives in the same order as this may deadlock the code too (basicly don't use them in if statements like above)\n",
     "\n",
     "## Parallel patterns\n",
     "\n",
-    "There are lots of different approaches to parallelising code.  The best one for your code is quite problem specific.  Here we will look as some basic \"patterns\" used to parallelise code.  First there are a few things to consider.\n",
+    "There are lots of different approaches to parallelising code.  The best one for your code is quite problem specific.  Here we will look as some basic \"patterns\" that are commonly used.  First there are a few things to consider.\n",
     "\n",
     "- How much of my code can be parallelised? If 30% of your code is serial then the maximum speedup you can achieve is $1/0.3 = 3.3$x.  This gives you a benchmark to work against and whether the effort will be worth it.\n",
     "\n",
-    "- How will my code scale?  If you plan to use you code for large problems but only use a small test data set you need to check how you approach will work on the full solution. Suppose you have two approaches the first is faster on the small problem but scales as $n^2$ (eg gaussian smoothing in 2D) the second is slower but scales as $n\\ln(n)$ (eg gaussian smoothing via fft then weighting modes then ifft back).  Here the second would be best for the large problem\n",
+    "- How will my code scale?  If you plan to use you code for large problems but only use a small test data set you need to check how you approach will work on the full solution. Suppose you have two approaches the first is faster on the small problem but scales as $O(N^2)$ (eg gaussian smoothing in 2D) the second is slower but scales as $O(N\\ln(N))$ (eg gaussian smoothing via fft then weighting modes then ifft back).  Here the first would be faster from small problems but the second would be best for the large problem.\n",
     "\n",
-    "- Load balance.  How will you ensure that each CPU has the same amount of work to do?\n",
+    "- Load balance.  How will you ensure that each CPU has the roughly same amount of work to do?\n",
     "\n",
     "Parallelisation also has costs\n",
     "\n",
-    "- Efficency (Sometimes we will chose algorithms that paralleise well but are slower for serial)\n",
-    "- Simplicity (Code get harder to read and maintain)\n",
-    "- Portability (parallel code is often writn with a particulare system in mind so it can be hard to migrate to new machines)\n",
+    "- Efficency (Sometimes we will chose algorithms that paralleise well but are slower for serial also communication has significant costs)\n",
+    "- Simplicity (Code gets harder to read and maintain as we will see)\n",
+    "- Portability (parallel code is often written with a particular system in mind so it can be hard to migrate to new machines)\n",
     "\n",
     "\n",
     "With this lets look at some simple paralleisation patterns (in order of complexity)\n",
@@ -423,19 +406,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 28,
+   "execution_count": null,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "495\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "imax = 10\n",
     "x=0\n",
@@ -805,6 +780,15 @@
     "    print(listall)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Pipelines\n",
+    "\n",
+    "The final pattern we will meet is pipelines here the data flows through the ranks with different operations being applied at each stage.  This can work well for heavy compute algoritms or ones that use specilist chips like GPU's.  You may want to input a stream of vectors to be multiplied by several matricies with different operations inbetween so rather than moving the matricies you pass the vectors along a production line where the matricies remain fixed.  This is just like a production line in a car plant for example.  Here the communication is just simple non-blocking send and recv's between the ranks.  I currently don't have a good data analysis example but I will try to find one and add it later."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},