Skip to content

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

License

Notifications You must be signed in to change notification settings

yangyang14641/Parallel-Matrix-Multiplication-FOX-Algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel-Matrix-Multiplication-FOX-Algorithm

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

Yes We Code

Contents

  1. Reference Documents
    • Thomas Anastasio, Example of Matrix Multiplication by Fox Method
    • Jaeyoung Choi, A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers
    • Ned Nedialkov, Communicators and Topologies: Matrix Multiplication Example
  2. Source Codes
  3. Code Tests
    • Dell XPS8900
      • Code Test on Dell XPS8900 Workstation (Intel® Core™ i7-6700K Processor)
      • Analyzing MPI Performance Using Intel Trace Analyzer
    • PKU-HPC
      • Lenovo X8800 Supercomputer Platform
      • Code Performance Tests on X8800 Supercomputer Platform's CPU Node (Intel® Xeon® Processor E5-2697A v4)
      • Code Performance Tests on X8800 Supercomputer Platform's MIC Node (Intel® Xeon Phi™ Processor 7250)
    • Code Tests' Contents
  4. Reports
    • 1801111621_洪瑶_并行程序报告.pdf
    • 并行程序报告.docx
    • 洪瑶_1801111621并行程序设计报告.pptx
    • Parallel FOX Algorithm Project Report.pptx (will be added in the future)
    • Parallel FOX Algorithm Project Report Paper.tex (will be added in the future)
    • Parallel FOX Algorithm Project Report Paper.pdf (will be added in the future)
    • Reports' Contents
  5. Imagines
    • FOX.png
    • FOX Stage Whole.JPG
    • FOX Stage Loading Balance.png

Brief Introduction to Parallel Matrix Multiplication FOX Algorithm

Basic Concepts

  • 规约计算 (Reduction)
  • 拥有者计算原则 (Owner Computing Rule)
  • 流水并行(Pipeline Parallelism):
    • 在一个进程上,矩阵计算被划分为P个阶段 (P Supercomputing Steps in a Process)
  • 数据并行 (Data Parallelism):
    • 在每个进程上同时计算局部的矩阵乘积 (Local Matrix Multiplications are computing on every processess at the same Computing Step)

Serial Matrix Multiplication

  • Mathematical Modeling of Matrix Multiplication

    • C_{ij}=\sum_{k=0}^{K-1} A_{ik}B_{kj}; \quad (i=0,N-1), \quad (j=0,M-1)
  • Time Complexity

    • O\left ( N^{3} \right )
  • Storage Complexity

    • O\left ( N^{3} \right )
  • Example Implementation in C Language

for (i = 0; i < n; i++)                                      
        for (j = 0; j < n; j++)              
            for (k = 0; k < n; k++)
                C(i,j) = C(i,j) + A(i,k)*B(k,j);

Parallel Computing Modeling Design

  1. Basic Flow
  • Matrix \mathbf{A}'s Dimension is M \times K, and Matirx \mathbf{B}'s Dimension is a K \times N.
  • Compute Matrix \mathbf{C} = \mathbf{A}\mathbf{B} in parallel.
  • Let p=num(processors) is the number of processors, and q=\sqrt{p} be an integer such that it devides M and N.
  • Create a Cartesian topology with process mesh P_{ij}, and i=0..q-1, j=0..q-1.
  • Denote \hat{M} = \frac{M}{q}, \hat{K}=\frac{K}{q}, \hat{N}=\frac{N}{q}.
  • Distribute \mathbf{A} and \mathbf{B} by blocks on p processess such that A_{ij} is \hat{M} \times \hat{K} block and B_{ij} is \hat{K} \times \hat{N} block, stored on process P_{ij}.
  1. Details
  • Partitions of Matrices A, B and C. (Index syntax in Mathematical form: start from 1)

    • Matrix A

    • Matirx B

    • Matrix C

  • Data Distribution on the 2-D Cartesian Topology Processes Mesh (Index syntax in Mathematical formulars: start from 1)

    • Data Mapping

      • Data Mesh Mapping Process Mesh
    • Partition may not perfect such that every sub-matrix is a square matrix. Yet, that's not a problem, except load unbalance on each process!

    • Unbalanced Partition

      • Item Object
        Data Partition
        Data Partition
        Process Mesh
    • Mathematical Modeling of Sub-Matirx Multiplication

Parallel Algorithm Design on BSP

Parallelism type: Data parallelism with Pipeline parallelism

  1. Rewrite the formula of Sub-Matirx Multiplication as q−1 Supercomputing Steps

    • Stage Mathematical Operation
      0 \mathbf{C_{ij}}=\mathbf{A_{ii}}\mathbf{B_{ij}}
      1 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+1}}\mathbf{B_{i+1j}}
      2 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+2}}\mathbf{B_{i+2j}}
      ... ...
      q-2-i \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-2}}\mathbf{B_{q-2j}}
      q-1-i \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-1}}\mathbf{B_{q-1j}}
      ... \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i1}}\mathbf{B_{1j}}
      ... \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i2}}\mathbf{B_{2j}}
      ... ...
      q-1 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii-1}}\mathbf{B_{i-1j}}
    • Data parallelism: Local Matrix Multiplication operation in each processes for each supercomputing step.

  2. Parallel Modeling Algorithm Operations on each step:

    • Stage Algorithm Operation
      0 1. Process P_{ij} has \mathbf{A_{ij}}, \mathbf{B_{ij}} but needs \mathbf{A_{ii}} (for each index i)
      2. Process \mathbf{P_{ii}} broadcast \mathbf{A_{ii}} across process mesh row i
      3. Process P_{ij} computes \mathbf{C_{ij}=A_{ii}B_{ij}}
      1 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+1}} and \mathbf{B_{i+1j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q-1) (period)
      1.2 P_{ii+1} broadcast \mathbf{A_{ii+1}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+1}B_{i+1j}}
      2 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+2}} and \mathbf{B_{i+2j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q−1) (period)
      1.2 P_{ii+2} broadcast \mathbf{A_{ii+2}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}
      ... ...
      q-2-i \mathbf{C_{ij}=C_{ij}+A_{iq-2}B_{q-2j}}
      q-1-i \mathbf{C_{ij}=C_{ij}+A_{iq-1}B_{q-1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i1}B_{1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i2}B_{2j}}
      ... ...
      q-1 \mathbf{C_{ij}=C_{ij}+A_{ii-1}B_{i-1j}}
    • Pipe parallelism: The (0 \to q-1) Computing Steps for each process P_{ij}.

Algorithm Analysis

  1. Algorithm Analysis on each Supercomputing Step

    • Stage Algorithm Operation Computing and Communication Analysis
      0 1. Process P_{ij} has \mathbf{A_{ij}}, \mathbf{B_{ij}} but needs \mathbf{A_{ii}} (for each index i)
      2. Process \mathbf{P_{ii}} broadcast \mathbf{A_{ii}} across process mesh row i
      3. Process P_{ij} computes \mathbf{C_{ij}=A_{ii}B_{ij}}
      Communication in Broadcast Operation:
      \left( q-1 \right)  \times q \times \frac{M}{q} \times \frac{K}{q}
      Computing in each process:
      \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}
      Computing in total:
      \left( \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q} \right) q \times q
      1 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+1}} and \mathbf{B_{i+1j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q-1) (period)
      1.2 P_{ii+1} broadcast \mathbf{A_{ii+1}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+1}B_{i+1j}}
      Communication in shift operation:
      \left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q} \right)
      Communication in broadcast operation:
      \left[\left( q-1 \right) \times q\right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)
      Communication in total:
      \left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q}\right) + \left[ \left( q-1 \right)\times q \right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)
      Computing in each process:
      \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}
      Computing in total:
      \left( \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q} \right) \times q \times q
      2 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+2}} and \mathbf{B_{i+2j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q−1) (period)
      1.2 P_{ii+2} broadcast \mathbf{A_{ii+2}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}
      Communication in shift operation:
      \left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q} \right)
      Communication in broadcast operation:
      \left[\left( q-1 \right) \times q \right] \times \left( \frac{M}{q} \times \frac{K}{q} \right)
      Communication in total:
      \left[\left( q-1 \right) \times q \right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)
      Computing in each process:
      \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}
      Computing in total:
      \left(\frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}\right) \times q \times q
      ... ...
      q-2-i \mathbf{C_{ij}=C_{ij}+A_{iq-2}B_{q-2j}}
      q-1-i \mathbf{C_{ij}=C_{ij}+A_{iq-1}B_{q-1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i1}B_{1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i2}B_{2j}}
      ... ...
      q-1 \mathbf{C_{ij}=C_{ij}+A_{ii-1}B_{i-1j}}
  2. Communication in total

  1. Computing in total

FOX Kernel in the Parallel MPI-C Program

  •     n_bar = n/grid->q;
        Set_to_zero(local_C);
    
        source = (grid->my_row + 1) % grid->q;
        dest = (grid->my_row + grid->q - 1) % grid->q;
    
        temp_A = Local_matrix_allocate(n_bar);
    
        for (stage = 0; stage < grid->q; stage++) {
            bcast_root = (grid->my_row + stage) % grid->q;
            if (bcast_root == grid->my_col) {
              MPI_Bcast(local_A, 1, local_matrix_mpi_t,
                        bcast_root, grid->row_comm);
              Local_matrix_multiply(local_A, local_B,local_C);
            } else {
              MPI_Bcast(temp_A, 1, local_matrix_mpi_t,
                        bcast_root, grid->row_comm);
              Local_matrix_multiply(temp_A, local_B,local_C);
            }
            MPI_Sendrecv_replace(local_B, 1, local_matrix_mpi_t,
                                 dest, 0, source, 0, grid->col_comm, &status);
         }

Analysis

  • Intel® Trace Analyzer Statistics results While FOX Kernel Executing
  • Intel® Trace Analyzer's Load Balance Analysis While FOX Kernel Executing

Warranty

Maybe, there are many mistakes in the both documents and Codes, because of the limitation of our knowledge and strength. As a result: THESE DOCUMENTS AND CODES ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. I MAKE NO WARRANTIES, EXPRESS OR IMPLIED, THAT THEY ARE FREE OF ERROR.

Copyright

You can use and copy these works for any academic purpose, Except just copy to finish your homework or republish these works without proper declare their original author.

About

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published