Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new features and fixes to Sparse Update #47

Open
wants to merge 95 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
5e914fe
Fix minors for sparse update and add train loss monitoring during epochs
dnadalini Apr 3, 2024
15ce4e4
Fix code generation (minors)
dnadalini Apr 3, 2024
81cdd47
Enhance GM for Conv2D test (FP32)
dnadalini Apr 3, 2024
dcd195e
Merge branch 'main' of github.com:pulp-platform/pulp-trainlib into pr…
dnadalini Apr 4, 2024
ec037d1
Merge branch 'main' of github.com:pulp-platform/pulp-trainlib into pr…
dnadalini Apr 4, 2024
3aaae79
Fix code generation for sparse update in TrainLib_Deployer without si…
dnadalini Apr 4, 2024
438716b
Introduce sparse update in single buffer more (code generator is bugg…
dnadalini Apr 4, 2024
bd8d8cb
Merge branch 'main' of github.com:pulp-platform/pulp-trainlib into pr…
dnadalini Apr 5, 2024
d0a981e
Merge branch 'main' of github.com:pulp-platform/pulp-trainlib into pr…
dnadalini Apr 5, 2024
d8ebd50
Fix merge issues in TrainLib_Deployer
dnadalini Apr 5, 2024
1c29032
Merge branch 'main' of github.com:pulp-platform/pulp-trainlib into pr…
dnadalini Apr 8, 2024
29d07a3
Merge branch 'main' of github.com:pulp-platform/pulp-trainlib into pr…
dnadalini Apr 8, 2024
b11257a
Partially fix single buffering with sparse update
dnadalini Apr 9, 2024
315fb17
Fix partial update with single buffering mode
dnadalini Apr 9, 2024
bde9f48
Fix single buffer mode (no sparse update)
dnadalini Apr 9, 2024
35c8dc0
Optimize memory occupation and fix single buffer mode with sparse update
dnadalini Apr 9, 2024
a84f8ba
Fix sparse update (still existing bugs) with no buffer and single buf…
dnadalini Apr 9, 2024
793d864
Add error in case of double buffering selection (currently unavailable)
dnadalini Apr 9, 2024
4dd71f8
Fix memory bug with single buffering
dnadalini Apr 10, 2024
b566118
Merge pull request #3 from pulp-platform/main
diaconuccalin Apr 10, 2024
2fe4089
Fix functional bug with memory allocation for ReLU backpropagation wi…
dnadalini Apr 11, 2024
417ae28
Add testing setup for sparse update
dnadalini Apr 11, 2024
c95b43a
Add optimized naive conv2d algorithms for 5x5 kernels with stride 2 a…
dnadalini Apr 11, 2024
e5d34dc
Fix InstanceNorm primitives in both forward and backward (FP32, FP16)
dnadalini Apr 15, 2024
d2e4bd1
Merge with main
dnadalini Apr 15, 2024
216520e
Merge pull request #4 from pulp-platform/main
diaconuccalin Apr 16, 2024
c08ef3a
Merge branch 'main' of github.com:pulp-platform/pulp-trainlib into pr…
dnadalini Apr 17, 2024
f5c6621
Merge pull request #6 from pulp-platform/main
diaconuccalin Jun 6, 2024
8f55d8c
Refactor code for the data assignment between blobs (no and single bu…
dnadalini Jun 25, 2024
378ce6b
Merge pull request #8 from pulp-platform/main
diaconuccalin Jul 1, 2024
3100433
Fix bug of forward pass in backward step. Clean GM.py and net.c files…
diaconuccalin Jul 3, 2024
5dffe02
Merge main into pr/SparseUpdate
dnadalini Jul 3, 2024
115a772
Fix minors in double buffering
dnadalini Jul 3, 2024
bf929cf
Set up experiments to try out sparse update
dnadalini Jul 3, 2024
019d363
Fix issues with the backward step (single buffering) with sparse update
dnadalini Jul 9, 2024
84ec014
Fix several bugs in backward with residual connections
dnadalini Jul 9, 2024
984325a
Fix pointer setup in single buffer mode
dnadalini Jul 9, 2024
60b02e0
Enhance code generation (visually)
dnadalini Jul 9, 2024
caf2aa2
Remove several bugs for residual connections in single buffer mode
dnadalini Jul 9, 2024
83b6554
Add new test setup for the TrainLib_Deployer frontend
dnadalini Jul 9, 2024
d61bf44
Enhance code generation in single buffer mode
dnadalini Jul 9, 2024
184a96f
Enhance features description in main README.md
dnadalini Jul 10, 2024
734882a
Add Leaky ReLU to PULP-TrainLib and add related test
dnadalini Jul 10, 2024
99f5b93
Add Sigmoid and Leaky ReLU to TrainLib_Deployer
dnadalini Jul 10, 2024
8cfd099
Add LeakyReLU and Sigmoid to single buffer mode
dnadalini Jul 10, 2024
c74c81d
Fixed backward step for softmax. Fixed backward step for mhsa. Replac…
diaconuccalin Jul 12, 2024
2594913
Completed comments with explanations. Other clean-ups
diaconuccalin Jul 13, 2024
bb3b82a
Added more debug messages
diaconuccalin Jul 13, 2024
7a68c23
Fix errors
diaconuccalin Jul 13, 2024
5cbb139
Fixed the softmax activation to work on heights that are different fr…
diaconuccalin Jul 15, 2024
9cf9a36
Replicated backward step of mhsa from pulp primitive to the GM PyTorc…
diaconuccalin Jul 16, 2024
a79f739
Parallelized backward pass of softmax
diaconuccalin Jul 16, 2024
4ed09d9
Add pseudo-random number generator and related test
dnadalini Jul 16, 2024
026f010
Partial backward softmax bug fix
diaconuccalin Jul 17, 2024
bda2ba6
Fix in activations test
diaconuccalin Jul 17, 2024
ca16ced
Ported softmax to fp16 for activation test. Ported fast exp softmax t…
diaconuccalin Jul 24, 2024
e754d70
Brought fp32 implementation of mhsa to fp16. NOT yet adapted
diaconuccalin Jul 25, 2024
1b7f8c3
Adapted fp16 to fp32 for mhsa. Other fixes
diaconuccalin Jul 25, 2024
75cdcd2
Add L1Loss in both FP32 and FP16
dnadalini Jul 25, 2024
ce0fd65
Add C implementation of berHu loss, still without test
dnadalini Jul 25, 2024
9c4f745
Switched to relative error check for the mhsa fp16 test. Writing of t…
diaconuccalin Jul 25, 2024
af7c2d6
Add test for berHu loss, loss computation not perfectly matching pytorch
dnadalini Jul 26, 2024
9cdeacf
Made temp buffer size computation dynamic. Removed unused h_buffer. O…
diaconuccalin Jul 27, 2024
5bd75ba
Merge pull request #1 from pulp-platform/pr/SparseUpdate
Dequino Jul 30, 2024
859af2a
Enhance defaults for losses
dnadalini Jul 30, 2024
ae3cf78
Added dropout kernels for FP32 and FP16
Aug 1, 2024
aa4ada7
Changed the mhsa fp32 implementation to replace the previous big line…
diaconuccalin Aug 1, 2024
c75cce6
Changed the mhsa fp16 implementation to replace the previous big line…
diaconuccalin Aug 1, 2024
959cf24
Modified makefile
Aug 5, 2024
6802ead
Merge branch 'mhsa_fix' of github.com:diaconuccalin/pulp-trainlib int…
Aug 5, 2024
4fd1fd3
Merge branch 'diaconuccalin-mhsa_fix' into pulp-trainlib-dev
Aug 5, 2024
778a830
Included biases for input projection layers for the forward fp32 step…
diaconuccalin Aug 6, 2024
6a1e311
Fully included biases for input projection layers for fp32 and fp64. …
diaconuccalin Aug 6, 2024
0db8db7
Add first version of fp32 transposed convolution 2D (incomplete, to b…
dnadalini Aug 27, 2024
78136ef
Merge branch 'mhsa_fix' of github.com:diaconuccalin/pulp-trainlib int…
Sep 6, 2024
95a4f92
Merge branch 'diaconuccalin-mhsa_fix' into pulp-trainlib-dev
Sep 6, 2024
983f336
Merge branch 'pr/SparseUpdate' into pulp-trainlib-dev
Dequino Sep 6, 2024
4249fba
Merge pull request #48 from Dequino/pulp-trainlib-dev
dnadalini Sep 12, 2024
6349d73
Add forward step of transposed convolution 2d (FP32, no optimization)
dnadalini Oct 2, 2024
da982d3
Merge branch 'pr/SparseUpdate' of github.com:pulp-platform/pulp-train…
dnadalini Oct 2, 2024
de5d70f
Fix pulp_train.h
dnadalini Oct 2, 2024
4b44e2b
Fix forward step of transposed conv2d (FP32, no opt)
dnadalini Oct 2, 2024
9ff2b60
Add (partially tested) weight grad of transposed conv2d (FP32, no opt)
dnadalini Oct 2, 2024
d32f938
Add input grad (no padding) for FP32 transposed convolution 2d (no opt)
dnadalini Oct 3, 2024
482de08
Add first working (but unoptimized) transposed convolution 2D kernels…
dnadalini Oct 10, 2024
0cac308
Optimize the weight grad step of FP32 naive transposed conv2d
dnadalini Oct 15, 2024
5ddab65
Parallelize and optimize the forward and backward grad steps of the t…
dnadalini Oct 15, 2024
f36fff1
Add parallelization and code optimizations for all 3 steps of transpo…
dnadalini Oct 15, 2024
1e49fd0
Add fp16 transposed convolution (not optimized more than fp32) and re…
dnadalini Oct 15, 2024
95ea4a0
Add bilinear and nearest neighbour interpolations (fp32, parallel)
dnadalini Oct 16, 2024
cf1b8e7
Add fp16 version of interpolations and fix test to support fp16
dnadalini Oct 16, 2024
662c9bd
Fix minors in test_losses
dnadalini Oct 29, 2024
7f3fbb8
Fix minors in test_losses
dnadalini Oct 29, 2024
1cc482a
Fix test for activations and sizes of convolutions for testing
dnadalini Nov 9, 2024
69d17dd
Fix test for the activation functions
dnadalini Nov 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,13 +151,15 @@ PULP-TrainLib's repository is organized with these branches:

> Note: checked are complete, unchecked are ongoing

PULP-TrainLib:

- [X] Forward passes for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected (FP32, FP16)
- [X] Weight gradients for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected (FP32, FP16)
- [X] Input gradients for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected (FP32, FP16)
- [X] CWH data layout for DepthWise, PointWise and 2D Convolutions (FP32, FP16)
- [X] HWC data layout for PointWise Convolution (FP32, FP16) and 2D Convolutions (FP32, FP16)
- [X] stride and padding (only naive 2D Convolutions, without im2col+mm optimization)
- [X] ReLU, Sigmoid activation functions (FP32, FP16)
- [X] Stride and Padding (only naive 2D Convolutions, without im2col+mm optimization)
- [X] ReLU, Leaky ReLU, Sigmoid activation functions (FP32, FP16)
- [X] Gradient Descent optimizer (FP32, FP16)
- [X] Max and Average Pooling (FP32, FP16)
- [X] RNN training primitives (FP32)
Expand All @@ -173,7 +175,23 @@ PULP-TrainLib's repository is organized with these branches:
- [ ] Biases for DepthWise and PointWise Convolutions (FP32, FP16)
- [ ] Sparse Update (layer-wise) in TrainLib_Deployer
- [ ] Partial Im2Col / Im2Row for Conv2D (FP32, FP16)
- [ ] Integration of biases in TrainLib-Deployer (Conv2D)

TrainLib_Deployer:

- [X] No Buffer and Single Buffer mode, supporting layer-wise execution (tiling not supported)
- [X] Conv2D, PointWise, DepthWise Convolutions, Fully-Connected support (FP32, FP16)
- [X] Average and Max Pooling (FP32, FP16)
- [X] ReLU, LeakyReLU, Sigmoid Activations (FP32, FP16)
- [X] InstanceNorm (FP32, FP16)
- [X] Residual Connections (FP32, FP16, only no buffer mode)
- [ ] Residual Connections (FP32, FP16, single buffer mode)
- [X] SGD Optimizer (FP32, FP16)
- [ ] FP32-FP16 Layer-Wise Mixed Precision Mode
- [X] Layer-Wise Sparse Update
- [X] CHW Data Layout
- [ ] HWC Data Layout
- [X] Online Learning (batch size = 1)
- [ ] Mini-Batch Learning (batch size > 1)

# Known bugs / issues (open for contributions)

Expand All @@ -185,6 +203,10 @@ PULP-TrainLib's repository is organized with these branches:
- Missing integration of sigmoid function in TrainLib_Deployer
- Performances of FP16 sigmoid may need to be optimized with FP16 exponenetial (e.g., https://github.com/0xBYTESHIFT/fp16/blob/master/include/half/half.hpp)

TrainLib_Deployer:
- Training does not converge in DNNs generated with TrainLib_Deployer if the last layer is not updated
- With no single/double buffering, not updating a PW layer in a sparse update results in wrong backward computation


# Contributors

Expand Down
86 changes: 68 additions & 18 deletions lib/include/pulp_act_fp16.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,33 @@
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* Authors: Davide Nadalini, Leonardo Ravaglia, Calin Diaconu
*
* Activation functions configuration structure
*/

/**
* Authors: Davide Nadalini, Leonardo Ravaglia
*/

/**
* Activation functions configuration structure
* @brief Structure for activation functions
* @param input blob structure for the input data of the activation layer
* @param output blob structure for the output data of the activation layer
*/
struct act_args_fp16 {
struct blob_fp16 * input;
struct blob_fp16 * output;
};


/**
* @brief Structure for activation functions
* @brief Structure for leaky relu activation functions
* @param input blob structure for the input data of the activation layer
* @param output blob structure for the output data of the activation layer
*/
struct act_args_fp16 {
struct leakyrelu_args_fp16 {
struct blob_fp16 * input;
struct blob_fp16 * output;
fp16 negative_slope;
};

/**
Expand All @@ -39,17 +48,22 @@ struct act_args_fp16 {
* @param output pointer to output vector
* @param sum final sum value of all exponentials
*/
struct softmax_args_fp16{
struct blob_fp16 * input;
struct blob_fp16 * output;
int L;
int n_heads;
fp16 * maxes;
fp16 * sums;
struct softmax_args_fp16 {
fp16 *input_data;
fp16 *input_diff;
fp16 *output_data;
fp16 *output_diff;
int H;
int W;
int L;
int n_heads;
fp16 *global_max;
fp16 *partial_exp_sum;
fp16 *maxes;
fp16 *sums;
};



/**
* Activation functions, both FW and BW
**/
Expand All @@ -62,54 +76,88 @@ struct softmax_args_fp16{
*/
void pulp_sigmoid_fp16_fw_cl( void * act_args );


/**
* @brief Backward pass function.
* @param input Input for sigmoid.
* @param output Output of sigmoid.
*/
void pulp_sigmoid_fp16_bw_cl( void * act_args );


/**
* @brief Core function to implement the forward of sigmoid (allows parallelization, parallelize with pi_cl_team_fork(NUM_CORES, sigmoid_core_fw_fp16, &args)).
* @param act_args Input and output data (data only will be used)
*/
void sigmoid_core_fw_fp16( void * act_args );


/**
* @brief Core function to implement the backward of sigmoid (allows parallelization, parallelize with pi_cl_team_fork(NUM_CORES, sigmoid_core_bw_fp16, &args)).
* @param act_args Input and output data (gradients only will be used)
*/
void sigmoid_core_bw_fp16( void * act_args );



/**
* @brief Forward pass function. Configure and pass a act_args structure pointer as argument.
* @param input Input for relu.
* @param output Output of relu.
*/
void pulp_relu_fp16_fw_cl( void * act_args_fp16 );


/**
* @brief Bakcward pass function.
* @brief Backward pass function.
* @param input Input for relu.
* @param output Output of relu.
*/
void pulp_relu_fp16_bw_cl( void * act_args_fp16 );


/**
* @brief Core function to implement the forward of ReLU (allows parallelization, parallelize with pi_cl_team_fork(NUM_CORES, relu_core_fw_fp16, &args)).
* @param act_args Input and output data (data only will be used)
*/
void relu_core_fw_fp16( void * act_args_fp16 );


/**
* @brief Core function to implement the backward of ReLU (allows parallelization, parallelize with pi_cl_team_fork(NUM_CORES, relu_core_bw_fp16, &args)).
* @param act_args Input and output data (gradients only will be used)
*/
void relu_core_bw_fp16( void * act_args_fp16 );


/**
* @brief Forward pass function. Configure and pass a leakyrelu_args structure pointer as argument.
* @param input Input for leaky relu.
* @param output Output of leaky relu.
*/
void pulp_leakyrelu_fp16_fw_cl( void * leakyrelu_args_fp16 );

/**
* @brief Backward pass function.
* @param input Input for leaky relu.
* @param output Output of leaky relu.
*/
void pulp_leakyrelu_fp16_bw_cl( void * leakyrelu_args_fp16 );

/**
* @brief Core function to implement the forward of Leaky ReLU (allows parallelization, parallelize with pi_cl_team_fork(NUM_CORES, leakyrelu_core_fw_fp16, &leakyrelu_args)).
* @param leakyrelu_args_fp16 Input and output data (data only will be used)
*/
void leakyrelu_core_fw_fp16( void * leakyrelu_args_fp16 );

/**
* @brief Core function to implement the backward of Leaky ReLU (allows parallelization, parallelize with pi_cl_team_fork(NUM_CORES, leakyrelu_core_bw_fp16, &leakyrelu_args)).
* @param leakyrelu_args_fp16 Input and output data (gradients only will be used)
*/
void leakyrelu_core_bw_fp16( void * leakyrelu_args_fp16 );





/**
* @brief Forward pass function.
Expand All @@ -118,16 +166,18 @@ void relu_core_bw_fp16( void * act_args_fp16 );
*/
void pulp_softmax_fp16_fw_cl( void * act_args_fp16 );


/**
* @brief Bakcward pass function.
* @brief Backward pass function.
* @param input Input for softmax.
* @param output Output of softmax.
*/
void pulp_softmax_fp16_bw_cl( void * act_args_fp16 );


/**
* @brief Forward pass function. Configure and pass a act_args structure pointer as argument.
* @param input Input for gelu.
* @param output Output of gelu.
*/
void pulp_gelu_fp16_fw_cl( void* act_args_fp16);
void pulp_gelu_fp16_fw_cl( void* act_args_fp16);
Loading