[WIP] support dataset rows more then max(int32_t) #5454

junpeng0715 · 2022-08-31T05:57:40Z

Description

This is a WIP PR to allow more than MAX(int32_t) of dataset rows.

change data_size_t to int64_t
change int32_t to int64_t for related places
modify test code to allow int64 dataset.

change unmatch type in c_api.h(int) and c_api.cpp(int32_t) for ncol

Lines 1221 to 1231 in 2b8fe8b

    
           LIGHTGBM_C_EXPORT int LGBM_BoosterPredictForMatSingleRow(BoosterHandle handle, 
        
                                                                    const void* data, 
        
                                                                    int data_type, 
        
                                                                    int ncol, 
        
                                                                    int is_row_major, 
        
                                                                    int predict_type, 
        
                                                                    int start_iteration, 
        
                                                                    int num_iteration, 
        
                                                                    const char* parameter, 
        
                                                                    int64_t* out_len, 
        
                                                                    double* out_result);

change the thread run logic in Threading.h, when nblock = 0, set left_cnt = 0

LightGBM/include/LightGBM/utils/threading.h

Lines 162 to 168 in 2b8fe8b

    
           left_write_pos_[0] = 0; 
        
           right_write_pos_[0] = 0; 
        
           for (int i = 1; i < nblock; ++i) { 
        
             left_write_pos_[i] = left_write_pos_[i - 1] + left_cnts_[i - 1]; 
        
             right_write_pos_[i] = right_write_pos_[i - 1] + right_cnts_[i - 1]; 
        
           } 
        
           data_size_t left_cnt = left_write_pos_[nblock - 1] + left_cnts_[nblock - 1];

Additional Comments

Need some experienced developers to review this PR, and make some suggestions.
When I build in java wrapper(swig), I got two errors,

  ubuntu: 20.04
  SWIG:4.0.2
  gcc: GNU 9.4.0
  G++: GNU 9.4.0
  JVM: 1.8.0_342

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

 [ 96%] Building CXX object CMakeFiles/_lightgbm_swig.dir/java/lightgbmlibJAVA_wrap.cxx.o
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetCreateFromSampledColumn(JNIEnv*, jclass, jlong, jlong, jint, jlong, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1285:68: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
 1285 |   result = (int)LGBM_DatasetCreateFromSampledColumn(arg1,arg2,arg3,(long long const *)arg4,arg5,arg6,arg7,(char const *)arg8,arg9);
      |                                                                    ^~~~~~~~~~~~~~~~~~~~~~~
      |                                                                    |
      |                                                                    const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:130:74: note:   initializing argument 4 of ‘int LGBM_DatasetCreateFromSampledColumn(double**, int**, int32_t, const int64_t*, int64_t, int64_t, int64_t, const char*, void**)’
  130 |                                                           const int64_t* num_per_col,
      |                                                           ~~~~~~~~~~~~~~~^~~~~~~~~~~
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetGetSubset(JNIEnv*, jclass, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1561:44: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
 1561 |   result = (int)LGBM_DatasetGetSubset(arg1,(long long const *)arg2,arg3,(char const *)arg4,arg5);
      |                                            ^~~~~~~~~~~~~~~~~~~~~~~
      |                                            |
      |                                            const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:316:60: note:   initializing argument 2 of ‘int LGBM_DatasetGetSubset(DatasetHandle, const int64_t*, int64_t, const char*, void**)’
  316 |                                             const int64_t* used_row_indices,

guolinke · 2022-09-01T15:07:13Z

will this affect the training speed? any benchmarks?

junpeng0715 · 2022-09-02T01:36:28Z

will this affect the training speed? any benchmarks?

@guolinke

I found a benchmarks in you repo, boosting_tree_benchmarks, is it fine to confirm training speed?
Can you give me some feedback on the question for SWIG that I raised? Or who is the main developer for SWIG part to ask for advice.

guolinke · 2022-09-02T03:15:43Z

yeah, it is fine to benchmark.

@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?

StrikerRUS · 2022-09-04T15:37:04Z

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

svotaw · 2022-09-04T17:54:32Z

include/LightGBM/c_api.h

-                                                          int32_t num_local_row,
+                                                          const int64_t* num_per_col,
+                                                          int64_t num_sample_row,
+                                                          int64_t num_local_row,


Do you need to support int64_t for sample data and data local to 1 machine? Not opposing, just making sure you want to do that. If so, you prob need to change the sample_indices to int64_t as well?

Do you need to support int64_t for sample data and data local to 1 machine? Not opposing, just making sure you want to do that. If so, you prob need to change the sample_indices to int64_t as well?

@svotaw Very happy to receive feedbacks.

Not just using 1 machine. I'm not sure, int/int32_t is enough for Worker machines, is it enough to Driver machine?

svotaw · 2022-09-04T17:55:21Z

include/LightGBM/c_api.h

@@ -91,10 +91,10 @@ LIGHTGBM_C_EXPORT int LGBM_GetSampleCount(int32_t num_total_row,
 * \param[out] out_len Number of indices
 * \return 0 when succeed, -1 when failure happens
 */
-LIGHTGBM_C_EXPORT int LGBM_SampleIndices(int32_t num_total_row,
+LIGHTGBM_C_EXPORT int LGBM_SampleIndices(int64_t num_total_row,


all these signature changes are breaking changes. Just making sure you consider that.

junpeng0715 · 2022-09-05T02:35:19Z

yeah, it is fine to benchmark.

@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?

@guolinke Tested Higgs Dataset in AWS
Instrance : r5.2xlarge (vCPUs 8/Memory 64GiB)

#	Master Branch	Modified
Accuracy	310.827840 seconds, iteration 500, AUC:0.845874	427.753441 seconds, iteration 500, AUC:0.845874
Speed	294.647278 seconds, iteration 500	303.744985 seconds, iteration 500

junpeng0715 · 2022-09-05T03:15:47Z

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

LightGBM/swig/ChunkedArray_API_extensions.i

Lines 16 to 21 in 2e54d5f

    
           /* Unfortunately, for the time being, 
        
            * SWIG has issues generating the overloads to coalesce_to() 
        
            * for larger integral types 
        
            * so we won't support that for now: 
        
            */ 
        
           //%template(int64ChunkedArray) ChunkedArray<int64_t>;

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

Is there any tasks are working on this issue?
I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

guolinke · 2022-09-05T13:53:06Z

@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.

junpeng0715 · 2022-09-05T15:08:52Z

@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.

@guolinke Re-ran the Higgs benchmark, and got a new result below.
It is faster this time.It seems the instance is not stable. Anyway, I also agree with you, it's fine to add an option.

#	Master Branch	Modified
Accuracy	309.241788 seconds, iteration 500, AUC:0.845874	318.976365 seconds, iteration 500, AUC:0.845874
Speed	399.465431 seconds, iteration 500	304.222913 seconds, iteration 500

e.g. make it a flag (define in CMake) in the compilation.

It feels like changing the modified int64_t to data_size_t , then add a CMAKE option to switch data_size_t .
But how to handle the APIs(C, Python, etc)?

junpeng0715 · 2022-09-06T01:41:08Z

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

LightGBM/swig/ChunkedArray_API_extensions.i

Lines 16 to 21 in 2e54d5f

/* Unfortunately, for the time being,

* SWIG has issues generating the overloads to coalesce_to()

* for larger integral types

* so we won't support that for now:

*/

//%template(int64ChunkedArray) ChunkedArray<int64_t>;

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

Is there any tasks are working on this issue?

I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?

guolinke · 2022-09-07T14:28:34Z

cc @shiyu1994 to confirm the performance test.

junpeng0715 · 2022-09-11T15:46:22Z

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

LightGBM/swig/ChunkedArray_API_extensions.i

Lines 16 to 21 in 2e54d5f

/* Unfortunately, for the time being,

* SWIG has issues generating the overloads to coalesce_to()

* for larger integral types

* so we won't support that for now:

*/

//%template(int64ChunkedArray) ChunkedArray<int64_t>;

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

Is there any tasks are working on this issue?

I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?

Hi @imatiach-msft , @AlbertoEAF, @svotaw, @guolinke
Could you explain my doubt for SWIG part?

LightGBM is using the SWIG's stdint.i for int64_t,

LightGBM/swig/lightgbmlib.i

Line 21 in 952458a

%include "stdint.i"

the defination for int64_t in SWIG need to enable SWIGWORDSIZE64 when Word size is 64-bit，
https://github.com/swig/swig/blob/ad1688055dea4f07cd4023d5e0301d97f7759151/Lib/stdint.i#L19-L24

So for default, in LightGBM's swig part, int64_t is being long long(SWIGWORDSIZE64 not defined),
Is that expected?Or unsuitable?

junpeng0715 · 2022-09-21T12:14:54Z

Hi guys, Is there any update on this ticket?

ghost · 2022-09-22T05:54:55Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

❌ junpeng0715 sign now
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

guolinke · 2022-10-09T05:06:40Z

@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.

junpeng0715 · 2022-10-11T08:28:02Z

@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.

Thank you so mush! Ok, will do that!

junpeng0715 · 2022-10-13T07:14:08Z

I raised a new ticket for "make it a flag (define in CMake) in the compilation." #5540

jameslamb · 2022-10-18T03:08:17Z

Based on #5540 (comment), #5540 replaces this PR. I'm going to close this one.

github-actions · 2023-11-15T00:20:36Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

junpeng0715 requested review from guolinke, shiyu1994, StrikerRUS, jameslamb and jmoralez as code owners August 31, 2022 05:57

microsoft-github-policy-service bot deleted a comment Aug 31, 2022

jameslamb added the in progress label Sep 2, 2022

microsoft-github-policy-service bot deleted a comment Sep 4, 2022

StrikerRUS added the breaking label Sep 4, 2022

svotaw reviewed Sep 4, 2022

View reviewed changes

microsoft-github-policy-service bot deleted a comment Sep 5, 2022

junpeng0715 closed this Sep 5, 2022

junpeng0715 force-pushed the dev_0830 branch from e1b07b8 to 649ef60 Compare September 5, 2022 10:02

[microsoft#5454]support dataset rows more then max(int32_t)

930d4e1

junpeng0715 reopened this Sep 5, 2022

microsoft-github-policy-service bot deleted a comment Sep 5, 2022

Fix cpp test code microsoft#5454

cd8fe08

microsoft-github-policy-service bot deleted a comment Sep 5, 2022

Empty commit to trigger ci

3d1ff46

junpeng.li and others added 3 commits October 7, 2022 09:48

modify GPU data type to fit int64

131abfc

Update histogram256.cl

1146436

modify GPU data type to fit int64

ac27c5f

junpeng0715 pushed a commit to junpeng0715/LightGBM that referenced this pull request Oct 12, 2022

[microsoft#5454]support dataset rows more then max(int32_t)

087793d

junpeng0715 pushed a commit to junpeng0715/LightGBM that referenced this pull request Oct 12, 2022

Fix cpp test code microsoft#5454

b32427a

junpeng0715 mentioned this pull request Oct 13, 2022

support dataset rows more then max(int32_t) (add CMAKE option) #5540

Open

jameslamb closed this Oct 18, 2022

jameslamb mentioned this pull request Feb 9, 2023

crash when data size exceeds int32 #5705

Closed

jameslamb removed the in progress label Aug 13, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] support dataset rows more then max(int32_t) #5454

[WIP] support dataset rows more then max(int32_t) #5454

junpeng0715 commented Aug 31, 2022 •

edited

Loading

guolinke commented Sep 1, 2022

junpeng0715 commented Sep 2, 2022 •

edited

Loading

guolinke commented Sep 2, 2022 •

edited

Loading

StrikerRUS commented Sep 4, 2022

svotaw Sep 4, 2022

junpeng0715 Sep 5, 2022

svotaw Sep 4, 2022

junpeng0715 commented Sep 5, 2022 •

edited

Loading

junpeng0715 commented Sep 5, 2022

guolinke commented Sep 5, 2022

junpeng0715 commented Sep 5, 2022

junpeng0715 commented Sep 6, 2022

guolinke commented Sep 7, 2022

junpeng0715 commented Sep 11, 2022

junpeng0715 commented Sep 21, 2022

ghost commented Sep 22, 2022

guolinke commented Oct 9, 2022

junpeng0715 commented Oct 11, 2022

junpeng0715 commented Oct 13, 2022 •

edited

Loading

jameslamb commented Oct 18, 2022

github-actions bot commented Nov 15, 2023

	LIGHTGBM_C_EXPORT int LGBM_BoosterPredictForMatSingleRow(BoosterHandle handle,
	const void* data,
	int data_type,
	int ncol,
	int is_row_major,
	int predict_type,
	int start_iteration,
	int num_iteration,
	const char* parameter,
	int64_t* out_len,
	double* out_result);

	left_write_pos_[0] = 0;
	right_write_pos_[0] = 0;
	for (int i = 1; i < nblock; ++i) {
	left_write_pos_[i] = left_write_pos_[i - 1] + left_cnts_[i - 1];
	right_write_pos_[i] = right_write_pos_[i - 1] + right_cnts_[i - 1];
	}
	data_size_t left_cnt = left_write_pos_[nblock - 1] + left_cnts_[nblock - 1];

[WIP] support dataset rows more then max(int32_t) #5454

[WIP] support dataset rows more then max(int32_t) #5454

Conversation

junpeng0715 commented Aug 31, 2022 • edited Loading

Description

Additional Comments

guolinke commented Sep 1, 2022

junpeng0715 commented Sep 2, 2022 • edited Loading

guolinke commented Sep 2, 2022 • edited Loading

StrikerRUS commented Sep 4, 2022

svotaw Sep 4, 2022

Choose a reason for hiding this comment

junpeng0715 Sep 5, 2022

Choose a reason for hiding this comment

svotaw Sep 4, 2022

Choose a reason for hiding this comment

junpeng0715 commented Sep 5, 2022 • edited Loading

junpeng0715 commented Sep 5, 2022

guolinke commented Sep 5, 2022

junpeng0715 commented Sep 5, 2022

junpeng0715 commented Sep 6, 2022

guolinke commented Sep 7, 2022

junpeng0715 commented Sep 11, 2022

junpeng0715 commented Sep 21, 2022

ghost commented Sep 22, 2022

guolinke commented Oct 9, 2022

junpeng0715 commented Oct 11, 2022

junpeng0715 commented Oct 13, 2022 • edited Loading

jameslamb commented Oct 18, 2022

github-actions bot commented Nov 15, 2023

junpeng0715 commented Aug 31, 2022 •

edited

Loading

junpeng0715 commented Sep 2, 2022 •

edited

Loading

guolinke commented Sep 2, 2022 •

edited

Loading

junpeng0715 commented Sep 5, 2022 •

edited

Loading

junpeng0715 commented Oct 13, 2022 •

edited

Loading