-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] support dataset rows more then max(int32_t) #5454
Conversation
will this affect the training speed? any benchmarks? |
|
yeah, it is fine to benchmark. @imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ? |
I believe this is related: #4091. |
int32_t num_local_row, | ||
const int64_t* num_per_col, | ||
int64_t num_sample_row, | ||
int64_t num_local_row, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to support int64_t for sample data and data local to 1 machine? Not opposing, just making sure you want to do that. If so, you prob need to change the sample_indices to int64_t as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to support int64_t for sample data and data local to 1 machine? Not opposing, just making sure you want to do that. If so, you prob need to change the sample_indices to int64_t as well?
@svotaw Very happy to receive feedbacks.
Not just using 1 machine. I'm not sure, int/int32_t
is enough for Worker machines, is it enough to Driver machine?
@@ -91,10 +91,10 @@ LIGHTGBM_C_EXPORT int LGBM_GetSampleCount(int32_t num_total_row, | |||
* \param[out] out_len Number of indices | |||
* \return 0 when succeed, -1 when failure happens | |||
*/ | |||
LIGHTGBM_C_EXPORT int LGBM_SampleIndices(int32_t num_total_row, | |||
LIGHTGBM_C_EXPORT int LGBM_SampleIndices(int64_t num_total_row, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these signature changes are breaking changes. Just making sure you consider that.
@guolinke Tested Higgs Dataset in AWS
|
LightGBM/swig/ChunkedArray_API_extensions.i Lines 16 to 21 in 2e54d5f
Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.
|
e1b07b8
to
649ef60
Compare
@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation. |
@guolinke Re-ran the
It feels like changing the modified |
Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this? |
cc @shiyu1994 to confirm the performance test. |
Hi @imatiach-msft , @AlbertoEAF, @svotaw, @guolinke LightGBM is using the SWIG's stdint.i for int64_t, Line 21 in 952458a
the defination for int64_t in SWIG need to enable SWIGWORDSIZE64 when Word size is 64-bit, So for default, in LightGBM's swig part, int64_t is being long long(SWIGWORDSIZE64 not defined), |
Hi guys, Is there any update on this ticket? |
|
@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version. |
Thank you so mush! Ok, will do that! |
I raised a new ticket for "make it a flag (define in CMake) in the compilation." #5540 |
Based on #5540 (comment), #5540 replaces this PR. I'm going to close this one. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Description
This is a WIP PR to allow more than MAX(int32_t) of dataset rows.
c_api.h(int)
andc_api.cpp(int32_t)
forncol
LightGBM/include/LightGBM/c_api.h
Lines 1221 to 1231 in 2b8fe8b
Threading.h
, whennblock = 0
, setleft_cnt = 0
LightGBM/include/LightGBM/utils/threading.h
Lines 162 to 168 in 2b8fe8b
Additional Comments
It seem that SWIG will convert the parameter which defined as
const int64_t*
tolong long const*
, could you give some suggestions for this error?