-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: 810 feature add huggingface hubutilstelemetry #916
feat: 810 feature add huggingface hubutilstelemetry #916
Conversation
Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-916/ |
CodSpeed Performance ReportMerging #916 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding the calls to add_step
and add_edge
, we could send the information of pipeline.dump()
on BasePipeline.run
so we have all the relevant information we already use for serialization/caching? It would be a single call containing all the information from the pipeline.
I think it could be useful tracking the exceptions raised from the pipeline as you mention, maybe in the _StepWrapper
would be easier, as that's the container in charge of running the steps on each process.
Co-authored-by: Agus <[email protected]>
Logging on dump: The pipeline dump would be an option but the underlying way to write to HF telemetry would be less convenient I think, also because we would eventually still need to write custom code to transform and get the info in the right way. Logging exceptions: What do you see as interesting usage exceptions? I assumed it would be more high-level usage, like DAG-validation, param passing errors etc. instead of the low-level exceptions within the StepsWrapper. |
Not sure how are we going to access the data afterwards to generate the data, I guess it depends, but accessing the dump of the pipeline and cleaning the content we don't want from there I think it's easier, at least from the point of view of For the exceptions I'm really not sure of what should we track, I guess everything that it's a user error, so you are right not tracking at the |
@plaguss, I will do the pipeline thingies based on the dump and make individual calls from the telemetry clients instead. @gabrielmbmb any specific exceptions you would like to see captured? I capture the RuntimeErrors but perhaps some users errors would be interesting, not sure what we can capture cleanly without filling the code up with try-excepts. Also, we can start with this and consider adding more at a later stage, in case we feel miss anything. |
@plaguss, I updated the code. Decided to leave the batching within the run because I think it is good to know how much data is being generated too and we lost some tracking differentiation for generator, normal and global steps but if we need it wee would be able to back-track. |
Closes #810
run
method of pipeline for example? @plaguss )