-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ClearML NMT build job #101
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 66 of 66 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)
Damien - can you explain what is happening here? Modifying the SMT models is not specified in the commit messages. |
Won't this break all existing builds in the Mongo DB? Should just clear everything out or do we need a migration plan? |
These functions do nothing. Should they be removed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 66 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit)
Are these supposed to do more than just log? |
Previously, johnml1135 (John Lambert) wrote…
Could this be defined once in RecurrentTask and only overridden if necessary? |
Can this be thread limited to 1.5 threads? Or configurable from app settings as a float? |
Can you add a issue for completing this Python code? If it is incomplete, it should have an attached issue. |
957a85d
to
e3dbe61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 54 of 66 files reviewed, 6 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine.AspNetCore/SIL.Machine.AspNetCore.csproj
line 37 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Can you add a issue for completing this Python code? If it is incomplete, it should have an attached issue.
Done.
src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs
line 72 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Damien - can you explain what is happening here? Modifying the SMT models is not specified in the commit messages.
This is just a refactoring of the configuration for the Thot SMT model to make it more consistent with the way that other services are configured.
src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs
line 9 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Won't this break all existing builds in the Mongo DB? Should just clear everything out or do we need a migration plan?
That is a good question. Migrating would be simple. I think it would only require us to remove the BuildState
and IsCanceled
properties from all engines.
src/SIL.Machine.AspNetCore/Services/ClearMLAuthenticationService.cs
line 44 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
These functions do nothing. Should they be removed?
Done.
src/SIL.Machine.AspNetCore/Services/ClearMLMonitorService.cs
line 32 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Could this be defined once in RecurrentTask and only overridden if necessary?
Done.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 23 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Can this be thread limited to 1.5 threads? Or configurable from app settings as a float?
You can configure the number of workers that are available for Hangfire. The default is Math.Min(Environment.ProcessorCount * 5, MaxDefaultWorkerCount)
.
Codecov ReportAttention:
... and 4 files with indirect coverage changes 📢 Thoughts on this report? Let us know!. |
So, you have the new metrics code - does it do the same thing? Where is all the parsing of the results? I can't find it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 54 of 66 files reviewed, 7 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine.AspNetCore/Services/ClearMLService.cs
line 169 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So, you have the new metrics code - does it do the same thing? Where is all the parsing of the results? I can't find it.
There is a GetMetric
method in the ClearMLMonitorService
class that reads a metric from a task.
How is this creating an engine? |
Previously, johnml1135 (John Lambert) wrote…
Ah it's for clearml - it doesn't exist for hangfire. |
Should we first check to see if we need to cancel any jobs? Is that done somewhere else? |
Previously, ddaspit (Damien Daspit) wrote…
sillsdev/serval#154 - opened an issue to see if this will be a problem. |
We need to finish this here: #105. |
There is no code coverage in CI for this. Probably a test or two could be added to cover deleting an engine with a build running. |
There is no CI coverage for this function. |
I know there are almost no changes, but there are no CI tests for training a segment while the engine is building. |
There are no tests to cover this function. |
Previously, johnml1135 (John Lambert) wrote…
Though it should be covered in CancelBuildAsync. What gives? |
Previously, johnml1135 (John Lambert) wrote…
Nevermind - it's covered in DeleteAsync when a job is currently running, is cancelled and then we need to wait for it to finish up. A test covering that should suffice. |
Any particular reason for the change in logic here? Why current build is null or jobstate is pending from just not active? |
This should be overridden in either development app settings or in env variables. I would remove queue and docker image - let the program fail if they are not defined. |
There should be a test to delete an engine after a build has started. |
Previously, ddaspit (Damien Daspit) wrote…
I started an issue here - #105. Where is your issue? |
Previously, ddaspit (Damien Daspit) wrote…
Or, would the states just be ignored? Should they be transitioned? Maybe keep the old parameters and check at initialization if the depreciated fields are there - and then transition them? |
Previously, ddaspit (Damien Daspit) wrote…
Whatever it is, it looks more elegant. I am assuming it will get tested in E2E testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 42 of 66 files at r1, 12 of 12 files at r2, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @ddaspit)
e3dbe61
to
dfcce0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 58 of 66 files reviewed, 12 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine.AspNetCore/SIL.Machine.AspNetCore.csproj
line 37 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I started an issue here - #105. Where is your issue?
The issue is #103.
src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs
line 9 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Or, would the states just be ignored? Should they be transitioned? Maybe keep the old parameters and check at initialization if the depreciated fields are there - and then transition them?
I don't think it is worth migrating them. They are only used when a build is running. When we upgrade, we will shut everything down, so there won't be any running builds. If we don't remove them, the Mongo driver will throw an exception when it tries to deserialize a translation engine doc into a model. I added SetIgnoreExtraElements(true)
to the TranslationEngine
map. This will cause the Mongo driver to ignore the old properties. We won't need to migrate or remove the properties.
src/SIL.Machine.AspNetCore/Services/NmtEngineService.cs
line 64 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Should we first check to see if we need to cancel any jobs? Is that done somewhere else?
Done.
src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs
line 69 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
There is no code coverage in CI for this. Probably a test or two could be added to cover deleting an engine with a build running.
Done.
src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs
line 118 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
There is no CI coverage for this function.
Done.
src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs
line 150 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I know there are almost no changes, but there are no CI tests for training a segment while the engine is building.
Done.
src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs
line 223 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Nevermind - it's covered in DeleteAsync when a job is currently running, is cancelled and then we need to wait for it to finish up. A test covering that should suffice.
Done.
src/SIL.Machine.AspNetCore/Services/SmtTransferEngineStateService.cs
line 51 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Any particular reason for the change in logic here? Why current build is null or jobstate is pending from just not active?
This was updated, because of the addition of the canceling state.
src/SIL.Machine.Serval.EngineServer/appsettings.json
line 21 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
And remove s3 uri - that changes based upon environment.
Done.
src/SIL.Machine.Serval.JobServer/appsettings.json
line 22 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Remove queue, docker image and sharedfile.
Done.
src/SIL.Machine.Serval.JobServer/appsettings.json
line 23 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Actually, it will throw an exception when making the periodic timer: https://learn.microsoft.com/en-us/dotnet/api/system.threading.periodictimer.-ctor?view=net-7.0
This disables the ClearMLMonitorService
on the job server. I put in a check for a zero TimeSpan. There is no need for it to run on both the engine server and the job server.
tests/SIL.Machine.AspNetCore.Tests/Services/NmtEngineServiceTests.cs
line 49 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
There should be a test to delete an engine after a build has started.
Done.
e2ab711
to
ecc166f
Compare
If I am correct, this should allow it to not crash with upgrading - though it will still remove all builds... |
I am confused why it is broken up. Would we ever cancel a build job instead of a build? What does cancelling a build do over and above just cancelling the job? |
Now CreateAsync is showing as not covered by tests - weird: https://app.codecov.io/gh/sillsdev/machine/blob/refactor-clearml-job/src%2FSIL.Machine.AspNetCore%2FServices%2FSmtTransferEngineService.cs |
Ok the same question for breaking out the cancel build and cancel build job. The only difference is the lock and updating the state? Why? Why would I want to call one or the other? |
Previously, ddaspit (Damien Daspit) wrote…
Can we have a parameter "enable polling" defaulted to false and instead place this on the engine server? I believe that would be clearer. |
Previously, ddaspit (Damien Daspit) wrote…
Sorry to be finicky - can the test be called DeleteWhileBuildingAsync or similar? |
DeleteWhileBuildingAsync? |
Previously, ddaspit (Damien Daspit) wrote…
How did you get around this function (to be able to delete it)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit)
ecc166f
to
a34ec44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 59 of 66 files reviewed, 7 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine.AspNetCore/Configuration/IMachineBuilderExtensions.cs
line 224 at r3 (raw file):
Previously, johnml1135 (John Lambert) wrote…
If I am correct, this should allow it to not crash with upgrading - though it will still remove all builds...
Yes, Mongo won't throw an exception, because an engine doc contains unrecognized properties. The only engines that will be in a bad state are those that had running builds when the servers were restarted. We should ensure that there aren't any active builds when we perform the upgrade, and we should be fine.
src/SIL.Machine.AspNetCore/Services/NmtEngineService.cs
line 151 at r3 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I am confused why it is broken up. Would we ever cancel a build job instead of a build? What does cancelling a build do over and above just cancelling the job?
This private method allows me to reuse the same code in the CancelBuildAsync
and DeleteAsync
methods.
src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs
line 223 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
How did you get around this function (to be able to delete it)?
The distributed lock ensures that we are in a good state to delete the engine, so we don't need to wait until the build is completely finished. I never really liked waiting, since it could cause the Serval delete request to hang.
src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs
line 39 at r3 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Now CreateAsync is showing as not covered by tests - weird: https://app.codecov.io/gh/sillsdev/machine/blob/refactor-clearml-job/src%2FSIL.Machine.AspNetCore%2FServices%2FSmtTransferEngineService.cs
Done.
src/SIL.Machine.AspNetCore/Services/SmtTransferEngineService.cs
line 197 at r3 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Ok the same question for breaking out the cancel build and cancel build job. The only difference is the lock and updating the state? Why? Why would I want to call one or the other?
This private method allows me to reuse the same code in the CancelBuildAsync
and DeleteAsync
methods.
src/SIL.Machine.Serval.JobServer/appsettings.json
line 23 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Can we have a parameter "enable polling" defaulted to false and instead place this on the engine server? I believe that would be clearer.
Done.
tests/SIL.Machine.AspNetCore.Tests/Services/NmtEngineServiceTests.cs
line 49 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Sorry to be finicky - can the test be called DeleteWhileBuildingAsync or similar?
Done.
tests/SIL.Machine.AspNetCore.Tests/Services/SmtTransferEngineServiceTests.cs
line 98 at r3 (raw file):
Previously, johnml1135 (John Lambert) wrote…
DeleteWhileBuildingAsync?
Done.
Previously, ddaspit (Damien Daspit) wrote…
Ah, one is private, one is public. I see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)
Some failing builds - https://github.com/sillsdev/machine/actions/runs/6434169750/job/17472827308. We should really run e2e on pull requests. |
- add support for multiple build stages - add support for running build jobs on Hangfire or ClearML - add BuildJobService - categorize build jobs into CPU or GPU jobs - decouple build job runners from translation engines - fix issues with S3FileStorage - fix issues with ClearMLService
a34ec44
to
c8a27cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r6, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured out why the E2E tests are failing. The appsettings changed and need to be updated in the Serval repo. In the future, it would be great if we could find a way to keep the appsettings for Machine in the Machine repo.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)
This change is