Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timer's misleading behaviour when epoch completion time calculated #2157

Open
Priyansi opened this issue Aug 11, 2021 · 7 comments · May be fixed by #2471
Open

Timer's misleading behaviour when epoch completion time calculated #2157

Priyansi opened this issue Aug 11, 2021 · 7 comments · May be fixed by #2471
Assignees

Comments

@Priyansi
Copy link
Contributor

When I tried calculating the time taken to complete a single epoch via Timer, the handlers attached to trainer before Timer were executed first, and thus their time also got recorded by the Timer too for a single epoch. Therefore the true time taken for epoch completion, provided by trainer.state.times, is less than what Timer calculated. This can be misleading. More clarification on how this actually works in the docs would be appreciated. Or the Timer's functionality can be enhanced to step before all other handlers attached to an event could also be helpful. Notebook to quickly verify this here.

@sdesrozis
Copy link
Contributor

The logic is here

def _internal_run(self) -> State:

The state timers should return the times for specific events. However, if an handler, during a given event, tries accessing the state timer of this event, the time is not yet computed. Yet the value in the timer is the time spent in the previous events rather than an undefined value.

I agree that it should be explained, or modified. For instance, the state timer of an event could be lazily updated each time it is reached by the user in the event. It could be done but not sure it really worth…

@FarehaNousheen
Copy link
Contributor

Hi, I would like to work on this issue, please assign this to me. Also, please provide examples to understand the issue better.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 18, 2021

Also, please provide examples to understand the issue better.

@FarehaNousheen please read attentively #2157 (comment). There is a notebook with a concrete example provided by Priyansi

@vfdev-5 vfdev-5 added the PyDataGlobal PyData Global 2020 Sprint label Oct 21, 2021
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Nov 8, 2021

Things to do for this issue:

Time taken by a single epoch calculated by Timer: 197.58
Time Taken for single epoch calculated by State of engine : 170.89924907684326
  • Modify the notebook such that timer is attached before running the validation:
    Current notebook
trainer = create_supervised_trainer(model, optimizer, criterion, device=device)
evaluator = create_supervised_evaluator(
    model, metrics={"accuracy": Accuracy(), "loss": Loss(criterion)}, device=device
)

@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(trainer):
    evaluator.run(train_loader)
    metrics = evaluator.state.metrics
    print(
        f"Training Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}"
    )
    print(
    f"Time taken by a single epoch calculated by Timer: {timer.value():.2f}"
    )
    print(
        f"Time Taken for single epoch calculated by State of engine : {trainer.state.times['EPOCH_COMPLETED']}"
    )


@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(trainer):
    evaluator.run(val_loader)
    metrics = evaluator.state.metrics
    print(
        f"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}"
    )
    print(
    f"Time taken by a single epoch calculated by Timer: {timer.value():.2f}"
    )
    print(
        f"Time Taken for single epoch calculated by State of engine : {trainer.state.times['EPOCH_COMPLETED']}"
    )

timer = Timer(average=True)
timer.attach(trainer,
             start=Events.EPOCH_STARTED,
             resume=Events.EPOCH_STARTED,
             pause=Events.EPOCH_COMPLETED,
             step=Events.EPOCH_COMPLETED)

Expected modfication:

trainer = create_supervised_trainer(model, optimizer, criterion, device=device)
evaluator = create_supervised_evaluator(
    model, metrics={"accuracy": Accuracy(), "loss": Loss(criterion)}, device=device
)

timer = Timer(average=True)
timer.attach(trainer,
             start=Events.EPOCH_STARTED,
             resume=Events.EPOCH_STARTED,
             pause=Events.EPOCH_COMPLETED,
             step=Events.EPOCH_COMPLETED)

@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(trainer):
    evaluator.run(train_loader)
    metrics = evaluator.state.metrics
    print(
        f"Training Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}"
    )
    print(
    f"Time taken by a single epoch calculated by Timer: {timer.value():.2f}"
    )
    print(
        f"Time Taken for single epoch calculated by State of engine : {trainer.state.times['EPOCH_COMPLETED']}"
    )


@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(trainer):
    evaluator.run(val_loader)
    metrics = evaluator.state.metrics
    print(
        f"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}"
    )
    print(
    f"Time taken by a single epoch calculated by Timer: {timer.value():.2f}"
    )
    print(
        f"Time Taken for single epoch calculated by State of engine : {trainer.state.times['EPOCH_COMPLETED']}"
    )
  • Re-run the notebook and report here the results for the time measurements for a single epoch
    • put the logs here
    • put the link on the colab here and make sure that colab is accessible to everyone

@FarehaNousheen
Copy link
Contributor

That's a really detailed description. Thank you for sharing it. I'll work on it tomorrow.

@FarehaNousheen
Copy link
Contributor

FarehaNousheen commented Nov 24, 2021

Hi all, I have modified the notebook with the timer attached before running the validation.
Sharing the link of Google colab here
https://colab.research.google.com/drive/1SJYqgmzYvGivruq9OAS0q6MADjBFqI13

@FarehaNousheen
Copy link
Contributor

I executed the notebook with the Timer attached before the second code snippet and removed the cell where the Timer is attached after validation. In doing so I got the result that is updated on the copy of the Colab shared above. In short sharing results below.
Training Results - Epoch[1] Avg accuracy: 0.95 Avg loss: 0.16
Time taken by a single epoch calculated by Timer: 73.73
Time Taken for single epoch calculated by State of engine : 73.72484469413757
Validation Results - Epoch[1] Avg accuracy: 0.95 Avg loss: 0.15
Time taken by a single epoch calculated by Timer: 73.73
Time Taken for single epoch calculated by State of engine : 73.72484469413757
Training Results - Epoch[2] Avg accuracy: 0.97 Avg loss: 0.13
Time taken by a single epoch calculated by Timer: 73.51
Time Taken for single epoch calculated by State of engine : 73.5108630657196
Validation Results - Epoch[2] Avg accuracy: 0.96 Avg loss: 0.13
Time taken by a single epoch calculated by Timer: 73.51
Time Taken for single epoch calculated by State of engine : 73.5108630657196

@vfdev-5 vfdev-5 linked a pull request May 2, 2022 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants