Skip to content

Latest commit

 

History

History
732 lines (702 loc) · 86.6 KB

CHANGELOG.md

File metadata and controls

732 lines (702 loc) · 86.6 KB

Change Log

Full Changelog

Features and improvements:

  • Pod stuck in unknown status when kubernetes node is down #720
  • [proposal] cleanup jobs after finished #718
  • [v1alpha2] Remove redundant code about status #713
  • [v1alpha2] Invalid job spec not reported in TFJob status #707
  • [v1alpha2] Error when host name is not svc.cluster.local #703
  • how to upgrade smoothly from v1alpha1 to v1alpha2? #697
  • [v1alpha2] Validate the TFJob converted from unstructured #682
  • [feature] Add Cleanup Policy to TFJob Spec #536
  • Enable kube-arbitrator as scheduler for tensorflow #349

Fixed bugs:

  • Unable to check logs in TFJob ui for v1apha2 #723
  • Unable to check logs in TFJob ui for v1apha2 #723
  • Pod stuck in unknown status when kubernetes node is down #720
  • Pod stuck in unknown status when kubernetes node is down #720
  • [v1alpha2] Invalid Job Status #712
  • \[v1alpha2\] Invalid Job Status #712
  • \[v1alpha2\] Error when host name is not svc.cluster.local #703
  • \[v1alpha2\] Error when host name is not svc.cluster.local #703
  • latest.Status.StartTime is nil:invalid memory address or nil pointer dereference #608
  • tf-operator throws runtime error: invalid memory address or nil pointer dereference #596

Closed issues:

  • Ability to prefer using all gpus on a single node #781
  • test_runner.py is using wrong util module for JobTimeoutError #780
  • [Test Flake] Intermittent test failures: tensorflow.python.framework.errors_impl.UnavailableError: OS Error #778
  • Latest docker Image on wrong commit #775
  • PS still running after tfjob is complete #774
  • TF_CONFIG in tf-operator:v20180724-13863edf missing Environment: cloud #772
  • TF_CONFIG cluster spec has wrong FQDN name #770
  • Error syncing tfjob: Failed to found the port #768
  • Events don't show up in kubectl describe tfjobs #763
  • v1alpha2 doesn't work TF.estimator for TF <= 1.6 ; need to add environment:cloud to TF_CONFIG #761
  • Update and move README.md to website #760
  • Scope TFJob operator to only claim jobs in a given namespace #759
  • TFJobs UI returns 500s and json parse errors displaying pod information or creating job #754
  • [v1alpha2] Job should be marked completed when worker 0 exits but other workers are still running #751
  • v1 and v2 E2E tests appear to be stomping on each other #748
  • [Test Flake] tf_job_client.py needs to handle case where conditions is none #744
  • tf-dashboard show workers of all the tfjobs when querying a specific tfjob #737
  • [build] Delete build/images/tf_operator/build_and_push.py #736
  • tf-operator synPdb failed when enable-gang-scheduler #729
  • not proper log message #727
  • [v1alpha2] Invalid Job spec crashes operator #706
  • unable to create a tfjob in the UI; namespace not set #701
  • Wrong comment when setting default CleanPodPolicy #698
  • file_cache is unavailable when using oauth2client >= 4.0.0 #696
  • [v1alpha2] CreatedCondition is not set #680
  • Make it easier to debug/develope E2E tests #655
  • [v1alpha2][log] Use logrus instead of glog in service_control #635
  • [v1alpha2] Add PDB of TFReplicaSet for gang scheduling by kube-arbitrator #575
  • Update releaser to use Argo. #400

Merged pull requests:

  • Typo in TTLSecondsAfterFinished json field #799 (jian-he)
  • Avoid logging for non-TFJob pod #798 (jian-he)
  • Job completion time is not set for job FAILED state #797 (jian-he)
  • reconcileTFJobs is always triggered even with no update #796 (jian-he)
  • Restructing common utility functions #792 (johnugeorge)
  • Mark TFJob succeeded if worker 0 completed. #791 (ScorpioCPH)
  • Rename the async keyword argument to async\_req #790 (ojarjur)
  • Scope tf-operator to a namespace #789 (ankushagarwal)
  • Fix the name of the "JobTime(out)Error" class #788 (ojarjur)
  • Add SchedulerName in V1alpha2 #787 (ChanYiLin)
  • OWNERS: Add ChanYiLin as reviewers #784 (ChanYiLin)
  • Add retries to deal with test flakes related to UnavailableError. #779 (jlewi)
  • Renaming TFJobController to TFController #777 (johnugeorge)
  • Shared implementation of operator code #773 (johnugeorge)
  • WaitForJob should use conditions for v1alpha2. #771 (jlewi)
  • OpenAPI: update openapi_generated.go to support TTLSecondsAfterFinished #769 (JetMuffin)
  • Refactoring TF operator code #767 (johnugeorge)
  • TFCONFIG needs to set environment:cloud to support older versions. #766 (jlewi)
  • Improve meta information in log messages to make it easier to debug jobs #765 (jlewi)
  • Replace contents of README.md with a link to kubeflow.org #764 (jlewi)
  • linter: Rename gas to gosec and fix linting errors #758 (gaocegege)
  • Cleanup tf-job after a configured TTL #753 (ccding)
  • Name Error: mv tj_job_mnist.yaml to tf_job_mnist.yaml #752 (xieydd)
  • Prevent multiple versions of an E2E test from clobbering each other. #749 (jlewi)
  • Revert "cleanup jobs after finished (#725)" #746 (ccding)
  • Fix a test flake caused by conditions being None #745 (jlewi)
  • OWNER: Add cheyang, Remove mitake #741 (gaocegege)
  • build: Remove the useless script #740 (gaocegege)
  • fix list all the pods of tfjob #738 (cheyang)
  • test fix: None type check #735 (kunmingg)
  • fix not proper log message #734 (ChanYiLin)
  • Generate api information in OpenAPI model and register types to scheme #733 (JetMuffin)
  • Add err msg for TFJob from Unstructured #732 (xychu)
  • Add retrying to log_status function #728 (ankushagarwal)
  • Update developer_guide.md for v1alpha2 #726 (lovejoy)
  • cleanup jobs after finished #725 (ccding)
  • Fix a log function issue #724 (lovejoy)
  • delete pdb when tfjob is terminated #721 (ChanYiLin)
  • [v1alpha2] Add PDB of TFReplicaSet for gang scheduling by kube-arbitrator #717 (codeflitting)
  • [v1alpha2]Remove redundant code about status and fix bug of invalid job status #715 (codeflitting)
  • OWNERS: Add yph152 and codeflitting as reviewers #714 (gaocegege)
  • [v1alpha2] Add more validation of TFJobSpec #711 (codeflitting)
  • Fix sub domain issue #704 (ScorpioCPH)
  • add ValidateAlphaTwoTFJobSpec to check v1alpha2.TFJobSpec is valid #702 (codeflitting)
  • [v1alpha2] controller_pod_test: Add test cases for evaluator #700 (codeflitting)
  • Wrong comment when setting default CleanPodPolicy #699 (jian-he)
  • Use logrus instead of glog in service_control #695 (codeflitting)
  • [v1alpha2]update tfjob condition for Created #694 (yph152)
  • setup cors for redirects under iap #688 (kkasravi)
  • define cleanup policy #685 (cheyang)

v0.2.0-rc1 (2018-06-21)

Full Changelog

Features and improvements:

  • [v1alpha2] Set event for tfjob when spec is not valid #620
  • [enhancement] Fix the gofmt support #586
  • [go] Use dep instead of glide to reduce the size of vendor #556
  • [v1alpha2] Enhance the logic about sync #547
  • [v1alpha2] Use structured log #537
  • [log] investigate zap #534
  • [v1alpha2] Try to not to always claim pods #533
  • [v1alpha2] Suppport customized port #532
  • [v1alpha2] start using kubeconfig #522
  • v1alpha2 integration #521
  • TFJob operator surface queue metrics #503
  • [api] Remove pending pods from active pods #484
  • [enhancement] Set StartTime for TFJob status #475
  • [Feature] Support "eval" worker in tf-operator #444
  • Add appropriate logging fields to the tf-operator log messages #424
  • [enhancement] Refactor docs #379
  • Deprecate TfPort and set default port for users #327
  • [enhancement] Add e2e test cases for recorder #317
  • Make the TfJob controller more event driven #314
  • Potential data race, maybe #302
  • Don't leave pods running just to get logs #128
  • Add hyperparameter tuning? #112
  • Use headless services for Training jobs #40
  • More validation of TfJob #25

Fixed bugs:

  • [v1alpha2] RealServiceControl does not set owner reference #616
  • TfJob operator stops working on invalid spec #561
  • [v1alpha2]tfjob restartPolicy for Never #555
  • [v1alpha2] Potential bugs when there is one worker succeeded #538
  • [v1alpha2][test] Avoid potential data race problem #530
  • Phase is wrong unexpected TfJob phase: Done #110

Closed issues:

  • [v1alpha2] Make restart policy a pointer #692
  • [v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673
  • [v1alpha2] add pod label with job name (without namespace) #672
  • [v1alpha2] Pods not deleted when job finishes #671
  • [v1alpha2] conditions not updated #668
  • [v1alpha2] Move control interface to separate pakckage #665
  • [v1alpha2] Move test util to separate package #664
  • Speedup E2E test by running build and setup cluster in parallel #659
  • In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657
  • [v1alpha2] service names are prefixed with namespace #654
  • [v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653
  • dep ensure give warning on k8s.io/apiserver #647
  • [v1alpha2] pod names don't include random salt #644
  • [v1alpha2]Unable to create pod #641
  • GPU tests failing; ks env doesn't exist #640
  • TFJob not marked as success when master exits but not workers #634
  • v1alpha2 - pod names don't include replica type #633
  • tensorflow on kubernetes how to pass in worker_host and ps_host to container if I use tf-operator #630
  • tf_job_client blocks forever #606
  • [v1alpha2] Need to add the v1alpha2 binaries to our Docker image #600
  • [v1alpha2] Need ksonnet package #599
  • Support deploying v1alpha2 and v1alpha1 controllers simultaneously #598
  • [v1alpha2] Remove controller_utils.go #591
  • [v1alpha2] Add CI test #589
  • [question] dist_mnist example failed to run #588
  • can not set labels #580
  • v1alpha2 should use headless services #574
  • TFJob operator should pass through annotations to the pod #573
  • [test] Test failed because of ImagePullBackOff #567
  • Servable not found for request: Latest(mnist) #552
  • [v1alpha2] The state of distributed model training. #544
  • [test] copy labels and anotations to pod from tfjob #543
  • Unable to deploy the example TfJob in the user guide #535
  • [v1alpha2] Do not set default to always for restartpolicy #524
  • E2E test steps should exit with non zero exit code if test fails #514
  • [v1alpha2] Sync commits with v1alpha1 #490
  • Use OpenAPI validation for CRDs in k8s 1.9 #437
  • default install of kubeflow no longer install tf-job-dashboard #435
  • Use DAG functionality of Argo in our E2E tests #422
  • Post submits are failing with Argo #370
  • tf-job-operator pod hangs and doesn't restart if it can't delete one of the TfJob pods #366
  • Refactor TFJobStatus in CRD API #333
  • Deprecate the TfImage field #330
  • [discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
  • Does TfJob controller need to do master election? #263
  • Setup Prow PR Dashboard #255
  • API: some comments about API changes from PR #215 review #249
  • e2e test for the case that the chief is not master #235
  • Use conditions instead of phase #223
  • Submitted tfjobs cease to start running under unknown conditions #203
  • Tutorials #195
  • Copy chart to kubernetes/charts #93
  • Create a web page to list releases #70
  • tensorflow 1.4 and estimator support #61
  • Set a default value for restartPolicy #55

Merged pull requests:

v0.1.0 (2018-03-29)

Features and improvements:

  • [v1alpha2] Implement condition update #502
  • [v1alpha2] TF_CONFIG should be configurable by user #499
  • [test] All log is 404 in argo #496
  • [test] Add unit test for pkg/controller #455
  • [enhancement] Rename the cmd/tf_operator to cmd/tf-operator #363
  • Get rid of TensorBoard replica #347
  • Deprecate the ENV MY_POD_NAMESPACE and MY_POD_NAME #341
  • [feature] Does tfJob support setting different label/envVar for each worker(replicas >1)? #340
  • Manage Pods directly instead of using Job controllers #325
  • TfJobs dashboard doesn't work with K8s API server proxy or envoy proxy #323
  • E2E test delete and recreate job with same name #310
  • Use Kubeflow & ksonnet to install TfJob #239
  • [CRD] Request for input and output dirs in TFJobSpec #224
  • Make default TfImage configurable by users #207
  • refactor the TfJob to use Informer and Controller #206
  • Clean up examples; don't require cloning the repo #68
  • UI / Kubernetes Dashboard Integration #57
  • Use K8s Garbage Collection #42
  • Structured Logging For the operator #24
  • Setup continuous build of containers #19
  • Run TensorFlow server for parameter servers by default #16
  • Create Pod instead of Job #344 (ScorpioCPH)

Fixed bugs:

  • E2E tests timing out; job appears to remain in running state even though job is done. #500
  • Presubmit shows succeeded, but some test actually failed. #479
  • Tide is misconfigured for this repository. #433
  • Local releaser fails due to version_tag #360
  • Helm test failure not reported to gubernator #355
  • TfJob should be marked as failed if setup fails #218
  • Handle the case where grpcServerFilePath is the empty string #188
  • func c.findAllTfJobs() in controller.go will never reach #41
  • Permanent errors don't cause job failure #28
  • Operator Log Spam; replicas.go:287] No container named: tensorflow found for pod; assuming POD is running #23
  • TfJobRestClient.Create doesn't set kind appropriately #5

Closed issues:

  • Waiting pods start too long #461
  • Create a suitable OWNERS file in /dashboard #443
  • CI failed to setup the cluster #420
  • [docs] Add dashboard readme #411
  • Make coverall results advisory and not report as failure #406
  • Presubmits failing due to lint #404
  • [enhancement] Fix go vet errors which not caught by the compilers #395
  • User facing website for Kubeflow that details how to choose a stack #371
  • [discussion] How to set clusterspec #369
  • [discussion] Whether to create CRD in helm charts #353
  • Should resourcelock be in the same namespace as controller? #352
  • Helm test tf-job does not pass validation #351
  • Move tensorflow/k8s to kubeflow/tf-operator #350
  • Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs #346
  • [Discussion] Time to start tagging releases for the TF operator? #339
  • [discussion] Should group name be tensorflow.org or kubeflow.io or kubeflow.org? #337
  • dashboard silient error during calling non-existent tfjob #335
  • in dashboard, silent error when nonexistent namespace is specified #334
  • Deprecate the IsDefaultPS field #329
  • [Convention] Replace Tf with TF in CRD #328
  • Standardise labels for issues and PRs #326
  • TfJobs dashboard not showing jobs #324
  • Recreating a failed/successful job with same name doesn't work #322
  • Releaser incorrectly tags images as "dirty" #321
  • Reenable the releaser #320
  • E2E tests are not isolated #318
  • Need to mark prow job as failed if any tests fail #315
  • Remove outdated branch wbuchwalter-patch-1 #311
  • TrainingJob.reconcile not called periodically #309
  • rename master to chief #306
  • Assign resource quota for TensorBoard #304
  • Jobs evicted for lack of memory, potentially add resource field to tf-job prototype #301
  • [Discussion] Operators vs. controller pattern #300
  • [bug] Add a default pod template for PS #297
  • Bunch of pylint error messages #294
  • Fix Head #293
  • Operator deployment fails post-v20180108-190394d #292
  • Promote last known good release #290
  • [bug] metadata.ownerReferences.apiVersion is not set #288
  • fail to run example job. invalid job spec: tfReplicaSpec.TfPort can''t be nil #284
  • [bug] Build log 404 in https://prow.k8s.io/?repo=tensorflow%2Fk8s #282
  • [feature] Seperate the CRD and controller #281
  • Gaps in test coverage #280
  • Regression in flag name: controller-config-file #279
  • [bug] glog before flag.Parse() #275
  • build new code to new image and find some problem #274
  • Fix the releaser so we can build new images #270
  • deploy.py gives gcloud api error '... Version "1.8.1-gke.1" is invalid.' #268
  • Pods terminated without waiting #267
  • Attach appropriate header (copyright) to go files #266
  • suppose i've install the tfjob in my k8s cluster #265
  • what's the folder pkg for? #264
  • Build failing because of lint issues #256
  • what's the main change between version 0.2 and version 0.3? #247
  • SetupCluster failures unexpected keyword argument 'client_configuration' #242
  • GPU test marked as succeeded but airflow step is failing #240
  • tf_smoke.py distributed computing doesn't work on minikube #238
  • example-job can not work in private k8s cluster #233
  • Test failures aren't properly reported in Gubernator #229
  • panic: runtime error: invalid memory address or nil pointer dereference can not run in k8s 1.8.5 #212
  • Rethink the TFJob CRD #209
  • ksonnet configs for deploying the TfJob CRD & Controller #208
  • Use Argo workflow engine for CI/CD or releases #205
  • Potential issue with Tensorboard / value of simple best-practices example with tboard #202
  • Investigate using buildah to build our images #201
  • E2E tests pre & postsubmits are failing #196
  • Publishing a client to pypi #193
  • Don't require a master or chief #192
  • Make cloning the repo and building the artifacts separate commands in py/release.py #189
  • Make Airflow logs accessible #185
  • Complement docs for Python 3rd party dependencies #181
  • Helm Test fails because grpcServerFilePath is the empty string #179
  • Helm should only set --controller_config_file conditionally #175
  • Troubleshooting Guide: no matches for tensorflow.org/, Kind=TfJob #174
  • no matches for tensorflow.org/, Kind=TfJob #173
  • Failed to build TFOperator #171
  • E2E test for GPUs #164
  • TfJob doesn't work on minikube #160
  • Deleted jobs re-starting #156
  • Use coveralls.io to report and check code coverage #155
  • Clarify scope of tensorflow/k8s #150
  • After init helm, install chart failed #149
  • Helm test; insufficient permissions on RBAC clusters #135
  • Need to trim trailing slash of host string in TfJobRestClient.Watch() #130
  • results of lint test aren't reported in junit file used by gubernator #126
  • Collaborators need to be K8s members to trigger tests #122
  • Extend Test Infrastructure to run multiple E2E tests in parallel #120
  • initResource() failed; findAllTfJobs returned error: #118
  • Latest tag on gcr.io is not up to date #116
  • duplicate #115
  • postsubmit results aren't showing up in testrgrid #113
  • TensorBoard replica set not deleted when job deleted. #107
  • helm permission issue on 1.8.1 #106
  • Run python unittests as part of pre/post/periodic tests #101
  • E2E tests are failing #96
  • E2E Test log should capture output from helm-test #95
  • Rename TfJob kind to remove mlkube.io #89
  • Setup travis for tensorflow/k8s #88
  • Update repo to use its new location tensorflow/k8s #86
  • mlkube.io -> tensorflow/k8s #85
  • Update prow to use repo tensorflow/k8s #84
  • periodic test is failing #83
  • runner.py needs to create build-log.txt with stdout/stderr of test #82
  • E2E tests leaking GKE clusters #80
  • No results show up if you click on mlkube-build-periodic #76
  • No results show up in prow test grid for presubmit jobs #75
  • Include TfJob name in labels #72
  • Simplify/Clarify Accelerators config #71
  • How to create TF Jobs from the user side? #67
  • Change version from beta -> alpha #65
  • API Review #64
  • Setup release process for CRD #63
  • Post submit jobs don't correctly upload artifacts to GCS #62
  • presubmit test(bootstrap.py) doesn't properly check out PRs #59
  • E2E Test for default PS server #58
  • E2E test for GPUs #54
  • Integrate with Prow for Continuous Testing #46
  • Consider how we manage replicas (stateful sets, managing pods directly) #45
  • Rename project #34
  • Structured (Json) logging for Tf Processes #32
  • If handling Add event fails, TfJob should be marked as failed with appropriate error #26
  • Provide a default value for TfPort, replicas, and tfReplicaType #22
  • Should this be converted to a Custom Resource Definition (CRD) in anticipation of 1.7 #17
  • TensorBoard Integration #13
  • Dependency management #7
  • Better GPU support #6
  • Add a creationTimestamp #4

Merged pull requests:

* This Change Log was automatically generated by github_changelog_generator