Releases: d2iq-archive/marathon
v1.8.222: Remove strict validation of external volume name (#7024)
Changes from 1.8.218 to 1.8.212
External Volume Validation changes
Relaxed name validation
As there are some external volume providers which require options in the volume name, the strict validation of the name on the external volume is now removed.
As the uniqueness check is based on the volume name, this may lead to some inconsistencies, for the sake of uniqueness, the following volumes are distinct:
"volumes": [
{
"external": {
"name": "name=volumename,option1=value",
},
}
],
"volumes": [
{
"external": {
"name": "option1=value,name=volumename",
},
}
],
Optional uniqueness check
Previously, Marathon would validate that an external volume with the same name is only used once across all apps. This was due to the initial implementation being focused on Rexray+EBS. However, multiple external volume providers now
allow shared access to mounted volumes, so we introduced a way to disable the uniqueness check:
A new field, container.volumes[n].external.shared
which defaults to false
. If set to true, the same volume name can be used
by multiple containers. The shared
flag has to be set to true
on all external volumes with the same name, otherwise a conflict is reported on the volume without the shared=true
flag.
"container": {
"type": "MESOS",
"volumes": [
{
"external": {
"size": 5,
"name": "volumename",
"provider": "dvdi",
"shared": "true",
"options": {
"dvdi/driver": "pxd",
"dvdi/shared": "true"
}
},
"mode": "RW",
"containerPath": "/mnt/nginx"
}
],
}
v1.8.218
Changes from 1.8.194 to 1.8.218
Revive and Suppress Refactoring
The revive and suppress logic was unified. In the past Marathon would keep reviving when
an instance with a reservation was expunged (case 1) or it would revive when instance should be started (case 2). When
no instance should be started Marathon would suppress offers which could conflict with case 1. With the refactoring
only one logic decides whether to revive or suppress and thus avoids the conflict. The change also required changing
the default --min_revive_offers_interval
to thirty seconds. This should avoid overriding revive calls with a suppress
too quickly. The --[disable]_suppress_offers
flag can switch off suppress calls all together. This should be used
when Marathon fails to clean up reservation which requires offers being sent.
Fixed issues
- DCOS-54927 - Fixed an issue where two independent deployments could interfere with each other resulting in too many tasks launched and/or possibly a stuck deployment.
Changes from 1.8.180 to 1.8.194
Fixed issues
-
DCOS_OSS-5212 - Fixed an issue that prevented reserved instances created by older Marathon versions from being restarted
-
MARATHON-8623 - Fixed an issue that could cause /v2/deployments to become stale
-
MARATHON-8624 - Fixed issue where the presence of a TASK_UNKNOWN status could cause an API failure
-
DCOS-51375 - Fixed an issue where deployment cancellation could leak instances.
-
DCOS_OSS-5211 - The initial support for volume profiles would match disk resources with a profile, even if no profile was required. This behavior has been adjusted so that disk resources with profiles are only used when those profiles are required, and are not used if the service for which we are matching offers does not require a disk with that profile.
-
MARATHON-8631 - In order to prepare for the general availability of the DC/OS Storage Service (DSS), Marathon will now default to disk type
Mount
, if a persistent volumeprofileName
is configured by the user without specifying the wanted disktype
. Services like DSS will populate this field to allow users selecting the volumes they previously created. MesosRoot
disks will not have aprofileName
set, so the default for persistent volumes that do not specify aprofileName
is stillRoot
. -
MARATHON-8422 - Kill unreachable tasks that came back. Marathon could get stuck waiting for terminal events but not issue a kill.
v1.5.15
Introduce global throttling to Marathon health checks
Marathon health checks is a deprecated feature and customers are strongly recommended to switch to Mesos health checks for scalability reasons. However, we've seen a number of issues when an excessive number of Marathon health checks (HTTP and TCP) would overload parts of Marathon. Hence we introduced a new parameter --max_concurrent_marathon_health_checks
that defines a maximum number (256 by default) of Marathon health checks (HTTP/S and TCP) that can be executed concurrently in the given moment. Note that setting a big value here and using many services with Marathon health checks will overload Marathon leading to internal timeouts and unstable behavior.
Fixed Issues
- MARATHON-8596 Introduced global throttling to Marathon health checks
- MARATHON-8575 Fixed a broken migration for app definitions with port mappings protocol "tcp,udp" which is no longer valid and should be "udp,tcp"
- MARATHON-8566 Fixed a rare bug where deployment was sometimes not immediately visible through the
v2/deployments
endpoint after creation
Note: Previous 1.5.14 release introduced a regression where an unhealthy instance would not be killed. This will not happen anymore (promise) and we do not recommend using 1.5.14 release if you use health checks.
v1.5.13
Marathon 1.4 Compatible /v2/tasks
Marathon 1.5 had a major overhaul of networking which resulted in an unintended change to the porting of ports in /v2/tasks
. In text/plain
form, this information is used to configure load-balancers and routers. When the container is in host
mode the reported port is the running host port. When the container is in bridge
mode, the reported port is the dynamically created host port that will bridge to the internal container port. For Marathon 1.4, in USER
mode, it reported the container port. Regardless of correctness, this feature was used by some customers and needs to be forward ported. Marathon 1.5.13 now provides that ability by using the compatibilityMode
query parameter to the /v2/tasks
end point. If compatibilityMode
is not specified the 1.5 version is rendered. If /v2/tasks?compatibilityMode=1.4
is used it will provide the previous Marathon 1.4 rendering.
Apps names restrictions (breaking change)
From now on, apps which use ids which ends with "restart", "tasks", "versions" won't be valid anymore. Such apps already had broken behavior (for example it wasn't possible to use a GET /v2/apps
endpoint with them), so we made that constraint more explicit. Existing apps with such names will continue working, however, all operations on them (except deletion) will result in an error. Please take care of renaming them before upgrading Marathon.
Fixed Issues
- MARATHON-8493 Fixed precision bug associated with summing pod resources.
- MARATHON-8498 Fixed secrets validator when changing secret env.
- MARATHON-8466 Prohibit the use of reserved words in app and pod ids
- COPS-4483 Provide a backward compatible way to produce container ports for
text/plain
GET requests against/v2/tasks
when usingUSER
networking consistent with Marathon 1.4. - MARATHON-8566 - We fixed a race condition causing
v2/deployments
not containing a confirmed deployment after HTTP 200/201 response was returned.
v1.7.189
Changes to 1.7.189
Marathon Supports Java 9+
Marathon build tools and dependencies have been adjusted to allow it to be compiled and run with Java 9 and we regularly build and test with Java 11. We currently still target Java 8 binary compatibility.
Fixed Issues
- MARATHON-8539 - Marathon now responds back with Mesos source field exactly as it is received fixing an issue where the vendor information was lost as part of vendor information.
- MARATHON-8466 - Marathon restricts the use of "reserved" words as Ids. The following is a list of restricted words: "restart", "tasks", "versions"
- MARATHON-8498 - Marathon now validates the full app when partial updates are applied.
- MARATHON-8493 - Fix to a precision bug associated with sum resource needs for Pods.
- MARATHON-8453 - Marathon now respects the
--kill_retry_timeout
timeout. - MARATHON-8452 - Reduced logging; Marathon only logs zero-value offers if a scalar value is set.
- MARATHON-8413 - Fix for broken versioning of Apps and Pods associated with changes in Java Timestamps.
v1.7.174
Changes to 1.7.174
Marathon framework ID generation is now very conservative
Previously, Marathon would automatically request a new framework ID from Mesos if the old one was marked as torn down in Mesos, or if the framework ID record was removed from Zookeeper. This has led to more trouble than it has helped. The new behavior is:
-
If Marathon's framework ID has been torn down in Mesos, or if the failover timeout has been exceeded, Marathon will crash, on launch, with a clear message.
-
If Marathon's framework ID record was deleted from Zookeeper or is otherwise inaccessible, and there are instances defined, Marathon will refuse to create a new Framework ID and crash.
For more information, refer to the framework id docs page.
Minimum Mesos version requirement has been increased to 1.5.0
In previous Marathon versions, we monitored offers as a surrogate terminal task status signal for resident tasks in order to work around a Mesos issue in which we would not receive terminal task status updates for agents that restarted. As of Mesos 1.4.0, this is been resolved, and we have removed this workaround.
There are still some edge cases where Mesos agent metadata is wiped (manually, by an operator) in a way that the agent ID will change, but reservations will be preserved. In these cases, Mesos will report a resident tasks as perpetually unreachable. Operators should use the MARK_AGENT_GONE call in such cases to get Mesos to mark the associated resident tasks as terminal, and therefore signal to Marathon that it should try to relaunch the resident task. This call was introduced in Mesos 1.5.0.
Native Packages
We have stopped publishing native packages for operating system versions that are past their end-of-life:
- Ubuntu Yakkety
- Ubuntu Wily
- Ubuntu Vivid
Additionally, we have added support for Debian Stretch.
Docker image now allows user nobody
; default user has been changed
Previously, the Marathon Docker container would only run as user root. The packaging has been updated so that the container is now run, by default, as the user nobody
.
NOTE This is a breaking change! If you did not specify MARATHON_MESOS_USER
before, and did not specify the container user of nobody
when launching Marathon in a container before, then add the environment value MARATHON_MESOS_USER=root
to the containerized Marathon.
Non-leader/standby Marathon instances respond to /v2/events with a redirect, rather than proxy
Previously, Marathon standby instances would proxy the event stream. This causes an unnecessary increase in event stream drops, as the connection will terminate if either the master or the standby restarts. Further, there have been occasional buffering issues.
Now, when a standby Marathon instance is asked for /v2/events, it responds with a 302, with a redirect response directing the client to /v2/events resource for the current leader. Clients that consume the event stream should be updated to follow redirect responses.
Event-proxying has the following deprecation schedule:
- 1.7.x - Standby Marathon instances return redirect responses. The old behavior of proxying event streams can be brought back with the command-line argument
--deprecated_features=proxy_events
. - 1.8.x - Event stream proxying logic will be completely removed. If
--deprecated_features=proxy_events
is still specified, Marathon will refuse to launch, with an error.
Default for "max-open-connections" increased for asynchronous standby proxy, now configurable
In some clusters with heavy standby-proxy usage, a limit of 32 max-open-connections was too small. This default has been increased to 64. In addition, the flag --leader_proxy_max_open_connections
has been introduced to tune the value further, if needed.
Maintenance Mode Support Production Ready, Now Default
Marathon now declines offers for agents with scheduled maintenance.
Previously, this behavior was enabled by --enable_features maintenance_mode
. Operators should remove maintenance_mode
from the --enable_features
value list, as it now has no effect. In Marathon 1.8.x, including the term maintenance_mode
in the --enable_features
list will be considered an error.
The flag --disable_maintenance_mode
has been introduced. To revert back to the default maintenance mode behavior in Marathon 1.6.x and earlier (ignore), operators can specify --disable_maintenance_mode
.
Fixed Issues
- MARATHON-8409 - You can now launch marathon in Docker as non-root user.
- MARATHON-8017 - Fixed various issues when posting groups with relative ids.
- MARATHON-7568 - We now redact any Zookeeper credentials from the /v2/info response endpoint.
- MARATHON-8326 - Pods can be deleted together with persistent volumes, using a new wipe=true query parameter.
- Updated version of Marathon UI to 1.3.1:
- MARATHON-8255 - Marathon UI properly shows fetch URLs in the edit dialog, now.
New Exit Codes
Marathon will indicate with an exit code why it stopped itself. See the docs page for a list of all codes and their meanings.
Marathon 1.5.12
Changes from 1.5.11 to 1.5.12
Default for "kill_retry_timeout" was increased to 30 seconds
Sending frequent kill requests to an agent can in certain cases lead to overloading the Docker daemon (if the tasks are docker containers run by the Docker containerizer). Thirty seconds seems to be a more sensible default here.
Marathon framework ID generation is now very conservative
Previously, Marathon would automatically request a new framework ID from Mesos if the old one was marked as torn down in Mesos, or if the framework ID record was removed from Zookeeper. This has led to more trouble than it has helped. The new behavior is:
-
If Marathon's framework ID has been torn down in Mesos, or if the failover timeout has been exceeded, Marathon will crash, on launch, with a clear message.
-
If Marathon's framework ID record was deleted from Zookeeper or is otherwise inaccessible, and there are instances defined, Marathon will refuse to create a new Framework ID and crash.
For more information, refer to the framework id docs page.
Docker image now allows user nobody
Previously, the Marathon Docker container would only run as user root. The packaging has been updated so that the container can be run as the user nobody
. The default user for running the container (and, subsequently, the default value for --mesos_user
) has not been changed.
Docker image upgraded to Debian Stretch
The Docker image for Marathon now uses Debian Stretch as a base OS, since Debian Jessie is no longer receiving security updates.
Native Packages
We have stopped publishing native packages for operating system versions that are past their end-of-life:
- Ubuntu Yakkety
- Ubuntu Wily
- Ubuntu Vivid
Additionally, we have added support for Debian Stretch.
Fixed Issues
- MARATHON-7568 We now redact any Zookeeper credentials from the /v2/info response endpoint.
- MARATHON-8413 Fixed a bug where versions feature did not work if Marathon was launched using Java 9.
- MARATHON-8095 Fixed a bug where proxying the PATCH call was impossible due to Java limitations.
- MARATHON-8430 Readiness checks now work with self-signed certificates.
- Updated version of Marathon UI to 1.3.1:
- MARATHON-8255 Marathon UI properly shows fetch URLs in the edit dialog, now.
- MARATHON-7941 Default for unreachable strategy on PUT /apps now matches POST requests.
- MARATHON-8084 Fix issue in which
POST /v2/apps/{app_id}/restart
would not proxy properly. - MARATHON-7390 Fix issue in which Marathon would become unresponsive for a long time if Zookeeper DNS cannot be resolved at launch.
- Fixed a data migration issue in which UNIQUE constraint value was stripped when empty.
v1.6.549
Change from 1.6.352 to 1.6.549
New Exit Codes
Marathon will indicate with an exit code why it stopped itself. See the docs page for a list of all codes and their meanings.
Native Packages
We have stopped publishing native packages for operating system versions that are past their end-of-life:
- Ubuntu Yakkety
- Ubuntu Wily
- Ubuntu Vivid
Additionally, we have added support for Debian Stretch.
Limit maximum number of running deployments
New command line flag --max_running_deployments
was added to limit the max number of concurrently running deployments. The default value is set to 100. Should the user try to submit more updates than set by this flag a HTTP 403 Error is returned with an explanatory error message. We introduced this flag because having lots of running deployments can lead to a significant performance decrease in the failover scenario during marathon initialization phase. Note that if you reach the maximum deployment number, you will have to use ?force=true
parameter to cancel an existing deployment.
Zookeeper storage compaction interval
New command line flag --storage_compaction_interval
was added to set zookeeper storage compaction interval in seconds. The default value is set to 30 seconds.
Deprecation Mechanism
Marathon has gained a new feature flag: --deprecated_features
. For more information, see the docs.
Non-blocking API and Leader Proxying
Previously, when under substantial load, Marathon would time out a deployment initiating request (such as modifying an app) after some time, with "futures timed out". The timeout was not very helpful because Marathon would perform the work requested, regardless. This timeout has been removed. However, note that the client will time out if configured to do so.
To handle the potential increase in concurrent connections, deployment operations and leader request proxying now use nonblocking I/O. The nonblocking I/O proxying logic may have some subtle differences in how responses are handled, including more aggressive rejection of malformed HTTP requests. In the off-chance that this causes an issue in your cluster, the old behavior can be restored with the command line flag --deprecated_features=sync_proxy
. sync_proxy
is scheduled to be removed in Marathon 1.8.0
.
Improved environment variable to command line argument mapping
As part of the fix for MARATHON-8254, the logic for receiving command-line options from environment variables has been reworked. "*" is properly propagated (previously, the glob-expanded result was getting passed), and spaces and new-lines are now preserved.
There's a small change in behavior for environments in which the launcher script is sourced, rather than executed. Unexported environment variables will not be converted in to parameters.
Optionally allow offer suppress
Marathon can now be configured to suppress offers from Mesos by specifying the flag --suppress_offers
. This can improve offer-starvation scenarios in larger clusters at the cost of reservations taking longer to destroy. This is off by default.
New Metrics
Several new metrics have been added to improve detection of load-scenarios known to degrade Marathon's performance:
mesosphere.marathon.api.HTTPMetricsFilter.gzippedBytesWritten
mesosphere.marathon.api.HTTPMetricsFilter.bytesRead
mesosphere.marathon.api.HTTPMetricsFilter.bytesWritten
mesosphere.marathon.core.deployment.impl.DeploymentManagerActor.currentDeploymentCount
mesosphere.marathon.core.deployment.impl.DeploymentManagerActor.deploymentCount
mesosphere.marathon.core.flow.impl.ReviveOffersActor.reviveCount
mesosphere.marathon.core.flow.impl.ReviveOffersActor.suppressCount
mesosphere.marathon.core.group.impl.GroupManagerImpl.dismissedDeployments
mesosphere.marathon.core.group.impl.GroupManagerImpl.queueSize
mesosphere.marathon.core.matcher.base.util.OfferOperationFactory.launchOperationCount
mesosphere.marathon.core.matcher.base.util.OfferOperationFactory.launchGroupOperationCount
mesosphere.marathon.core.matcher.base.util.OfferOperationFactory.reserveOperationCount
Deprecated Features
/v2/schemas
The route /v2/schemas
has been deprecated in favor of the RAML specifications. Clients that need to perform local validation of requests can access the RAML specifications with the prefix the /public/api
. For example, to get the RAML definition for the apps resource, GET http://marathon:8080/public/api/v2/apps.raml
.
The route /v2/schemas
has the following deprecation schedule:
- 1.6.x -
/v2/schemas
will continue to function as normal. - 1.7.x - The API will stop responding to
/v2/schemas
; requests to it will be met with a 404 response. The route can
be re-enabled with the command-line argument--deprecated_features=json_schemas_resource
. - 1.8.x -
/v2/schemas
is scheduled to be completely removed. If--deprecated_features=json_schemas_resource
is
still specified, Marathon will refuse to launch, with an error.
/v2/events
The default response format of the /v2/events
is marked as deprecated and will be switched to the /v2/events?plan-format=light
in the first 1.7.x release. The following deprecation schedule is planned for this endpoint:
- 1.6.x -
/v2/events
will continue to function as normal - 1.7.x - The default
/v2/events
format will be switched to "light". You will still have the ability to use the command-line argument--deprecated_features=api_heavy_events
to re-enable the heavy event response. - 1.8.x - The
/v2/events
format will be permanently switched to "light". If--deprecated_features=api_heavy_events
is still specified, Marathon will refuse to launch, with an error.
Deprecation Details
The "lightweight" plan format can be already seen using the ?plan-format=light
argument. In summary, this format drops the following fields from the deployment-related events in the event stream accessed via /v2/events:
plan.original
- The current state of the root groupplan.target
- The target state of the root group
Fixed Issues
- MARATHON-7568 - We now redact any Zookeeper credentials from the /v2/info response endpoint.
- Updated version of Marathon UI to 1.3.1:
- MARATHON-8255 - Marathon UI properly shows fetch URLs in the edit dialog, now.
- MARATHON-8124 Fix issue in which reservations lacking a persistent volume would not be destroyed.
- MARATHON-7940 Fix connection-pool overflow issues with Marathon HTTP health checks by disabling connection pooling for them.
- MARATHON-8136 Fix issues involving headers and URI filtering with Marathon HTTP healthchecks.
- MARATHON-8083 Fix issue with datadog / graphite metric reporters in which several parameters were ignored.
- MARATHON-8110 Fix issue in which Marathon would fail to accept offers for some resources from newer versions of Mesos.
- MARATHON-2683 Deployments for run-specs with multiple health-checks now wait for all health checks to succeed.
- MARATHON-8148 Pod last-failure-reason is now exposed via the API, as is done for apps.
- MARATHON-8216 Fix Mesos HTTP health checks for non-host networking mode with containerPort=0 now work.
- MARATHON-8064 Fix migration issue when store caching is disabled
- MARATHON-8159 Fix migration issue which introduced erroneous taskKillGracePeriodSeconds values
- MARATHON-8304 Fix rare bug in which Marathon would become unresponsive while connecting to Mesos.
- MARATHON-7568 Zookeeper credentials are now redacted from logs and the
/v2/info
response. - MARATHON-7390 Fix issue in which Marathon would become unresponsive for a long time if Zookeeper DNS cannot be resolved at launch.
- MARATHON-8084 Fix issue in which
POST /v2/apps/{app_id}/restart
would not proxy properly. - MARATHON-8326 Pod instances with persistent volumes can now be destroyed.
- MARATHON-8095 Fix issue in which PATCH HTTP requests were not properly proxied.
- Fix an issue in which resident tasks sometimes wouldn't be restarted.
v1.4.13
Fixed issues
- MARATHON-8397 Get rid of unbounded concurrency in the migration code
v1.5.11
Fixed issues
- MARATHON-8311 Metrics for suppress operations.
- MARATHON-8304 Activate
HeartbeatMonitor
before Marathon is connected to Mesos. - MARATHON-8310 Ignore
MARATHON_APP_
env vars.