Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 1.1.1 - 2nd attempt #54

Merged
merged 62 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
cab50d7
Merge pull request #13 from carsdotcom/main
Macr0Nerd Sep 29, 2022
4457a2d
Split normalize_config into two functions (#27)
donheshanthaka Oct 4, 2022
00efd0b
Added new badges to the Readme (#30)
npatel-cars Oct 11, 2022
6b4e1e6
Move EC2 pricing calls to single function. (#29)
jagmoreira Oct 12, 2022
eb20716
Added more clear SSH error message for improper credentials
Macr0Nerd Oct 12, 2022
3fc2211
Updated changelog
Macr0Nerd Oct 12, 2022
b0ed1dd
Fixed changelog
Macr0Nerd Oct 12, 2022
e467d99
Updated SSH credential error message
Macr0Nerd Oct 14, 2022
a2c8224
Merge pull request #31 from carsdotcom/better_ssh_error
Macr0Nerd Oct 17, 2022
53d8218
Add tests for ssh.py module.
jagmoreira Oct 19, 2022
73bae52
Add coverage as a test dependency.
jagmoreira Oct 19, 2022
873ec58
Update changelog and fix style.
jagmoreira Oct 19, 2022
ba6fcff
Merge pull request #32 from carsdotcom/ssh-tests
Macr0Nerd Oct 19, 2022
f5a9913
Add unittests for rsync module. (#33)
jagmoreira Oct 26, 2022
b05428b
Add tests for yaml_loader.py to increase coverage (#34)
Alig1493 Oct 26, 2022
daa8c34
moved function outside for better testing (#35)
Alig1493 Oct 27, 2022
1ace9df
Added venv to .gitignore
Macr0Nerd Oct 27, 2022
6954867
Merge pull request #36 from carsdotcom/ignore_venv
Macr0Nerd Oct 27, 2022
971b1a0
bump version (#37)
npatel-cars Oct 27, 2022
6c56b7e
increase test coverage for forge/destory.py (#39)
Alig1493 Oct 28, 2022
10482dc
Add a configurable spot strategy
Macr0Nerd Jan 10, 2024
594ffc3
Updated tests
Macr0Nerd Jan 10, 2024
84b6a98
Fixed tests for multi-az
Macr0Nerd Jan 12, 2024
023f542
Updated documentation for multi-az
Macr0Nerd Jan 12, 2024
81f4c2b
Add multi-az functionality
Macr0Nerd Jan 12, 2024
d31bb08
Add spot retries and failover
Macr0Nerd Jan 19, 2024
48a164f
Update documentation
Macr0Nerd Feb 7, 2024
d28b7a7
Merge pull request #42 from carsdotcom/spot_retries
Macr0Nerd Feb 7, 2024
0159d2d
Merge pull request #41 from carsdotcom/multi_az
Macr0Nerd Feb 7, 2024
91f3802
Merge branch 'dev' into configurable_spot_strategy
Macr0Nerd Feb 7, 2024
98a738c
Merge pull request #40 from carsdotcom/configurable_spot_strategy
Macr0Nerd Feb 7, 2024
af259e3
Bumped version
Macr0Nerd Feb 7, 2024
dcd9133
Bumped version to 1.1.0
Macr0Nerd Feb 7, 2024
97c15ba
Updated maintainers
Macr0Nerd Feb 7, 2024
ac432c3
Fixed region bug in create.py
Macr0Nerd Feb 8, 2024
e1f8653
1.1.1
Macr0Nerd Feb 8, 2024
da71ef4
Merge pull request #44 from carsdotcom/region_patch
Macr0Nerd Feb 8, 2024
1ee83a9
Fixed automatic multi-worker allocation bug
Macr0Nerd Feb 9, 2024
a4f3bfd
Updated dependencies
Macr0Nerd Feb 20, 2024
fb83884
Added destroy_on_create
Macr0Nerd Feb 21, 2024
e123076
Fixed potential bug
Macr0Nerd Feb 23, 2024
ee6f9ac
Moved to get_nlist()
Macr0Nerd Feb 23, 2024
6bbbee6
Merge branch 'version_bump' into forge_patches
Macr0Nerd Feb 26, 2024
d1f44c0
Version 1.2.0
Macr0Nerd Feb 26, 2024
751eea2
Add create_timeout configuration option
Macr0Nerd Feb 28, 2024
891ce4b
Remove default create_timeout setting
Macr0Nerd Feb 28, 2024
4c276d8
Fix create_timeout check
Macr0Nerd Feb 28, 2024
d43ee3c
Reduce version to 1.1.0
Macr0Nerd Mar 1, 2024
f503e30
Remove Hacktoberfest 2022 branding
Macr0Nerd Apr 5, 2024
72b69d8
Merge pull request #46 from carsdotcom/forge_patches
Macr0Nerd Apr 8, 2024
3f20991
GPU Fix (#47)
jagmoreira May 9, 2024
6a79dc5
Add error reporting for RAM/CPU misconfigurations
Macr0Nerd May 31, 2024
666d3f3
Merge pull request #48 from carsdotcom/better-ram-errors
Macr0Nerd Jun 14, 2024
dacfe6b
Add retries and return code to rsync
Macr0Nerd Oct 2, 2024
22314b5
Bump version to 1.1.1
Macr0Nerd Oct 2, 2024
8b19ae1
Merge pull request #49 from carsdotcom/return_code_fix
Macr0Nerd Nov 13, 2024
6f11710
Add minute timer after create to engine
Macr0Nerd Nov 13, 2024
0862ee9
Merge pull request #50 from carsdotcom/rsync_timer
Macr0Nerd Nov 14, 2024
e842c4b
Add msg in engine to inform user of Rsync delay (#51)
jagmoreira Dec 9, 2024
57e6a02
Resolving merge conflicts (#53)
jagmoreira Dec 11, 2024
0da29de
Merge branch 'main' into dev
jagmoreira Dec 12, 2024
17c11f2
Update release date.
jagmoreira Dec 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 1.0.2
current_version = 1.1.1
commit = True
tag = False
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.7'
python-version: '3.9'

- name: Install dependencies
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/run_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.7'
python-version: '3.9'

- name: Install dependencies
run: |
Expand Down
35 changes: 33 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,35 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
## [1.1.1] - 2024-12-12

### Changed
- **Python Version** - Bump minimum python version to 3.9.
- **Rsync** - Properly triggers retry sequence
- **Rsync** - Gives a return code now

### Fixed
- **Create** - Fix GPU AMI not being selected.
- **Parser** - Fix GPU flag not being passed properly to the config dict.
- **Create** - Better error reporting regarding RAM and CPU misconfigurations.


## [1.1.0] - 2024-02-26

### Added
- **Create** - Added `destroy_on_create`
- **Create** - Added `create_timeout` option
- **Common** - Moved all `n_list` functions to `get_nlist()`
- **Dependencies** - Updated dependencies and tested on latest versions
- **Create** - Set default boto3 session at beginning of create to resolve region bug
- **Create**
- Multi-AZ functionality
- Spot retries
- On-demand Failover

### Changed
- **Create** - Configurable spot strategy
- **Documentation** - Updated with new changes


## [1.0.2] - 2022-10-27
Expand Down Expand Up @@ -33,12 +61,15 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- **GitHub** - Update action to build and publish package only when version is bumped.
- **Forge** - Added automatic tag `forge-name` to allow `Name` tag to be changed.


## [1.0.0] - 2022-09-27

### Added
- **Initial commit** - Forge source code, unittests, docs, pyproject.toml, README.md, and LICENSE files.

[unreleased]: https://github.com/carsdotcom/cars-forge/compare/v1.0.2...HEAD
[unreleased]: https://github.com/carsdotcom/cars-forge/compare/v1.1.1...HEAD
[1.1.1]: https://github.com/carsdotcom/cars-forge/compare/v1.1.0...v1.1.1
[1.1.0]: https://github.com/carsdotcom/cars-forge/compare/v1.0.2...v1.1.0
[1.0.2]: https://github.com/carsdotcom/cars-forge/compare/v1.0.1...v1.0.2
[1.0.1]: https://github.com/carsdotcom/cars-forge/compare/v1.0.0...v1.0.1
[1.0.0]: https://github.com/carsdotcom/cars-forge/releases/tag/v1.0.0
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

[![GitHub license](https://img.shields.io/github/license/carsdotcom/cars-forge?color=navy&label=License&logo=License&style=flat-square)](https://github.com/carsdotcom/cars-forge/blob/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/cars-forge?color=navy&style=flat-square)](https://pypi.org/project/cars-forge/)
![hacktoberfest](https://img.shields.io/github/issues/carsdotcom/cars-forge?color=orange&label=Hacktoberfest%202022&style=flat-square&?labelColor=black)
![PyPI - Downloads](https://img.shields.io/pypi/dm/cars-forge?color=navy&style=flat-square)
![GitHub Workflow Status (branch)](https://img.shields.io/github/workflow/status/carsdotcom/cars-forge/Publish%20Package/main?color=navy&style=flat-square)
![GitHub contributors](https://img.shields.io/github/contributors/carsdotcom/cars-forge?color=navy&style=flat-square)

---

## About
Expand Down
16 changes: 13 additions & 3 deletions docs/environmental_yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,18 @@ https://github.com/carsdotcom/cars-forge/blob/main/examples/env_yaml_example/exa
constraints: [2.3, 3.0, 3.1]
error: "Invalid Spark version. Only 2.3, 3.0, and 3.1 are supported."
```
- **aws_az** - The [AWS availability zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) where Forge will create the EC2 instance. Currently, Forge can run only in one AZ
- **aws_profile** - [AWS CLI profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html) to use
- **aws_az** - The [AWS availability zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html) where Forge will create the EC2 instance. If set, multi-az placement will be disabled.
- **aws_region** - The AWS region for Forge to run in- **aws_profile** - [AWS CLI profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html) to use
- **aws_security_group** - [AWS Security Group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-security-groups.html) for the instance
- **aws_subnet** - [AWS subnet](https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html) where the EC2s will run
- **aws_subnet** - [AWS subnet](https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html) where the EC2s will run
- **aws_multi_az** - [AWS subnet](https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html) where the EC2s will run organized by AZ
- E.g.
```yaml
aws_multi_az:
us-east-1a: subnet-aaaaaaaaaaaaaaaaa
us-east-1b: subnet-bbbbbbbbbbbbbbbbb
us-east-1c: subnet-ccccccccccccccccc
```
- **default_ratio** - Override the default ratio of RAM to CPU if the user does not provide one. Must be a list of the minimum and maximum.
- default is [8, 8]
- **ec2_amis** - A dictionary of dictionaries to store [AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html) info.
Expand Down Expand Up @@ -95,6 +103,8 @@ https://github.com/carsdotcom/cars-forge/blob/main/examples/env_yaml_example/exa
```
- **forge_env** - Name of the Forge environment. The user will refer to this in their yaml.
- **forge_pem_secret** - The secret name where the `ec2_key` is stored
- **on_demand_failover** - If using engine mode and all spot attempts (market: spot + spot retries) have failed, run a final attempt using on-demand.
- **spot_retries** - If using engine mode, sets the number of times to retry a spot instance. Only retries if either market is spot.
- **tags** - [Tags](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html) to apply to instances created by Forge. Follows the AWS tag format.
- Forge also exposes all string, numeric, and some extra variables from the combined user and environmental configs that will be replaced at runtime by the matching values (e.g. `{name}` for job name, `{date}` for job date, etc.) See the [variables](variables.md) page for more details.
- E.g.
Expand Down
3 changes: 3 additions & 0 deletions docs/yaml.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ Each forge command certain parameters. A yaml file with all the parameters can b
```
- If running via the command line, a range of values is passed as: ``--market on-demand spot``.
- **name** - Name of the instance/cluster
- **on_demand_failover** - If using engine mode and all spot attempts (market: spot + spot retries) have failed, run a final attempt using on-demand.
- **ram** - Minimum amount of RAM required. Can be a range e.g. [16, 32].
- If using a cluster, you must specify both the master and worker. Master first, worker second.
```yaml
Expand Down Expand Up @@ -76,5 +77,7 @@ Each forge command certain parameters. A yaml file with all the parameters can b
- Use the `--all` flag to run the script on all the instances in a cluster.
- E.g. `run_cmd: scripts/run.sh {env} {date} {ip}`
- **service** - `cluster` or `single`
- **spot_strategy** - Select the [spot allocation strategy](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2/client/create_fleet.html).
- **spot_retries** - If using engine mode, sets the number of times to retry a spot instance. Only retries if either market is spot.
- **user_data** - Custom script passed to instance. Will be run only once when the instance starts up.
- **valid_time** - How many hours the fleet will stay up. After this time, all EC2s will be destroyed. The default is 8.
29 changes: 20 additions & 9 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,20 @@
name = "cars-forge"
description = "Create an on-demand/spot fleet of single or cluster EC2 instances."
readme = "README.md"
requires-python = ">=3.7"
requires-python = ">=3.9"
license = "Apache-2.0"
authors = [
{name = "Nikhil Patel", email = "[email protected]"}
{name = "Nikhil Patel", email = "[email protected]"},
{name = "Gabriele Ron", email = "[email protected]"},
{name = "Joao Moreira", email = "[email protected]"}
]

maintainers = [
{name = "Nikhil Patel", email = "[email protected]"},
{name = "Gabriele Ron", email = "[email protected]"},
{name = "Joao Moreira", email = "[email protected]"}
]

keywords = [
"AWS",
"EC2",
Expand All @@ -19,6 +28,7 @@ keywords = [
"Cluster",
"Jupyter"
]

classifiers = [
"Development Status :: 5 - Production/Stable",
"Environment :: Console",
Expand All @@ -28,24 +38,25 @@ classifiers = [
"Operating System :: Unix",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
]

dynamic = ["version"]

dependencies = [
"boto3~=1.19.0",
"pyyaml~=5.3.0",
"schema~=0.7.0",
"boto3",
"pyyaml",
"schema",
]

[project.optional-dependencies]
test = [
"pytest~=7.1.0",
"pytest-cov~=4.0"
"pytest",
"pytest-cov"
]

dev = [
"bump2version~=1.0",
]
Expand Down
5 changes: 3 additions & 2 deletions src/forge/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "1.0.2"
__version__ = "1.1.1"

# Default values for forge's essential arguments
DEFAULT_ARG_VALS = {
Expand All @@ -11,7 +11,8 @@
'destroy_after_failure': True,
'default_ratio': [8, 8],
'valid_time': 8,
'ec2_max': 768
'ec2_max': 768,
'spot_strategy': 'price-capacity-optimized'
}

# Required arguments for each Forge job
Expand Down
56 changes: 53 additions & 3 deletions src/forge/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from botocore.exceptions import ClientError, NoCredentialsError

from . import DEFAULT_ARG_VALS, ADDITIONAL_KEYS
from .exceptions import ExitHandlerException

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -117,7 +118,8 @@ def ec2_ip(n, config):
'instance_type': i.get('InstanceType'),
'state': i.get('State').get('Name'),
'launch_time': i.get('LaunchTime'),
'fleet_id': check_fleet_id(n, config)
'fleet_id': check_fleet_id(n, config),
'az': i.get('Placement')['AvailabilityZone']
}
details.append(x)
logger.debug('ec2_ip details is %s', details)
Expand All @@ -142,6 +144,35 @@ def get_ip(details, states):
return [(i['ip'], i['id']) for i in list(filter(lambda x: x['state'] in states, details))]


def get_nlist(config):
"""get list of instance names based on service

Parameters
----------
config : dict
Forge configuration data

Returns
-------
list
List of instance names
"""
date = config.get('date', '')
market = config.get('market', DEFAULT_ARG_VALS['market'])
name = config['name']
service = config['service']

n_list = []
if service == "cluster":
n_list.append(f'{name}-{market[0]}-{service}-master-{date}')
if config.get('rr_all'):
n_list.append(f'{name}-{market[-1]}-{service}-worker-{date}')
elif service == "single":
n_list.append(f'{name}-{market[0]}-{service}-{date}')

return n_list


@contextlib.contextmanager
def key_file(secret_id, region, profile):
"""Safely retrieve a secret file from AWS for temporary use.
Expand Down Expand Up @@ -320,6 +351,14 @@ def normalize_config(config):
if config.get('aws_az'):
config['region'] = config['aws_az'][:-1]

if config.get('aws_subnet') and not config.get('aws_multi_az'):
config['aws_multi_az'] = {config.get('aws_az'): config.get('aws_subnet')}
elif config.get('aws_subnet') and config.get('aws_multi_az'):
logger.warning('Both aws_multi_az and aws_subnet exist, defaulting to aws_multi_az')

if config.get('aws_region'):
config['region'] = config['aws_region']

if not config.get('ram') and not config.get('cpu') and config.get('ratio'):
DEFAULT_ARG_VALS['default_ratio'] = config.pop('ratio')

Expand Down Expand Up @@ -492,8 +531,8 @@ def get_ec2_pricing(ec2_type, market, config):
float
Hourly price of given EC2 type in given market.
"""
region = config.get('region')
az = config.get('aws_az')
region = config['region']
az = config['aws_az']

if market == 'spot':
client = boto3.client('ec2')
Expand Down Expand Up @@ -529,3 +568,14 @@ def get_ec2_pricing(ec2_type, market, config):
price = float(price)

return price


def exit_callback(config, exit: bool = False):
if config['job'] == 'engine' and (config.get('spot_retries') or (config.get('on_demand_failover') or config.get('market_failover'))):
logger.error('Error occurred, bubbling up error to handler.')
raise ExitHandlerException

if exit:
sys.exit(1)

pass
23 changes: 18 additions & 5 deletions src/forge/configure.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import sys

import yaml
from schema import Schema, And, Optional, SchemaError
from schema import Schema, And, Optional, Or, SchemaError, Use

from .common import set_config_dir

Expand Down Expand Up @@ -50,19 +50,32 @@ def check_env_yaml(env_yaml):
"""
schema = Schema({
'forge_env': And(str, len, error='Invalid Environment Name'),
'aws_az': And(str, len, error='Invalid AWS availability zone'),
Optional('aws_region'): And(str, len, error='Invalid AWS region'),
Optional('aws_az'): And(str, len, error='Invalid AWS availability zone'),
Optional('aws_subnet'): And(str, len, error='Invalid AWS Subnet'),
'ec2_amis': And(dict, len, error='Invalid AMI Dictionary'),
'aws_subnet': And(str, len, error='Invalid AWS Subnet'),
Optional('aws_multi_az'): And(dict, len, error='Invalid AWS Subnet'),
'ec2_key': And(str, len, error='Invalid AWS key'),
'aws_security_group': And(str, len, error='Invalid AWS Security Group'),
Optional('aws_security_group'): And(str, len, error='Invalid AWS Security Group'),
'forge_pem_secret': And(str, len, error='Invalid Name of Secret'),
Optional('aws_profile'): And(str, len, error='Invalid AWS profile'),
Optional('ratio'): And(list, len, error='Invalid default ratio'),
Optional('user_data'): And(dict, len, error='Invalid Create Scripts'),
Optional('tags'): And(list, len, error="Invalid AWS tags"),
Optional('excluded_ec2s'): And(list),
Optional('additional_config'): And(list),
Optional('ec2_max'): And(int)
Optional('ec2_max'): And(int),
Optional('spot_strategy'): And(str, len,
Or(
'lowest-price',
'diversified',
'capacity-optimized',
'capacity-optimized-prioritized',
'price-capacity-optimized'),
error='Invalid spot allocation strategy'),
Optional('on_demand_failover'): And(bool),
Optional('spot_retries'): And(Use(int), lambda x: x > 0),
Optional('create_timeout'): And(Use(int), lambda x: x > 0),
})
try:
validated = schema.validate(env_yaml)
Expand Down
Loading
Loading