Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add device memory #2565

Closed
wants to merge 54 commits into from
Closed
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
8f08deb
Fixed fetch_azure.py to also fetch Device_Memory
jc9123 Sep 17, 2023
5acdcb1
Added device memory to fetch_azure
jc9123 Sep 17, 2023
78324fd
fixed formatting
jc9123 Sep 17, 2023
3d0c20e
fixed more formatting
jc9123 Sep 17, 2023
d8c07c8
fixed more format error
jc9123 Sep 17, 2023
e0e13bc
fixed format
jc9123 Sep 17, 2023
8a21f28
fixed format
jc9123 Sep 19, 2023
f7d3403
Added device memory to fetch_gcp
jc9123 Sep 25, 2023
99148c7
fixed device memory conflict
jc9123 Sep 25, 2023
b1cb7ab
updated conflict
jc9123 Sep 25, 2023
ccd750f
[TPU] Update default runtime version for TPU node (#2601)
Michaelvll Sep 25, 2023
96036de
[GCP] Avoid dumping cachetools for backward compatibilty (#2604)
Michaelvll Sep 25, 2023
7b00f1c
[GCP] Update tpu runtime version for doc (#2602)
Michaelvll Sep 25, 2023
84609be
[Docs] Add debugging container instructions to `contributing.md` (#2606)
romilbhardwaj Sep 25, 2023
d232a82
[Catalog] Fix aws catalog fetcher for removed offerings (#2610)
Michaelvll Sep 27, 2023
a97cd62
[LLM] Use `--ip` CLI to get cluster IPs (#2614)
Michaelvll Sep 27, 2023
0f29a5f
News: Mistral 7B (#2615)
concretevitamin Sep 27, 2023
f14245a
[Core] Fix version requirements for `google-api-python-client` (#2577)
cblmemo Sep 27, 2023
077911c
News: tweak width. (#2616)
concretevitamin Sep 28, 2023
e814a30
Minor: Update CONTRIBUTING.md (#2624)
concretevitamin Sep 28, 2023
3304a31
[UX] Print useful message when image id not found (#2535)
cblmemo Sep 28, 2023
608871c
[Provisioner] New provisioner with AWS support (#1702)
suquark Sep 29, 2023
da33d4f
UX: Fix spot launch hint. (#2630)
concretevitamin Sep 30, 2023
a05cd33
[Docs] Add falcon to list of LLMs (#2637)
romilbhardwaj Oct 2, 2023
202e663
Fix nits (#2633)
suquark Oct 2, 2023
a04a309
[Provisioner] Get rid of ray dependency locally for aws (#2625)
Michaelvll Oct 2, 2023
d27df87
[Dependency] install typin_extensions for all versions (#2642)
Michaelvll Oct 2, 2023
13e1533
README: update news and LLM list. (#2643)
concretevitamin Oct 3, 2023
14fe045
[Core] Fix type of `job_id` when querying job status (#2541)
cblmemo Oct 3, 2023
a322b96
[Provisioner] Fix ports on GCP for TPU VM and cluster launched before…
cblmemo Oct 3, 2023
eb6f490
[CoreWeave] Adding CoreWeave label to identify GPUs in K8s (#2650)
rtalaricw Oct 4, 2023
ad9f13f
[Core] Fix optimizer for dag when some resources provided are not fea…
Michaelvll Oct 4, 2023
3ae9b72
skip v5 catalog fetching (#2656)
infwinston Oct 4, 2023
55b4b2e
Add more details in exception (#2654)
Michaelvll Oct 4, 2023
dbae488
[k8s] Multi-node support for Kubernetes (#2609)
romilbhardwaj Oct 4, 2023
f56c2b1
[CLI] Restore `sky logs <cluster> * --sync-down` (#2660)
cblmemo Oct 4, 2023
65ad5db
UX: Allow inferring cloud from region or zone. (#2632)
concretevitamin Oct 4, 2023
753585f
[Core] Pin remote dependency for ray job (#2659)
Michaelvll Oct 4, 2023
6c9d215
[CLI] Fix bugs when specify job_id in `sky logs` (#2662)
cblmemo Oct 5, 2023
b15d29c
Fix creation of `~/sky_logging/{timestamp}` under dir running `sky` c…
cblmemo Oct 5, 2023
163e4b5
[GCP] Add retry for transient error during launching GCP clusters (#2…
Michaelvll Oct 6, 2023
c6ee6bc
Fix usage by checking detailed_reason (#2672)
Michaelvll Oct 6, 2023
4a0f8f3
Add sky show-gpus support for Kubernetes (#2638)
hemildesai Oct 7, 2023
92bec55
[Core] Fix the NoOp for rich status (#2678)
Michaelvll Oct 9, 2023
9177ec5
[Spot] Fix OOM for long running spot controller (#2675)
Michaelvll Oct 9, 2023
e00a3f9
refactor: 💡 update faq and add more detailed error message (#2594)
sunny0826 Oct 10, 2023
6d3ada5
[Provisioner] Avoid backward compatibility issue with provisioner (#2…
Michaelvll Oct 10, 2023
2f43fd7
[Docs] Add AI assistant for docs (#2688)
romilbhardwaj Oct 10, 2023
940d864
fixed branch
jc9123 Oct 11, 2023
b2732ff
add device memory
jc9123 Oct 11, 2023
a814e8c
fixed branch
jc9123 Oct 11, 2023
8d4e9bd
updated branch
jc9123 Oct 11, 2023
45f02f3
Merge branch 'skypilot-org:master' into AddDeviceMemory
jc9123 Oct 11, 2023
2d621e2
fixed gpu mapping
jc9123 Oct 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 71 additions & 3 deletions sky/clouds/service_catalog/data_fetchers/fetch_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,10 @@ def get_regions() -> List[str]:
# We have to manually remove it.
DEPRECATED_FAMILIES = ['standardNVSv2Family']

USEFUL_COLUMNS = [
USEFUL_COLUMNS = {
'InstanceType', 'AcceleratorName', 'AcceleratorCount', 'vCPUs', 'MemoryGiB',
'GpuInfo', 'Price', 'SpotPrice', 'Region', 'Generation'
]
'GpuInfo', 'Price', 'SpotPrice', 'Region', 'Generation', 'DeviceMemory'
}


def get_pricing_url(region: Optional[str] = None) -> str:
Expand Down Expand Up @@ -244,11 +244,79 @@ def get_additional_columns(row):
axis='columns',
)

def create_gpu_map(df):
# Map of Azure's machine with GPU to their corresponding memory
# Result is hard-coded since Azure's API to not return such info
# may be outdated so need to be maintained
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a reference link on how these information are found? also, how did we make sure we cover all the instance types?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the azure documentation to ensure that all instance type have been included and also ran the script on --all-regions and the result looked fine to me.
However, I think the approach you mentioned below makes more sense in which we map instance type -> gpu name then from gpu name -> gpu memory. There is already a mapping from instance type -> gpu name in the script. Assuming this mapping is complete, we can easily map the gpu name to their corresponding memory.

gpu_map = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we try to map instance type -> gpu name first and then calculate the resulting device memory later? this two-level approach might be cleaner.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this approach is much cleaner as it uses much less hard-coding and utilize already fetched info. I will change the script to use this approach.

'Standard_NC6': 12,
'Standard_NC12': 24,
'Standard_NC24': 48,
'Standard_NC24r*': 48,
'Standard_NC6s_v2': 16,
'Standard_NC12s_v2': 32,
'Standard_NC24s_v2': 64,
'Standard_NC24rs_v2*': 64,
'Standard_NC6s_v3': 16,
'Standard_NC12s_v3': 32,
'Standard_NC24s_v3': 32,
'Standard_NC4as_T4_v3': 16,
'Standard_NC8as_T4_v3': 16,
'Standard_NC16as_T4_v3': 16,
'Standard_NC64as_T4_v3': 64,
'Standard_NC24ads_A100_v4': 80,
'Standard_NC48ads_A100_v4': 160,
'Standard_NC96ads_A100_v4': 320,
'Standard_ND96asr_v4': 40,
'Standard_ND96amsr_A100_v4': 80,
'Standard_ND6s': 24,
'Standard_ND12s': 48,
'Standard_ND24s': 96,
'Standard_ND24rs*': 96,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a *?

Copy link
Author

@jc9123 jc9123 Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its one of the instance type offered by azure.
https://learn.microsoft.com/en-us/azure/virtual-machines/nd-series

'Standard_ND40rs_v2': 32,
'Standard_NG8ads_V620_v1': 8,
'Standard_NG16ads_V620_v1': 16,
'Standard_NG32ads_V620_v1': 32,
'Standard_NG32adms_V620_v1': 32,
'Standard_NV6': 8,
'Standard_NV12': 16,
'Standard_NV24': 32,
'Standard_NV12s_v3': 8,
'Standard_NV24s_v3': 16,
'Standard_NV48s_v3': 32,
'Standard_NV4as_v4': 2,
'Standard_NV8as_v4': 4,
'Standard_NV16as_v4': 8,
'Standard_NV32as_v4': 16,
'Standard_NV6ads_A10_v5': 4,
'Standard_NV12ads_A10_v5': 8,
'Standard_NV18ads_A10_v5': 12,
'Standard_NV36ads_A10_v5': 24,
'Standard_NV36adms_A10_v5': 24,
'Standard_NV72ads_A10_v5': 48,
'Standard_NV6_Promo': 16,
'Standard_NV12_Promo': 32,
'Standard_NV24_Promo': 48
}

all_instance = df.InstanceType.unique()

for instance in all_instance:
if instance not in gpu_map:
gpu_map[instance] = ''
return gpu_map

def map_device_memory(row, dic):
return dic[row]

before_drop_len = len(df_ret)
df_ret.dropna(subset=['InstanceType'], inplace=True, how='all')
after_drop_len = len(df_ret)
print(f'Dropped {before_drop_len - after_drop_len} duplicated rows')

df_ret['DeviceMemory'] = df_ret.InstanceType.apply(
map_device_memory, args=(create_gpu_map(df_ret),))

# Filter out deprecated families
df_ret = df_ret.loc[~df_ret['family'].isin(DEPRECATED_FAMILIES)]
df_ret = df_ret[USEFUL_COLUMNS]
Expand Down
17 changes: 17 additions & 0 deletions sky/clouds/service_catalog/data_fetchers/fetch_gcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -513,6 +513,23 @@ def get_catalog_df(region_prefix: str) -> pd.DataFrame:
# Round the prices.
df['Price'] = df['Price'].round(PRICE_ROUNDING)
df['SpotPrice'] = df['SpotPrice'].round(PRICE_ROUNDING)
gpu_map = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a reference link with comment above?

'L4': 24,
'A100': 40,
'A100-80GB': 80,
'A100-40GB': 40,
'T4': 16,
'P4': 8,
'V100': 16,
'P100': 16,
'K80': 12,
'': ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry why this ''?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake on my end, I will have this removed.

}

df['DeviceMemory'] = df.apply(
lambda row: gpu_map[row['AcceleratorName']] * row['AcceleratorCount']
if pd.notnull(row['AcceleratorName']) else np.nan,
axis=1)
return df


Expand Down