-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm support #252
Open
88Ocelot
wants to merge
4
commits into
smallcloudai:main
Choose a base branch
from
88Ocelot:feature/rocm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
ROCm support #252
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
version: "3.9" | ||
services: | ||
refact_self_hosted: | ||
# TODO: figureout how to pass gpu to docker builds, so there is no need to install deepspeed at runtime | ||
command: > | ||
/bin/bash -c 'pip install deepspeed --no-cache-dir | ||
&& python -m self_hosting_machinery.watchdog.docker_watchdog' | ||
image: refact_self_hosting_rocm | ||
build: | ||
dockerfile: rocm.Dockerfile | ||
shm_size: "32gb" | ||
devices: | ||
- "/dev/kfd" | ||
- "/dev/dri" | ||
group_add: | ||
- "video" | ||
security_opt: | ||
- seccomp:unconfined | ||
volumes: | ||
- perm_storage:/perm_storage | ||
ports: | ||
- 8008:8008 | ||
nginx: | ||
image: nginx | ||
ports: | ||
- "80:80" | ||
volumes: | ||
- ./nginx.conf:/etc/nginx/conf.d/default.conf:ro | ||
|
||
volumes: | ||
perm_storage: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
FROM ocelot88/rocm-pytorch-slim:rocm-5.7.1-dev-torch-2.3 | ||
RUN apt-get update | ||
RUN DEBIAN_FRONTEND="noninteractive" apt-get install -y \ | ||
curl \ | ||
git \ | ||
htop \ | ||
tmux \ | ||
file \ | ||
vim \ | ||
expect \ | ||
mpich \ | ||
libmpich-dev \ | ||
python3 python3-pip \ | ||
&& rm -rf /var/lib/{apt,dpkg,cache,log} | ||
|
||
|
||
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1 | ||
|
||
# linguist requisites | ||
RUN apt-get update | ||
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y \ | ||
expect \ | ||
ruby-full \ | ||
ruby-bundler \ | ||
build-essential \ | ||
cmake \ | ||
pkg-config \ | ||
libicu-dev \ | ||
zlib1g-dev \ | ||
libcurl4-openssl-dev \ | ||
libssl-dev | ||
RUN git clone https://github.com/smallcloudai/linguist.git /tmp/linguist \ | ||
&& cd /tmp/linguist \ | ||
&& bundle install \ | ||
&& rake build_gem | ||
|
||
ENV PATH="${PATH}:/tmp/linguist/bin" | ||
|
||
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y python3-packaging | ||
|
||
ENV INSTALL_OPTIONAL=TRUE | ||
ENV BUILD_CUDA_EXT=1 | ||
ENV USE_ROCM=1 | ||
ENV GITHUB_ACTIONS=true | ||
ENV AMDGPU_TARGETS="gfx1030" | ||
ENV FLASH_ATTENTION_FORCE_BUILD=TRUE | ||
ENV MAX_JOBS=8 | ||
COPY . /tmp/app | ||
RUN pip install --upgrade pip ninja packaging | ||
RUN DEBIAN_FRONTEND=noninteractive apt-get install python3-mpi4py -y | ||
ENV PYTORCH_ROCM_ARCH="gfx1030" | ||
ENV ROCM_TARGET="gfx1030" | ||
ENV ROCM_HOME=/opt/rocm-5.7.1 | ||
# TODO: https://github.com/TimDettmers/bitsandbytes/pull/756 remove this layer, when this pr merged | ||
RUN git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6 && \ | ||
cd bitsandbytes-rocm-5.6 && \ | ||
make hip && pip install . && \ | ||
cd .. && rm -rf bitsandbytes-rocm-5.6 | ||
RUN pip install /tmp/app -v --no-build-isolation && rm -rf /tmp/app | ||
RUN ln -s ${ROCM_HOME} /opt/rocm | ||
ENV REFACT_PERM_DIR "/perm_storage" | ||
ENV REFACT_TMP_DIR "/tmp" | ||
ENV RDMAV_FORK_SAFE 0 | ||
ENV RDMAV_HUGEPAGES_SAFE 0 | ||
|
||
EXPOSE 8008 | ||
|
||
CMD ["python", "-m", "self_hosting_machinery.watchdog.docker_watchdog"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the only issue i found with this build so far :D I am testing it right now, just waiting for the models to download
After some building and testing i have ecountered a big issue
After some testing today i can say that sadly we need to wait more to make this happen . For example flash_attention probably going to work from rocm5.7 when it gets stable release.I saw that you have tried some workarounds, but i believe it did not worked due to rocm library differences
So far even when it builded and started most of the time i just got timeout error , and model was not loaded properly.