Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump Olah to v0.2.0 #15

Merged
merged 12 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# How can I contribute to Olah?

Everyone is welcome to contribute, and we value everybody's contribution. Code contributions are not the only way to help the community. Answering questions, helping others, and improving the documentation are also immensely valuable.

It also helps us if you spread the word! Reference the library in blog posts about the awesome projects it made possible, shout out on Twitter every time it has helped you, or simply ⭐️ the repository to say thank you.

However you choose to contribute, please be mindful and respect our code of conduct.

## Ways to contribute

There are lots of ways you can contribute to Olah:
* Submitting issues on Github to report bugs or make feature requests
* Fixing outstanding issues with the existing code
* Implementing new features
* Contributing to the examples or to the documentation

*All are equally valuable to the community.*

#### This guide was heavily inspired by the awesome [transformers guide to contributing](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md)
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,10 @@ Or you can specify the host address and listening port:
```bash
python -m olah.server --host localhost --port 8090
```
Please remember to change the `--mirror-url` and `--mirror-lfs-url` to the actual URLs of the mirror site while modifying the host and port.
**Note: Please change --mirror-netloc and --mirror-lfs-netloc to the actual URLs of the mirror sites when modifying the host and port.**
```bash
python -m olah.server --host 192.168.1.100 --port 8090 --mirror-netloc 192.168.1.100:8090
```

The default mirror cache path is `./repos`, you can change it by `--repos-path` parameter:
```bash
Expand Down Expand Up @@ -185,7 +188,6 @@ allow = false

## Future Work

* Authentication
* Administrator and user system
* OOS backend support
* Mirror Update Schedule Task
Expand Down
38 changes: 22 additions & 16 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<p align="center">
<b>自托管的轻量级HuggingFace镜像服务</b>

Olah是一种自托管的轻量级HuggingFace镜像服务。`Olah`来源于丘丘人语,在丘丘人语中意味着`你好`。
Olah是开源的自托管轻量级HuggingFace镜像服务。`Olah`来源于丘丘人语,在丘丘人语中意味着`你好`。
Olah真正地实现了huggingface资源的`镜像`功能,而不仅仅是一个简单的`反向代理`。
Olah并不会立刻对huggingface全站进行镜像,而是在用户下载的同时在文件块级别对资源进行镜像(或者我们可以说是缓存)。

Expand Down Expand Up @@ -102,7 +102,11 @@ python -m olah.server
```bash
python -m olah.server --host localhost --port 8090
```
请记得在修改主机和端口时将`--mirror-url`和`--mirror-lfs-url`更改为镜像站点的实际URL。
**注意:请记得在修改主机和端口时将`--mirror-netloc`和`--mirror-lfs-netloc`更改为镜像站点的实际URL。**

```bash
python -m olah.server --host 192.168.1.100 --port 8090 --mirror-netloc 192.168.1.100:8090
```

默认的镜像缓存路径是`./repos`,您可以通过`--repos-path`参数进行更改:
```bash
Expand Down Expand Up @@ -137,17 +141,19 @@ mirror-netloc = "localhost:8090"
mirror-lfs-netloc = "localhost:8090"
mirrors-path = ["./mirrors_dir"]
```
host: 设置olah监听的host地址
port: 设置olah监听的端口
ssl-key和ssl-cert: 当需要开启HTTPS时传入key和cert的文件路径
repos-path: 用于保存缓存数据的目录
hf-scheme: huggingface官方站点的网络协议(一般不需要改动)
hf-netloc: huggingface官方站点的网络位置(一般不需要改动)
hf-lfs-netloc: huggingface官方站点LFS文件的网络位置(一般不需要改动)
mirror-scheme: Olah镜像站的网络协议(应当和上面的设置一致,当提供ssl-key和ssl-cert时,应改为https)
mirror-netloc: Olah镜像站的网络位置(应与host和port设置一致)
mirror-lfs-netloc: Olah镜像站LFS的网络位置(应与host和port设置一致)
mirrors-path: 额外的镜像文件目录。当你已经clone了一些git仓库时可以放入该目录下以供下载。此处例子目录为`./mirrors_dir`, 若要添加数据集`Salesforce/wikitext`,可将git仓库放置于`./mirrors_dir/datasets/Salesforce/wikitext`目录。同理,模型放置于`./mirrors_dir/models/organization/repository`下。

- host: 设置olah监听的host地址
- port: 设置olah监听的端口
- ssl-key和ssl-cert: 当需要开启HTTPS时传入key和cert的文件路径
- repos-path: 用于保存缓存数据的目录
- hf-scheme: huggingface官方站点的网络协议(一般不需要改动)
- hf-netloc: huggingface官方站点的网络位置(一般不需要改动)
- hf-lfs-netloc: huggingface官方站点LFS文件的网络位置(一般不需要改动)
- mirror-scheme: Olah镜像站的网络协议(应当和上面的设置一致,当提供ssl-key和ssl-cert时,应改为https)
- mirror-netloc: Olah镜像站的网络位置(应与host和port设置一致)
- mirror-lfs-netloc: Olah镜像站LFS的网络位置(应与host和port设置一致)
- mirrors-path: 额外的镜像文件目录。当你已经clone了一些git仓库时可以放入该目录下以供下载。此处例子目录为`./mirrors_dir`, 若要添加数据集`Salesforce/wikitext`,可将git仓库放置于`./mirrors_dir/datasets/Salesforce/wikitext`目录。同理,模型放置于`./mirrors_dir/models/organization/repository`下。


第二部分可以对可访问性进行限制
```toml
Expand Down Expand Up @@ -180,9 +186,9 @@ allow = true
repo = "adept/fuyu-8b"
allow = false
```
offline: 设置Olah镜像站是否进入离线模式,不再向huggingface官方站点发出请求以进行数据更新,但已经缓存的仓库仍可以下载
proxy: 用于设置该仓库是否可以被代理,默认全部允许,`repo`用于匹配仓库名字; 可使用正则表达式和通配符两种模式,`use_re`用于控制是否使用正则表达式,默认使用通配符; `allow`控制该规则的属性是允许代理还是不允许代理。
cache: 用于设置该仓库是否会被缓存,默认全部允许,`repo`用于匹配仓库名字; 可使用正则表达式和通配符两种模式,`use_re`用于控制是否使用正则表达式,默认使用通配符; `allow`控制该规则的属性是允许代理还是不允许缓存。
- offline: 设置Olah镜像站是否进入离线模式,不再向huggingface官方站点发出请求以进行数据更新,但已经缓存的仓库仍可以下载
- proxy: 用于设置该仓库是否可以被代理,默认全部允许,`repo`用于匹配仓库名字; 可使用正则表达式和通配符两种模式,`use_re`用于控制是否使用正则表达式,默认使用通配符; `allow`控制该规则的属性是允许代理还是不允许代理。
- cache: 用于设置该仓库是否会被缓存,默认全部允许,`repo`用于匹配仓库名字; 可使用正则表达式和通配符两种模式,`use_re`用于控制是否使用正则表达式,默认使用通配符; `allow`控制该规则的属性是允许代理还是不允许缓存。

## 许可证

Expand Down
8 changes: 8 additions & 0 deletions docs/en/main.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<h1 align="center">Olah Document</h1>

<p align="center">
<b>Self-hosted Lightweight Huggingface Mirror Service</b>

Olah is a self-hosted lightweight huggingface mirror service. `Olah` means `hello` in Hilichurlian.
Olah implemented the `mirroring` feature for huggingface resources, rather than just a simple `reverse proxy`.
Olah does not immediately mirror the entire huggingface website but mirrors the resources at the file block level when users download them (or we can say cache them).
Empty file added docs/en/quickstart.md
Empty file.
9 changes: 9 additions & 0 deletions docs/zh/main.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<h1 align="center">Olah 文档</h1>


<p align="center">
<b>自托管的轻量级HuggingFace镜像服务</b>

Olah是开源的自托管轻量级HuggingFace镜像服务。`Olah`来源于丘丘人语,在丘丘人语中意味着`你好`。
Olah真正地实现了huggingface资源的`镜像`功能,而不仅仅是一个简单的`反向代理`。
Olah并不会立刻对huggingface全站进行镜像,而是在用户下载的同时在文件块级别对资源进行镜像(或者我们可以说是缓存)。
Empty file added docs/zh/quickstart.md
Empty file.
Empty file added olah/auth/__init__.py
Empty file.
12 changes: 8 additions & 4 deletions olah/configs.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# coding=utf-8
# Copyright 2024 XiaHan
#
#
# Use of this source code is governed by an MIT-style
# license that can be found in the LICENSE file or at
# https://opensource.org/licenses/MIT.
Expand Down Expand Up @@ -88,9 +88,13 @@ def __init__(self, path: Optional[str] = None) -> None:
self.hf_netloc: str = "huggingface.co"
self.hf_lfs_netloc: str = "cdn-lfs.huggingface.co"

self.mirror_scheme: str = "http"
self.mirror_netloc: str = "localhost:8090"
self.mirror_lfs_netloc: str = "localhost:8090"
self.mirror_scheme: str = "http" if self.ssl_key is None else "https"
self.mirror_netloc: str = (
f"{self.host if self.host != '0.0.0.0' else 'localhost'}:{self.port}"
)
self.mirror_lfs_netloc: str = (
f"{self.host if self.host != '0.0.0.0' else 'localhost'}:{self.port}"
)

self.mirrors_path: List[str] = []

Expand Down
4 changes: 3 additions & 1 deletion olah/constants.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# coding=utf-8
# Copyright 2024 XiaHan
#
#
# Use of this source code is governed by an MIT-style
# license that can be found in the LICENSE file or at
# https://opensource.org/licenses/MIT.
Expand All @@ -11,6 +11,8 @@

DEFAULT_LOGGER_DIR = "./logs"

ORIGINAL_LOC = "oriloc"

from huggingface_hub.constants import (
REPO_TYPES_MAPPING,
HUGGINGFACE_CO_URL_TEMPLATE,
Expand Down
8 changes: 6 additions & 2 deletions olah/errors.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@


# coding=utf-8
# Copyright 2024 XiaHan
#
# Use of this source code is governed by an MIT-style
# license that can be found in the LICENSE file or at
# https://opensource.org/licenses/MIT.

from fastapi import Response
from fastapi.responses import JSONResponse
Expand Down
8 changes: 7 additions & 1 deletion olah/mirror/meta.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# coding=utf-8
# Copyright 2024 XiaHan
#
# Use of this source code is governed by an MIT-style
# license that can be found in the LICENSE file or at
# https://opensource.org/licenses/MIT.


class RepoMeta(object):
Expand All @@ -18,7 +24,7 @@ def __init__(self) -> None:
self.cardData = None
self.siblings = None
self.createdAt = None

def to_dict(self):
return {
"_id": self._id,
Expand Down
42 changes: 24 additions & 18 deletions olah/mirror/repos.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# coding=utf-8
# Copyright 2024 XiaHan
#
#
# Use of this source code is governed by an MIT-style
# license that can be found in the LICENSE file or at
# https://opensource.org/licenses/MIT.
Expand All @@ -15,6 +15,8 @@
import yaml

from olah.mirror.meta import RepoMeta


class LocalMirrorRepo(object):
def __init__(self, path: str, repo_type: str, org: str, repo: str) -> None:
self._path = path
Expand All @@ -23,21 +25,21 @@ def __init__(self, path: str, repo_type: str, org: str, repo: str) -> None:
self._repo = repo

self._git_repo = Repo(self._path)

def _sha256(self, text: Union[str, bytes]) -> str:
if isinstance(text, bytes) or isinstance(text, bytearray):
bin = text
elif isinstance(text, str):
bin = text.encode('utf-8')
bin = text.encode("utf-8")
else:
raise Exception("Invalid sha256 param type.")
sha256_hash = hashlib.sha256()
sha256_hash.update(bin)
hashed_string = sha256_hash.hexdigest()
return hashed_string

def _match_card(self, readme: str) -> str:
pattern = r'\s*---(.*?)---'
pattern = r"\s*---(.*?)---"

match = re.match(pattern, readme, flags=re.S)

Expand All @@ -46,22 +48,23 @@ def _match_card(self, readme: str) -> str:
return card_string
else:
return ""

def _remove_card(self, readme: str) -> str:
pattern = r'\s*---(.*?)---'
pattern = r"\s*---(.*?)---"
out = re.sub(pattern, "", readme, flags=re.S)
return out

def _get_readme(self, commit: Commit) -> str:
if "README.md" not in commit.tree:
return ""
else:
out: bytes = commit.tree["README.md"].data_stream.read()
return out.decode()

def _get_description(self, commit: Commit) -> str:
readme = self._get_readme(commit)
return self._remove_card(readme)

def _get_entry_files(self, tree, include_dir=False) -> List[str]:
out_paths = []
for entry in tree:
Expand All @@ -75,7 +78,6 @@ def _get_entry_files(self, tree, include_dir=False) -> List[str]:

def _get_tree_files(self, commit: Commit) -> List[str]:
return self._get_entry_files(commit.tree)


def _get_earliest_commit(self) -> Commit:
earliest_commit = None
Expand All @@ -96,12 +98,14 @@ def get_meta(self, commit_hash: str) -> Dict[str, Any]:
except gitdb.exc.BadName:
return None
meta = RepoMeta()

meta._id = self._sha256(f"{self._org}/{self._repo}/{commit.hexsha}")
meta.id = f"{self._org}/{self._repo}"
meta.author = self._org
meta.sha = commit.hexsha
meta.lastModified = self._git_repo.head.commit.committed_datetime.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
meta.lastModified = self._git_repo.head.commit.committed_datetime.strftime(
"%Y-%m-%dT%H:%M:%S.%fZ"
)
meta.private = False
meta.gated = False
meta.disabled = False
Expand All @@ -110,9 +114,13 @@ def get_meta(self, commit_hash: str) -> Dict[str, Any]:
meta.paperswithcode_id = None
meta.downloads = 0
meta.likes = 0
meta.cardData = yaml.load(self._match_card(self._get_readme(commit)), Loader=yaml.CLoader)
meta.cardData = yaml.load(
self._match_card(self._get_readme(commit)), Loader=yaml.CLoader
)
meta.siblings = [{"rfilename": p} for p in self._get_tree_files(commit)]
meta.createdAt = self._get_earliest_commit().committed_datetime.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
meta.createdAt = self._get_earliest_commit().committed_datetime.strftime(
"%Y-%m-%dT%H:%M:%S.%fZ"
)
return meta.to_dict()

def _contain_path(self, path: str, tree: Tree) -> bool:
Expand Down Expand Up @@ -149,7 +157,7 @@ def get_file(self, commit_hash: str, path: str) -> Optional[OStream]:
commit = self._git_repo.commit(commit_hash)
except gitdb.exc.BadName:
return None

def stream_wrapper(file_bytes: bytes):
file_stream = io.BytesIO(file_bytes)
while True:
Expand All @@ -158,10 +166,8 @@ def stream_wrapper(file_bytes: bytes):
break
else:
yield chunk

if not self._contain_path(path, commit.tree):
return None
else:
return stream_wrapper(commit.tree[path].data_stream.read())


Loading
Loading