Skip to content

Commit

Permalink
add datasets
Browse files Browse the repository at this point in the history
Signed-off-by: Zhiyuan Chen <[email protected]>
  • Loading branch information
ZhiyuanChen committed Sep 10, 2024
1 parent 144027a commit 32efa04
Show file tree
Hide file tree
Showing 60 changed files with 2,089 additions and 56 deletions.
4 changes: 4 additions & 0 deletions .codespell-whitelist.txt
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
ser
marz
manuel
wass
gir
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -390,6 +390,9 @@ FodyWeavers.xsd
!.vscode/extensions.json
*.code-workspace

# JetBrains
.idea/

# Local History for Visual Studio Code
.history/

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Welcome to MultiMolecule (浦原), a foundational library designed to accelerate
We understand that AI4Science is a broad field, with researchers from different disciplines employing various practices. Therefore, MultiMolecule is designed with low coupling in mind, meaning that while it offers a full suite of functionalities, each module can be used independently. This allows you to integrate only the components you need into your existing workflows without adding unnecessary complexity. The key functionalities that MultiMolecule provides include:

- [`data`](data): Efficient data handling and preprocessing capabilities to streamline the ingestion and transformation of scientific datasets.
- [`datasets`](datasets): A collection of widely-used datasets across different scientific domains, providing a solid foundation for training and evaluation.
- [`module`](module): Modular components designed to provide flexibility and reusability across various machine learning tasks.
- [`models`](models): State-of-the-art model architectures optimized for scientific research applications, ensuring high performance and accuracy.
- [`tokenisers`](tokenisers): Advanced tokenization methods to effectively handle complex scientific text and data representations.
Expand Down
1 change: 1 addition & 0 deletions README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ date: 2024-05-04 00:00:00
我们理解 AI4Science 是一个广泛的领域,来自不同学科的研究人员使用各种实践方法。因此,MultiMolecule 设计时考虑了低耦合性,这意味着虽然它提供了完整的功能套件,但每个模块都可以独立使用。这使您可以仅将所需组件集成到现有工作流程中,而不会增加不必要的复杂性。MultiMolecule 提供的主要功能包括:

- [`data`](data): 高效的数据处理和预处理功能,以简化科学数据集的摄取和转换。
- [`datasets`](datasets): 跨不同科学领域的广泛使用数据集集合,为训练和评估提供坚实基础。
- [`module`](module): 旨在提供灵活性和可重用性的模块化组件,适用于各种机器学习任务。
- [`models`](models): 为科学研究应用优化的最先进模型架构,确保高性能和高准确性。
- [`tokenisers`](tokenisers): 先进的分词方法,有效处理复杂的科学文本和数据表示。
Expand Down
19 changes: 19 additions & 0 deletions demo/data/huggingface-datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# MultiMolecule
# Copyright (C) 2024-Present MultiMolecule

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.

# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from multimolecule.data import Dataset

data = Dataset("multimolecule/bprna-spot", split="train", pretrained="multimolecule/rna")
19 changes: 19 additions & 0 deletions demo/datasets/huggingface-datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# MultiMolecule
# Copyright (C) 2024-Present MultiMolecule

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.

# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from datasets import load_dataset

data = load_dataset("multimolecule/bprna-spot")
109 changes: 109 additions & 0 deletions docs/docs/about/license-faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# License FAQ

This License FAQ explains the terms and conditions under which you may use the data, models, code, configuration, documentation, and weights provided by the DanLing Team (also known as DanLing) ('we', 'us', or 'our').
It serves as an addendum to our _[License](license.md)_.

## 0. Summary of Key Points

This summary provides key points from our license, but you can find out more details about any of these topics by clicking the link following each key point and by reading the full license.

<div class="grid cards" markdown>

!!! question "What constitutes the 'source code' in MultiMolecule?"

We consider everything in our repositories to be source code, including data, models, code, configuration, and documentation.

[:octicons-arrow-right-24: What constitutes the 'source code' in MultiMolecule?](#1-what-constitutes-the-source-code-in-multimolecule)

!!! question "Can I publish research papers using MultiMolecule?"

It depends.

You can publish research papers on fully open access journals and conferences or preprint servers following the terms of the *[License](license.md)*.

You must obtain a separate license from us to publish research papers on closed access journals and conferences.

[:octicons-arrow-right-24: Can I publish research papers using MultiMolecule?](#2-can-i-publish-research-papers-using-multimolecule)

!!! question "Can I use MultiMolecule for commercial purposes?"

Yes, you can use MultiMolecule for commercial purposes under the terms of the *[License](license.md)*.

[:octicons-arrow-right-24: Can I use MultiMolecule for commercial purposes?](#3-can-i-use-multimolecule-for-commercial-purposes)

!!! question "Do people affiliated with certain organizations have specific license terms?"

Yes, people affiliated with certain organizations have specific license terms.

[:octicons-arrow-right-24: Do people affiliated with certain organizations have specific license terms?](#4-do-people-affiliated-with-certain-organizations-have-specific-license-terms)

</div>

## 1. What constitutes the "source code" in MultiMolecule?

We consider everything in our repositories to be source code.

The training process of machine learning models is viewed similarly to the compilation process of traditional software.
As such, the model, code, configuration, documentation, and data used for training are all part of the source code, while the trained model weights are part of the object code.

We also consider research papers and manuscripts a special form of documentation, which are also part of the source code.

## 2. Can I publish research papers using MultiMolecule?

Since research papers are considered a form of source code, publishers are legally required to open-source all materials on their server to comply with the _[License](license.md)_ if they publish papers using MultiMolecule. This is generally impractical for most publishers.

As a special exemption under section 7 of the _[License](license.md)_, we grant permission to publish research papers using MultiMolecule in fully open access journals, conferences, or preprint servers, provided all published manuscripts are made available under the [GNU Free Documentation License (GFDL)](https://www.gnu.org/licenses/fdl.html), or a [Creative Commons license](https://creativecommons.org), or an [OSI-approved license](https://opensource.org/licenses) that permits the sharing of manuscripts.

For publishing in closed access journals or conferences, you must obtain a separate license from us. This typically involves co-authorship, a fee to support the project, or both. Contact us at [[email protected]](mailto:[email protected]) for more information.

While not mandatory, we recommend citing the MultiMolecule project in your research papers.

## 3. Can I use MultiMolecule for commercial purposes?

Yes, MultiMolecule can be used for commercial purposes under the _[License](license.md)_. However, you must open-source any modifications to the source code and make them available under the _[License](license.md)_.

If you prefer to use MultiMolecule for commercial purposes without open-sourcing your modifications, you must obtain a separate license from us. This typically involves a fee to support the project. Contact us at [[email protected]](mailto:[email protected]) for further details.

## 4. Do people affiliated with certain organizations have specific license terms?

YES!

If you are affiliated with an organization that has a separate license agreement with us, you may be subject to different license terms.
Please consult your organization's legal department to determine if you are subject to a separate license agreement.

Members of the following organizations automatically receive a non-transferable, non-sublicensable, and non-distributable [MIT License](https://mit-license.org/) to use MultiMolecule:

- [Microsoft Research AI for Science](https://www.microsoft.com/en-us/research/lab/microsoft-research-ai-for-science/)
- [DP Technology](https://dp.tech/)

This special license is considered an additional term under section 7 of the _[License](license.md)_.
It is not redistributable, and you are prohibited from creating any independent derivative works.
Any modifications or derivative works based on this license are automatically considered derivative works of MultiMolecule and must comply with all the terms of the _[License](license.md)_.
This ensures that third parties cannot bypass the license terms or create separate licenses from derivative works.

## 5. How can I use MultiMolecule if my organization forbids the use of code under the AGPL License?

Some organizations, such as [Google](https://opensource.google/documentation/reference/using/agpl-policy), have policies that prohibit the use of code under the AGPL License.

If you are affiliated with an organization that forbids the use of AGPL-licensed code, you must obtain a separate license from us.
Contact us at [[email protected]](mailto:[email protected]) for more information.

## 6. Can I use MultiMolecule if I am a federal employee of the United States Government?

No.

Code written by federal employees of the United States Government is not protected by copyright under [17 U.S. Code § 105](https://www.law.cornell.edu/uscode/text/17/105).

As a result, federal employees of the United States Government cannot comply with the terms of the _[License](license.md)_.

## 7. Do we make updates to this FAQ?

!!! tip "In Short"

Yes, we will update this FAQ as necessary to stay compliant with relevant laws.

We may update this license FAQ from time to time.
The updated version will be indicated by an updated 'Last Revised Time' at the bottom of this license FAQ.
If we make any material changes, we will notify you by posting the new license FAQ on this page.
We are unable to notify you directly as we do not collect any contact information from you.
We encourage you to review this license FAQ frequently to stay informed of how you can use our data, models, code, configuration, documentation, and weights.
115 changes: 115 additions & 0 deletions docs/docs/about/license-faq.zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
!!! warning "翻译"

本文内容为翻译版本,旨在为用户提供方便。
我们已经尽力确保翻译的准确性。
但请注意,翻译内容可能包含错误,仅供参考。
请以英文[原文](https://multimolecule.danling.org/about/license)为准。

为满足合规性与执法要求,翻译文档中的任何不准确或歧义之处均不具有约束力,也不具备法律效力。

# 许可协议常见问题解答

本许可协议常见问题解答解释了您可以在何种条件下使用由丹灵团队(也称为丹灵)(“我们”或“我们的”)提供的数据、模型、代码、配置、文档和权重。
它作为我们的 _[许可协议](license.zh.md)_ 的附加文件。

## 0. 关键点总结

本总结提供了常见问题解答的关键点,但您可以通过点击每个关键点后的链接或使用目录来找到您所查找的部分以了解更多详情。

<div class="grid cards" markdown>

!!! question "在 MultiMolecule 中,什么构成了“源代码”?"

我们认为我们存储库中的所有内容都是源代码,包括数据、模型、代码、配置和文档。

[:octicons-arrow-right-24: 在MultiMolecule中,什么构成了“源代码”?](#1-multimolecule)

!!! question "我可以使用 MultiMolecule 发表研究论文吗?"

视情况而定。

您可以按照 *[许可协议](license.zh.md)* 的条款在完全开放获取的期刊和会议或预印本服务器上发表研究论文。

要在封闭获取的期刊和会议上发表研究论文,您必须从我们这里获得单独的许可。

[:octicons-arrow-right-24: 我可以使用MultiMolecule发表研究论文吗?](#2multimolecule)

!!! question "我可以将 MultiMolecule 用于商业用途吗?"

是的,您可以根据 *[许可协议](license.zh.md)* 的条款将MultiMolecule用于商业用途。

[:octicons-arrow-right-24: 我可以将MultiMolecule用于商业用途吗?](#3-multimolecule)

!!! question "与某些组织有关系的人是否有特定的许可条款?"

是的,与某些组织有关系的人有特定的许可条款。

[:octicons-arrow-right-24: 与某些组织有关系的人是否有特定的许可条款?](#4)

</div>

## 1. 在 MultiMolecule 中,什么构成了“源代码”?

我们认为我们存储库中的所有内容都是源代码。

机器学习模型的训练过程被视作类似于传统软件的编译过程。因此,模型、代码、配置、文档和用于训练的数据都被视为源代码的一部分,而训练出的模型权重则被视为目标代码的一部分。

我们还将研究论文和手稿视为一种特殊的文档形式,它们也是源代码的一部分。

## 2 我可以使用 MultiMolecule 发表研究论文吗?

由于研究论文被视为源代码的一种形式,如果发表使用 MultiMolecule 的论文,出版商必须开源其服务器上的所有材料,以符合 _[许可协议](license.zh.md)_ 的要求。对于大多数出版商来说,这是不切实际的。

作为 _[许可协议](license.zh.md)_ 第 7 条的特别豁免,我们允许在完全开放获取的期刊、会议或预印本服务器上发表使用 MultiMolecule 的研究论文,前提是所有发表的手稿都应按照允许共享手稿的[GNU 自由文档许可协议(GFDL)](https://www.gnu.org/licenses/fdl.html)[知识共享许可协议](https://creativecommons.org)[OSI 批准许可协议](https://opensource.org/licenses)提供。

要在封闭获取的期刊或会议上发表论文,您必须从我们这里获得单独的许可。这通常包括共同署名、支持项目的费用或两者兼而有之。请通过 [[email protected]](mailto:[email protected]) 与我们联系以获取更多信息。

虽然不是强制性的,但我们建议在研究论文中引用 MultiMolecule 项目。

## 3. 我可以将 MultiMolecule 用于商业用途吗?

是的,您可以根据 _[许可协议](license.zh.md)_ 将 MultiMolecule 用于商业用途。但是,您必须开源对源代码的任何修改,并使其在 _[许可协议](license.zh.md)_ 下可用。

如果您希望在不开源修改内容的情况下将 MultiMolecule 用于商业用途,则必须从我们这里获得单独的许可。这通常涉及支持项目的费用。请通过 [[email protected]](mailto:[email protected]) 与我们联系以获取更多详细信息。

## 4. 与某些组织有关系的人是否有特定的许可条款?

是的!

如果您与一个与我们有单独许可协议的组织有关系,您可能会受到不同的许可条款的约束。请咨询您组织的法律部门,以确定您是否受制于单独的许可协议。

以下组织的成员自动获得一个不可转让、不可再许可、不可分发的 [MIT 许可协议](https://mit-license.org/) 来使用 MultiMolecule:

- [微软研究院科学智能中心](https://www.microsoft.com/en-us/research/lab/microsoft-research-ai-for-science/)
- [深势科技](https://dp.tech/)

此特别许可被视为 _[许可协议](license.zh.md)_ 第 7 条中的附加条款。
它不可再分发,并且您被禁止创建任何独立的衍生作品。
基于此许可的任何修改或衍生作品将自动被视为 MultiMolecule 的衍生作品,必须遵守 _[许可协议](license.zh.md)_ 的所有条款。
这确保了第三方无法绕过许可条款或从衍生作品中创建单独的许可协议。

## 5. 如果我的组织禁止使用 AGPL 许可协议下的代码,我该如何使用 MultiMolecule?

一些组织(如[Google](https://opensource.google/documentation/reference/using/agpl-policy))有禁止使用 AGPL 许可协议下代码的政策。

如果您与禁止使用 AGPL 许可协议代码的组织有关系,您必须从我们这里获得单独的许可。请通过 [[email protected]](mailto:[email protected]) 与我们联系以获取更多详细信息。

## 6. 如果我是美国联邦政府的雇员,我可以使用 MultiMolecule 吗?

不能。

根据[17 U.S. Code § 105](https://www.law.cornell.edu/uscode/text/17/105),美国联邦政府雇员撰写的代码不受版权保护。

因此,美国联邦政府雇员无法遵守 _[许可协议](license.zh.md)_ 的条款。

## 7. 我们会更新此常见问题解答吗?

!!! tip "简而言之"

是的,我们将根据需要更新此常见问题解答以保持与相关法律的一致。

我们可能会不时更新此许可协议常见问题解答。
更新后的版本将通过更新本页面底部的“最后修订时间”来表示。
如果我们进行任何重大更改,我们将通过在本页发布新的许可协议常见问题解答来通知您。
由于我们不收集您的任何联系信息,我们无法直接通知您。
我们鼓励您经常查看本许可协议常见问题解答,以了解您可以如何使用我们的数据、模型、代码、配置、文档和权重。
Loading

0 comments on commit 32efa04

Please sign in to comment.