-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
cc39be3
commit 7624a51
Showing
8 changed files
with
266 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
name: Deploy Sphinx Documentation | ||
|
||
on: | ||
push: | ||
branches: | ||
- main | ||
workflow_dispatch: | ||
|
||
permissions: | ||
contents: write | ||
|
||
jobs: | ||
build-and-deploy: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v3 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: '3.x' | ||
|
||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install sphinx sphinx_rtd_theme myst-parser | ||
# Add any additional dependencies your docs need | ||
# pip install -r datamule/docs/requirements.txt | ||
- name: Clean and Build Documentation | ||
run: | | ||
cd txt2dataset/docs | ||
# More aggressive cleaning | ||
rm -rf build/ | ||
rm -rf source/_build/ | ||
rm -rf _build/ | ||
git rm -rf --cached build/ || true | ||
git rm -rf --cached _build/ || true | ||
make clean | ||
make html | ||
ls -la | ||
ls -la build/ || true | ||
- name: Check Build Directory | ||
run: | | ||
pwd | ||
ls -la datamule/docs/build/html || true | ||
- name: Deploy to GitHub Pages | ||
uses: peaceiris/actions-gh-pages@v3 | ||
with: | ||
github_token: ${{ secrets.GITHUB_TOKEN }} | ||
publish_dir: ./datamule/docs/build/html | ||
force_orphan: true # This ensures a fresh history | ||
enable_jekyll: false | ||
user_name: 'github-actions[bot]' | ||
user_email: 'github-actions[bot]@users.noreply.github.com' | ||
commit_message: 'Deploy Sphinx documentation [skip ci]' | ||
full_commit_message: | | ||
Deploy Sphinx documentation | ||
Build from ${{ github.sha }} | ||
Triggered by ${{ github.event_name }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
*.egg-info |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Minimal makefile for Sphinx documentation | ||
# | ||
# You can set these variables from the command line, and also | ||
# from the environment for the first two. | ||
SPHINXOPTS ?= | ||
SPHINXBUILD ?= sphinx-build | ||
SOURCEDIR = source | ||
BUILDDIR = build | ||
|
||
# Put it first so that "make" without argument is like "make help". | ||
help: | ||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) | ||
|
||
.PHONY: help Makefile clean | ||
|
||
# Add clean target | ||
clean: | ||
rm -rf $(BUILDDIR)/* | ||
|
||
# Catch-all target: route all unknown targets to Sphinx using the new | ||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). | ||
%: Makefile | ||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
@ECHO OFF | ||
|
||
pushd %~dp0 | ||
|
||
REM Command file for Sphinx documentation | ||
|
||
if "%SPHINXBUILD%" == "" ( | ||
set SPHINXBUILD=sphinx-build | ||
) | ||
set SOURCEDIR=source | ||
set BUILDDIR=build | ||
|
||
%SPHINXBUILD% >NUL 2>NUL | ||
if errorlevel 9009 ( | ||
echo. | ||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx | ||
echo.installed, then set the SPHINXBUILD environment variable to point | ||
echo.to the full path of the 'sphinx-build' executable. Alternatively you | ||
echo.may add the Sphinx directory to PATH. | ||
echo. | ||
echo.If you don't have Sphinx installed, grab it from | ||
echo.https://www.sphinx-doc.org/ | ||
exit /b 1 | ||
) | ||
|
||
if "%1" == "" goto help | ||
|
||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% | ||
goto end | ||
|
||
:help | ||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% | ||
|
||
:end | ||
popd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
|
||
Dataset Builder | ||
============== | ||
|
||
Transforms unstructured text data into structured datasets using Gemini API. You can get a free API Key from `Google AI Studio <https://aistudio.google.com/app/apikey>`_ with a 15 rpm limit. For higher rate limits, you can then setup the Google $300 Free Credit Trial for 90 days. | ||
|
||
Requirements | ||
----------- | ||
|
||
Input CSV must contain ``accession_number`` and ``text`` columns. | ||
|
||
Methods | ||
------- | ||
|
||
set_api_key(api_key) | ||
Sets Google Gemini API key for authentication. | ||
|
||
set_paths(input_path, output_path, failed_path) | ||
Sets input CSV path, output path, and failed records log path. | ||
|
||
set_base_prompt(prompt) | ||
Sets prompt template for Gemini API. | ||
|
||
set_response_schema(schema) | ||
Sets expected JSON schema for validation. | ||
|
||
set_model(model_name) | ||
Sets Gemini model (default: 'gemini-1.5-flash-8b'). | ||
|
||
set_rpm(rpm) | ||
Sets API rate limit (default: 1500). | ||
|
||
set_save_frequency(frequency) | ||
Sets save interval in records (default: 100). | ||
|
||
build() | ||
Processes input CSV and generates dataset. | ||
|
||
Usage | ||
----- | ||
|
||
.. code-block:: python | ||
from txt2dataset import DatasetBuilder | ||
import os | ||
builder = DatasetBuilder() | ||
# Set API key | ||
builder.set_api_key(os.environ["GOOGLE_API_KEY"]) | ||
# Set required configurations | ||
builder.set_paths( | ||
input_path="data/item502.csv", | ||
output_path="data/bod.csv", | ||
failed_path="data/failed_accessions.txt" | ||
) | ||
builder.set_base_prompt("""Extract Director or Principal Officer info to JSON format. | ||
Provide the following information: | ||
- start_date (YYYYMMDD) | ||
- end_date (YYYYMMDD) | ||
- name (First Middle Last) | ||
- title | ||
Return null if info unavailable.""") | ||
builder.set_response_schema({ | ||
"type": "ARRAY", | ||
"items": { | ||
"type": "OBJECT", | ||
"properties": { | ||
"start_date": {"type": "STRING", "description": "Start date in YYYYMMDD format"}, | ||
"end_date": {"type": "STRING", "description": "End date in YYYYMMDD format"}, | ||
"name": {"type": "STRING", "description": "Full name (First Middle Last)"}, | ||
"title": {"type": "STRING", "description": "Official title/position"} | ||
}, | ||
"required": ["start_date", "end_date", "name", "title"] | ||
} | ||
}) | ||
# Optional configurations | ||
builder.set_rpm(1500) | ||
builder.set_save_frequency(100) | ||
builder.set_model('gemini-1.5-flash-8b') | ||
# Build the dataset | ||
builder.build() | ||
API Key Setup | ||
------------ | ||
|
||
1. Get API Key: | ||
Visit `Google AI Studio <https://aistudio.google.com/app/apikey>`_ to generate your API key. | ||
|
||
2. Set API Key as Environment Variable: | ||
|
||
Windows (Command Prompt): | ||
:: | ||
|
||
setx GOOGLE_API_KEY your-api-key | ||
|
||
Windows (PowerShell): | ||
:: | ||
|
||
[System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-api-key', 'User') | ||
|
||
macOS/Linux (bash): | ||
:: | ||
|
||
echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.bash_profile | ||
source ~/.bash_profile | ||
|
||
macOS (zsh): | ||
:: | ||
|
||
echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.zshrc | ||
source ~/.zshrc | ||
|
||
Note: Replace 'your-api-key' with your actual API key. | ||
|
||
|
||
Alternative API Key Setup | ||
----------------------- | ||
|
||
You can also set the API key directly in your Python code, though this is not recommended for production: | ||
|
||
.. code-block:: python | ||
api_key = "your-api-key" # Replace with your actual API key | ||
builder.set_api_key(api_key) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
======================================= | ||
Welcome to txt2dataset's documentation! | ||
======================================= | ||
|
||
A Python package to convert text into structured datasets. | ||
|
||
Navigation | ||
========== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
dataset_builder |
Binary file not shown.
Binary file not shown.