Skip to content

Commit

Permalink
push
Browse files Browse the repository at this point in the history
  • Loading branch information
john-friedman committed Jan 9, 2025
1 parent cc39be3 commit 7624a51
Show file tree
Hide file tree
Showing 8 changed files with 266 additions and 0 deletions.
63 changes: 63 additions & 0 deletions .github/workflows/deploy-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: Deploy Sphinx Documentation

on:
push:
branches:
- main
workflow_dispatch:

permissions:
contents: write

jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install sphinx sphinx_rtd_theme myst-parser
# Add any additional dependencies your docs need
# pip install -r datamule/docs/requirements.txt
- name: Clean and Build Documentation
run: |
cd txt2dataset/docs
# More aggressive cleaning
rm -rf build/
rm -rf source/_build/
rm -rf _build/
git rm -rf --cached build/ || true
git rm -rf --cached _build/ || true
make clean
make html
ls -la
ls -la build/ || true
- name: Check Build Directory
run: |
pwd
ls -la datamule/docs/build/html || true
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./datamule/docs/build/html
force_orphan: true # This ensures a fresh history
enable_jekyll: false
user_name: 'github-actions[bot]'
user_email: 'github-actions[bot]@users.noreply.github.com'
commit_message: 'Deploy Sphinx documentation [skip ci]'
full_commit_message: |
Deploy Sphinx documentation
Build from ${{ github.sha }}
Triggered by ${{ github.event_name }}
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.egg-info
23 changes: 23 additions & 0 deletions txt2dataset/docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile clean

# Add clean target
clean:
rm -rf $(BUILDDIR)/*

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
35 changes: 35 additions & 0 deletions txt2dataset/docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
130 changes: 130 additions & 0 deletions txt2dataset/docs/source/dataset_builder.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@

Dataset Builder
==============

Transforms unstructured text data into structured datasets using Gemini API. You can get a free API Key from `Google AI Studio <https://aistudio.google.com/app/apikey>`_ with a 15 rpm limit. For higher rate limits, you can then setup the Google $300 Free Credit Trial for 90 days.

Requirements
-----------

Input CSV must contain ``accession_number`` and ``text`` columns.

Methods
-------

set_api_key(api_key)
Sets Google Gemini API key for authentication.

set_paths(input_path, output_path, failed_path)
Sets input CSV path, output path, and failed records log path.

set_base_prompt(prompt)
Sets prompt template for Gemini API.

set_response_schema(schema)
Sets expected JSON schema for validation.

set_model(model_name)
Sets Gemini model (default: 'gemini-1.5-flash-8b').

set_rpm(rpm)
Sets API rate limit (default: 1500).

set_save_frequency(frequency)
Sets save interval in records (default: 100).

build()
Processes input CSV and generates dataset.

Usage
-----

.. code-block:: python
from txt2dataset import DatasetBuilder
import os
builder = DatasetBuilder()
# Set API key
builder.set_api_key(os.environ["GOOGLE_API_KEY"])
# Set required configurations
builder.set_paths(
input_path="data/item502.csv",
output_path="data/bod.csv",
failed_path="data/failed_accessions.txt"
)
builder.set_base_prompt("""Extract Director or Principal Officer info to JSON format.
Provide the following information:
- start_date (YYYYMMDD)
- end_date (YYYYMMDD)
- name (First Middle Last)
- title
Return null if info unavailable.""")
builder.set_response_schema({
"type": "ARRAY",
"items": {
"type": "OBJECT",
"properties": {
"start_date": {"type": "STRING", "description": "Start date in YYYYMMDD format"},
"end_date": {"type": "STRING", "description": "End date in YYYYMMDD format"},
"name": {"type": "STRING", "description": "Full name (First Middle Last)"},
"title": {"type": "STRING", "description": "Official title/position"}
},
"required": ["start_date", "end_date", "name", "title"]
}
})
# Optional configurations
builder.set_rpm(1500)
builder.set_save_frequency(100)
builder.set_model('gemini-1.5-flash-8b')
# Build the dataset
builder.build()
API Key Setup
------------

1. Get API Key:
Visit `Google AI Studio <https://aistudio.google.com/app/apikey>`_ to generate your API key.

2. Set API Key as Environment Variable:

Windows (Command Prompt):
::

setx GOOGLE_API_KEY your-api-key

Windows (PowerShell):
::

[System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-api-key', 'User')

macOS/Linux (bash):
::

echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.bash_profile
source ~/.bash_profile

macOS (zsh):
::

echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.zshrc
source ~/.zshrc

Note: Replace 'your-api-key' with your actual API key.


Alternative API Key Setup
-----------------------

You can also set the API key directly in your Python code, though this is not recommended for production:

.. code-block:: python
api_key = "your-api-key" # Replace with your actual API key
builder.set_api_key(api_key)
14 changes: 14 additions & 0 deletions txt2dataset/docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
=======================================
Welcome to txt2dataset's documentation!
=======================================

A Python package to convert text into structured datasets.

Navigation
==========

.. toctree::
:maxdepth: 2
:caption: Contents:

dataset_builder
Binary file not shown.
Binary file not shown.

0 comments on commit 7624a51

Please sign in to comment.