Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to categorize data for indexing #2132

Open
mrisahoo1 opened this issue Nov 8, 2024 · 2 comments
Open

How to categorize data for indexing #2132

mrisahoo1 opened this issue Nov 8, 2024 · 2 comments

Comments

@mrisahoo1
Copy link

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

scripts/prepdocs.sh1 --category ExampleCategory

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
Windows 11

azd version?

run azd version and copy paste here.
azd version 1.10.4

Versions

Mention any other details that might be useful

I want to know how to categorize the data if I have three type types of documents.
I understand that adding an extra tag of --category will enable me to do that.
Lets say there HR, Finance.
How do I add in under the data folder
data/HR/file1.pdf
data/HR/file2.pdf
and
data/Finance/file1.pdf
data/Finance/file2.pdf

I will run
scripts/prepdocs.ps1 --category HR
It will do the steps and ingest the data into the Vector index in AI Search and add the category to the metadata fields.
But the scripts keeps running and enter the next folder on its own. Since there is a folder switch does it automatically take in the next category ?

Or

Should I interuppt the previous script after all documents are done.
And run
scripts/prepdocs.ps1 --category Finance
And then it starts the ingestion process again.

Please guide me into understanding what is the ideal way.


Thanks! We'll be in touch soon.

@pamelafox
Copy link
Collaborator

cc @bnodir who has worked more with categories and may have advice.

@bnodir
Copy link
Contributor

bnodir commented Nov 19, 2024

@mrisahoo1 - To categorize your documents for indexing, you can run the ingestion script for each category separately as you already figured out, or you can modify the .\scripts\prepdocs.ps1 script as follows for your needs:

./scripts/load_python_env.ps1

$venvPythonPath = "./.venv/scripts/python.exe"
if (Test-Path -Path "/usr") {
  # fallback to Linux venv path
  $venvPythonPath = "./.venv/bin/python"
}

Write-Host 'Running "prepdocs.py"'

$cwd = (Get-Location)
$dataDir = "$cwd/data"

# Iterate over directories in ./data
Get-ChildItem -Path $dataDir -Directory | ForEach-Object {
  $dir = $_.FullName
  $CONTENT_CATEGORY = $_.Name
  $dataArg = "`"$dir/*`""
  $additionalArgs = ""
  if ($args) {
    $additionalArgs = "$args"
  }

  $argumentList = "./app/backend/prepdocs.py $dataArg --verbose --category $CONTENT_CATEGORY $additionalArgs"

  Write-Host ">>> !!! Processing category: $CONTENT_CATEGORY !!! <<<"
  Start-Process -FilePath $venvPythonPath -ArgumentList $argumentList -Wait -NoNewWindow
}

This script iterates over each subdirectory in the data directory and processes the files within it, passing the folder name as the category. Note that folder names (which in this case become category names) should not include spaces, as category names are not allowed to have any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants