Feedback / Bug Reporting for Local Environment and Dependencies Issues in Synthetic Data Generator #15

LumiWasTaken · 2024-12-16T23:34:04Z

Description

Current Behavior

The package has several issues that make local development and usage with local LLMs challenging:

Path Dependencies Issue

# Current workaround needed:
git clone https://github.com/argilla-io/synthetic-data-generator
ln -s synthetic-data-generator/src src

The demo requires src to be in the same path as main.py, forcing users to create symbolic links.

Local LLM Integration Failures
When attempting to use local LLMs (e.g., Ollama API), the following issues occur:

# Local API configuration fails
BASE_URL = "http://127.0.0.1:11434/v1/"
MODEL = "custom-model"

Hugging Face Dependencies
Even when configured for local usage, the system attempts to contact Hugging Face:

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api-inference.huggingface.co/status/gemma2-9b-q4_k_m:latest

Workaround Required for Local Usage
Currently requires manual modification of constants.py:

SFT_AVAILABLE = True
MAGPIE_PRE_QUERY_TEMPLATE = None

Error Messages

UserWarning: Since the `base_url=http://localhost:11434/v1/` is available and either one of `model_id` or `endpoint_name` is also provided, the `base_url` will either be ignored or overwritten with the one generated from either of those args, for serverless or dedicated inference endpoints, respectively.

Additional Context

Local LLM support should be a primary consideration for testing and development. The current implementation makes it difficult to use the package in environments without Hugging Face access or with local LLM setups.

The text was updated successfully, but these errors were encountered:

davidberenstein1957 · 2024-12-17T05:54:20Z

@LumiWasTaken thank you for the feedback. I will have a look now.

To get you started with some things already. The tool has been packaged up, so you can simply follow the development guidelines to install or you can use pip install synthetic-dataset-generator. I believe this should avoid the need for symbolic links, or am I missing something?

Also, we rely on the InferenceEndpointsLLM implementation and tokenization to be able to work with the Magpie paper which is why it currently is the default.

davidberenstein1957 · 2024-12-17T05:58:14Z

I believe something like this should work. I will add it to the example directory.

pip install synthetic-dataset-generator

import os

from synthetic_dataset_generator.app import demo

os.environ["BASE_URL"] = "https://api.openai.com/v1/"
os.environ["API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["MODEL"] = "gpt-4o"

if __name__ == "__main__":
    demo.launch()

LumiWasTaken · 2024-12-17T06:02:57Z

To get you started with some things already. The tool has been packaged up, so you can simply follow the development guidelines to install or you can use pip install synthetic-dataset-generator. I believe this should avoid the need for symbolic links, or am I missing something?

Actually no, that is the main reason why i posted this issue.
When you install the pip package and run the demo file it will complain about "src" module missing.
Normal people (like me) would go "oh surely it's pip install src" but then it didnt work and when i looked at the coke a bit deeper it ACTUALLY expected a FOLDER "src" to be present next to the main folder.
Try it yourself!
We have a dedicated Post in community for those workarounds as we thought it would support Local LLM support.

I just saw your addition and the code you sent above did not work. In our case we'd like our own Models locally trained.
We use Local OpenAi Compatible APIs that did not work. And from what we've tried to get it working, it doesn't seem like the openai example would work.

As i've shown above, you need to manually edit the constants.py in order to have have it working for other models.

LumiWasTaken · 2024-12-17T06:04:37Z

I totally understand that it's based off that Paper, but given that you had instructions on how to run it with custom URLs we just thought it would work.
From what we've seen it would be very promising, we hope that we can in the future use it without having to rely on external services.

davidberenstein1957 · 2024-12-17T06:37:07Z

Hi @LumiWasTaken thanks for the additional feedback. Sorry for the misunderstanding, I thought you were not installing the package and I had fixed all of the imports. My recent changes have fixed that.

fixed src import within the main package
additional comments about the magpie pre-query template and pointed towards an implementation option,
allow the user to manually set MAGPIE_PRE_QUERY_TEMPLATE to enforce llama3 or qwen2 when working with fine-tunes of that model. Pointing them to
added an examples directory to test deployments

Let me know what you think, then I will publish a new version.

davidberenstein1957 · 2024-12-17T06:39:05Z

@LumiWasTaken also, what custom model fine-tunes are you currently using?

LumiWasTaken · 2024-12-17T07:56:29Z

Sure, let me check! I just came back home.

@LumiWasTaken also, what custom model fine-tunes are you currently using?

We are testing some Privately within our community for potentially using it in future models. Nothing big or enterprise.
I personally use gemma2 at the moment to hopefully get some synthetic data with it due to its creative vocabulary.

LumiWasTaken · 2024-12-17T08:12:02Z

Alright so i did some testing with the newest version (dev install)

The setup worked better BUT we still have odd behaviour:

Logs:

python main.py 
/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/constants.py:78: UserWarning: ARGILLA_API_URL or ARGILLA_API_KEY is not set or is empty
  warnings.warn("ARGILLA_API_URL or ARGILLA_API_KEY is not set or is empty")
/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/oauth.py:163: UserWarning: Gradio does not support OAuth features outside of a Space environment. To help you debug your app locally, the login and logout buttons are mocked with your profile. To make it work, your machine must be logged in to Huggingface.
  warnings.warn(
* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/__init__.py:28: UserWarning: Since the `base_url=https://api-inference.huggingface.co/v1/` is available and either one of `model_id` or `endpoint_name` is also provided, the `base_url` will either be ignored or overwritten with the one generated from either of those args, for serverless or dedicated inference endpoints, respectively.
  warnings.warn(  # type: ignore
Step 'None' hasn't received a pipeline, and it hasn't been created within a `Pipeline` context. Please, use `with Pipeline() as pipeline:` and create the step within the context.
/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/__init__.py:28: UserWarning: Since the `base_url=https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct` is available and either one of `model_id` or `endpoint_name` is also provided, the `base_url` will either be ignored or overwritten with the one generated from either of those args, for serverless or dedicated inference endpoints, respectively.
  warnings.warn(  # type: ignore
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
Traceback (most recent call last):
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/queueing.py", line 625, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/blocks.py", line 2047, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/blocks.py", line 1594, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2505, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1005, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/utils.py", line 869, in wrapper
    response = f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/apps/textcat.py", line 65, in generate_system_prompt
    data = json.loads(result)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
Step 'None' hasn't received a pipeline, and it hasn't been created within a `Pipeline` context. Please, use `with Pipeline() as pipeline:` and create the step within the context.
Step 'None' hasn't received a pipeline, and it hasn't been created within a `Pipeline` context. Please, use `with Pipeline() as pipeline:` and create the step within the context.
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct/v1/chat/completions'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'
⚠ Received no response using Inference Client (model: 'meta-llama/Meta-Llama-3.1-8B-Instruct'). Finish reason was: 400, message='Bad Request', url='https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-8B-Instruct'

This happens when we fill out "Labels" and "Dataset description"

The "main.py" looks like this:

from synthetic_dataset_generator.app import demo
import os

os.environ["BASE_URL"] = "http://localhost:11434/v1/" #OpenAi Compatible API
os.environ["MODEL"] = "gemma2-9b-q4_k_m:latest"


demo.launch()

BUT!
Setting this OUTSIDE of the main.py like this:

(venv) [lunix@nix argilla-synthetic-data-generator]$ export BASE_URL=http://localhost:11434/v1/
(venv) [lunix@nix argilla-synthetic-data-generator]$ export MODEL=gemma2-9b-q4_k_m:latest

Throws a different chain of errors when clicking "Create":

(venv) [lunix@nix argilla-synthetic-data-generator]$ python main.py 
/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/constants.py:61: UserWarning: `SFT_AVAILABLE` is set to `False` because the model is not a Qwen or Llama model.
  warnings.warn(
/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/constants.py:78: UserWarning: ARGILLA_API_URL or ARGILLA_API_KEY is not set or is empty
  warnings.warn("ARGILLA_API_URL or ARGILLA_API_KEY is not set or is empty")
/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/oauth.py:163: UserWarning: Gradio does not support OAuth features outside of a Space environment. To help you debug your app locally, the login and logout buttons are mocked with your profile. To make it work, your machine must be logged in to Huggingface.
  warnings.warn(
* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/__init__.py:28: UserWarning: Since the `base_url=http://localhost:11434/v1/` is available and either one of `model_id` or `endpoint_name` is also provided, the `base_url` will either be ignored or overwritten with the one generated from either of those args, for serverless or dedicated inference endpoints, respectively.
  warnings.warn(  # type: ignore
Step 'None' hasn't received a pipeline, and it hasn't been created within a `Pipeline` context. Please, use `with Pipeline() as pipeline:` and create the step within the context.
Traceback (most recent call last):
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api-inference.huggingface.co/status/gemma2-9b-q4_k_m:latest

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/queueing.py", line 625, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/blocks.py", line 2047, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/blocks.py", line 1594, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2505, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1005, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/utils.py", line 869, in wrapper
    response = f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/apps/textcat.py", line 53, in generate_system_prompt
    generate_description = get_prompt_generator()
                           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/pipelines/textcat.py", line 79, in get_prompt_generator
    prompt_generator.load()
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/distilabel/steps/tasks/text_generation.py", line 219, in load
    super().load()
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/distilabel/steps/tasks/base.py", line 112, in load
    self.llm.load()
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/distilabel/llms/huggingface/inference_endpoints.py", line 255, in load
    status = client.get_model_status()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py", line 3155, in get_model_status
    hf_raise_for_status(response)
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 404 Client Error: Not Found for url: https://api-inference.huggingface.co/status/gemma2-9b-q4_k_m:latest (Request ID: JJxi3SPaA1RwIQVWoSQks)

Model gemma2-9b-q4_k_m:latest does not exist
Step 'None' hasn't received a pipeline, and it hasn't been created within a `Pipeline` context. Please, use `with Pipeline() as pipeline:` and create the step within the context.
Traceback (most recent call last):
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api-inference.huggingface.co/status/gemma2-9b-q4_k_m:latest

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/queueing.py", line 625, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/blocks.py", line 2047, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/blocks.py", line 1594, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2505, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 1005, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/gradio/utils.py", line 869, in wrapper
    response = f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/apps/textcat.py", line 74, in generate_sample_dataset
    dataframe = generate_dataset(
                ^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/apps/textcat.py", line 101, in generate_dataset
    textcat_generator = get_textcat_generator(
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/synthetic-data-generator-repo/src/synthetic_dataset_generator/pipelines/textcat.py", line 101, in get_textcat_generator
    textcat_generator.load()
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/distilabel/steps/tasks/improving_text_embeddings.py", line 131, in load
    super().load()
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/distilabel/steps/tasks/base.py", line 112, in load
    self.llm.load()
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/distilabel/llms/huggingface/inference_endpoints.py", line 255, in load
    status = client.get_model_status()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py", line 3155, in get_model_status
    hf_raise_for_status(response)
  File "/home/lunix/git_stuff/argilla-synthetic-data-generator/venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 404 Client Error: Not Found for url: https://api-inference.huggingface.co/status/gemma2-9b-q4_k_m:latest (Request ID: gG5Ro6le_e7t1Q6dEUNDG)

Model gemma2-9b-q4_k_m:latest does not exist

davidberenstein1957 · 2024-12-17T09:16:23Z

@LumiWasTaken thanks for the code. I see what is happening and will fix it to work with Gemma today. FYI this will not add SFT/chat data support because of the required update to distilabel but I encourage you to open PR there, people will be happy :)

LumiWasTaken · 2024-12-17T09:31:28Z

@LumiWasTaken thanks for the code. I see what is happening and will fix it to work with Gemma today. FYI this will not add SFT/chat data support because of the required update to distilabel but I encourage you to open PR there, people will be happy :)

No rush, take your time and do as please, I'll be napping now anyway. I just thought it could give you an insight before things get more complicated later on hehe.

The issue is that we've been just exploring it, and I haven't fully understood the knowledge base yet as I'm running on 2h sleep at the moment. I'm not really sure what distillable's role is etc.
I just saw the errors it's constantly falling back to the huggingface API, and we can't exactly figure out why it would need that since technically it should be all offline, right?

davidberenstein1957 · 2024-12-17T10:12:47Z

@LumiWasTaken yes, sloppy project set-up for defining the constants. We started from a place where this was just going to be a Space on Hugging Face but decided it would be more valuable as a sharable tool, hence we packaged it up with some issues related to the switch. I know where the problems are so a fix is easy :)

Sleep well.

davidberenstein1957 · 2024-12-18T19:01:29Z

@LumiWasTaken I fixed some things w.r.t. ollama and openai implementations, but will add a more core magpie integration for ollama and llamacpp. If it is possible you would also be able to use vlmm with magpie support.

davidberenstein1957 · 2024-12-19T11:49:56Z

@LumiWasTaken WIP but should soon work with things like llama-cpp, ollama and perhaps some other APIs that are serving specific models.

argilla-io/distilabel#1084

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback / Bug Reporting for Local Environment and Dependencies Issues in Synthetic Data Generator #15

Feedback / Bug Reporting for Local Environment and Dependencies Issues in Synthetic Data Generator #15

LumiWasTaken commented Dec 16, 2024

davidberenstein1957 commented Dec 17, 2024

davidberenstein1957 commented Dec 17, 2024

LumiWasTaken commented Dec 17, 2024

LumiWasTaken commented Dec 17, 2024

davidberenstein1957 commented Dec 17, 2024

davidberenstein1957 commented Dec 17, 2024

LumiWasTaken commented Dec 17, 2024 •

edited

Loading

LumiWasTaken commented Dec 17, 2024 •

edited

Loading

davidberenstein1957 commented Dec 17, 2024

LumiWasTaken commented Dec 17, 2024

davidberenstein1957 commented Dec 17, 2024

davidberenstein1957 commented Dec 18, 2024

davidberenstein1957 commented Dec 19, 2024 •

edited

Loading

Feedback / Bug Reporting for Local Environment and Dependencies Issues in Synthetic Data Generator #15

Feedback / Bug Reporting for Local Environment and Dependencies Issues in Synthetic Data Generator #15

Comments

LumiWasTaken commented Dec 16, 2024

Description

Current Behavior

Error Messages

Suggested Solutions

Additional Context

davidberenstein1957 commented Dec 17, 2024

davidberenstein1957 commented Dec 17, 2024

LumiWasTaken commented Dec 17, 2024

LumiWasTaken commented Dec 17, 2024

davidberenstein1957 commented Dec 17, 2024

davidberenstein1957 commented Dec 17, 2024

LumiWasTaken commented Dec 17, 2024 • edited Loading

LumiWasTaken commented Dec 17, 2024 • edited Loading

davidberenstein1957 commented Dec 17, 2024

LumiWasTaken commented Dec 17, 2024

davidberenstein1957 commented Dec 17, 2024

davidberenstein1957 commented Dec 18, 2024

davidberenstein1957 commented Dec 19, 2024 • edited Loading

LumiWasTaken commented Dec 17, 2024 •

edited

Loading

LumiWasTaken commented Dec 17, 2024 •

edited

Loading

davidberenstein1957 commented Dec 19, 2024 •

edited

Loading