-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback / Bug Reporting for Local Environment and Dependencies Issues in Synthetic Data Generator #15
Comments
@LumiWasTaken thank you for the feedback. I will have a look now. To get you started with some things already. The tool has been packaged up, so you can simply follow the development guidelines to install or you can use Also, we rely on the InferenceEndpointsLLM implementation and tokenization to be able to work with the Magpie paper which is why it currently is the default. |
I believe something like this should work. I will add it to the example directory. pip install synthetic-dataset-generator import os
from synthetic_dataset_generator.app import demo
os.environ["BASE_URL"] = "https://api.openai.com/v1/"
os.environ["API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["MODEL"] = "gpt-4o"
if __name__ == "__main__":
demo.launch() |
Actually no, that is the main reason why i posted this issue. I just saw your addition and the code you sent above did not work. In our case we'd like our own Models locally trained. As i've shown above, you need to manually edit the constants.py in order to have have it working for other models. |
I totally understand that it's based off that Paper, but given that you had instructions on how to run it with custom URLs we just thought it would work. |
Hi @LumiWasTaken thanks for the additional feedback. Sorry for the misunderstanding, I thought you were not installing the package and I had fixed all of the imports. My recent changes have fixed that.
Let me know what you think, then I will publish a new version. |
@LumiWasTaken also, what custom model fine-tunes are you currently using? |
Sure, let me check! I just came back home.
We are testing some Privately within our community for potentially using it in future models. Nothing big or enterprise. |
Alright so i did some testing with the newest version (dev install) The setup worked better BUT we still have odd behaviour: Logs:
This happens when we fill out "Labels" and "Dataset description" The "main.py" looks like this: from synthetic_dataset_generator.app import demo
import os
os.environ["BASE_URL"] = "http://localhost:11434/v1/" #OpenAi Compatible API
os.environ["MODEL"] = "gemma2-9b-q4_k_m:latest"
demo.launch() BUT! (venv) [lunix@nix argilla-synthetic-data-generator]$ export BASE_URL=http://localhost:11434/v1/
(venv) [lunix@nix argilla-synthetic-data-generator]$ export MODEL=gemma2-9b-q4_k_m:latest Throws a different chain of errors when clicking "Create":
|
@LumiWasTaken thanks for the code. I see what is happening and will fix it to work with Gemma today. FYI this will not add SFT/chat data support because of the required update to distilabel but I encourage you to open PR there, people will be happy :) |
No rush, take your time and do as please, I'll be napping now anyway. I just thought it could give you an insight before things get more complicated later on hehe. The issue is that we've been just exploring it, and I haven't fully understood the knowledge base yet as I'm running on 2h sleep at the moment. I'm not really sure what distillable's role is etc. |
@LumiWasTaken yes, sloppy project set-up for defining the constants. We started from a place where this was just going to be a Space on Hugging Face but decided it would be more valuable as a sharable tool, hence we packaged it up with some issues related to the switch. I know where the problems are so a fix is easy :) Sleep well. |
@LumiWasTaken I fixed some things w.r.t. ollama and openai implementations, but will add a more core magpie integration for ollama and llamacpp. If it is possible you would also be able to use vlmm with magpie support. |
@LumiWasTaken WIP but should soon work with things like llama-cpp, ollama and perhaps some other APIs that are serving specific models. |
Description
Current Behavior
The package has several issues that make local development and usage with local LLMs challenging:
# Current workaround needed: git clone https://github.com/argilla-io/synthetic-data-generator ln -s synthetic-data-generator/src src
The demo requires
src
to be in the same path asmain.py
, forcing users to create symbolic links.When attempting to use local LLMs (e.g., Ollama API), the following issues occur:
Even when configured for local usage, the system attempts to contact Hugging Face:
Currently requires manual modification of
constants.py
:Error Messages
Suggested Solutions
Local-First Approach
Demo Restructuring
src
directoryPackage Structure
Configuration Management
Additional Context
Local LLM support should be a primary consideration for testing and development. The current implementation makes it difficult to use the package in environments without Hugging Face access or with local LLM setups.
The text was updated successfully, but these errors were encountered: