Skip to content

Embeddings VIA API? #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
drewskidang opened this issue Apr 6, 2025 · 12 comments
Open

Embeddings VIA API? #51

drewskidang opened this issue Apr 6, 2025 · 12 comments

Comments

@drewskidang
Copy link

I dont have the hardware to run the embeddings locally ist here a way to configure use the API?

@PatrickMer
Copy link

I tried defining the embedding model as part of model list and it still ran locally. Also looking for an API solution for embeddings.

model_list: 
  - model_name: deepseek/deepseek-chat-v3-0324:free
    provider: null
    api_key: $OPENROUTER_TOKEN
    base_url: "https://openrouter.ai/api/v1"
    max_concurrent_requests: 4
  - model_name: intfloat/multilingual-e5-large-instruct
    provider: hf-inference
    api_key: $HF_TOKEN
    base_url: null
    max_concurrent_requests: 4

model_roles:
  ingestion:
    - deepseek/deepseek-chat-v3-0324:free
  summarization:
    - deepseek/deepseek-chat-v3-0324:free
  chunking:
    - intfloat/multilingual-e5-large-instruct # your sentence level chunking model
  single_shot_question_generation:
    - deepseek/deepseek-chat-v3-0324:free
  multi_hop_question_generation:
    - deepseek/deepseek-chat-v3-0324:free```

@drewskidang
Copy link
Author

@PatrickMer you need to change the chunking.py but the way its run is for local deployment now api

@drewskidang
Copy link
Author

@PatrickMer did you get the repo running??

@PatrickMer
Copy link

PatrickMer commented Apr 6, 2025

Yeah, I got the repo running locally on the example and then using OpenRouter for the example, but when trying to use a larger dataset (25 files, 75MB), the embedding uses all of my RAM (~14GB) and is not feasable.

What changes did you make to chunking.py?

@drewskidang
Copy link
Author

i didn't make any changes lol, it's taking way too long to chunk its stuck on the chunking mode

@PatrickMer
Copy link

@drewskidang I tried changing chunking.py to use cuda instead of cpu for semantic modelling. It worked, and chunking started to run without using too much RAM, but I still ran out of RAM and got a memory error about 5 minutes in.

Do you think using a smaller model could work? Or is the memory issue coming from loading the data into memory?

@sumukshashidhar
Copy link
Member

@PatrickMer @drewskidang have you tried using the fast chunking mode? the quality difference for most tasks should be negligible

@PatrickMer
Copy link

PatrickMer commented Apr 6, 2025

Yeah, I got the repo running locally on the example and then using OpenRouter for the example, but when trying to use a larger dataset (25 files, 75MB), the embedding uses all of my RAM (~14GB) and is not feasable.

@sumukshashidhar This was using fast chunking mode, I only tried semantic thinking using VRAM for the model might reduce the program's RAM usage. Have you tried running larger datasets, and if so on what hardware?

@drewskidang
Copy link
Author

@PatrickMer i redid the entire chunking to use llama index for data loading lol. And use togetherapi for embeddings

@PatrickMer
Copy link

@drewskidang haha did it work. Could you share that.

@drewskidang
Copy link
Author

@PatrickMer will tomorrow lol if don't remind me

@sumukshashidhar
Copy link
Member

@PatrickMer I did try it with large, but I tried it on 8xH100 machines 😅, which was an oversight. I'll investigate this part further. In the meantime @drewskidang - would be great if you could share the embedding API implementation / make a PR 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants