Add pip install datasets to RAG example #371

sanjeed5 · 2025-01-27T12:04:17Z

datasets is used in the code later and would throw an error

in the below code:

import datasets
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever

knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")
knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers"))

source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],
)
docs_processed = text_splitter.split_documents(source_docs)

datasets is used in the code later and would throw an error in the below code: ```python import datasets from langchain.docstore.document import Document from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.retrievers import BM25Retriever knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train") knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers")) source_docs = [ Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]}) for doc in knowledge_base ] text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, add_start_index=True, strip_whitespace=True, separators=["\n\n", "\n", ".", " ", ""], ) docs_processed = text_splitter.split_documents(source_docs) ```

albertvillanova

Thanks!

HuggingFaceDocBuilderDev · 2025-01-27T15:26:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova approved these changes Jan 27, 2025

View reviewed changes

albertvillanova changed the title ~~add pip install datasets~~ Add pip install datasets to RAG example Jan 27, 2025

albertvillanova merged commit 5edf940 into huggingface:main Jan 27, 2025
4 checks passed

sanjeed5 deleted the patch-2 branch February 20, 2025 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pip install datasets to RAG example #371

Add pip install datasets to RAG example #371

Uh oh!

sanjeed5 commented Jan 27, 2025

Uh oh!

albertvillanova left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Jan 27, 2025

Uh oh!

Uh oh!

Uh oh!

Add pip install datasets to RAG example #371

Add pip install datasets to RAG example #371

Uh oh!

Conversation

sanjeed5 commented Jan 27, 2025

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jan 27, 2025

Uh oh!

Uh oh!

Uh oh!