Skip to content

Commit b3aa03e

Browse files
authored
Merge pull request #884 from ScrapeGraphAI/829-languagecountry-selection
Merge pull request #883 from ScrapeGraphAI/main
2 parents 3108793 + 37c07c8 commit b3aa03e

File tree

145 files changed

+1006
-706
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

145 files changed

+1006
-706
lines changed

.github/FUNDING.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,4 @@ lfx_crowdfunding: # Replace with a single LFX Crowdfunding project-name e.g., cl
1212
polar: # Replace with a single Polar username
1313
buy_me_a_coffee: # Replace with a single Buy Me a Coffee username
1414
thanks_dev: # Replace with a single thanks.dev username
15-
custom:
15+
custom:

.github/ISSUE_TEMPLATE/custom.md

-2
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,3 @@ labels: ''
66
assignees: ''
77

88
---
9-
10-

.github/workflows/release.yml

+13-13
Original file line numberDiff line numberDiff line change
@@ -19,21 +19,21 @@ jobs:
1919
uses: actions/setup-python@v5
2020
with:
2121
python-version: '3.10'
22-
22+
2323
- name: Install uv
2424
uses: astral-sh/setup-uv@v3
25-
25+
2626
- name: Install Node Env
2727
uses: actions/setup-node@v4
2828
with:
2929
node-version: 20
30-
30+
3131
- name: Checkout
3232
uses: actions/[email protected]
3333
with:
3434
fetch-depth: 0
3535
persist-credentials: false
36-
36+
3737
- name: Build and validate package
3838
run: |
3939
uv venv
@@ -44,10 +44,10 @@ jobs:
4444
uv build
4545
uv pip install --upgrade pkginfo==1.12.0 twine==6.0.1 # Upgrade pkginfo and install twine
4646
python -m twine check dist/*
47-
47+
4848
- name: Debug Dist Directory
4949
run: ls -al dist
50-
50+
5151
- name: Cache build
5252
uses: actions/cache@v3
5353
with:
@@ -59,7 +59,7 @@ jobs:
5959
runs-on: ubuntu-latest
6060
needs: build
6161
environment: development
62-
if: >
62+
if: >
6363
github.event_name == 'push' && (github.ref == 'refs/heads/main' || github.ref == 'refs/heads/pre/beta') ||
6464
(github.event_name == 'pull_request' && github.event.action == 'closed' && github.event.pull_request.merged &&
6565
(github.event.pull_request.base.ref == 'main' || github.event.pull_request.base.ref == 'pre/beta'))
@@ -74,23 +74,23 @@ jobs:
7474
with:
7575
fetch-depth: 0
7676
persist-credentials: false
77-
77+
7878
- name: Restore build artifacts
7979
uses: actions/cache@v3
8080
with:
8181
path: ./dist
8282
key: ${{ runner.os }}-build-${{ github.sha }}
83-
83+
8484
- name: Semantic Release
8585
uses: cycjimmy/[email protected]
8686
with:
8787
semantic_version: 23
8888
extra_plugins: |
8989
semantic-release-pypi@3
90-
@semantic-release/git
91-
@semantic-release/commit-analyzer@12
92-
@semantic-release/release-notes-generator@13
93-
@semantic-release/github@10
90+
@semantic-release/git
91+
@semantic-release/commit-analyzer@12
92+
@semantic-release/release-notes-generator@13
93+
@semantic-release/github@10
9494
@semantic-release/changelog@6
9595
conventional-changelog-conventionalcommits@7
9696
env:

.readthedocs.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
# Read the Docs configuration file for Sphinx projects
23
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
34

@@ -32,4 +33,4 @@ sphinx:
3233
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
3334
# python:
3435
# install:
35-
# - requirements: docs/requirements.txt
36+
# - requirements: docs/requirements.txt

.releaserc.yml

-1
Original file line numberDiff line numberDiff line change
@@ -53,4 +53,3 @@ branches:
5353
channel: "dev"
5454
prerelease: "beta"
5555
debug: true
56-

Dockerfile

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ RUN pip install --no-cache-dir scrapegraphai
66
RUN pip install --no-cache-dir scrapegraphai[burr]
77

88
RUN python3 -m playwright install-deps
9-
RUN python3 -m playwright install
9+
RUN python3 -m playwright install

LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ Permission is hereby granted, free of charge, to any person obtaining a copy of
44

55
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
66

7-
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
7+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ The Official API Documentation can be found [here](https://docs.scrapegraphai.co
182182
</a>
183183
</div>
184184

185-
## 📈 Telemetry
185+
## 📈 Telemetry
186186
We collect anonymous usage metrics to enhance our package's quality and user experience. The data helps us prioritize improvements and ensure compatibility. If you wish to opt-out, set the environment variable SCRAPEGRAPHAI_TELEMETRY_ENABLED=false. For more information, please refer to the documentation [here](https://scrapegraph-ai.readthedocs.io/en/latest/scrapers/telemetry.html).
187187

188188

SECURITY.md

-1
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,3 @@
33
## Reporting a Vulnerability
44

55
For reporting a vulnerability contact directly [email protected]
6-

docs/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ markmap:
5555
- Use Selenium or Playwright to take screenshots
5656
- Use LLM to asses if it is a block-like page, paragraph-like page, etc.
5757
- [Issue #88](https://github.com/VinciGit00/Scrapegraph-ai/issues/88)
58-
58+
5959
## **Long-Term Goals**
6060

6161
- Automatic generation of scraping pipelines from a given prompt

docs/requirements-dev.txt

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
sphinx>=7.1.2
2+
sphinx-rtd-theme>=1.3.0
3+
myst-parser>=2.0.0
4+
sphinx-copybutton>=0.5.2
5+
sphinx-design>=0.5.0
6+
sphinx-autodoc-typehints>=1.25.2
7+
sphinx-autoapi>=3.0.0

docs/russian.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -228,4 +228,4 @@ ScrapeGraphAI лицензирован под MIT License. Подробнее с
228228
## Благодарности
229229

230230
- Мы хотели бы поблагодарить всех участников проекта и сообщество с открытым исходным кодом за их поддержку.
231-
- ScrapeGraphAI предназначен только для исследования данных и научных целей. Мы не несем ответственности за неправильное использование библиотеки.
231+
- ScrapeGraphAI предназначен только для исследования данных и научных целей. Мы не несем ответственности за неправильное использование библиотеки.

docs/source/conf.py

+9-10
Original file line numberDiff line numberDiff line change
@@ -12,31 +12,30 @@
1212
import sys
1313

1414
# import all the modules
15-
sys.path.insert(0, os.path.abspath('../../'))
15+
sys.path.insert(0, os.path.abspath("../../"))
1616

17-
project = 'ScrapeGraphAI'
18-
copyright = '2024, ScrapeGraphAI'
19-
author = 'Marco Vinciguerra, Marco Perini, Lorenzo Padoan'
17+
project = "ScrapeGraphAI"
18+
copyright = "2024, ScrapeGraphAI"
19+
author = "Marco Vinciguerra, Marco Perini, Lorenzo Padoan"
2020

2121
html_last_updated_fmt = "%b %d, %Y"
2222

2323
# -- General configuration ---------------------------------------------------
2424
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
2525

26-
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']
26+
extensions = ["sphinx.ext.autodoc", "sphinx.ext.napoleon"]
2727

28-
templates_path = ['_templates']
28+
templates_path = ["_templates"]
2929
exclude_patterns = []
3030

3131
# -- Options for HTML output -------------------------------------------------
3232
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
3333

34-
html_theme = 'furo'
34+
html_theme = "furo"
3535
html_theme_options = {
3636
"source_repository": "https://github.com/VinciGit00/Scrapegraph-ai/",
3737
"source_branch": "main",
3838
"source_directory": "docs/source/",
39-
'navigation_with_keys': True,
40-
'sidebar_hide_name': False,
39+
"navigation_with_keys": True,
40+
"sidebar_hide_name": False,
4141
}
42-

docs/source/getting_started/examples.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -84,4 +84,4 @@ After that, you can run the following code, using only your machine resources br
8484
result = smart_scraper_graph.run()
8585
print(result)
8686
87-
To find out how you can customize the `graph_config` dictionary, by using different LLM and adding new parameters, check the `Scrapers` section!
87+
To find out how you can customize the `graph_config` dictionary, by using different LLM and adding new parameters, check the `Scrapers` section!

docs/source/getting_started/installation.rst

+2-4
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The library is available on PyPI, so it can be installed using the following com
2222
pip install scrapegraphai
2323
2424
.. important::
25-
25+
2626
It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
2727

2828
If your clone the repository, it is recommended to use a package manager like `uv <https://github.com/astral-sh/uv>`_.
@@ -35,7 +35,7 @@ To install the library using uv, you can run the following command:
3535
uv build
3636
3737
.. caution::
38-
38+
3939
**Rye** must be installed first by following the instructions on the `official website <https://github.com/astral-sh/uv>`_.
4040

4141
Additionally on Windows when using WSL
@@ -46,5 +46,3 @@ If you are using Windows Subsystem for Linux (WSL) and you are facing issues wit
4646
.. code-block:: bash
4747
4848
sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2
49-
50-

docs/source/index.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,4 @@ Indices and tables
4343

4444
* :ref:`genindex`
4545
* :ref:`modindex`
46-
* :ref:`search`
46+
* :ref:`search`

docs/source/introduction/overview.rst

+10-28
Original file line numberDiff line numberDiff line change
@@ -3,46 +3,23 @@
33
:width: 50%
44
:alt: ScrapegraphAI
55

6-
Overview
6+
Overview
77
========
88

99
ScrapeGraphAI is an **open-source** Python library designed to revolutionize **scraping** tools.
10-
In today's data-intensive digital landscape, this library stands out by integrating **Large Language Models** (LLMs)
10+
In today's data-intensive digital landscape, this library stands out by integrating **Large Language Models** (LLMs)
1111
and modular **graph-based** pipelines to automate the scraping of data from various sources (e.g., websites, local files etc.).
1212

1313
Simply specify the information you need to extract, and ScrapeGraphAI handles the rest, providing a more **flexible** and **low-maintenance** solution compared to traditional scraping tools.
1414

1515
For comprehensive documentation and updates, visit our `website <https://scrapegraphai.com>`_.
1616

17-
Key Features
18-
-----------
19-
20-
* **Just One Prompt Away**: Transform any website into clean, organized data for AI agents and Data Analytics
21-
* **Save Time**: No more writing complex code or dealing with manual extraction
22-
* **Save Money**: High-quality data extraction at a fraction of the cost of traditional scraping services
23-
* **AI Powered**: State-of-the-art AI technologies for fast, accurate, and dependable results
24-
25-
Community Impact
26-
--------------
27-
28-
Our open-source technology is continuously enhanced by a global community of developers:
29-
30-
* **+17K** stars on Github
31-
* **7,000,000+** extracted webpages
32-
* **250k+** unique users
33-
34-
Services
35-
--------
36-
37-
* **Markdownify**: Convert webpage to markdown format (2 credits/page)
38-
* **Smart Scraper**: Structured AI web scraping given a URL (5 credits/page)
39-
* **Local Scraper**: Structured AI scraping given your local HTML (10 credits/page)
4017

4118
Why ScrapegraphAI?
4219
==================
4320

4421
Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages.
45-
ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
22+
ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
4623
This flexibility ensures that scrapers remain functional even when website layouts change.
4724

4825
We support many LLMs including **GPT, Gemini, Groq, Azure, Hugging Face** etc.
@@ -187,13 +164,13 @@ FAQ
187164
- Check your internet connection. Low speed or unstable connection can cause the HTML to not load properly.
188165

189166
- Try using a proxy server to mask your IP address. Check out the :ref:`Proxy` section for more information on how to configure proxy settings.
190-
167+
191168
- Use a different LLM model. Some models might perform better on certain websites than others.
192169

193170
- Set the `verbose` parameter to `True` in the graph_config to see more detailed logs.
194171

195172
- Visualize the pipeline graphically using :ref:`Burr`.
196-
173+
197174
If the issue persists, please report it on the GitHub repository.
198175

199176
6. **How does ScrapeGraphAI handle the context window limit of LLMs?**
@@ -226,3 +203,8 @@ Sponsors
226203
:width: 11%
227204
:alt: Scrapedo
228205
:target: https://scrape.do
206+
207+
.. image:: ../../assets/scrapegraph_logo.png
208+
:width: 11%
209+
:alt: ScrapegraphAI
210+
:target: https://scrapegraphai.com

docs/source/modules/modules.rst

-1
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,3 @@ scrapegraphai
77
scrapegraphai
88

99
scrapegraphai.helpers.models_tokens
10-

docs/source/modules/scrapegraphai.helpers.models_tokens.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,4 @@ Example usage:
2525
else:
2626
print(f"{model_name} not found in the models list")
2727
28-
This information is crucial for users to understand the capabilities and limitations of different AI models when designing their scraping pipelines.
28+
This information is crucial for users to understand the capabilities and limitations of different AI models when designing their scraping pipelines.

docs/source/scrapers/llm.rst

+5-6
Original file line numberDiff line numberDiff line change
@@ -133,11 +133,11 @@ We can also pass a model instance for the chat model and the embedding model. Fo
133133
openai_api_version="AZURE_OPENAI_API_VERSION",
134134
)
135135
# Supposing model_tokens are 100K
136-
model_tokens_count = 100000
136+
model_tokens_count = 100000
137137
graph_config = {
138138
"llm": {
139139
"model_instance": llm_model_instance,
140-
"model_tokens": model_tokens_count,
140+
"model_tokens": model_tokens_count,
141141
},
142142
"embeddings": {
143143
"model_instance": embedder_model_instance
@@ -198,7 +198,7 @@ We can also pass a model instance for the chat model and the embedding model. Fo
198198
Other LLM models
199199
^^^^^^^^^^^^^^^^
200200

201-
We can also pass a model instance for the chat model and the embedding model through the **model_instance** parameter.
201+
We can also pass a model instance for the chat model and the embedding model through the **model_instance** parameter.
202202
This feature enables you to utilize a Langchain model instance.
203203
You will discover the model you require within the provided list:
204204

@@ -208,7 +208,7 @@ You will discover the model you require within the provided list:
208208
For instance, consider **chat model** Moonshot. We can integrate it in the following manner:
209209

210210
.. code-block:: python
211-
211+
212212
from langchain_community.chat_models.moonshot import MoonshotChat
213213
214214
# The configuration parameters are contingent upon the specific model you select
@@ -221,8 +221,7 @@ For instance, consider **chat model** Moonshot. We can integrate it in the follo
221221
llm_model_instance = MoonshotChat(**llm_instance_config)
222222
graph_config = {
223223
"llm": {
224-
"model_instance": llm_model_instance,
224+
"model_instance": llm_model_instance,
225225
"model_tokens": 5000
226226
},
227227
}
228-

examples/ScrapegraphAI_cookbook.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -912,4 +912,4 @@
912912
},
913913
"nbformat": 4,
914914
"nbformat_minor": 0
915-
}
915+
}

examples/code_generator_graph/.env.example

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,4 @@ DEFAULT_LANGUAGE=python
1111
GENERATE_TESTS=true
1212
ADD_DOCUMENTATION=true
1313
CODE_STYLE=pep8
14-
TYPE_CHECKING=true
14+
TYPE_CHECKING=true

examples/code_generator_graph/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,4 @@ code = graph.generate("code specification")
2727
## Environment Variables
2828

2929
Required environment variables:
30-
- `OPENAI_API_KEY`: Your OpenAI API key
30+
- `OPENAI_API_KEY`: Your OpenAI API key

0 commit comments

Comments
 (0)