Skip to content

Commit 612722a

Browse files
authored
Merge pull request #364 from VikParuchuri/dev
Surya OCR 3
2 parents b68afd0 + 407a01e commit 612722a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

80 files changed

+6221
-3005
lines changed

.github/workflows/benchmarks.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,6 @@ jobs:
2525
run: |
2626
poetry run python benchmark/detection.py --max_rows 2
2727
poetry run python benchmark/utils/verify_benchmark_scores.py results/benchmark/det_bench/results.json --bench_type detection
28-
- name: Run inline detection benchmark
29-
run: |
30-
poetry run python benchmark/inline_detection.py --max_rows 5
31-
poetry run python benchmark/utils/verify_benchmark_scores.py results/benchmark/inline_math_bench/results.json --bench_type inline_detection
3228
- name: Run recognition benchmark test
3329
run: |
3430
poetry run python benchmark/recognition.py --max_rows 2

.pre-commit-config.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
repos:
2+
- repo: https://github.com/astral-sh/ruff-pre-commit
3+
# Ruff version.
4+
rev: v0.9.10
5+
hooks:
6+
# Run the linter.
7+
- id: ruff
8+
types_or: [ python, pyi ]
9+
args: [ --fix ]
10+
# Run the formatter.
11+
- id: ruff-format
12+
types_or: [ python, pyi ]

README.md

Lines changed: 35 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -96,11 +96,11 @@ surya_ocr DATA_PATH
9696
```
9797

9898
- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
99-
- `--langs` is an optional (but recommended) argument that specifies the language(s) to use for OCR. You can comma separate multiple languages. Use the language name or two-letter ISO code from [here](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). Surya supports the 90+ languages found in `surya/languages.py`.
100-
- `--lang_file` if you want to use a different language for different PDFs/images, you can optionally specify languages in a file. The format is a JSON dict with the keys being filenames and the values as a list, like `{"file1.pdf": ["en", "hi"], "file2.pdf": ["en"]}`.
99+
- `--task_name` will specify which task to use for predicting the lines. `ocr_with_boxes` is the default, which will format text and give you bboxes. If you get bad performance, try `ocr_without_boxes`, which will give you potentially better performance but no bboxes. For blocks like equations and paragraphs, try `block_without_boxes`.
101100
- `--images` will save images of the pages and detected text lines (optional)
102101
- `--output_dir` specifies the directory to save results to instead of the default
103102
- `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
103+
- `--disable_math` - by default, surya will recognize math in text. This can lead to false positives - you can disable this with this flag.
104104

105105
The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
106106

@@ -109,7 +109,18 @@ The `results.json` file will contain a json dictionary where the keys are the in
109109
- `confidence` - the confidence of the model in the detected text (0-1)
110110
- `polygon` - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
111111
- `bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
112-
- `languages` - the languages specified for the page
112+
- `chars` - the individual characters in the line
113+
- `text` - the text of the character
114+
- `bbox` - the character bbox (same format as line bbox)
115+
- `polygon` - the character polygon (same format as line polygon)
116+
- `confidence` - the confidence of the model in the detected character (0-1)
117+
- `bbox_valid` - if the character is a special token or math, the bbox may not be valid
118+
- `words` - the individual words in the line (computed from the characters)
119+
- `text` - the text of the word
120+
- `bbox` - the word bbox (same format as line bbox)
121+
- `polygon` - the word polygon (same format as line polygon)
122+
- `confidence` - mean character confidence
123+
- `bbox_valid` - if the word is a special token or math, the bbox may not be valid
113124
- `page` - the page number in the file
114125
- `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
115126

@@ -125,33 +136,12 @@ from surya.recognition import RecognitionPredictor
125136
from surya.detection import DetectionPredictor
126137

127138
image = Image.open(IMAGE_PATH)
128-
langs = ["en"] # Replace with your languages or pass None (recommended to use None)
129139
recognition_predictor = RecognitionPredictor()
130140
detection_predictor = DetectionPredictor()
131141

132-
predictions = recognition_predictor([image], [langs], detection_predictor)
142+
predictions = recognition_predictor([image], det_predictor=detection_predictor)
133143
```
134144

135-
### Compilation
136-
137-
The following models have support for compilation. You will need to set the following environment variables to enable compilation:
138-
139-
- Recognition: `COMPILE_RECOGNITION=true`
140-
- Detection: `COMPILE_DETECTOR=true`
141-
- Layout: `COMPILE_LAYOUT=true`
142-
- Table recognition: `COMPILE_TABLE_REC=true`
143-
144-
Alternatively, you can also set `COMPILE_ALL=true` which will compile all models.
145-
146-
Here are the speedups on an A10 GPU:
147-
148-
| Model | Time per page (s) | Compiled time per page (s) | Speedup (%) |
149-
| ----------------- | ----------------- | -------------------------- | ----------- |
150-
| Recognition | 0.657556 | 0.56265 | 14.43314334 |
151-
| Detection | 0.108808 | 0.10521 | 3.306742151 |
152-
| Layout | 0.27319 | 0.27063 | 0.93707676 |
153-
| Table recognition | 0.0219 | 0.01938 | 11.50684932 |
154-
155145

156146
## Text line detection
157147

@@ -300,11 +290,7 @@ surya_latex_ocr DATA_PATH
300290
- `--output_dir` specifies the directory to save results to instead of the default
301291
- `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
302292

303-
The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
304-
305-
- `text` - the detected LaTeX text - it will be in KaTeX compatible LaTeX, with `<math display="block">...</math>` and `<math>...</math>` as delimiters.
306-
- `confidence` - the prediction confidence from 0-1.
307-
- `page` - the page number in the file
293+
The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. See the OCR section above for the format of the output.
308294

309295
### From python
310296

@@ -327,12 +313,30 @@ pip install streamlit==1.40 streamlit-drawable-canvas-jsretry
327313
texify_gui
328314
```
329315

316+
## Compilation
317+
318+
The following models have support for compilation. You will need to set the following environment variables to enable compilation:
319+
320+
- Detection: `COMPILE_DETECTOR=true`
321+
- Layout: `COMPILE_LAYOUT=true`
322+
- Table recognition: `COMPILE_TABLE_REC=true`
323+
324+
Alternatively, you can also set `COMPILE_ALL=true` which will compile all models.
325+
326+
Here are the speedups on an A10 GPU:
327+
328+
| Model | Time per page (s) | Compiled time per page (s) | Speedup (%) |
329+
| ----------------- | ----------------- | -------------------------- | ----------- |
330+
| Detection | 0.108808 | 0.10521 | 3.306742151 |
331+
| Layout | 0.27319 | 0.27063 | 0.93707676 |
332+
| Table recognition | 0.0219 | 0.01938 | 11.50684932 |
333+
330334
# Limitations
331335

332336
- This is specialized for document OCR. It will likely not work on photos or other images.
333337
- It is for printed text, not handwriting (though it may work on some handwriting).
334338
- The text detection model has trained itself to ignore advertisements.
335-
- You can find language support for OCR in `surya/languages.py`. Text detection, layout analysis, and reading order will work with any language.
339+
- You can find language support for OCR in `surya/recognition/languages.py`. Text detection, layout analysis, and reading order will work with any language.
336340

337341
## Troubleshooting
338342

benchmark/inline_detection.py

Lines changed: 0 additions & 107 deletions
This file was deleted.

0 commit comments

Comments
 (0)