You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+35-31Lines changed: 35 additions & 31 deletions
Original file line number
Diff line number
Diff line change
@@ -96,11 +96,11 @@ surya_ocr DATA_PATH
96
96
```
97
97
98
98
-`DATA_PATH` can be an image, pdf, or folder of images/pdfs
99
-
-`--langs` is an optional (but recommended) argument that specifies the language(s) to use for OCR. You can comma separate multiple languages. Use the language name or two-letter ISO code from [here](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). Surya supports the 90+ languages found in `surya/languages.py`.
100
-
-`--lang_file` if you want to use a different language for different PDFs/images, you can optionally specify languages in a file. The format is a JSON dict with the keys being filenames and the values as a list, like `{"file1.pdf": ["en", "hi"], "file2.pdf": ["en"]}`.
99
+
-`--task_name` will specify which task to use for predicting the lines. `ocr_with_boxes` is the default, which will format text and give you bboxes. If you get bad performance, try `ocr_without_boxes`, which will give you potentially better performance but no bboxes. For blocks like equations and paragraphs, try `block_without_boxes`.
101
100
-`--images` will save images of the pages and detected text lines (optional)
102
101
-`--output_dir` specifies the directory to save results to instead of the default
103
102
-`--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
103
+
-`--disable_math` - by default, surya will recognize math in text. This can lead to false positives - you can disable this with this flag.
104
104
105
105
The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
106
106
@@ -109,7 +109,18 @@ The `results.json` file will contain a json dictionary where the keys are the in
109
109
-`confidence` - the confidence of the model in the detected text (0-1)
110
110
-`polygon` - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
111
111
-`bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
112
-
-`languages` - the languages specified for the page
112
+
-`chars` - the individual characters in the line
113
+
-`text` - the text of the character
114
+
-`bbox` - the character bbox (same format as line bbox)
115
+
-`polygon` - the character polygon (same format as line polygon)
116
+
-`confidence` - the confidence of the model in the detected character (0-1)
117
+
-`bbox_valid` - if the character is a special token or math, the bbox may not be valid
118
+
-`words` - the individual words in the line (computed from the characters)
119
+
-`text` - the text of the word
120
+
-`bbox` - the word bbox (same format as line bbox)
121
+
-`polygon` - the word polygon (same format as line polygon)
122
+
-`confidence` - mean character confidence
123
+
-`bbox_valid` - if the word is a special token or math, the bbox may not be valid
113
124
-`page` - the page number in the file
114
125
-`image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
115
126
@@ -125,33 +136,12 @@ from surya.recognition import RecognitionPredictor
125
136
from surya.detection import DetectionPredictor
126
137
127
138
image = Image.open(IMAGE_PATH)
128
-
langs = ["en"] # Replace with your languages or pass None (recommended to use None)
-`--output_dir` specifies the directory to save results to instead of the default
301
291
-`--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
302
292
303
-
The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
304
-
305
-
-`text` - the detected LaTeX text - it will be in KaTeX compatible LaTeX, with `<math display="block">...</math>` and `<math>...</math>` as delimiters.
306
-
-`confidence` - the prediction confidence from 0-1.
307
-
-`page` - the page number in the file
293
+
The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. See the OCR section above for the format of the output.
- This is specialized for document OCR. It will likely not work on photos or other images.
333
337
- It is for printed text, not handwriting (though it may work on some handwriting).
334
338
- The text detection model has trained itself to ignore advertisements.
335
-
- You can find language support for OCR in `surya/languages.py`. Text detection, layout analysis, and reading order will work with any language.
339
+
- You can find language support for OCR in `surya/recognition/languages.py`. Text detection, layout analysis, and reading order will work with any language.
0 commit comments