Skip to content

Commit 26fba49

Browse files
wonjuleeevinnamkim
andauthored
Fix validator and add notebooks and document for level-up validator (#933)
<!-- Contributing guide: https://github.com/openvinotoolkit/datumaro/blob/develop/CONTRIBUTING.md --> ### Summary <!-- Resolves #111 and #222. Depends on #1000 (for series of dependent commits). This PR introduces this capability to make the project better in this and that. - Added this feature - Removed that feature - Fixed the problem #1234 --> ### How to test <!-- Describe the testing procedure for reviewers, if changes are not fully covered by unit tests or manual testing can be complicated. --> ### Checklist <!-- Put an 'x' in all the boxes that apply --> - [ ] I have added unit tests to cover my changes.​ - [ ] I have added integration tests to cover my changes.​ - [ ] I have added the description of my changes into [CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​ - [ ] I have updated the [documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs) accordingly ### License - [ ] I submit _my code changes_ under the same [MIT License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern. - [ ] I have updated the license header for each file (see an example below). ```python # Copyright (C) 2023 Intel Corporation # # SPDX-License-Identifier: MIT ``` --------- Signed-off-by: Kim, Vinnam <[email protected]> Co-authored-by: Vinnam Kim <[email protected]>
1 parent 57ccba7 commit 26fba49

File tree

9 files changed

+952
-457
lines changed

9 files changed

+952
-457
lines changed

.github/workflows/publish_sdist_to_pypi.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ jobs:
4343
uses: actions-ecosystem/action-regex-match@v2
4444
with:
4545
text: ${{ github.ref }}
46-
regex: '^refs/tags/v[0-9]+\.[0-9]+\.[0-9]+$'
46+
regex: '^refs/tags/v[0-9]+\.[0-9]+\.[0-9]+(rc[0-9]+)?$'
4747
- name: Publish package distributions to PyPI
4848
if: ${{ steps.check-tag.outputs.match != '' }}
4949
uses: pypa/[email protected]

datumaro/plugins/validators.py

Lines changed: 349 additions & 400 deletions
Large diffs are not rendered by default.

docs/source/docs/level-up/basic_skills/03_dataset_import_export.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,22 @@
1-
=============
1+
===============================
22
Level 3: Data Import and Export
3-
=============
3+
===============================
44

55
Datumaro is a tool that supports public data formats across a wide range of tasks such as
66
classification, detection, segmentation, pose estimation, or visual tracking.
77
To facilitate this, Datumaro provides assistance with data import and export via both Python API and CLI.
88
This makes it easier for users to work with various data formats using Datumaro.
99

1010
Prepare dataset
11-
============
11+
===============
1212

1313
For the segmentation task, we here introduce the Cityscapes, which collects road scenes from 50
1414
different cities and contains 5K fine-grained pixel-level annotations and 20K coarse annotations.
1515
More detailed description is given by :ref:`here <Cityscapes>`.
1616
The Cityscapes dataset is available for free `download <https://www.cityscapes-dataset.com/downloads/>`_.
1717

1818
Convert data format
19-
============
19+
===================
2020

2121
Users sometimes needs to compare, merge, or manage various kinds of public datasets in a unified
2222
system. To achieve this, Datumaro not only has `import` and `export` funcionalities, but also
@@ -59,32 +59,32 @@ We now convert the Cityscapes data into the MS-COCO format, which is described i
5959

6060
.. code-block:: bash
6161
62-
datum create -o <path/to/project>
62+
datum project create -o <path/to/project>
6363
6464
We now import Cityscapes data into the project through
6565

6666
.. code-block:: bash
6767
68-
datum import --format cityscapes -p <path/to/project> <path/to/cityscapes>
68+
datum project import --format cityscapes -p <path/to/project> <path/to/cityscapes>
6969
7070
(Optional) When we import a data, the change is automatically commited in the project.
7171
This can be shown through `log` as
7272

7373
.. code-block:: bash
7474
75-
datum log -p <path/to/project>
75+
datum project log -p <path/to/project>
7676
7777
(Optional) We can check the imported dataset information such as subsets, number of data, or
7878
categories through `info`.
7979

8080
.. code-block:: bash
8181
82-
datum info -p <path/to/project>
82+
datum project info -p <path/to/project>
8383
8484
Finally, we export the data within the project with MS-COCO format as
8585

8686
.. code-block:: bash
8787
88-
datum export --format coco -p <path/to/project> -o <path/to/save> -- --save-media
88+
datum project export --format coco -p <path/to/project> -o <path/to/save> -- --save-media
8989
9090
For a data with an unknown format, we can detect the format in the :ref:`next level <Level 4: Detect Data Format from an Unknown Dataset>`!

docs/source/docs/level-up/basic_skills/04_detect_data_format.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
=============
1+
===================================================
22
Level 4: Detect Data Format from an Unknown Dataset
3-
=============
3+
===================================================
44

55
Datumaro provides a function to detect the format of a dataset before importing data. This can be
66
useful in cases where information about the original format of the data has been lost or is unclear.
77
With this function, users can easily identify the format and proceed with appropriate data
88
handling processes.
99

1010
Detect data format
11-
============
11+
==================
1212

1313
.. tabbed:: CLI
1414

docs/source/docs/level-up/intermediate_skills/08_data_refinement.md

Lines changed: 0 additions & 28 deletions
This file was deleted.
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
===========================
2+
Level 8: Dataset Validation
3+
===========================
4+
5+
6+
When creating a dataset, it is natural for imbalances to occur between categories, and sometimes
7+
there may be very few data points for the minority class. In addition, inconsistent annotations may
8+
be produced by annotators or over time. When training a model with such data, more attention should
9+
be paid, and sometimes it may be necessary to filter or correct the data in advance. Datumaro provides
10+
data validation functionality for this purpose.
11+
12+
More detailed descriptions about validation errors and warnings are given by :ref:`here <Validate>`.
13+
The Python example for the usage of validator is described in `here <https://github.com/openvinotoolkit/datumaro/blob/develop/notebooks/11_validate.ipynb>`_.
14+
15+
16+
.. tab-set::
17+
18+
.. tab-item:: Python
19+
20+
.. code-block:: python
21+
22+
from datumaro.components.environment import Environment
23+
from datumaro.components.dataset import Dataset
24+
25+
data_path = '/path/to/data'
26+
27+
env = Environment()
28+
29+
detected_formats = env.detect_dataset(data_path)
30+
31+
dataset = Dataset.import_from(path, detected_formats[0])
32+
33+
from datumaro.plugins.validators import DetectionValidator
34+
35+
validator = DetectionValidator() # Or ClassificationValidator or SegementationValidator
36+
37+
reports = validator.validate(dataset)
38+
39+
.. tab-item:: ProjectCLI
40+
41+
With the project-based CLI, we first require to create a project by
42+
43+
.. code-block:: bash
44+
45+
datum project create -o <path/to/project>
46+
47+
We now import MS-COCO validation data into the project through
48+
49+
.. code-block:: bash
50+
51+
datum project import --format coco_instances -p <path/to/project> <path/to/cityscapes>
52+
53+
(Optional) When we import a data, the change is automatically commited in the project.
54+
This can be shown through `log` as
55+
56+
.. code-block:: bash
57+
58+
datum project log -p <path/to/project>
59+
60+
(Optional) We can check the imported dataset information such as subsets, number of data, or
61+
categories through `info`.
62+
63+
.. code-block:: bash
64+
65+
datum project dinfo -p <path/to/project>
66+
67+
Finally, we validate the data within the project as
68+
69+
.. code-block:: bash
70+
71+
datum validate --task-type <classification/detection/segmentation> --subset <subset_name> -p <path/to/project>
72+
73+
We now have the validation report named by validation-report-<subset_name>.json.

docs/source/docs/level-up/intermediate_skills/index.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,12 @@ Intermediate Skills
3131

3232
---
3333

34-
.. link-button:: 08_data_refinement
34+
.. link-button:: 08_data_validate
3535
:type: ref
36-
:text: Level 08: Dataset Refinement
36+
:text: Level 08: Dataset Validate
3737
:classes: btn-outline-primary btn-block stretched-link
3838

39-
:badge:`CLI,badge-info`
39+
:badge:`ProjectCLI,badge-primary`
4040
:badge:`Python,badge-warning`
4141

4242
---

notebooks/11_validate.ipynb

Lines changed: 500 additions & 0 deletions
Large diffs are not rendered by default.

tests/unit/test_validator.py

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -405,10 +405,13 @@ def test_check_missing_attribute(self):
405405

406406
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
407407
def test_check_undefined_label(self):
408-
label_name = "unittest"
409-
label_stats = {"items_with_undefined_label": [(1, "unittest")]}
408+
label_name = "cat0"
409+
item_id = 1
410+
item_subset = "unittest"
411+
label_stats = {label_name: {"items_with_undefined_label": [(item_id, item_subset)]}}
412+
stats = {"label_distribution": {"undefined_labels": label_stats}}
410413

411-
actual_reports = self.validator._check_undefined_label(label_name, label_stats)
414+
actual_reports = self.validator._check_undefined_label(stats)
412415

413416
self.assertTrue(len(actual_reports) == 1)
414417
self.assertIsInstance(actual_reports[0], UndefinedLabel)
@@ -455,14 +458,12 @@ def test_check_only_one_label(self):
455458
self.assertIsInstance(actual_reports[0], OnlyOneLabel)
456459

457460
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
458-
def test_check_only_one_attribute_value(self):
461+
def test_check_only_one_attribute(self):
459462
label_name = "unit"
460463
attr_name = "test"
461464
attr_dets = {"distribution": {"mock": 1}}
462465

463-
actual_reports = self.validator._check_only_one_attribute_value(
464-
label_name, attr_name, attr_dets
465-
)
466+
actual_reports = self.validator._check_only_one_attribute(label_name, attr_name, attr_dets)
466467

467468
self.assertTrue(len(actual_reports) == 1)
468469
self.assertIsInstance(actual_reports[0], OnlyOneAttributeValue)
@@ -897,7 +898,7 @@ def test_validate_annotations_detection(self):
897898
self.assertEqual(actual_stats["items_with_negative_length"], {})
898899
self.assertEqual(actual_stats["items_with_invalid_value"], {})
899900

900-
bbox_dist_by_label = actual_stats["bbox_distribution_in_label"]
901+
bbox_dist_by_label = actual_stats["point_distribution_in_label"]
901902
label_prop_stats = bbox_dist_by_label["label_1"]["width"]
902903
self.assertEqual(label_prop_stats["items_far_from_mean"], {})
903904
self.assertEqual(label_prop_stats["mean"], 3.5)
@@ -906,7 +907,7 @@ def test_validate_annotations_detection(self):
906907
self.assertEqual(label_prop_stats["max"], 4.0)
907908
self.assertEqual(label_prop_stats["median"], 3.5)
908909

909-
bbox_dist_by_attr = actual_stats["bbox_distribution_in_attribute"]
910+
bbox_dist_by_attr = actual_stats["point_distribution_in_attribute"]
910911
attr_prop_stats = bbox_dist_by_attr["label_0"]["a"]["1"]["width"]
911912
self.assertEqual(attr_prop_stats["items_far_from_mean"], {})
912913
self.assertEqual(attr_prop_stats["mean"], 2.0)
@@ -915,7 +916,7 @@ def test_validate_annotations_detection(self):
915916
self.assertEqual(attr_prop_stats["max"], 3.0)
916917
self.assertEqual(attr_prop_stats["median"], 2.0)
917918

918-
bbox_dist_item = actual_stats["bbox_distribution_in_dataset_item"]
919+
bbox_dist_item = actual_stats["point_distribution_in_dataset_item"]
919920
self.assertEqual(sum(bbox_dist_item.values()), 8)
920921

921922
with self.subTest("Test of validation reports", i=1):
@@ -948,7 +949,7 @@ def test_validate_annotations_segmentation(self):
948949
self.assertEqual(len(actual_stats["items_missing_annotation"]), 1)
949950
self.assertEqual(actual_stats["items_with_invalid_value"], {})
950951

951-
mask_dist_by_label = actual_stats["mask_distribution_in_label"]
952+
mask_dist_by_label = actual_stats["point_distribution_in_label"]
952953
label_prop_stats = mask_dist_by_label["label_1"]["area"]
953954
self.assertEqual(label_prop_stats["items_far_from_mean"], {})
954955
areas = [12, 4, 8]
@@ -958,7 +959,7 @@ def test_validate_annotations_segmentation(self):
958959
self.assertEqual(label_prop_stats["max"], np.max(areas))
959960
self.assertEqual(label_prop_stats["median"], np.median(areas))
960961

961-
mask_dist_by_attr = actual_stats["mask_distribution_in_attribute"]
962+
mask_dist_by_attr = actual_stats["point_distribution_in_attribute"]
962963
attr_prop_stats = mask_dist_by_attr["label_0"]["a"]["1"]["area"]
963964
areas = [12, 4]
964965
self.assertEqual(attr_prop_stats["items_far_from_mean"], {})
@@ -968,7 +969,7 @@ def test_validate_annotations_segmentation(self):
968969
self.assertEqual(attr_prop_stats["max"], np.max(areas))
969970
self.assertEqual(attr_prop_stats["median"], np.median(areas))
970971

971-
mask_dist_item = actual_stats["mask_distribution_in_dataset_item"]
972+
mask_dist_item = actual_stats["point_distribution_in_dataset_item"]
972973
self.assertEqual(sum(mask_dist_item.values()), 9)
973974

974975
with self.subTest("Test of validation reports", i=1):

0 commit comments

Comments
 (0)