Fix validator and add notebooks and document for level-up validator (#933)

wonjuleee · vinnamkim · web-flow · commit 26fba49b68cc · 2023-04-17T11:34:45.000+09:00
### Summary  ### How to test  ### Checklist  - [ ] I have added unit tests to cover my changes.​ - [ ] I have added integration tests to cover my changes.​ - [ ] I have added the description of my changes into [CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​ - [ ] I have updated the [documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs) accordingly ### License - [ ] I submit _my code changes_ under the same [MIT License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern. - [ ] I have updated the license header for each file (see an example below). ```python # Copyright (C) 2023 Intel Corporation # # SPDX-License-Identifier: MIT ``` --------- Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com> Co-authored-by: Vinnam Kim <vinnam.kim@intel.com>
diff --git a/.github/workflows/publish_sdist_to_pypi.yml b/.github/workflows/publish_sdist_to_pypi.yml
@@ -43,7 +43,7 @@ jobs:
       uses: actions-ecosystem/action-regex-match@v2
       with:
         text: ${{ github.ref }}
-        regex: '^refs/tags/v[0-9]+\.[0-9]+\.[0-9]+$'
+        regex: '^refs/tags/v[0-9]+\.[0-9]+\.[0-9]+(rc[0-9]+)?$'
     - name: Publish package distributions to PyPI
       if: ${{ steps.check-tag.outputs.match != '' }}
       uses: pypa/gh-action-pypi-publish@v1.7.1
diff --git a/datumaro/plugins/validators.py b/datumaro/plugins/validators.py
diff --git a/docs/source/docs/level-up/basic_skills/03_dataset_import_export.rst b/docs/source/docs/level-up/basic_skills/03_dataset_import_export.rst
@@ -1,22 +1,22 @@
-=============
+===============================
 Level 3: Data Import and Export
-=============
+===============================
 
 Datumaro is a tool that supports public data formats across a wide range of tasks such as
 classification, detection, segmentation, pose estimation, or visual tracking.
 To facilitate this, Datumaro provides assistance with data import and export via both Python API and CLI.
 This makes it easier for users to work with various data formats using Datumaro.
 
 Prepare dataset
-============
+===============
 
 For the segmentation task, we here introduce the Cityscapes, which collects road scenes from 50
 different cities and contains 5K fine-grained pixel-level annotations and 20K coarse annotations.
 More detailed description is given by :ref:`here <Cityscapes>`.
 The Cityscapes dataset is available for free `download <https://www.cityscapes-dataset.com/downloads/>`_.
 
 Convert data format
-============
+===================
 
 Users sometimes needs to compare, merge, or manage various kinds of public datasets in a unified
 system. To achieve this, Datumaro not only has `import` and `export` funcionalities, but also
@@ -59,32 +59,32 @@ We now convert the Cityscapes data into the MS-COCO format, which is described i
 
   .. code-block:: bash
 
-    datum create -o <path/to/project>
+    datum project create -o <path/to/project>
 
   We now import Cityscapes data into the project through
 
   .. code-block:: bash
 
-    datum import --format cityscapes -p <path/to/project> <path/to/cityscapes>
+    datum project import --format cityscapes -p <path/to/project> <path/to/cityscapes>
 
   (Optional) When we import a data, the change is automatically commited in the project.
   This can be shown through `log` as
 
   .. code-block:: bash
 
-    datum log -p <path/to/project>
+    datum project log -p <path/to/project>
 
   (Optional) We can check the imported dataset information such as subsets, number of data, or
   categories through `info`.
 
   .. code-block:: bash
 
-    datum info -p <path/to/project>
+    datum project info -p <path/to/project>
 
   Finally, we export the data within the project with MS-COCO format as
 
   .. code-block:: bash
 
-    datum export --format coco -p <path/to/project> -o <path/to/save> -- --save-media
+    datum project export --format coco -p <path/to/project> -o <path/to/save> -- --save-media
 
 For a data with an unknown format, we can detect the format in the :ref:`next level <Level 4: Detect Data Format from an Unknown Dataset>`!
diff --git a/docs/source/docs/level-up/basic_skills/04_detect_data_format.rst b/docs/source/docs/level-up/basic_skills/04_detect_data_format.rst
@@ -1,14 +1,14 @@
-=============
+===================================================
 Level 4: Detect Data Format from an Unknown Dataset
-=============
+===================================================
 
 Datumaro provides a function to detect the format of a dataset before importing data. This can be
 useful in cases where information about the original format of the data has been lost or is unclear.
 With this function, users can easily identify the format and proceed with appropriate data
 handling processes.
 
 Detect data format
-============
+==================
 
 .. tabbed:: CLI
 
diff --git a/docs/source/docs/level-up/intermediate_skills/08_data_refinement.md b/docs/source/docs/level-up/intermediate_skills/08_data_refinement.md
diff --git a/docs/source/docs/level-up/intermediate_skills/08_data_validate.rst b/docs/source/docs/level-up/intermediate_skills/08_data_validate.rst
@@ -0,0 +1,73 @@
+===========================
+Level 8: Dataset Validation
+===========================
+
+
+When creating a dataset, it is natural for imbalances to occur between categories, and sometimes
+there may be very few data points for the minority class. In addition, inconsistent annotations may
+be produced by annotators or over time. When training a model with such data, more attention should
+be paid, and sometimes it may be necessary to filter or correct the data in advance. Datumaro provides
+data validation functionality for this purpose.
+
+More detailed descriptions about validation errors and warnings are given by :ref:`here <Validate>`.
+The Python example for the usage of validator is described in `here <https://github.com/openvinotoolkit/datumaro/blob/develop/notebooks/11_validate.ipynb>`_.
+
+
+.. tab-set::
+
+  .. tab-item:: Python
+
+    .. code-block:: python
+
+        from datumaro.components.environment import Environment
+        from datumaro.components.dataset import Dataset
+
+        data_path = '/path/to/data'
+
+        env = Environment()
+
+        detected_formats = env.detect_dataset(data_path)
+
+        dataset = Dataset.import_from(path, detected_formats[0])
+
+        from datumaro.plugins.validators import DetectionValidator
+
+        validator = DetectionValidator() # Or ClassificationValidator or SegementationValidator
+
+        reports = validator.validate(dataset)
+
+  .. tab-item:: ProjectCLI
+
+    With the project-based CLI, we first require to create a project by
+
+    .. code-block:: bash
+
+      datum project create -o <path/to/project>
+
+    We now import MS-COCO validation data into the project through
+
+    .. code-block:: bash
+
+      datum project import --format coco_instances -p <path/to/project> <path/to/cityscapes>
+
+    (Optional) When we import a data, the change is automatically commited in the project.
+    This can be shown through `log` as
+
+    .. code-block:: bash
+
+      datum project log -p <path/to/project>
+
+    (Optional) We can check the imported dataset information such as subsets, number of data, or
+    categories through `info`.
+
+    .. code-block:: bash
+
+      datum project dinfo -p <path/to/project>
+
+    Finally, we validate the data within the project as
+
+  .. code-block:: bash
+
+    datum validate --task-type <classification/detection/segmentation> --subset <subset_name> -p <path/to/project>
+
+  We now have the validation report named by validation-report-<subset_name>.json.
diff --git a/docs/source/docs/level-up/intermediate_skills/index.rst b/docs/source/docs/level-up/intermediate_skills/index.rst
@@ -31,12 +31,12 @@ Intermediate Skills
 
    ---
 
-   .. link-button:: 08_data_refinement
+   .. link-button:: 08_data_validate
       :type: ref
-      :text: Level 08: Dataset Refinement
+      :text: Level 08: Dataset Validate
       :classes: btn-outline-primary btn-block stretched-link
 
-   :badge:`CLI,badge-info`
+   :badge:`ProjectCLI,badge-primary`
    :badge:`Python,badge-warning`
 
    ---
diff --git a/notebooks/11_validate.ipynb b/notebooks/11_validate.ipynb
diff --git a/tests/unit/test_validator.py b/tests/unit/test_validator.py
@@ -405,10 +405,13 @@ def test_check_missing_attribute(self):
 
     @mark_requirement(Requirements.DATUM_GENERAL_REQ)
     def test_check_undefined_label(self):
-        label_name = "unittest"
-        label_stats = {"items_with_undefined_label": [(1, "unittest")]}
+        label_name = "cat0"
+        item_id = 1
+        item_subset = "unittest"
+        label_stats = {label_name: {"items_with_undefined_label": [(item_id, item_subset)]}}
+        stats = {"label_distribution": {"undefined_labels": label_stats}}
 
-        actual_reports = self.validator._check_undefined_label(label_name, label_stats)
+        actual_reports = self.validator._check_undefined_label(stats)
 
         self.assertTrue(len(actual_reports) == 1)
         self.assertIsInstance(actual_reports[0], UndefinedLabel)
@@ -455,14 +458,12 @@ def test_check_only_one_label(self):
         self.assertIsInstance(actual_reports[0], OnlyOneLabel)
 
     @mark_requirement(Requirements.DATUM_GENERAL_REQ)
-    def test_check_only_one_attribute_value(self):
+    def test_check_only_one_attribute(self):
         label_name = "unit"
         attr_name = "test"
         attr_dets = {"distribution": {"mock": 1}}
 
-        actual_reports = self.validator._check_only_one_attribute_value(
-            label_name, attr_name, attr_dets
-        )
+        actual_reports = self.validator._check_only_one_attribute(label_name, attr_name, attr_dets)
 
         self.assertTrue(len(actual_reports) == 1)
         self.assertIsInstance(actual_reports[0], OnlyOneAttributeValue)
@@ -897,7 +898,7 @@ def test_validate_annotations_detection(self):
             self.assertEqual(actual_stats["items_with_negative_length"], {})
             self.assertEqual(actual_stats["items_with_invalid_value"], {})
 
-            bbox_dist_by_label = actual_stats["bbox_distribution_in_label"]
+            bbox_dist_by_label = actual_stats["point_distribution_in_label"]
             label_prop_stats = bbox_dist_by_label["label_1"]["width"]
             self.assertEqual(label_prop_stats["items_far_from_mean"], {})
             self.assertEqual(label_prop_stats["mean"], 3.5)
@@ -906,7 +907,7 @@ def test_validate_annotations_detection(self):
             self.assertEqual(label_prop_stats["max"], 4.0)
             self.assertEqual(label_prop_stats["median"], 3.5)
 
-            bbox_dist_by_attr = actual_stats["bbox_distribution_in_attribute"]
+            bbox_dist_by_attr = actual_stats["point_distribution_in_attribute"]
             attr_prop_stats = bbox_dist_by_attr["label_0"]["a"]["1"]["width"]
             self.assertEqual(attr_prop_stats["items_far_from_mean"], {})
             self.assertEqual(attr_prop_stats["mean"], 2.0)
@@ -915,7 +916,7 @@ def test_validate_annotations_detection(self):
             self.assertEqual(attr_prop_stats["max"], 3.0)
             self.assertEqual(attr_prop_stats["median"], 2.0)
 
-            bbox_dist_item = actual_stats["bbox_distribution_in_dataset_item"]
+            bbox_dist_item = actual_stats["point_distribution_in_dataset_item"]
             self.assertEqual(sum(bbox_dist_item.values()), 8)
 
         with self.subTest("Test of validation reports", i=1):
@@ -948,7 +949,7 @@ def test_validate_annotations_segmentation(self):
             self.assertEqual(len(actual_stats["items_missing_annotation"]), 1)
             self.assertEqual(actual_stats["items_with_invalid_value"], {})
 
-            mask_dist_by_label = actual_stats["mask_distribution_in_label"]
+            mask_dist_by_label = actual_stats["point_distribution_in_label"]
             label_prop_stats = mask_dist_by_label["label_1"]["area"]
             self.assertEqual(label_prop_stats["items_far_from_mean"], {})
             areas = [12, 4, 8]
@@ -958,7 +959,7 @@ def test_validate_annotations_segmentation(self):
             self.assertEqual(label_prop_stats["max"], np.max(areas))
             self.assertEqual(label_prop_stats["median"], np.median(areas))
 
-            mask_dist_by_attr = actual_stats["mask_distribution_in_attribute"]
+            mask_dist_by_attr = actual_stats["point_distribution_in_attribute"]
             attr_prop_stats = mask_dist_by_attr["label_0"]["a"]["1"]["area"]
             areas = [12, 4]
             self.assertEqual(attr_prop_stats["items_far_from_mean"], {})
@@ -968,7 +969,7 @@ def test_validate_annotations_segmentation(self):
             self.assertEqual(attr_prop_stats["max"], np.max(areas))
             self.assertEqual(attr_prop_stats["median"], np.median(areas))
 
-            mask_dist_item = actual_stats["mask_distribution_in_dataset_item"]
+            mask_dist_item = actual_stats["point_distribution_in_dataset_item"]
             self.assertEqual(sum(mask_dist_item.values()), 9)
 
         with self.subTest("Test of validation reports", i=1):