Skip to content

Commit 583cb93

Browse files
vertex-sdk-botcopybara-github
authored andcommitted
docs: Update the documentation for the image_dataset class
PiperOrigin-RevId: 642377218
1 parent fe15b18 commit 583cb93

File tree

1 file changed

+98
-64
lines changed

1 file changed

+98
-64
lines changed

google/cloud/aiplatform/datasets/text_dataset.py

+98-64
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,35 @@
2727

2828

2929
class TextDataset(datasets._Dataset):
30-
"""Managed text dataset resource for Vertex AI."""
30+
"""A managed text dataset resource for Vertex AI.
31+
32+
Use this class to work with a managed text dataset. To create a managed
33+
text dataset, you need a datasource file in CSV format and a schema file in
34+
YAML format. A schema is optional for a custom model. The CSV file and the
35+
schema are accessed in Cloud Storage buckets.
36+
37+
Use text data for the following objectives:
38+
39+
* Classification. For more information, see
40+
[Prepare text training data for classification](https://cloud.google.com/vertex-ai/docs/text-data/classification/prepare-data).
41+
* Entity extraction. For more information, see
42+
[Prepare text training data for entity extraction](https://cloud.google.com/vertex-ai/docs/text-data/entity-extraction/prepare-data).
43+
* Sentiment analysis. For more information, see
44+
[Prepare text training data for sentiment analysis](Prepare text training data for sentiment analysis).
45+
46+
The following code shows you how to create and import a text dataset with
47+
a CSV datasource file and a YAML schema file. The schema file you use
48+
depends on whether your text dataset is used for single-label
49+
classification, multi-label classification, or object detection.
50+
51+
```py
52+
my_dataset = aiplatform.TextDataset.create(
53+
display_name="my-text-dataset",
54+
gcs_source=['gs://path/to/my/text-dataset.csv'],
55+
import_schema_uri=['gs://path/to/my/schema.yaml'],
56+
)
57+
```
58+
"""
3159

3260
_supported_metadata_schema_uris: Optional[Tuple[str]] = (
3361
schema.dataset.metadata.text,
@@ -49,91 +77,97 @@ def create(
4977
sync: bool = True,
5078
create_request_timeout: Optional[float] = None,
5179
) -> "TextDataset":
52-
"""Creates a new text dataset and optionally imports data into dataset
53-
when source and import_schema_uri are passed.
80+
"""Creates a new text dataset.
81+
82+
Optionally imports data into this dataset when a source and
83+
`import_schema_uri` are passed in. The following is an example of how
84+
this method is used:
5485
55-
Example Usage:
56-
ds = aiplatform.TextDataset.create(
57-
display_name='my-dataset',
58-
gcs_source='gs://my-bucket/dataset.csv',
59-
import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
60-
)
86+
```py
87+
ds = aiplatform.TextDataset.create(
88+
display_name='my-dataset',
89+
gcs_source='gs://my-bucket/dataset.csv',
90+
import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
91+
)
92+
```
6193
6294
Args:
6395
display_name (str):
64-
Optional. The user-defined name of the Dataset.
65-
The name can be up to 128 characters long and can be consist
66-
of any UTF-8 characters.
96+
Optional. The user-defined name of the dataset. The name must
97+
contain 128 or fewer UTF-8 characters.
6798
gcs_source (Union[str, Sequence[str]]):
68-
Google Cloud Storage URI(-s) to the
69-
input file(s).
70-
71-
Examples:
72-
str: "gs://bucket/file.csv"
73-
Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
99+
Optional. The URI to one or more Google Cloud Storage buckets
100+
that contain your datasets. For example, `str:
101+
"gs://bucket/file.csv"` or `Sequence[str]:
102+
["gs://bucket/file1.csv", "gs://bucket/file2.csv"]`.
74103
import_schema_uri (str):
75-
Points to a YAML file stored on Google Cloud
76-
Storage describing the import format. Validation will be
77-
done against the schema. The schema is defined as an
78-
`OpenAPI 3.0.2 Schema
79-
Object <https://tinyurl.com/y538mdwt>`__.
104+
Optional. A URI for a YAML file stored in Cloud Storage that
105+
describes the import schema used to validate the
106+
dataset. The schema is an
107+
[OpenAPI 3.0.2 Schema](https://tinyurl.com/y538mdwt) object.
80108
data_item_labels (Dict):
81-
Labels that will be applied to newly imported DataItems. If
82-
an identical DataItem as one being imported already exists
83-
in the Dataset, then these labels will be appended to these
84-
of the already existing one, and if labels with identical
85-
key is imported before, the old label value will be
86-
overwritten. If two DataItems are identical in the same
87-
import data operation, the labels will be combined and if
88-
key collision happens in this case, one of the values will
89-
be picked randomly. Two DataItems are considered identical
90-
if their content bytes are identical (e.g. image bytes or
91-
pdf bytes). These labels will be overridden by Annotation
92-
labels specified inside index file referenced by
93-
``import_schema_uri``,
94-
e.g. jsonl file.
109+
Optional. A dictionary of label information. Each dictionary
110+
item contains a label and a label key. Each item in the dataset
111+
includes one dictionary of label information. If a data item is
112+
added or merged into a dataset, and that data item contains an
113+
image that's identical to an image that’s already in the
114+
dataset, then the data items are merged. If two identical labels
115+
are detected during the merge, each with a different label key,
116+
then one of the label and label key dictionary items is randomly
117+
chosen to be into the merged data item. Data items are
118+
compared using their binary data (bytes), not on their content.
119+
If annotation labels are referenced in a schema specified by the
120+
`import_schema_url` parameter, then the labels in the
121+
`data_item_labels` dictionary are overriden by the annotations.
95122
project (str):
96-
Project to upload this dataset to. Overrides project set in
97-
aiplatform.init.
123+
Optional. The name of the Google Cloud project to which this
124+
`TextDataset` is uploaded. This overrides the project that
125+
was set by `aiplatform.init`.
98126
location (str):
99-
Location to upload this dataset to. Overrides location set in
100-
aiplatform.init.
127+
Optional. The Google Cloud region where this dataset is uploaded. This
128+
region overrides the region that was set by `aiplatform.init`.
101129
credentials (auth_credentials.Credentials):
102-
Custom credentials to use to upload this dataset. Overrides
103-
credentials set in aiplatform.init.
130+
Optional. The credentials that are used to upload the `TextDataset`.
131+
These credentials override the credentials set by
132+
`aiplatform.init`.
104133
request_metadata (Sequence[Tuple[str, str]]):
105-
Strings which should be sent along with the request as metadata.
134+
Optional. Strings that contain metadata that's sent with the request.
106135
labels (Dict[str, str]):
107-
Optional. Labels with user-defined metadata to organize your Tensorboards.
108-
Label keys and values can be no longer than 64 characters
109-
(Unicode codepoints), can only contain lowercase letters, numeric
110-
characters, underscores and dashes. International characters are allowed.
111-
No more than 64 user labels can be associated with one Tensorboard
112-
(System labels are excluded).
113-
See https://goo.gl/xmQnxf for more information and examples of labels.
114-
System reserved label keys are prefixed with "aiplatform.googleapis.com/"
115-
and are immutable.
136+
Optional. Labels with user-defined metadata to organize your
137+
Vertex AI Tensorboards. The maximum length of a key and of a
138+
value is 64 unicode characters. Labels and keys can contain only
139+
lowercase letters, numeric characters, underscores, and dashes.
140+
International characters are allowed. No more than 64 user
141+
labels can be associated with one Tensorboard (system labels are
142+
excluded). For more information and examples of using labels, see
143+
[Using labels to organize Google Cloud Platform resources](https://goo.gl/xmQnxf).
144+
System reserved label keys are prefixed with
145+
`aiplatform.googleapis.com/` and are immutable.
116146
encryption_spec_key_name (Optional[str]):
117147
Optional. The Cloud KMS resource identifier of the customer
118-
managed encryption key used to protect the dataset. Has the
119-
form:
120-
``projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key``.
148+
managed encryption key that's used to protect the dataset. The
149+
format of the key is
150+
`projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key`.
121151
The key needs to be in the same region as where the compute
122152
resource is created.
123153
124-
If set, this Dataset and all sub-resources of this Dataset will be secured by this key.
154+
If `encryption_spec_key_name` is set, this `TextDataset` and
155+
all of its sub-resources are secured by this key.
125156
126-
Overrides encryption_spec_key_name set in aiplatform.init.
127-
create_request_timeout (float):
128-
Optional. The timeout for the create request in seconds.
157+
This `encryption_spec_key_name` overrides the
158+
`encryption_spec_key_name` set by `aiplatform.init`.
129159
sync (bool):
130-
Whether to execute this method synchronously. If False, this method
131-
will be executed in concurrent Future and any downstream object will
132-
be immediately returned and synced when the Future has completed.
160+
If `true`, the `create` method creates a text dataset
161+
synchronously. If `false`, the `create` method creates a text
162+
dataset asynchronously.
163+
create_request_timeout (float):
164+
Optional. The number of seconds for the timeout of the create
165+
request.
133166
134167
Returns:
135168
text_dataset (TextDataset):
136-
Instantiated representation of the managed text dataset resource.
169+
An instantiated representation of the managed `TextDataset`
170+
resource.
137171
"""
138172
if not display_name:
139173
display_name = cls._generate_display_name()

0 commit comments

Comments
 (0)