27
27
28
28
29
29
class TextDataset (datasets ._Dataset ):
30
- """Managed text dataset resource for Vertex AI."""
30
+ """A managed text dataset resource for Vertex AI.
31
+
32
+ Use this class to work with a managed text dataset. To create a managed
33
+ text dataset, you need a datasource file in CSV format and a schema file in
34
+ YAML format. A schema is optional for a custom model. The CSV file and the
35
+ schema are accessed in Cloud Storage buckets.
36
+
37
+ Use text data for the following objectives:
38
+
39
+ * Classification. For more information, see
40
+ [Prepare text training data for classification](https://cloud.google.com/vertex-ai/docs/text-data/classification/prepare-data).
41
+ * Entity extraction. For more information, see
42
+ [Prepare text training data for entity extraction](https://cloud.google.com/vertex-ai/docs/text-data/entity-extraction/prepare-data).
43
+ * Sentiment analysis. For more information, see
44
+ [Prepare text training data for sentiment analysis](Prepare text training data for sentiment analysis).
45
+
46
+ The following code shows you how to create and import a text dataset with
47
+ a CSV datasource file and a YAML schema file. The schema file you use
48
+ depends on whether your text dataset is used for single-label
49
+ classification, multi-label classification, or object detection.
50
+
51
+ ```py
52
+ my_dataset = aiplatform.TextDataset.create(
53
+ display_name="my-text-dataset",
54
+ gcs_source=['gs://path/to/my/text-dataset.csv'],
55
+ import_schema_uri=['gs://path/to/my/schema.yaml'],
56
+ )
57
+ ```
58
+ """
31
59
32
60
_supported_metadata_schema_uris : Optional [Tuple [str ]] = (
33
61
schema .dataset .metadata .text ,
@@ -49,91 +77,97 @@ def create(
49
77
sync : bool = True ,
50
78
create_request_timeout : Optional [float ] = None ,
51
79
) -> "TextDataset" :
52
- """Creates a new text dataset and optionally imports data into dataset
53
- when source and import_schema_uri are passed.
80
+ """Creates a new text dataset.
81
+
82
+ Optionally imports data into this dataset when a source and
83
+ `import_schema_uri` are passed in. The following is an example of how
84
+ this method is used:
54
85
55
- Example Usage:
56
- ds = aiplatform.TextDataset.create(
57
- display_name='my-dataset',
58
- gcs_source='gs://my-bucket/dataset.csv',
59
- import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
60
- )
86
+ ```py
87
+ ds = aiplatform.TextDataset.create(
88
+ display_name='my-dataset',
89
+ gcs_source='gs://my-bucket/dataset.csv',
90
+ import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
91
+ )
92
+ ```
61
93
62
94
Args:
63
95
display_name (str):
64
- Optional. The user-defined name of the Dataset.
65
- The name can be up to 128 characters long and can be consist
66
- of any UTF-8 characters.
96
+ Optional. The user-defined name of the dataset. The name must
97
+ contain 128 or fewer UTF-8 characters.
67
98
gcs_source (Union[str, Sequence[str]]):
68
- Google Cloud Storage URI(-s) to the
69
- input file(s).
70
-
71
- Examples:
72
- str: "gs://bucket/file.csv"
73
- Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
99
+ Optional. The URI to one or more Google Cloud Storage buckets
100
+ that contain your datasets. For example, `str:
101
+ "gs://bucket/file.csv"` or `Sequence[str]:
102
+ ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]`.
74
103
import_schema_uri (str):
75
- Points to a YAML file stored on Google Cloud
76
- Storage describing the import format. Validation will be
77
- done against the schema. The schema is defined as an
78
- `OpenAPI 3.0.2 Schema
79
- Object <https://tinyurl.com/y538mdwt>`__.
104
+ Optional. A URI for a YAML file stored in Cloud Storage that
105
+ describes the import schema used to validate the
106
+ dataset. The schema is an
107
+ [OpenAPI 3.0.2 Schema](https://tinyurl.com/y538mdwt) object.
80
108
data_item_labels (Dict):
81
- Labels that will be applied to newly imported DataItems. If
82
- an identical DataItem as one being imported already exists
83
- in the Dataset, then these labels will be appended to these
84
- of the already existing one, and if labels with identical
85
- key is imported before, the old label value will be
86
- overwritten. If two DataItems are identical in the same
87
- import data operation, the labels will be combined and if
88
- key collision happens in this case, one of the values will
89
- be picked randomly. Two DataItems are considered identical
90
- if their content bytes are identical (e.g. image bytes or
91
- pdf bytes). These labels will be overridden by Annotation
92
- labels specified inside index file referenced by
93
- ``import_schema_uri``,
94
- e.g. jsonl file.
109
+ Optional. A dictionary of label information. Each dictionary
110
+ item contains a label and a label key. Each item in the dataset
111
+ includes one dictionary of label information. If a data item is
112
+ added or merged into a dataset, and that data item contains an
113
+ image that's identical to an image that’s already in the
114
+ dataset, then the data items are merged. If two identical labels
115
+ are detected during the merge, each with a different label key,
116
+ then one of the label and label key dictionary items is randomly
117
+ chosen to be into the merged data item. Data items are
118
+ compared using their binary data (bytes), not on their content.
119
+ If annotation labels are referenced in a schema specified by the
120
+ `import_schema_url` parameter, then the labels in the
121
+ `data_item_labels` dictionary are overriden by the annotations.
95
122
project (str):
96
- Project to upload this dataset to. Overrides project set in
97
- aiplatform.init.
123
+ Optional. The name of the Google Cloud project to which this
124
+ `TextDataset` is uploaded. This overrides the project that
125
+ was set by `aiplatform.init`.
98
126
location (str):
99
- Location to upload this dataset to. Overrides location set in
100
- aiplatform.init.
127
+ Optional. The Google Cloud region where this dataset is uploaded. This
128
+ region overrides the region that was set by ` aiplatform.init` .
101
129
credentials (auth_credentials.Credentials):
102
- Custom credentials to use to upload this dataset. Overrides
103
- credentials set in aiplatform.init.
130
+ Optional. The credentials that are used to upload the `TextDataset`.
131
+ These credentials override the credentials set by
132
+ `aiplatform.init`.
104
133
request_metadata (Sequence[Tuple[str, str]]):
105
- Strings which should be sent along with the request as metadata .
134
+ Optional. Strings that contain metadata that's sent with the request.
106
135
labels (Dict[str, str]):
107
- Optional. Labels with user-defined metadata to organize your Tensorboards.
108
- Label keys and values can be no longer than 64 characters
109
- (Unicode codepoints), can only contain lowercase letters, numeric
110
- characters, underscores and dashes. International characters are allowed.
111
- No more than 64 user labels can be associated with one Tensorboard
112
- (System labels are excluded).
113
- See https://goo.gl/xmQnxf for more information and examples of labels.
114
- System reserved label keys are prefixed with "aiplatform.googleapis.com/"
115
- and are immutable.
136
+ Optional. Labels with user-defined metadata to organize your
137
+ Vertex AI Tensorboards. The maximum length of a key and of a
138
+ value is 64 unicode characters. Labels and keys can contain only
139
+ lowercase letters, numeric characters, underscores, and dashes.
140
+ International characters are allowed. No more than 64 user
141
+ labels can be associated with one Tensorboard (system labels are
142
+ excluded). For more information and examples of using labels, see
143
+ [Using labels to organize Google Cloud Platform resources](https://goo.gl/xmQnxf).
144
+ System reserved label keys are prefixed with
145
+ `aiplatform.googleapis.com/` and are immutable.
116
146
encryption_spec_key_name (Optional[str]):
117
147
Optional. The Cloud KMS resource identifier of the customer
118
- managed encryption key used to protect the dataset. Has the
119
- form:
120
- `` projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key` `.
148
+ managed encryption key that's used to protect the dataset. The
149
+ format of the key is
150
+ `projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key`.
121
151
The key needs to be in the same region as where the compute
122
152
resource is created.
123
153
124
- If set, this Dataset and all sub-resources of this Dataset will be secured by this key.
154
+ If `encryption_spec_key_name` is set, this `TextDataset` and
155
+ all of its sub-resources are secured by this key.
125
156
126
- Overrides encryption_spec_key_name set in aiplatform.init.
127
- create_request_timeout (float):
128
- Optional. The timeout for the create request in seconds.
157
+ This `encryption_spec_key_name` overrides the
158
+ `encryption_spec_key_name` set by `aiplatform.init`.
129
159
sync (bool):
130
- Whether to execute this method synchronously. If False, this method
131
- will be executed in concurrent Future and any downstream object will
132
- be immediately returned and synced when the Future has completed.
160
+ If `true`, the `create` method creates a text dataset
161
+ synchronously. If `false`, the `create` method creates a text
162
+ dataset asynchronously.
163
+ create_request_timeout (float):
164
+ Optional. The number of seconds for the timeout of the create
165
+ request.
133
166
134
167
Returns:
135
168
text_dataset (TextDataset):
136
- Instantiated representation of the managed text dataset resource.
169
+ An instantiated representation of the managed `TextDataset`
170
+ resource.
137
171
"""
138
172
if not display_name :
139
173
display_name = cls ._generate_display_name ()
0 commit comments