33
33
34
34
35
35
class TabularDataset (datasets ._ColumnNamesDataset ):
36
- """Managed tabular dataset resource for Vertex AI."""
36
+ """A managed tabular dataset resource for Vertex AI.
37
+
38
+ Use this class to work with tabular datasets. You can use a CSV file, BigQuery, or a pandas
39
+ [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
40
+ to create a tabular dataset. For more information about paging through
41
+ BigQuery data, see [Read data with BigQuery API using
42
+ pagination](https://cloud.google.com/bigquery/docs/paging-results). For more
43
+ information about tabular data, see [Tabular
44
+ data](https://cloud.google.com/vertex-ai/docs/training-overview#tabular_data).
45
+
46
+ The following code shows you how to create and import a tabular
47
+ dataset with a CSV file.
48
+
49
+ ```py
50
+ my_dataset = aiplatform.TabularDataset.create(
51
+ display_name="my-dataset", gcs_source=['gs://path/to/my/dataset.csv'])
52
+ ```
53
+
54
+ The following code shows you how to create and import a tabular
55
+ dataset in two distinct steps.
56
+
57
+ ```py
58
+ my_dataset = aiplatform.TextDataset.create(
59
+ display_name="my-dataset")
60
+
61
+ my_dataset.import(
62
+ gcs_source=['gs://path/to/my/dataset.csv']
63
+ import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
64
+ )
65
+ ```
66
+
67
+ If you create a tabular dataset with a pandas
68
+ [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html),
69
+ you need to use a BigQuery table to stage the data for Vertex AI:
70
+
71
+ ```py
72
+ my_dataset = aiplatform.TabularDataset.create_from_dataframe(
73
+ df_source=my_pandas_dataframe,
74
+ staging_path=f"bq://{bq_dataset_id}.table-unique"
75
+ )
76
+ ```
77
+
78
+ """
37
79
38
80
_supported_metadata_schema_uris : Optional [Tuple [str ]] = (
39
81
schema .dataset .metadata .tabular ,
@@ -54,66 +96,68 @@ def create(
54
96
sync : bool = True ,
55
97
create_request_timeout : Optional [float ] = None ,
56
98
) -> "TabularDataset" :
57
- """Creates a new tabular dataset.
99
+ """Creates a tabular dataset.
58
100
59
101
Args:
60
102
display_name (str):
61
- Optional. The user-defined name of the Dataset.
62
- The name can be up to 128 characters long and can be consist
63
- of any UTF-8 characters.
103
+ Optional. The user-defined name of the dataset. The name must
104
+ contain 128 or fewer UTF-8 characters.
64
105
gcs_source (Union[str, Sequence[str]]):
65
- Google Cloud Storage URI(-s) to the
66
- input file(s).
67
-
68
- Examples:
69
- str: "gs://bucket/file.csv"
70
- Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
106
+ The URI to one or more Google Cloud Storage buckets that contain
107
+ your datasets. For example, `str: "gs://bucket/file.csv"` or
108
+ `Sequence[str]: ["gs://bucket/file1.csv",
109
+ "gs://bucket/file2.csv"]`.
71
110
bq_source (str):
72
- BigQuery URI to the input table.
73
- example:
74
- "bq://project.dataset.table_name"
111
+ The URI to a BigQuery table that's used as an input source. For
112
+ example, `bq://project.dataset.table_name`.
75
113
project (str):
76
- Project to upload this dataset to. Overrides project set in
77
- aiplatform.init.
114
+ The name of the Google Cloud project to which this
115
+ `TabularDataset` is uploaded. This overrides the project that
116
+ was set by `aiplatform.init`.
78
117
location (str):
79
- Location to upload this dataset to. Overrides location set in
80
- aiplatform.init.
118
+ The Google Cloud region where this dataset is uploaded. This
119
+ region overrides the region that was set by ` aiplatform.init` .
81
120
credentials (auth_credentials.Credentials):
82
- Custom credentials to use to upload this dataset. Overrides
83
- credentials set in aiplatform.init.
121
+ The credentials that are used to upload the `TabularDataset`.
122
+ These credentials override the credentials set by
123
+ `aiplatform.init`.
84
124
request_metadata (Sequence[Tuple[str, str]]):
85
- Strings which should be sent along with the request as metadata .
125
+ Strings that contain metadata that's sent with the request.
86
126
labels (Dict[str, str]):
87
- Optional. Labels with user-defined metadata to organize your Tensorboards.
88
- Label keys and values can be no longer than 64 characters
89
- (Unicode codepoints), can only contain lowercase letters, numeric
90
- characters, underscores and dashes. International characters are allowed.
91
- No more than 64 user labels can be associated with one Tensorboard
92
- (System labels are excluded).
93
- See https://goo.gl/xmQnxf for more information and examples of labels.
94
- System reserved label keys are prefixed with "aiplatform.googleapis.com/"
95
- and are immutable.
127
+ Optional. Labels with user-defined metadata to organize your
128
+ Vertex AI Tensorboards. The maximum length of a key and of a
129
+ value is 64 unicode characters. Labels and keys can contain only
130
+ lowercase letters, numeric characters, underscores, and dashes.
131
+ International characters are allowed. No more than 64 user
132
+ labels can be associated with one Tensorboard (system labels are
133
+ excluded). For more information and examples of using labels, see
134
+ [Using labels to organize Google Cloud Platform resources](https://goo.gl/xmQnxf).
135
+ System reserved label keys are prefixed with
136
+ `aiplatform.googleapis.com/` and are immutable.
96
137
encryption_spec_key_name (Optional[str]):
97
138
Optional. The Cloud KMS resource identifier of the customer
98
- managed encryption key used to protect the dataset. Has the
99
- form:
100
- `` projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key` `.
139
+ managed encryption key that's used to protect the dataset. The
140
+ format of the key is
141
+ `projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key`.
101
142
The key needs to be in the same region as where the compute
102
143
resource is created.
103
144
104
- If set, this Dataset and all sub-resources of this Dataset will be secured by this key.
145
+ If `encryption_spec_key_name` is set, this `TabularDataset` and
146
+ all of its sub-resources are secured by this key.
105
147
106
- Overrides encryption_spec_key_name set in aiplatform.init.
148
+ This `encryption_spec_key_name` overrides the
149
+ `encryption_spec_key_name` set by `aiplatform.init`.
107
150
sync (bool):
108
- Whether to execute this method synchronously. If False, this method
109
- will be executed in concurrent Future and any downstream object will
110
- be immediately returned and synced when the Future has completed .
151
+ If `true`, the `create` method creates a tabular dataset
152
+ synchronously. If false, the `create` mdthod creates a tabular
153
+ dataset asynchronously .
111
154
create_request_timeout (float):
112
- Optional. The timeout for the create request in seconds.
155
+ Optional. The number of seconds for the timeout of the create
156
+ request.
113
157
114
158
Returns:
115
159
tabular_dataset (TabularDataset):
116
- Instantiated representation of the managed tabular dataset resource.
160
+ An instantiated representation of the managed `TabularDataset` resource.
117
161
"""
118
162
if not display_name :
119
163
display_name = cls ._generate_display_name ()
@@ -162,44 +206,49 @@ def create_from_dataframe(
162
206
location : Optional [str ] = None ,
163
207
credentials : Optional [auth_credentials .Credentials ] = None ,
164
208
) -> "TabularDataset" :
165
- """Creates a new tabular dataset from a Pandas DataFrame.
209
+ """Creates a new tabular dataset from a pandas ` DataFrame` .
166
210
167
211
Args:
168
212
df_source (pd.DataFrame):
169
- Required. Pandas DataFrame containing the source data for
170
- ingestion as a TabularDataset. This method will use the data
171
- types from the provided DataFrame when creating the dataset.
213
+ Required. A pandas
214
+ [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
215
+ containing the source data for ingestion as a `TabularDataset`.
216
+ This method uses the data types from the provided `DataFrame`
217
+ when the `TabularDataset` is created.
172
218
staging_path (str):
173
- Required. The BigQuery table to stage the data
174
- for Vertex . Because Vertex maintains a reference to this source
175
- to create the Vertex Dataset, this BigQuery table should
176
- not be deleted. Example : `bq://my-project.my-dataset.my-table`.
177
- If the provided BigQuery table doesn't exist, this method will
178
- create the table . If the provided BigQuery table already exists,
219
+ Required. The BigQuery table used to stage the data for Vertex
220
+ AI . Because Vertex AI maintains a reference to this source to
221
+ create the `TabularDataset`, you shouldn't delete this BigQuery
222
+ table. For example : `bq://my-project.my-dataset.my-table`.
223
+ If the specified BigQuery table doesn't exist, then the table is
224
+ created for you . If the provided BigQuery table already exists,
179
225
and the schemas of the BigQuery table and your DataFrame match,
180
- this method will append the data in your local DataFrame to the table.
181
- The location of the provided BigQuery table should conform to the location requirements
182
- specified here: https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations.
226
+ then the data in your local ` DataFrame` is appended to the table.
227
+ The location of the BigQuery table must conform to the
228
+ [BigQuery location requirements]( https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations) .
183
229
bq_schema (Optional[Union[str, bigquery.SchemaField]]):
184
- Optional. If not set, BigQuery will autodetect the schema using your DataFrame's column types.
185
- If set, BigQuery will use the schema you provide when creating the staging table. For more details,
186
- see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema
230
+ Optional. If not set, BigQuery autodetects the schema using the
231
+ column types of your `DataFrame`. If set, BigQuery uses the
232
+ schema you provide when the staging table is created. For more
233
+ information,
234
+ see the BigQuery
235
+ [`LoadJobConfig.schema`](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema)
236
+ property.
187
237
display_name (str):
188
- Optional. The user-defined name of the Dataset.
189
- The name can be up to 128 characters long and can be consist
190
- of any UTF-8 charact
238
+ Optional. The user-defined name of the `Dataset`. The name must
239
+ contain 128 or fewer UTF-8 characters.
191
240
project (str):
192
- Optional. Project to upload this dataset to. Overrides project set in
193
- aiplatform.init.
241
+ Optional. The project to upload this dataset to. This overrides
242
+ the project set using ` aiplatform.init` .
194
243
location (str):
195
- Optional. Location to upload this dataset to. Overrides location set in
196
- aiplatform.init.
244
+ Optional. The location to upload this dataset to. This overrides
245
+ the location set using ` aiplatform.init` .
197
246
credentials (auth_credentials.Credentials):
198
- Optional. Custom credentials to use to upload this dataset. Overrides
199
- credentials set in aiplatform.init.
247
+ Optional. The custom credentials used to upload this dataset.
248
+ This overrides credentials set using ` aiplatform.init` .
200
249
Returns:
201
250
tabular_dataset (TabularDataset):
202
- Instantiated representation of the managed tabular dataset resource.
251
+ An instantiated representation of the managed `TabularDataset` resource.
203
252
"""
204
253
205
254
if staging_path .startswith ("bq://" ):
0 commit comments