Skip to content

Commit 6f40f1b

Browse files
vertex-sdk-botcopybara-github
authored andcommitted
docs: Update the documentation for the tabular_dataset class
PiperOrigin-RevId: 580329801
1 parent 91e985a commit 6f40f1b

File tree

1 file changed

+115
-66
lines changed

1 file changed

+115
-66
lines changed

google/cloud/aiplatform/datasets/tabular_dataset.py

+115-66
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,49 @@
3333

3434

3535
class TabularDataset(datasets._ColumnNamesDataset):
36-
"""Managed tabular dataset resource for Vertex AI."""
36+
"""A managed tabular dataset resource for Vertex AI.
37+
38+
Use this class to work with tabular datasets. You can use a CSV file, BigQuery, or a pandas
39+
[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
40+
to create a tabular dataset. For more information about paging through
41+
BigQuery data, see [Read data with BigQuery API using
42+
pagination](https://cloud.google.com/bigquery/docs/paging-results). For more
43+
information about tabular data, see [Tabular
44+
data](https://cloud.google.com/vertex-ai/docs/training-overview#tabular_data).
45+
46+
The following code shows you how to create and import a tabular
47+
dataset with a CSV file.
48+
49+
```py
50+
my_dataset = aiplatform.TabularDataset.create(
51+
display_name="my-dataset", gcs_source=['gs://path/to/my/dataset.csv'])
52+
```
53+
54+
The following code shows you how to create and import a tabular
55+
dataset in two distinct steps.
56+
57+
```py
58+
my_dataset = aiplatform.TextDataset.create(
59+
display_name="my-dataset")
60+
61+
my_dataset.import(
62+
gcs_source=['gs://path/to/my/dataset.csv']
63+
import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
64+
)
65+
```
66+
67+
If you create a tabular dataset with a pandas
68+
[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html),
69+
you need to use a BigQuery table to stage the data for Vertex AI:
70+
71+
```py
72+
my_dataset = aiplatform.TabularDataset.create_from_dataframe(
73+
df_source=my_pandas_dataframe,
74+
staging_path=f"bq://{bq_dataset_id}.table-unique"
75+
)
76+
```
77+
78+
"""
3779

3880
_supported_metadata_schema_uris: Optional[Tuple[str]] = (
3981
schema.dataset.metadata.tabular,
@@ -54,66 +96,68 @@ def create(
5496
sync: bool = True,
5597
create_request_timeout: Optional[float] = None,
5698
) -> "TabularDataset":
57-
"""Creates a new tabular dataset.
99+
"""Creates a tabular dataset.
58100
59101
Args:
60102
display_name (str):
61-
Optional. The user-defined name of the Dataset.
62-
The name can be up to 128 characters long and can be consist
63-
of any UTF-8 characters.
103+
Optional. The user-defined name of the dataset. The name must
104+
contain 128 or fewer UTF-8 characters.
64105
gcs_source (Union[str, Sequence[str]]):
65-
Google Cloud Storage URI(-s) to the
66-
input file(s).
67-
68-
Examples:
69-
str: "gs://bucket/file.csv"
70-
Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
106+
The URI to one or more Google Cloud Storage buckets that contain
107+
your datasets. For example, `str: "gs://bucket/file.csv"` or
108+
`Sequence[str]: ["gs://bucket/file1.csv",
109+
"gs://bucket/file2.csv"]`.
71110
bq_source (str):
72-
BigQuery URI to the input table.
73-
example:
74-
"bq://project.dataset.table_name"
111+
The URI to a BigQuery table that's used as an input source. For
112+
example, `bq://project.dataset.table_name`.
75113
project (str):
76-
Project to upload this dataset to. Overrides project set in
77-
aiplatform.init.
114+
The name of the Google Cloud project to which this
115+
`TabularDataset` is uploaded. This overrides the project that
116+
was set by `aiplatform.init`.
78117
location (str):
79-
Location to upload this dataset to. Overrides location set in
80-
aiplatform.init.
118+
The Google Cloud region where this dataset is uploaded. This
119+
region overrides the region that was set by `aiplatform.init`.
81120
credentials (auth_credentials.Credentials):
82-
Custom credentials to use to upload this dataset. Overrides
83-
credentials set in aiplatform.init.
121+
The credentials that are used to upload the `TabularDataset`.
122+
These credentials override the credentials set by
123+
`aiplatform.init`.
84124
request_metadata (Sequence[Tuple[str, str]]):
85-
Strings which should be sent along with the request as metadata.
125+
Strings that contain metadata that's sent with the request.
86126
labels (Dict[str, str]):
87-
Optional. Labels with user-defined metadata to organize your Tensorboards.
88-
Label keys and values can be no longer than 64 characters
89-
(Unicode codepoints), can only contain lowercase letters, numeric
90-
characters, underscores and dashes. International characters are allowed.
91-
No more than 64 user labels can be associated with one Tensorboard
92-
(System labels are excluded).
93-
See https://goo.gl/xmQnxf for more information and examples of labels.
94-
System reserved label keys are prefixed with "aiplatform.googleapis.com/"
95-
and are immutable.
127+
Optional. Labels with user-defined metadata to organize your
128+
Vertex AI Tensorboards. The maximum length of a key and of a
129+
value is 64 unicode characters. Labels and keys can contain only
130+
lowercase letters, numeric characters, underscores, and dashes.
131+
International characters are allowed. No more than 64 user
132+
labels can be associated with one Tensorboard (system labels are
133+
excluded). For more information and examples of using labels, see
134+
[Using labels to organize Google Cloud Platform resources](https://goo.gl/xmQnxf).
135+
System reserved label keys are prefixed with
136+
`aiplatform.googleapis.com/` and are immutable.
96137
encryption_spec_key_name (Optional[str]):
97138
Optional. The Cloud KMS resource identifier of the customer
98-
managed encryption key used to protect the dataset. Has the
99-
form:
100-
``projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key``.
139+
managed encryption key that's used to protect the dataset. The
140+
format of the key is
141+
`projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key`.
101142
The key needs to be in the same region as where the compute
102143
resource is created.
103144
104-
If set, this Dataset and all sub-resources of this Dataset will be secured by this key.
145+
If `encryption_spec_key_name` is set, this `TabularDataset` and
146+
all of its sub-resources are secured by this key.
105147
106-
Overrides encryption_spec_key_name set in aiplatform.init.
148+
This `encryption_spec_key_name` overrides the
149+
`encryption_spec_key_name` set by `aiplatform.init`.
107150
sync (bool):
108-
Whether to execute this method synchronously. If False, this method
109-
will be executed in concurrent Future and any downstream object will
110-
be immediately returned and synced when the Future has completed.
151+
If `true`, the `create` method creates a tabular dataset
152+
synchronously. If false, the `create` mdthod creates a tabular
153+
dataset asynchronously.
111154
create_request_timeout (float):
112-
Optional. The timeout for the create request in seconds.
155+
Optional. The number of seconds for the timeout of the create
156+
request.
113157
114158
Returns:
115159
tabular_dataset (TabularDataset):
116-
Instantiated representation of the managed tabular dataset resource.
160+
An instantiated representation of the managed `TabularDataset` resource.
117161
"""
118162
if not display_name:
119163
display_name = cls._generate_display_name()
@@ -162,44 +206,49 @@ def create_from_dataframe(
162206
location: Optional[str] = None,
163207
credentials: Optional[auth_credentials.Credentials] = None,
164208
) -> "TabularDataset":
165-
"""Creates a new tabular dataset from a Pandas DataFrame.
209+
"""Creates a new tabular dataset from a pandas `DataFrame`.
166210
167211
Args:
168212
df_source (pd.DataFrame):
169-
Required. Pandas DataFrame containing the source data for
170-
ingestion as a TabularDataset. This method will use the data
171-
types from the provided DataFrame when creating the dataset.
213+
Required. A pandas
214+
[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
215+
containing the source data for ingestion as a `TabularDataset`.
216+
This method uses the data types from the provided `DataFrame`
217+
when the `TabularDataset` is created.
172218
staging_path (str):
173-
Required. The BigQuery table to stage the data
174-
for Vertex. Because Vertex maintains a reference to this source
175-
to create the Vertex Dataset, this BigQuery table should
176-
not be deleted. Example: `bq://my-project.my-dataset.my-table`.
177-
If the provided BigQuery table doesn't exist, this method will
178-
create the table. If the provided BigQuery table already exists,
219+
Required. The BigQuery table used to stage the data for Vertex
220+
AI. Because Vertex AI maintains a reference to this source to
221+
create the `TabularDataset`, you shouldn't delete this BigQuery
222+
table. For example: `bq://my-project.my-dataset.my-table`.
223+
If the specified BigQuery table doesn't exist, then the table is
224+
created for you. If the provided BigQuery table already exists,
179225
and the schemas of the BigQuery table and your DataFrame match,
180-
this method will append the data in your local DataFrame to the table.
181-
The location of the provided BigQuery table should conform to the location requirements
182-
specified here: https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations.
226+
then the data in your local `DataFrame` is appended to the table.
227+
The location of the BigQuery table must conform to the
228+
[BigQuery location requirements](https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations).
183229
bq_schema (Optional[Union[str, bigquery.SchemaField]]):
184-
Optional. If not set, BigQuery will autodetect the schema using your DataFrame's column types.
185-
If set, BigQuery will use the schema you provide when creating the staging table. For more details,
186-
see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema
230+
Optional. If not set, BigQuery autodetects the schema using the
231+
column types of your `DataFrame`. If set, BigQuery uses the
232+
schema you provide when the staging table is created. For more
233+
information,
234+
see the BigQuery
235+
[`LoadJobConfig.schema`](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema)
236+
property.
187237
display_name (str):
188-
Optional. The user-defined name of the Dataset.
189-
The name can be up to 128 characters long and can be consist
190-
of any UTF-8 charact
238+
Optional. The user-defined name of the `Dataset`. The name must
239+
contain 128 or fewer UTF-8 characters.
191240
project (str):
192-
Optional. Project to upload this dataset to. Overrides project set in
193-
aiplatform.init.
241+
Optional. The project to upload this dataset to. This overrides
242+
the project set using `aiplatform.init`.
194243
location (str):
195-
Optional. Location to upload this dataset to. Overrides location set in
196-
aiplatform.init.
244+
Optional. The location to upload this dataset to. This overrides
245+
the location set using `aiplatform.init`.
197246
credentials (auth_credentials.Credentials):
198-
Optional. Custom credentials to use to upload this dataset. Overrides
199-
credentials set in aiplatform.init.
247+
Optional. The custom credentials used to upload this dataset.
248+
This overrides credentials set using `aiplatform.init`.
200249
Returns:
201250
tabular_dataset (TabularDataset):
202-
Instantiated representation of the managed tabular dataset resource.
251+
An instantiated representation of the managed `TabularDataset` resource.
203252
"""
204253

205254
if staging_path.startswith("bq://"):

0 commit comments

Comments
 (0)