5 min read

Create & Delete Datasets in Vertex AI with Ease

Google Cloud's Vertex AI makes it easier than ever for organizations to apply machine learning (ML) to their most complex business challenges. However, successfully leveraging the power of ML requires proper dataset management. Below, we will walk you through best practices for creating and deleting datasets in Vertex AI.

 

Understanding the Basics of Vertex AI

Vertex AI is Google's integrated machine learning platform, built to accelerate the development and deployment of AI across teams. It provides the necessary tools, services, and infrastructure to empower data scientists, ML engineers, and other AI practitioners to build custom models or leverage Google's state-of-the-art pre-trained models.

An Overview of Google’s Vertex AI

Key capabilities of Vertex AI include:

  • AutoML for no-code model development
  • Vertex Experiments, Vizier, and NAS to accelerate advanced ML
  • Vertex Pipelines for MLOps and automation
  • Pre-trained APIs like Vision and Natural Language

Use cases span predictive analytics, speech/text/image analytics, search ranking, content recommendations, time series forecasting, data extraction, and more in industries from healthcare to manufacturing and the public sector.

Importance of Dataset Management in Vertex AI

To develop accurate ML models that provide business value, you need high-quality, representative datasets. Vertex AI includes specialized dataset resources designed specifically to support the ML workflow from flexible data ingestion through model deployment.

Properly structuring Vertex AI datasets is crucial since models can ultimately only be as good as the data they learn patterns from. Flawed or biased data leads to flawed model performance. That's why taking advantage of robust dataset capabilities lays the groundwork for success.

With Vertex AI datasets, you can:

  • Import data of any type (text, image, etc.) from Cloud Storage.
  • Organize data for different models into separate datasets.
  • Add labels if data is not already annotated.
  • Control access to sensitive data.
  • Export datasets as backup or to share with others.
  • Link datasets to model training jobs.
  • Monitor data usage across teams.
  • Delete unused datasets to reduce costs.


Step-by-Step Guide to Creating Datasets in Vertex AI

Now that you understand the pivotal role datasets play in Vertex AI, let’s walk through the process of creating them. While simple in principle, proper dataset configuration saves tremendous time downstream when executing ML training jobs and inference. This step-by-step guide will equip you to structure Vertex AI datasets like a pro.

Preparing Your Data for Vertex AI

The first step in creating a quality Vertex AI dataset is preparing your source training data. Start by uploading your data files into a regional Cloud Storage bucket in the same region you will enable Vertex AI. Supported data formats include CSV, JSON Lines, and TFRecord. Match your data schema to the ML task whether classification, object detection, text entity extraction, or other.

For supervised learning-based models, ensure data is already labeled or annotated with the categories the model will be trained to recognize. If the raw data doesn’t include labels, use Vertex AI console data labeling tools to tag the images, text, or videos accordingly. Take time to carefully review data as models, predictions, and business value will only be as good as this real-world example data.

Lastly, while not required, consider whether TFRecords could provide performance optimization over raw CSV/JSON formats. Once source data is configured, you’re ready to create the dataset in Vertex AI.

Using Google Cloud Console and Cloud Shell in Dataset Creation

Navigate to the Datasets page in Vertex AI section of Google Cloud Console. Click "Create dataset" and provide a name. Select the data type tabular, image, etc. Choose the ML objective like "Image classification".

Specify the path in Cloud Storage of where your training data files are located that you want to import into the dataset. You identify which bucket and folder contains your source CSVs, images, etc. that will be used for machine learning. Configure the split of your source data into training, validation, and test sets if not dividing data programmatically later.

You can also create annotation sets to categorize the data. Annotation sets apply labels to datasets for model training and are tied to specific data types and objectives. For example, select an image dataset then create an annotation set for image classification.

Vertex AI automatically creates the first annotation set using imported labels. Make additional ones in the console by choosing that dataset, clicking "Create annotation set", naming it, picking the objective, and hitting create.

Alternatively, Cloud Shell provides a command line interface. Use the gcloud ai-platform datasets create command to define dataset parameters. Reference the source data paths in Cloud Storage and import.

Essential Tips on Effective Dataset Creation

When importing data source files to create Vertex AI datasets, how you configure datasets greatly impacts model training performance later. Follow these best practices for the greatest success:

  • Place each dataset in a separate Cloud Storage folder.
  • Have ~1000 quality examples per label for image/text data.
  • Stratify data across training, validation, and test sets.
  • Match datasets to intended model objective.
  • Give team members required permissions to access.
  • Create datasets in the same region as ML jobs.
  • Use datasets for both AutoML and custom jobs.

Properly structuring datasets saves time when training models later.

 

A Guide to Deleting Datasets in Vertex AI

While creating datasets to fuel ML initiatives, data needs evolve over time. Projects get discontinued, revised, or rebuilt from scratch.

This section covers recommended scenarios for deleting datasets, step-by-step instructions for both console and command line dataset removal, and considerations around data integrity to ensure stability.

Situations Requiring Dataset Deletion in Vertex AI

There are a few scenarios where deleting Vertex AI datasets makes sense:

  • Remove temporary project datasets not needed longer term.
  • Eliminate unused datasets wasting resources.
  • Streamline the datasets list by pruning obsolete ones.
  • Refresh a dataset by deleting and recreating from new data.
  • Archive datasets no longer being accessed.

Deleting datasets that won't be reused helps optimize your environment.

The Process of Deleting Datasets Using Google Cloud Console & Cloud Shell

In Cloud Console, navigate to the Datasets page and tick the checkbox beside datasets to remove. Click "Delete dataset" in the upper right and confirm deletion.

To delete via gcloud command line, use:

Copy code
gcloud ai platforms datasets delete [DATASET_NAME]

Re-run a dataset import job if you need to recreate it.

Ensuring Data Quality & Integrity After Dataset Deletion

Importantly, deleting a Vertex AI dataset only removes the managed resource in that project, not the training data itself stored in Cloud Storage. So, source data remains intact for re-import later.

Before deletion, be sure to export datasets to Cloud Storage as a precaution. Never delete a dataset still linked to active models without considering implications. With the flexibility of datasets, you can confidently create, reuse, archive, and purge them as needs evolve.

 

Make the Most of Vertex AI With Promevo

If you need help rolling out Vertex AI and optimizing dataset best practices specific to your ML initiatives, Promevo is here. As a Google-certified partner, we provide end-to-end support with all things Google Cloud, from infrastructure to analytics, migration to machine learning, and beyond.

Our team helps you harness the robust capabilities of Google's data and AI tools to reinvent the way you do business. We are proud to be a 100% Google-focused partner helping you succeed wherever you are in your Google Cloud or Vertex AI journey. Contact us today to get started.

 

FAQs: Creating & Deleting Datasets in Vertex AI

What is a Vertex AI dataset?

A Vertex AI dataset holds the structured or unstructured data used for training ML models in Google Cloud's Vertex AI platform. Datasets provide a centralized way to ingest, organize, manage access, and control versions of your ML data.

How do you create a dataset in Vertex AI?

To create a dataset in Vertex AI, upload source data files to Cloud Storage, then use the Vertex AI console or API to import that data. Define key parameters like the data type (text, image), ML task (classification), and Cloud Storage path of data files. Vertex AI handles the rest!

Why delete Vertex AI datasets?

It's best practice to delete Vertex AI datasets that are obsolete or unused. This eliminates wasted resources, reduces clutter, and simplifies dataset management as your needs evolve over time. The source data remains intact in Cloud Storage for future re-import if ever needed again.

 

New call-to-action

 

Related Articles

How Predictive AI with Google Cloud Is Shaping the Future

7 min read

How Predictive AI with Google Cloud Is Shaping the Future

Artificial intelligence (AI) is transforming businesses across every industry by enhancing efficiency, personalization, and automation. As a...

Read More
What Is Gemini for Google Cloud?

4 min read

What Is Gemini for Google Cloud?

Editor's Note: Google announced on February 8, 2024 that Duet AI and Bard will be moved under the Gemini product umbrella. This blog has been updated...

Read More
Harnessing the Power of Generative AI with Google Cloud

5 min read

Harnessing the Power of Generative AI with Google Cloud

Generating creative, personalized content like text, images, and video is now easier than ever before, thanks to advances in artificial intelligence...

Read More