7 min read
How Predictive AI with Google Cloud Is Shaping the Future
Artificial intelligence (AI) is transforming businesses across every industry by enhancing efficiency, personalization, and automation. As a...
Google Cloud's Vertex AI makes it easier than ever for organizations to apply machine learning (ML) to their most complex business challenges. However, successfully leveraging the power of ML requires proper dataset management. Below, we will walk you through best practices for creating and deleting datasets in Vertex AI.
Vertex AI is Google's integrated machine learning platform, built to accelerate the development and deployment of AI across teams. It provides the necessary tools, services, and infrastructure to empower data scientists, ML engineers, and other AI practitioners to build custom models or leverage Google's state-of-the-art pre-trained models.
Key capabilities of Vertex AI include:
Use cases span predictive analytics, speech/text/image analytics, search ranking, content recommendations, time series forecasting, data extraction, and more in industries from healthcare to manufacturing and the public sector.
To develop accurate ML models that provide business value, you need high-quality, representative datasets. Vertex AI includes specialized dataset resources designed specifically to support the ML workflow — from flexible data ingestion through model deployment.
Properly structuring Vertex AI datasets is crucial since models can ultimately only be as good as the data they learn patterns from. Flawed or biased data leads to flawed model performance. That's why taking advantage of robust dataset capabilities lays the groundwork for success.
With Vertex AI datasets, you can:
Now that you understand the pivotal role datasets play in Vertex AI, let’s walk through the process of creating them. While simple in principle, proper dataset configuration saves tremendous time downstream when executing ML training jobs and inference. This step-by-step guide will equip you to structure Vertex AI datasets like a pro.
The first step in creating a quality Vertex AI dataset is preparing your source training data. Start by uploading your data files into a regional Cloud Storage bucket in the same region you will enable Vertex AI. Supported data formats include CSV, JSON Lines, and TFRecord. Match your data schema to the ML task whether classification, object detection, text entity extraction, or other.
For supervised learning-based models, ensure data is already labeled or annotated with the categories the model will be trained to recognize. If the raw data doesn’t include labels, use Vertex AI console data labeling tools to tag the images, text, or videos accordingly. Take time to carefully review data as models, predictions, and business value will only be as good as this real-world example data.
Lastly, while not required, consider whether TFRecords could provide performance optimization over raw CSV/JSON formats. Once source data is configured, you’re ready to create the dataset in Vertex AI.
Navigate to the Datasets page in Vertex AI section of Google Cloud Console. Click "Create dataset" and provide a name. Select the data type — tabular, image, etc. Choose the ML objective like "Image classification".
Specify the path in Cloud Storage of where your training data files are located that you want to import into the dataset. You identify which bucket and folder contains your source CSVs, images, etc. that will be used for machine learning. Configure the split of your source data into training, validation, and test sets if not dividing data programmatically later.
You can also create annotation sets to categorize the data. Annotation sets apply labels to datasets for model training and are tied to specific data types and objectives. For example, select an image dataset then create an annotation set for image classification.
Vertex AI automatically creates the first annotation set using imported labels. Make additional ones in the console by choosing that dataset, clicking "Create annotation set", naming it, picking the objective, and hitting create.
Alternatively, Cloud Shell provides a command line interface. Use the gcloud ai-platform datasets create command to define dataset parameters. Reference the source data paths in Cloud Storage and import.
When importing data source files to create Vertex AI datasets, how you configure datasets greatly impacts model training performance later. Follow these best practices for the greatest success:
Properly structuring datasets saves time when training models later.
While creating datasets to fuel ML initiatives, data needs evolve over time. Projects get discontinued, revised, or rebuilt from scratch.
This section covers recommended scenarios for deleting datasets, step-by-step instructions for both console and command line dataset removal, and considerations around data integrity to ensure stability.
There are a few scenarios where deleting Vertex AI datasets makes sense:
Deleting datasets that won't be reused helps optimize your environment.
In Cloud Console, navigate to the Datasets page and tick the checkbox beside datasets to remove. Click "Delete dataset" in the upper right and confirm deletion.
To delete via gcloud command line, use:
Copy code
gcloud ai platforms datasets delete [DATASET_NAME]
Re-run a dataset import job if you need to recreate it.
Importantly, deleting a Vertex AI dataset only removes the managed resource in that project, not the training data itself stored in Cloud Storage. So, source data remains intact for re-import later.
Before deletion, be sure to export datasets to Cloud Storage as a precaution. Never delete a dataset still linked to active models without considering implications. With the flexibility of datasets, you can confidently create, reuse, archive, and purge them as needs evolve.
If you need help rolling out Vertex AI and optimizing dataset best practices specific to your ML initiatives, Promevo is here. As a Google-certified partner, we provide end-to-end support with all things Google Cloud, from infrastructure to analytics, migration to machine learning, and beyond.
Our team helps you harness the robust capabilities of Google's data and AI tools to reinvent the way you do business. We are proud to be a 100% Google-focused partner helping you succeed wherever you are in your Google Cloud or Vertex AI journey. Contact us today to get started.
A Vertex AI dataset holds the structured or unstructured data used for training ML models in Google Cloud's Vertex AI platform. Datasets provide a centralized way to ingest, organize, manage access, and control versions of your ML data.
To create a dataset in Vertex AI, upload source data files to Cloud Storage, then use the Vertex AI console or API to import that data. Define key parameters like the data type (text, image), ML task (classification), and Cloud Storage path of data files. Vertex AI handles the rest!
It's best practice to delete Vertex AI datasets that are obsolete or unused. This eliminates wasted resources, reduces clutter, and simplifies dataset management as your needs evolve over time. The source data remains intact in Cloud Storage for future re-import if ever needed again.
Meet the Author
Promevo is a Google Premier Partner that offers comprehensive support and custom solutions across the entire Google ecosystem — including Google Cloud Platform, Google Workspace, ChromeOS, everything in between. We also help users harness Google Workspace's robust capabilities through our proprietary gPanel® software.
7 min read
Artificial intelligence (AI) is transforming businesses across every industry by enhancing efficiency, personalization, and automation. As a...
4 min read
Editor's Note: Google announced on February 8, 2024 that Duet AI and Bard will be moved under the Gemini product umbrella. This blog has been updated...
5 min read
Generating creative, personalized content like text, images, and video is now easier than ever before, thanks to advances in artificial intelligence...