In the realm of machine learning, datasets serve as the foundation for building and training effective models. They provide the raw material that algorithms can analyze to extract patterns, make predictions, and generate insights.
Vertex AI, Google Cloud's unified machine learning platform, offers a comprehensive set of tools and capabilities for managing and utilizing datasets throughout the ML lifecycle. Let's explore the fundamentals of dataset creation and usage within Vertex AI, whether you're a novice or an experienced ML practitioner.
Introduction to Vertex AI
As mentioned, Vertex AI is Google Cloud's unified ML platform that simplifies the process of building, deploying, and managing ML models. It provides a set of tools for the entire ML lifecycle, including data preparation, model training, deployment, and monitoring.
Features of Vertex AI
- Simplified ML Workflow: Vertex AI provides a streamlined and unified approach for building, training, and deploying ML models. It connects data sources, tools, and processes into a single platform, eliminating the need for users to manage multiple disparate tools and infrastructure components.
- Automated Data Preprocessing: This ML platform automates many of the data preprocessing tasks that are typically required for ML, such as data cleaning, normalization, and feature engineering. This reduces the manual effort required to prepare data for model training and makes it easier for users to experiment with different models and datasets.
- Built-in ML Models: Vertex AI provides a collection of pre-trained ML models that can be used for a variety of tasks, including image classification, object detection, natural language processing, and anomaly detection. This makes it easy for users to get started with ML and to develop solutions quickly.
- Deployment and Monitoring: Vertex AI makes it easy to deploy ML models to production and to monitor their performance. It provides built-in tools for tracking model accuracy, latency, and other metrics and for identifying and addressing any issues that may arise.
Benefits of Using Vertex AI
- Increased Productivity: This platform can help users develop ML models more quickly and easily, which can lead to increased productivity and innovation.
- Improved Accuracy: Vertex AI's automated data preprocessing and built-in ML models can help to improve the accuracy of ML models.
- Reduced Costs: Vertex AI can help to reduce the cost of developing and deploying ML models by automating many of the tasks that are typically manual.
Who Should Use Vertex AI?
Vertex AI is a good choice for anyone who wants to build, deploy, and manage ML models. It is particularly well-suited for businesses that:
- Need to develop ML models quickly and easily.
- Need to improve the accuracy of their ML models.
- Want to reduce the cost of developing and deploying ML models.
What Are Datasets in Vertex AI?
Datasets are the building blocks of machine learning in Vertex AI. They serve as the repositories of data that ML algorithms utilize to learn, make predictions, and generate insights. Without high-quality datasets, it is impossible to train accurate and reliable ML models.
Vertex AI supports a wide range of data types, including:
- Tabular Datasets: Tabular datasets consist of rows and columns of structured data, typically organized in CSV or JSON format. They are commonly used for classification, regression, and recommendation tasks. Tabular datasets can be used to analyze customer transactions, predict customer behavior, and personalize product recommendations.
- Image Datasets: Image datasets contain images that can be used for tasks such as image classification, object detection, and image segmentation. Image datasets can be used to identify objects in images, analyze images for patterns and anomalies, and generate creative content such as images and videos.
- Text Datasets: Text datasets consist of written text that can be used for tasks like text classification, sentiment analysis, and natural language processing (NLP). Text datasets can be used to analyze customer reviews, understand customer sentiment, and generate text content such as summaries, reports, and creative writing.
- Video Datasets: Video datasets contain videos that can be used for tasks like video classification, object tracking, and action recognition. Video datasets can be used to analyze videos for events, identify objects in videos, and generate creative content such as videos and animated GIFs.
Creating Datasets in Vertex AI
Using a managed dataset, you can provide the source data used to train AutoML and custom models on Vertex AI. A managed dataset is required for AutoML but optional for custom training.
How to Create a Managed Dataset for AutoML Models
Using Google Cloud Console or the Vertex AI API, you can create managed datasets for AutoML models. Because this process varies based on your data type and model objective, it looks different for everyone. Let's focus on preparing image training data for classification. For this example, we'll focus on a single-label classification.
Data Requirements
First, let's review the data requirements:
- Training Data: The following image formats are supported when training your model. After Vertex AI API processes these imported images, they serve as the data used to train a model. Note that the maximum file size is 30MB:
- Prediction data: The following image formats are supported when requesting a prediction from (querying) your model. Note that the maximum file size is 1.5MB:
- JPEG
- GIF
- PNG
- WEBP
- BMP
- TIFF
- ICO
Optimizing Image Data for AutoML Models
To ensure the best performance of AutoML models, it's crucial to follow these best practices when preparing image data:
- Real-World Representation: AutoML models are specifically designed to analyze and classify images that depict real-world objects. Therefore, the training data should closely resemble the types of images the model will encounter during real-world usage. This means replicating factors such as image resolution, background variations, and object angles.
- Human-Labeling Feasibility: If a human can't accurately identify an object within an image within a brief period (1-2 seconds), it's unlikely that an AutoML model can be trained to do so.
- Image Quantity: Aim for at least 1000 training images per label, with a minimum of 10 images per label. For models with multiple labels per image, consider increasing the image count per label.
- Class Balance: Maintain a balanced distribution of images among different labels. Ideally, the number of images for the most common label should not exceed 100 times the number of images for the least common label. Remove very low-frequency labels to maintain balance.
- None-of-the-Above Label: Consider including a "None-of-the-above" label and images that don't match any of your defined labels. This helps the model recognize and handle instances that fall outside the established categories.
Creating and optimizing images for a Vertex AI dataset is just one of many capabilities of this platform. Learn more about:
Using Datasets in Vertex AI
Managed datasets in Vertex AI provide a streamlined and centralized approach for managing and utilizing datasets throughout the machine learning lifecycle. They offer several advantages over traditional methods, including:
- Simplified Dataset Management: Managed datasets are stored and managed within Vertex AI, eliminating the need for separate data storage and management infrastructure. This simplifies dataset organization, access, and version control.
- Automated Data Preprocessing: Vertex AI provides built-in data preprocessing capabilities for managed datasets, automatically handling tasks like data cleaning, normalization, and feature engineering. This reduces the manual effort required to prepare data for model training.
- Integrated Model Training: Managed datasets can be seamlessly integrated into model training pipelines within Vertex AI, enabling users to directly utilize their datasets for training without additional setup or data preparation.
- Enhanced Data Quality and Governance: Vertex AI enforces data quality and governance policies for managed datasets, ensuring that data is consistent, reliable, and compliant with organizational standards.
To use managed datasets in Vertex AI, follow these steps:
- Create a Dataset: Navigate to the Datasets section in Vertex AI and click "Create Dataset." Select the appropriate data type (Tabular, Image, Text, or Video) and provide a name for the dataset.
- Choose a Storage Location: Specify the storage location for the dataset. You can import data from local files, Cloud Storage buckets, or BigQuery tables.
- Define Data Schema (Optional): For tabular datasets, define the schema of the data, including column names and data types. This helps Vertex AI understand the structure of the data and perform automated preprocessing.
- Upload or Import Data: Upload your data files or configure the import settings to retrieve data from the specified source.
- Preview and Validate Data: Review the dataset contents to ensure data quality and consistency. Vertex AI provides data profiling tools for examining data distribution, identifying missing values, and detecting outliers.
- Create Dataset Version: Once you're satisfied with the dataset, create a version to lock in the current state of the data. This allows you to track changes and revert to previous versions if necessary.
- Use Dataset for Model Training: Select the managed dataset when creating a training pipeline in Vertex AI. The dataset will be automatically prepared and utilized for training your machine learning model.
Managed datasets in Vertex AI streamline the data management and preparation process, enabling users to focus on building and training effective machine learning models.
Find Google Assistance from Promevo
If your team uses Vertex AI to deploy ML models or you're considering integrating the platform into your system, Promevo is here to help. We are a certified Google partner providing end-to-end support for all things Google, from ChromeOS devices to Workspace subscriptions.
Whether you need help managing datasets in Vertex or are looking to start your Google journey, we're here to help. Our team allows you to harness the capabilities of Google to reinvent the way you do business and accelerate your company's growth.
We are proud to be a 100% Google-focused partner helping you succeed. Contact us today to get started.
FAQs: Datasets in Vertex AI
What is a managed dataset in Vertex AI?
A managed dataset in Vertex AI is a centralized repository for your machine learning data. It provides a structured way to manage, prepare, and access your data for training and serving machine learning models. Managed datasets are designed to be easy to use, scalable, and secure. Key features include:
- Centralized Management: Managed datasets are stored in a central location within Vertex AI, providing a single source of truth for your machine-learning data.
- Data Preparation: Managed datasets support various data preparation tasks, including data cleansing, transformation, and encoding. This simplifies the process of preparing your data for training and ensures that your models are trained on high-quality, consistent data.
- Scalability: Managed datasets are designed to scale horizontally to accommodate large amounts of data.
How do I export a dataset from Vertex AI?
You can use the Google Cloud Console or the Vertex AI API to export a dataset. Find specific instructions for either option here.
How do AI datasets work?
Essentially, AI datasets consist of organized collections of data that represent specific aspects of the world or problem domain.
To illustrate how AI datasets work, consider the example of training a machine learning model to classify images of cats and dogs. The dataset for this task would consist of a collection of images of cats and dogs, each labeled with the corresponding animal category. The machine learning model would then analyze these images, identifying patterns and features that distinguish between cats and dogs.
During training, the model would be presented with labeled images, allowing it to continuously improve its ability to correctly classify new images.