ML Feature store
Table of Contents
FEAST
- Open source feature store
- Serves features in production
- Prevents feature leakage by building training datasets from batch data
- Automates process of loading and serving features in an online feature store
- Supports feature store management without kubernetes, spark or self-managed infra.
- Feature stores require access to compute layers, offline and online databases and need to directly interface with production systems.
Vertex AI Feature Store:
Challenges in ML Workflows:
- Difficulty in sharing and reusing features across different ML workflows and projects.
- Need to serve features reliably in production with low latency.
- Addressing feature value skew between training and serving.
Main Capabilities:
- Sharing features across the organization.
- Reducing duplicate efforts and eliminating the need for feature reengineering.
- Minimizing redundant work.
- Providing a central repository for storing and serving features.
- Offering search and filter capabilities based on feature name and entity name.
- Managing services to reduce latency and management overhead.
- Mitigating training-serving skew through various measures, including computing feature values only once and point-in-time lookups to avoid data leakage.
- Continuous monitoring of data distribution to detect drift.
Data Model:
- Utilizes a time series data model to store features.
- Organizes resources hierarchically: Feature Store > Entity Type > Feature.
- Feature Store: A container for storing and managing features, shared across teams.
- Entity Type: Groups related features (e.g., movies and users) and organizes them.
- Specifies entity type name, description, and features.
- Entity IDs are used to identify entities uniquely within a feature store.
- Entity IDs must be of type string.
- Feature Value: Captured at specific time points, allowing multiple values for a feature for a single entity.
- Associated with an identifier (Entity ID, Feature ID, Timestamp) for retrieval during serving.
Storage Types:
- Two storage methods: Online storage and Offline storage.
- Online storage retains the latest timestamp values for online serving.
- Offline storage stores data until the retention limit or deletion.
- You can control offline storage costs and retention limits.
- Online serving nodes are virtual machines for serving feature values online.
- They offer low latency, scalability, and reliability.
- The number of online serving nodes depends on online serving requests and ingestion jobs.
- Options for configuring online serving nodes: Auto-scaling or allocating a fixed node count.
- Considerations for auto-scaling, including re-balancing data and error handling.
- Option to set the number of online serving nodes to zero to prevent charges and data deletion.