ML Feature store

Table of Contents

FEAST

  • Open source feature store
  • Serves features in production
  • Prevents feature leakage by building training datasets from batch data
  • Automates process of loading and serving features in an online feature store
  • Supports feature store management without kubernetes, spark or self-managed infra.
  • Feature stores require access to compute layers, offline and online databases and need to directly interface with production systems.

Vertex AI Feature Store:

Challenges in ML Workflows:

  • Difficulty in sharing and reusing features across different ML workflows and projects.
  • Need to serve features reliably in production with low latency.
  • Addressing feature value skew between training and serving.

Main Capabilities:

  • Sharing features across the organization.
  • Reducing duplicate efforts and eliminating the need for feature reengineering.
  • Minimizing redundant work.
  • Providing a central repository for storing and serving features.
  • Offering search and filter capabilities based on feature name and entity name.
  • Managing services to reduce latency and management overhead.
  • Mitigating training-serving skew through various measures, including computing feature values only once and point-in-time lookups to avoid data leakage.
  • Continuous monitoring of data distribution to detect drift.

Data Model:

  • Utilizes a time series data model to store features.
  • Organizes resources hierarchically: Feature Store > Entity Type > Feature.
  • Feature Store: A container for storing and managing features, shared across teams.
  • Entity Type: Groups related features (e.g., movies and users) and organizes them.
  • Specifies entity type name, description, and features.
  • Entity IDs are used to identify entities uniquely within a feature store.
  • Entity IDs must be of type string.
  • Feature Value: Captured at specific time points, allowing multiple values for a feature for a single entity.
  • Associated with an identifier (Entity ID, Feature ID, Timestamp) for retrieval during serving.

Storage Types:

  • Two storage methods: Online storage and Offline storage.
  • Online storage retains the latest timestamp values for online serving.
  • Offline storage stores data until the retention limit or deletion.
  • You can control offline storage costs and retention limits.
  • Online serving nodes are virtual machines for serving feature values online.
  • They offer low latency, scalability, and reliability.
  • The number of online serving nodes depends on online serving requests and ingestion jobs.
  • Options for configuring online serving nodes: Auto-scaling or allocating a fixed node count.
  • Considerations for auto-scaling, including re-balancing data and error handling.
  • Option to set the number of online serving nodes to zero to prevent charges and data deletion.
Previous