ML Feature store

Table of Contents

FEAST

Open source feature store
Serves features in production
Prevents feature leakage by building training datasets from batch data
Automates process of loading and serving features in an online feature store
Supports feature store management without kubernetes, spark or self-managed infra.
Feature stores require access to compute layers, offline and online databases and need to directly interface with production systems.

Challenges in ML Workflows:

Difficulty in sharing and reusing features across different ML workflows and projects.
Need to serve features reliably in production with low latency.
Addressing feature value skew between training and serving.

Main Capabilities:

Sharing features across the organization.
Reducing duplicate efforts and eliminating the need for feature reengineering.
Minimizing redundant work.
Providing a central repository for storing and serving features.
Offering search and filter capabilities based on feature name and entity name.
Managing services to reduce latency and management overhead.
Mitigating training-serving skew through various measures, including computing feature values only once and point-in-time lookups to avoid data leakage.
Continuous monitoring of data distribution to detect drift.

Data Model:

Utilizes a time series data model to store features.
Organizes resources hierarchically: Feature Store > Entity Type > Feature.
Feature Store: A container for storing and managing features, shared across teams.
Entity Type: Groups related features (e.g., movies and users) and organizes them.
Specifies entity type name, description, and features.
Entity IDs are used to identify entities uniquely within a feature store.
Entity IDs must be of type string.
Feature Value: Captured at specific time points, allowing multiple values for a feature for a single entity.
Associated with an identifier (Entity ID, Feature ID, Timestamp) for retrieval during serving.

Storage Types:

Two storage methods: Online storage and Offline storage.
Online storage retains the latest timestamp values for online serving.
Offline storage stores data until the retention limit or deletion.
You can control offline storage costs and retention limits.
Online serving nodes are virtual machines for serving feature values online.
They offer low latency, scalability, and reliability.
The number of online serving nodes depends on online serving requests and ingestion jobs.
Options for configuring online serving nodes: Auto-scaling or allocating a fixed node count.
Considerations for auto-scaling, including re-balancing data and error handling.
Option to set the number of online serving nodes to zero to prevent charges and data deletion.

Last updated on Jan 1, 0001