DataIku Certification Quiz
Table of Contents
Overview:
Q1. What is generative vs discriminative learning?
Soln: Naive bayes is generative whereas logistic regression is discriminative
Q2. What is Bayes Theorem? Derive it.
Soln: Baye’s theorem is given as follows:
\begin{equation} P(A|B) = \frac{P (A) * P (B|A)} {P(B)} \end{equation}
This can be derived as follows:
\begin{equation} \tag{1} \text{P(A and B)} = P(A) * P(B|A) \text{(the events are not independent)} \end{equation}
\begin{equation} \tag{2} \text{P(B and A)} = P(B) * P(A|B) \end{equation}
\begin{equation} \text{P(A and B)} = \text{P(B and A)} \text{(The relationship is commutative)} \end{equation}
\begin{equation} P(A) * P(B|A) = P(B) * P (A|B) \end{equation}
\begin{equation} \tag{3} \boxed{\therefore{ P(A|B)} = \frac{P(A) * P(B|A)}{P(B)}} \end{equation}
Q1. What is the 5 step process of ML model development?
- Problem Definition
- Metrics (Online metrics related to business and offline metrics related to model)
- High level architecture
- Offline Model Training
- Deep dive - embeddings
Q1. Fraud detection example in Airbnb
Soln:
- Label is created much later. Maybe months later after Airbnb has paid the money to the ‘host’.
- since the label is outdated, the hsot can play a new trick next time and model is just catching up.
- Requirements:
- What should the model do? Model inference or prediction should happen before the money leaves Airbnb to hosts
Model is trained offline…focus is on model inference/prediction
Batch Prediction:
- Generate Model predictions periodically
- predictions stored somehwere in sql table
- Retreieve them as needed
- Allow more complex models
- What should the model do? Model inference or prediction should happen before the money leaves Airbnb to hosts
Model is trained offline…focus is on model inference/prediction
Batch Prediction:
- Some features are available only during checking time. (the meessages right before checking things like where is the key, how to do something etc)
- In Netflix recommendation, batch predction makes sense because we ened to show recomemendations without any latencey when the app is used
- Third compoenent such as constant unsupervised compoenent as batch processing.
- The mdoel should output a risk score for downstream execution
- Extra verification
- Human agent review
- Identify fraud transaction trends
- Unseen fraud trends (the assumption is fraudster may do more than one transaction somewhere)
- Human agent review
- QPS? (Queries per Second) HOw many transactions divided by how many per second per day?
- Relatively low qps (< 1)
- Latency ? SHould it be in sub milli seconds
Metrics:
- Three actors:
- Guests
- Hosts
- Credit card company
- Offline metrics
- Precistion @ x %
- recall @ x%
- Clustered Based
- Entropy reduction
- silhousette coefficient
- Online Metrics:
- Monetrary loss prevention ($, %) Time for detection d