DataIku Certification Quiz

Table of Contents

Overview:

Q1. What is generative vs discriminative learning?

Soln: Naive bayes is generative whereas logistic regression is discriminative

Q2. What is Bayes Theorem? Derive it.

Soln: Baye’s theorem is given as follows:

\begin{equation} P(A|B) = \frac{P (A) * P (B|A)} {P(B)} \end{equation}

This can be derived as follows:

\begin{equation} \tag{1} \text{P(A and B)} = P(A) * P(B|A) \text{(the events are not independent)} \end{equation}

\begin{equation} \tag{2} \text{P(B and A)} = P(B) * P(A|B) \end{equation}

\begin{equation} \text{P(A and B)} = \text{P(B and A)} \text{(The relationship is commutative)} \end{equation}

\begin{equation} P(A) * P(B|A) = P(B) * P (A|B) \end{equation}

\begin{equation} \tag{3} \boxed{\therefore{ P(A|B)} = \frac{P(A) * P(B|A)}{P(B)}} \end{equation}

This sentence is marked.
Q1. What is the 5 step process of ML model development?

  1. Problem Definition
  2. Metrics (Online metrics related to business and offline metrics related to model)
  3. High level architecture
  4. Offline Model Training
  5. Deep dive - embeddings

Q1. Fraud detection example in Airbnb

Soln:

  • Label is created much later. Maybe months later after Airbnb has paid the money to the ‘host’.
  • since the label is outdated, the hsot can play a new trick next time and model is just catching up.
  • Requirements:
    • What should the model do? Model inference or prediction should happen before the money leaves Airbnb to hosts Model is trained offline…focus is on model inference/prediction Batch Prediction:
      • Generate Model predictions periodically
      • predictions stored somehwere in sql table
      • Retreieve them as needed
      • Allow more complex models
  • Some features are available only during checking time. (the meessages right before checking things like where is the key, how to do something etc)
  • In Netflix recommendation, batch predction makes sense because we ened to show recomemendations without any latencey when the app is used
  • Third compoenent such as constant unsupervised compoenent as batch processing.
  • The mdoel should output a risk score for downstream execution
    • Extra verification
    • Human agent review
  • Identify fraud transaction trends
    • Unseen fraud trends (the assumption is fraudster may do more than one transaction somewhere)
    • Human agent review
  • QPS? (Queries per Second) HOw many transactions divided by how many per second per day?
    • Relatively low qps (< 1)
  • Latency ? SHould it be in sub milli seconds

Metrics:

  • Three actors:
    • Guests
    • Hosts
    • Credit card company
  • Offline metrics
    • Precistion @ x %
    • recall @ x%
    • Clustered Based
      • Entropy reduction
      • silhousette coefficient
  • Online Metrics:
    • Monetrary loss prevention ($, %) Time for detection d