Core Designer Certification Prep DataIku

Terms and Definitions:

Data catalog


shuffle_answers: false


What is data catalog?

  • The Data Catalog is the recommended way to search for datasets
  • no
  • self.sound = “meow”
  • wuff

To configure homepage:

  • Visit profile & settings and select Design homepage layout
  • Homepage cannot be reconfigured

## Where can we find project artefacts, searchable catalog, environment-level admin and security in dataiku?

  • App Menu
  • Home Page
  • Options

## What are some of the dataiku items?

  • Datasets, recipes, projects
  • Datasets, features, workspace
  • recipes, formulae, projects

## What are some of the ways to search for datasets in dataiku?

  • select data catalog from App menu
  • Global search
  • Search DSS Items from App menu

## What are some of the collaborative options in dataiku?

  • Wiki page
  • Discussion options within project
  • To do list
  • Tags to group and search similar projects
  • Dashboards for read only users to share project status

## Summarize the concept of dataset in Dataiku?

  • Dataset refers to any piece of data in tabular format. Dataiku decouples data processing logic from underlying storage infra of the dataset thereby provides a uniform approach to process any data.
  • Creating dataset means user informs dataiku of dataset location and data is not copied into dataiku.

## Who can create connections and control settings such as credentials, security settings, naming rules, and usage parameters.in dataiku?

  • Only admin can manage connections
  • Anyone who has project permission can create connection

## What are the datatypes in dataiku?

  • Dataiku has two types, storage type and meaning. Storage Type indicates how the backend should store value such as integer, string, float, boolean, date etc. Meaning is dataiku specific that gives rich semantic label to the type, such as ip address, url, country etc.
  • Each meaning is able to validate a cell value. Therefore each cell can be valid or invalid for a given meaning.
  • You can use the meaning to enable column transformations, measure the data quality of a column, and make specific values easier to find.

## Default sampling in dataiku is 10,000 rows. What are the potential trade offs with other sampling techniques such as random, stratified, or class rebalancing samplings?

  • The tradeoff is that dataiku may have to do a single or atmost two full passes on the entire dataset to select the sample output which can involve additional time
  • No tradeoff

## What is the purpose of analyze window in dataiku?

  • Data quality report showing, number of valid, invalid, and empty values, as well as those values which appear only once. Numeric columns plot a histogram and boxplot of the distribution. Categorical columns plot a bar chart, sorted by the most frequent observations.
  • Analyze model behavior
  • All of the above

## What part of dataset is used to get the summary statistics in analyze tab in dataiku?

  • Sample dataset

    This is the default behavior

  • Complete dataset

## What are some of the features of chart tab in dataiku?

  • with time series, you can zoom in on different periods, change the aggregated date interval, explore multiple series within the same chart, examine them side-by-side in subcharts, or create basic animations.
  • When working with large numbers of groups of categorical data, you can easily control the number of displayed values by grouping less-prevalent categories into an “other” bucket.
  • You can also select an execution engine when working with certain types of datasets, namely those stored in SQL databases. Such a chart can be executed in-database to improve performance.

## Explain recipes in datakiu?

  • Recipes contain transformation steps and processing logic.
  • Keeping logic separate from data helps with managing different data storage technologies and data lineage
  • The color of the recipe represents one of the categories (visual, code or plugin)
  • visual recipes are pre-defined recipes

## What is prepare recipe?

  • It is a visual recipe to clean, normalize and enrich data in interactive way
  • This is achieved by assembling a series of transformation steps from processor library.
  • Steps can be added from processor library or from column context menu or from analyze window.
  • Steps can be applied to more than one column at a time
  • Steps are applied to sample and not to actual dataset.

## What are the methods to manage growing complexity of steps?

  • Disable steps.
  • Organize individual steps into groups of steps.
  • Add colors and comments to steps in order to send reminders to yourself and colleagues.
  • Copy and paste steps within the same recipe or to another recipe, even if that recipe is in another project or another Dataiku instance.

## Formula

  • From the processor library, add formula and give name of the output column
  • Expression box and editor are two ways to use formula
  • Editor suggests code completion
  • Formula supports common mathematical functions, such as round, sum and max, comparison operators, such as >, <, >=, <= logical operators, such as AND and OR, tests for missing values, such as isBlank() or isNULL() string operations with functions like contains(), length(), and startsWith(). conditional if-then statements

## Worksheets

  • This provides a visual summary of EDA.
  • Worksheet can be created by going to statistics tab.
  • A card is used to specific EDA task. Worksheets organize the cards.

## Configuration menu of cards help to:

  • Configuring tests,
  • Duplicating the card,
  • Viewing the JSON payloads and responses for leveraging the public API,
  • Publishing the card,
  • Deleting the card.
  • Some cards also contain multiple sections, with each section having its own configuration menu.
  • The Split by menu in a card is useful for grouping your dataset by a specified variable. This allows the card to perform computations on each data subgroup.

Dataiku Concepts

Previous
Next