Building Data pipelines in GCP
Table of Contents
AWS ML Certification
Exam Format
Exam Details:
- 180 minutes, ~65 questions
- Multiple Choice questions and multiple-response questions.
- No partial credit for multiple-response questions
- Score from 100 to 1000 minimum passing score of 750.
- Scaled scoring models
Blue Print
- Four Domains
- Domain 1: Data Engineering - 20%
- Domain 2: Exploratory Data Analysis - 24%
- Domain 3: Modeling - 36%
- Domain 4: Machine learning implementation and operations - 20%
Data Collection
Amazon Ground Truth
Data stores
AWS provides the following data stores
S3
- Files can be from 0 to 5TB
- Unlimited storage
- Namespace sould be unique globally
- The end points can be path style or virtual hosted style.
- bucketname.s3.amazonaws.com this is virtual hosted style and from 2020, this is the adopted format.
RDS
- RDS provides the following engines for fully managed relational database
Dynamo DB
- NOSQl datastore for non-sql databases based on key value pair
- Unstructured, or semi structured data for nosql database.
Amazon Redshift
- Amazon Redshift is a fully managed, clustered petabyte data warehousing solution that congregates data from other data sources like S3, Dynamo DB, and more.
- It can store mass amounts of relational or non-relational semi-structured or structured data to create a data warehousing solution.
- Once data is in Redshift, SQL client tools,or business intelligence tools,or other analytics tools can be used to query that data and find out important information about your data warehouse.
- Within the console, you can launch your Amazon Redshift cluster, select the number of nodes that you want, their storage size,and other parameters to create your data warehousing solution.
- Another cool feature within Redshift is a tool called Redshift spectrum,which allows you to query your Redshift cluster that has sources of S3 data. So essentially, it allows you to query your S3 data. You can then use tools like QuickSight to create charts and graphs to actually visualize that data.
Amazon Time Series DB
- Amazon Timestream is a fully-managed time series database service,and it allows you to plug in business intelligence tools and run SQL Lite queries on your time series data
Document DB
- Introduced in 2019
- Place to migrate all mongoDB data
- Provides better performance and scalability.
Data Migration Tools