AWS ML Speciality Certification
Table of Contents
Data Engineering with AWS Machine Learning
**Data type Characteristics help to determine which AWS repository to use.
- Structured Data:
- Relational
- Have predefined schema
- Relantionships
- Supports for complex queries
- Amazon Relational Database Service: Amazon RDS: Amazon Aurora, Postgresql, mysql, mariodb, oracle and SQL database engines. AWS includes Amazon redshift datawarehouse in this datatype
- Semi Structured Data:
- Partially structured such as JSON/XML
- Key value AWS databases support these type like mariodb, dynamo db
- Unstructured data:
- No schema at all
- Heterogenous object storage.
- AWS S3 supports this type.
Batch and Stream processing Characteristics:
- Batch Processing: Scope is limited to querying or processing over all or most of the datasets Data size is in the form large batches Data performance latencies are over 1 minutes over to hours Analyzes are complex like OLTP, string processing.
Book: AWS Certified Machine Learning Study Guide by Shreyas Subramanian, Stefan Natu
Chapter 5: Data Collection
-
This chapter covers AWS ML Objectives:
- Domain 1.0: Data Engineering
- 1.1 Create data repositories for machine learning
- 1.2 Identify and implement a data ingestion solution
- Domain 2.0: Exploratory Data Analysis
- 2.1 Sanitize and prepare data for modeling
- Domain 1.0: Data Engineering
-
Types of data:
-
Structured data:
- Has well defined schema and metadata
-
Unstructured data:
- No well defined structural properties or schema
-
Semi-structured data:
- No precise schema.
- Formats like JSON and XML are in this category
-