AWS ML Speciality Certification

Data Engineering with AWS Machine Learning

**Data type Characteristics help to determine which AWS repository to use.

  • Structured Data:
    • Relational
    • Have predefined schema
    • Relantionships
    • Supports for complex queries
  • Amazon Relational Database Service: Amazon RDS: Amazon Aurora, Postgresql, mysql, mariodb, oracle and SQL database engines. AWS includes Amazon redshift datawarehouse in this datatype
  • Semi Structured Data:
    • Partially structured such as JSON/XML
    • Key value AWS databases support these type like mariodb, dynamo db
  • Unstructured data:
    • No schema at all
    • Heterogenous object storage.
    • AWS S3 supports this type.

Batch and Stream processing Characteristics:

  • Batch Processing: Scope is limited to querying or processing over all or most of the datasets Data size is in the form large batches Data performance latencies are over 1 minutes over to hours Analyzes are complex like OLTP, string processing.

Book: AWS Certified Machine Learning Study Guide by Shreyas Subramanian, Stefan Natu

Chapter 5: Data Collection

  • This chapter covers AWS ML Objectives:

    • Domain 1.0: Data Engineering
      • 1.1 Create data repositories for machine learning
      • 1.2 Identify and implement a data ingestion solution
    • Domain 2.0: Exploratory Data Analysis
      • 2.1 Sanitize and prepare data for modeling
  • Types of data:

    • Structured data:

      • Has well defined schema and metadata
    • Unstructured data:

      • No well defined structural properties or schema
    • Semi-structured data:

      • No precise schema.
      • Formats like JSON and XML are in this category
Previous
Next