Terminology

  • Serverless vs Fully managed
    • Serverless: No visibility into the machines / No server management / Pay for used
    • Fully managed: Visibility and control of machines / Managed & automated / Pay for machine runtime

IAM (Identity and Access Management)

  • Credential management system
    • Policy: what effect / action / resource is allowed
    • Role: what policies are allowed
    • User: role can be assign to a user
    • Group: role can be assign to a group which contains multiple users
  • User / ARN should have correct level of access to the data / service

ARN (Amazon Resource Names)

  • Uniquely identify AWS resources, such as
    • IAM policies
    • Amazon Relational Database Service (Amazon RDS) tags
    • API calls

S3 (Simple Storage Service)

  • Data lake

Lambda

  • Event-driven, serverless computing platform
  • Features
    • Similar to a cron job in Linux, schedule to run at a specific time or by some trigger
  • Advantage
    • No need to configure server, just write code
    • Auto scaling: allocating resource based on workload
    • Pay for run time
  • Example: process images uploaded to S3
    • Trigger: metadata change
    • Workflow: Create S3 buckets -> Assign IAM roles -> Add trigger to Lambda -> Add script to Lambda -> Test

Glue

  • Event-driven, serverless computing platform
  • Features
    • Schema-inference: extract schema and store into metadata catalog
    • Auto gen ETL script
  • AWS Lambda vs AWS Glue
    • Glue can perform ETL with higher level of automation (e.g. Schema-inference, auto generate ETL script); Lambda can perform more complex data manipulation
    • Lambda is suitable for smaller jobs; Glue is suitable for larger job / distributed processing
    • Glue has more attached services (e.g., Data Catalog, Crawler, DataBrew, Elastic Views)
  • Concepts
    • Data catalog: central, persistent metadata store
    • Database / table: a representation of the schema in Glue / data still reside in the original store
    • Partitions: folders in S3, that create logical structure for data
    • Crawler: auto detect and load schema
    • Athena: can query GLUE using SQL style query tools
    • Connection: contains properties that are required to connect to a particular data store (e.g., RDS)
  • Example: combine multiple data source
    • Data: Ad click log in JSON in S3 + user profile in RDS
    • Glue: flatten JSON -> delete columns -> output processed Ad click data
    • Athena: query to combine processed Ad click data with user profile

SNS (Simple Notification Service)

  • Message publishing and processing service
  • Features
    • Multiple formats: email, http endpoint, AWS SQS, texting
    • Fully managed: no worry about infrastructure
    • Durable: message guaranteed not be lost
    • Auto-scaling: cover millions of consumers
    • Application-to-person or application-to-application
  • Why Application-to-application?
    • Up-stream not required to have knowledge about down-stream services
    • No need to set up sequential pipeline

ECS (Elastic Container Service)

  • Fully managed container orchestration service
  • Features
    • Serverless (Fargate) or Fully managed (EC2)
    • Auto-scaling
  • Workflow
    • Collect required docker files
    • AWS ECR: fully managed Docker container registry for storing, sharing, and deploying container images
    • Define tasks: how to spin up containers, e.g., number of containers, port mapping, resources, etc.
    • Define cluster: which cluster to launch tasks

Step Functions

  • Visual workflow service / workflow automation / Orchestration
  • Why Step Functions?
    • Allow individual step to wait / fail / retry, without running of whole sequence of event again
    • Using script to create flow chart for ARN
  • Examples

SageMaker

  • Fully managed ML service
  • Continuous ML lifecycle
    • Data collection, cleaning, preprocessing
    • Build, Train, Evaluate ML models
    • Model deployment, monitoring
  • Features
    • JupyterLab: an extension of the JupyterLab interface, can launch notebook/kernel with specified environment
    • Resource sharing: a group of authorized users kernel can share a pool of compute resource (CPU, GPU)
    • SageMaker image: create customized environment
    • A collection of Tools
      • SageMaker Data Wrangler: create a preprocessing pipelines with a set of summary statistics and visualization tools
      • SageMaker Feature Store: a managed repository to store and retrieve ML features
      • SageMaker Pipelines: CI/CD service for ML / Automation
      • SageMaker Autopilot: AutoML service
    • Connect to other AWS services
      • Amazon DynamoDB database: structured data storage
      • AWS Batch: offline batch processing
      • Amazon Kinesis: real-time processing
  • Multiple level of abstractions for training and deploying ML models
    • Highest level of abstraction: pre-trained ML models that can be deployed as-is
    • Lowest level of abstraction: full ML lifecycle from scratch