AWS | Martin Jiang

Serverless vs Fully managed
- Serverless: No visibility into the machines / No server management / Pay for used
- Fully managed: Visibility and control of machines / Managed & automated / Pay for machine runtime

Credential management system
- Policy: what effect / action / resource is allowed
- Role: what policies are allowed
- User: role can be assign to a user
- Group: role can be assign to a group which contains multiple users
User / ARN should have correct level of access to the data / service

Uniquely identify AWS resources, such as
- IAM policies
- Amazon Relational Database Service (Amazon RDS) tags
- API calls

Event-driven, serverless computing platform
Features
- Similar to a cron job in Linux, schedule to run at a specific time or by some trigger
Advantage
- No need to configure server, just write code
- Auto scaling: allocating resource based on workload
- Pay for run time
Example: process images uploaded to S3
- Trigger: metadata change
- Workflow: Create S3 buckets -> Assign IAM roles -> Add trigger to Lambda -> Add script to Lambda -> Test

Event-driven, serverless computing platform
Features
- Schema-inference: extract schema and store into metadata catalog
- Auto gen ETL script
AWS Lambda vs AWS Glue
- Glue can perform ETL with higher level of automation (e.g. Schema-inference, auto generate ETL script); Lambda can perform more complex data manipulation
- Lambda is suitable for smaller jobs; Glue is suitable for larger job / distributed processing
- Glue has more attached services (e.g., Data Catalog, Crawler, DataBrew, Elastic Views)
Concepts
- Data catalog: central, persistent metadata store
- Database / table: a representation of the schema in Glue / data still reside in the original store
- Partitions: folders in S3, that create logical structure for data
- Crawler: auto detect and load schema
- Athena: can query GLUE using SQL style query tools
- Connection: contains properties that are required to connect to a particular data store (e.g., RDS)
Example: combine multiple data source
- Data: Ad click log in JSON in S3 + user profile in RDS
- Glue: flatten JSON -> delete columns -> output processed Ad click data
- Athena: query to combine processed Ad click data with user profile

Message publishing and processing service
Features
- Multiple formats: email, http endpoint, AWS SQS, texting
- Fully managed: no worry about infrastructure
- Durable: message guaranteed not be lost
- Auto-scaling: cover millions of consumers
- Application-to-person or application-to-application
Why Application-to-application?
- Up-stream not required to have knowledge about down-stream services
- No need to set up sequential pipeline

Fully managed container orchestration service
Features
- Serverless (Fargate) or Fully managed (EC2)
- Auto-scaling
Workflow
- Collect required docker files
- AWS ECR: fully managed Docker container registry for storing, sharing, and deploying container images
- Define tasks: how to spin up containers, e.g., number of containers, port mapping, resources, etc.
- Define cluster: which cluster to launch tasks

Visual workflow service / workflow automation / Orchestration
Why Step Functions?
- Allow individual step to wait / fail / retry, without running of whole sequence of event again
- Using script to create flow chart for ARN
Examples

Fully managed ML service
Continuous ML lifecycle
- Data collection, cleaning, preprocessing
- Build, Train, Evaluate ML models
- Model deployment, monitoring
Features
- JupyterLab: an extension of the JupyterLab interface, can launch notebook/kernel with specified environment
- Resource sharing: a group of authorized users kernel can share a pool of compute resource (CPU, GPU)
- SageMaker image: create customized environment
- A collection of Tools
  - SageMaker Data Wrangler: create a preprocessing pipelines with a set of summary statistics and visualization tools
  - SageMaker Feature Store: a managed repository to store and retrieve ML features
  - SageMaker Pipelines: CI/CD service for ML / Automation
  - SageMaker Autopilot: AutoML service
- Connect to other AWS services
  - Amazon DynamoDB database: structured data storage
  - AWS Batch: offline batch processing
  - Amazon Kinesis: real-time processing
Multiple level of abstractions for training and deploying ML models
- Highest level of abstraction: pre-trained ML models that can be deployed as-is
- Lowest level of abstraction: full ML lifecycle from scratch