Terminology
- Serverless vs Fully managed
- Serverless: No visibility into the machines / No server management / Pay for used
- Fully managed: Visibility and control of machines / Managed & automated / Pay for machine runtime
IAM (Identity and Access Management)
- Credential management system
- Policy: what effect / action / resource is allowed
- Role: what policies are allowed
- User: role can be assign to a user
- Group: role can be assign to a group which contains multiple users
- User / ARN should have correct level of access to the data / service
ARN (Amazon Resource Names)
- Uniquely identify AWS resources, such as
- IAM policies
- Amazon Relational Database Service (Amazon RDS) tags
- API calls
S3 (Simple Storage Service)
Lambda
- Event-driven, serverless computing platform
- Features
- Similar to a cron job in Linux, schedule to run at a specific time or by some trigger
- Advantage
- No need to configure server, just write code
- Auto scaling: allocating resource based on workload
- Pay for run time
- Example: process images uploaded to S3
- Trigger: metadata change
- Workflow: Create S3 buckets -> Assign IAM roles -> Add trigger to Lambda -> Add script to Lambda -> Test
Glue
- Event-driven, serverless computing platform
- Features
- Schema-inference: extract schema and store into metadata catalog
- Auto gen ETL script
- AWS Lambda vs AWS Glue
- Glue can perform ETL with higher level of automation (e.g. Schema-inference, auto generate ETL script); Lambda can perform more complex data manipulation
- Lambda is suitable for smaller jobs; Glue is suitable for larger job / distributed processing
- Glue has more attached services (e.g., Data Catalog, Crawler, DataBrew, Elastic Views)
- Concepts
- Data catalog: central, persistent metadata store
- Database / table: a representation of the schema in Glue / data still reside in the original store
- Partitions: folders in S3, that create logical structure for data
- Crawler: auto detect and load schema
- Athena: can query GLUE using SQL style query tools
- Connection: contains properties that are required to connect to a particular data store (e.g., RDS)
- Example: combine multiple data source
- Data: Ad click log in JSON in S3 + user profile in RDS
- Glue: flatten JSON -> delete columns -> output processed Ad click data
- Athena: query to combine processed Ad click data with user profile
SNS (Simple Notification Service)
- Message publishing and processing service
- Features
- Multiple formats: email, http endpoint, AWS SQS, texting
- Fully managed: no worry about infrastructure
- Durable: message guaranteed not be lost
- Auto-scaling: cover millions of consumers
- Application-to-person or application-to-application
- Why Application-to-application?
- Up-stream not required to have knowledge about down-stream services
- No need to set up sequential pipeline
ECS (Elastic Container Service)
- Fully managed container orchestration service
- Features
- Serverless (Fargate) or Fully managed (EC2)
- Auto-scaling
- Workflow
- Collect required docker files
- AWS ECR: fully managed Docker container registry for storing, sharing, and deploying container images
- Define tasks: how to spin up containers, e.g., number of containers, port mapping, resources, etc.
- Define cluster: which cluster to launch tasks
Step Functions
- Visual workflow service / workflow automation / Orchestration
- Why Step Functions?
- Allow individual step to wait / fail / retry, without running of whole sequence of event again
- Using script to create flow chart for ARN
- Examples
SageMaker
- Fully managed ML service
- Continuous ML lifecycle
- Data collection, cleaning, preprocessing
- Build, Train, Evaluate ML models
- Model deployment, monitoring
- Features
- JupyterLab: an extension of the JupyterLab interface, can launch notebook/kernel with specified environment
- Resource sharing: a group of authorized users kernel can share a pool of compute resource (CPU, GPU)
- SageMaker image: create customized environment
- A collection of Tools
- SageMaker Data Wrangler: create a preprocessing pipelines with a set of summary statistics and visualization tools
- SageMaker Feature Store: a managed repository to store and retrieve ML features
- SageMaker Pipelines: CI/CD service for ML / Automation
- SageMaker Autopilot: AutoML service
- Connect to other AWS services
- Amazon DynamoDB database: structured data storage
- AWS Batch: offline batch processing
- Amazon Kinesis: real-time processing
- Multiple level of abstractions for training and deploying ML models
- Highest level of abstraction: pre-trained ML models that can be deployed as-is
- Lowest level of abstraction: full ML lifecycle from scratch