AWS multiple services for analytics. Right from simple S3 select to EMR cluster which is managed hadoop. In this arcile we will analyse all these offering and understand differences and their use cases:
- Amazon S3 Select and S3 Glacier Select support only the SELECT SQL command.
- Data in object storage have traditionally been accessed as a whole entities, meaning when you ask for a 5 gigabyte object you get all 5 gigabytes. Select for S3 and Glacier allows you to use simple SQL expressions to pull out only the bytes you need from those objects.
- this partial data retrieval ability is especially useful for serverless applications built with AWS Lambda.
- Amazon Athena, Amazon Redshift, and Amazon EMR as well as partners like Cloudera, DataBricks, and Hortonworks will all support S3 Select
SELECT d.dir_name, d.files FROM S3Object[*] d
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. You don’t even need to load your data into Athena, it works directly with data stored in S3.
- Supports only S3
- Serverless. Zero infrastructure. Zero administration.
- Easy to query, just use standard SQL
- Pay per query
- Integrated with AWS Glue
- Amazon Athena integrates with Amazon QuickSight for easy visualization.
- Power BI
- Data Source upload (CSV, excel), S3, RedShift, RDS, Salesforce., Athena
- QuickSight is built with “SPICE” – a Super-fast, Parallel, In-memory Calculation Engine Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. QuickSight is serverless and can automatically scale to tens of thousands of users without any infrastructure to manage or capacity to plan for
- Scale from tens to tens of thousands of users
- Embed BI dashboards in your applications
- Ask questions of your data, receive answers
- Pay-per-session pricing
- Can be from upload (CSV, excel), S3, RedShift, RDS, Salesforce., Athena
AWS Glue is a fully managed ETL service. Glue has three main components:
- The AWS Glue Data Catalog
- The AWS Glue Data Catalog is your persistent metadata store.
- It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.
- The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data.
- The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue.
- AWS Glue Data Catalog is Apache Hive Metastore compatible / replacement
- AWS Glue Crawlers and Classifiers
- AWS Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog.
- Fully Managed ETL
- a fully managed ETL service that allows you to transform and move data to various destinations, and
AWS Glue provides both visual and code-based interfaces to make data integration easier.
- AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
- Crawlers that infer schema
- Autogen ETL scripts
- AWS Glue provides a console and API operations to set up and manage your extract, transform, and load (ETL) workload.
- Cloud data warehouse
- Deepest integration with your data lake and AWS services
- Best performance
- Most scalable
- Best Value
- Easy to manage
- Most secure and compliant
- It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
- Amazon Redshift also includes Amazon Redshift Spectrum, allowing you to run SQL queries directly against exabytes of unstructured data in Amazon S3 data lakes.
- with Redshift Spectrum, it also makes it easy to analyze large amounts of data in its native format without requiring you to load the data
- AQUA (Advanced Query Accelerator) is a new distributed and hardware-accelerated cache that enables Redshift to run up to 10x faster than any other enterprise cloud data warehouse.
- You can load data into Amazon Redshift from a range of data sources including Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon EMR, AWS Glue, AWS Data Pipeline and or any SSH-enabled host on Amazon EC2 or on-premises.
- There are two types of snapshots: automated and manual. Amazon Redshift stores these snapshots internally in Amazon S3 by using an encrypted Secure Sockets Layer (SSL) connection
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.
- Easy to use
- Low cost
- You can deploy your workloads to EMR using Amazon EC2, Amazon Elastic Kubernetes Service (EKS), or on-premises AWS Outposts.
- Amazon EMR lets you focus on transforming and analyzing your data without having to worry about managing compute capacity or open-source applications, and saves you money. Using EMR, you can instantly provision as much or as little capacity as you like on Amazon EC2 and set up scaling rules to manage changing compute demand.
How these work together
Below is a sample architecture diagram having how these different services can work together
|Keywork||Partial File Fetch||run ad-hoc queries||Dashboards||ETL||Datawarehouse||Bigdata (hadoop)|
|Input||S3||S3||upload (CSV, excel), S3, RedShift, RDS, Salesforce., Athena||RDS, Redshift, DynamoDB, S3, MySQL||Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon EMR, AWS Glue, AWS Data Pipeline|
|Purpose||Fetch selected data from file (avoid loading whole file)||Load data into Athena for analytics||Load data and dislay dashboard & Analytics||ETL||Datawarehouse||Big Data processing|