How to Integrate and Query Data From AWS S3 Data Using Athena

Posted By :Shivani Chaudhary |30th July 2020

Integrate and Query Data From AWS S3 Data Using Athena

 

 

AWS or Amazon Web Services are acting as significant virtual storage, compute, and network platform. The data powerhouse, AWS S3 is gaining traction among developers for simplifying web-scale computing. On the other hand, Athena is complementary service that streamlines data analysis in S3. Both these services are integral parts of the machine learning tech stack. We, at Oodles, as a provider of AI Development Services, discuss how to integrate and query AWS S3 data using Athena and Quicksight Visualization.

 

 

AWS S3

Amazon Simple Storage Service is storage for the net. AWS is designed to make web-scale cloud computing easier for developers.

Amazon S3 features a simple web services interface that you simply can use to store and retrieve any amount of knowledge, at any time, from anywhere on the online. It provides developer entrance to an equivalent highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon handles to run its global network of web sites. 

 

The service proposes to optimize machine learning development for data-driven solutions. 

 

Pre-requisites for AWS S3 Data Analysis Using Athena 

 

  1. Global Upload Settings – (Optional) Use this element to specify import settings for the Amazon S3 files, like field delimiters. If this element is not specified, Amazon QuickSight uses the default values for the fields in this section.

 

  1. Data Format – (Optional) Define the format of the files to be denoted. Valid formats are CSV, JSON, CLF, ELF, and TSV. The default format value is CSV.

 

  1. Delimiter – (Optional) Define the file field delimiter. This requirement map to the file type placed out in the format field. Original formats are commas (,) for .csv files and tabs (t) for .tsv files. The default format value is the comma (,).

 

Important: The Data should be stored in a bucket inside a folder. The single record of data should be stored in a single line and the next record start with the next line. The JSON fields end with a comma(,) then the next record should not contain any space.

 

Bucket Example: ProjectNameData -> folderName -> .json files or file

 

AWS Athena

Amazon Athena maintains a subset of Data Definition Language (DDL) statements and ANSI SQL functions and operators to define. The query external tables where data resides in Amazon Simple Storage Service.

While we create a database and table in Athena, we need to specify the schema and the location of the data (S3 bucket location), making the data in the table ready for real-time querying.

 

Example:

Database Name: testDatabase

Table Name: testTable

Location: S3://ProjectNameData/folderName/                                    // S3 path S3://bucket/folder/ 

Row format: 'org.openx.data.jsonserde.JsonSerDe'          // for SERDE to read JSON data it should be valid single object type (Other formats will not work)

 

S3 to Athena Steps:

  1. Create a database and then decide Run Query, or press Ctrl+ENTER

Query: create database testDatabase;

  1. Create a table with fields.

Query: CREATE EXTERNAL TABLE IF NOT EXISTS `testDatabase`.testTable (

         `id` int,

         `dob` timestamp,

         `name` string,

         `address` string,

         `email` string,

         `company` STRUCT < `userId`: string,                                                                              // This is for Nested data

         `userRole`: array<string> >

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

 WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')

LOCATION 'S3://ProjectNameData/folderName/'

 

  1. View result with 

Query: select * from ‘testdatabase'.testtable;


 

 


 

At Oodles, we are an experiential team of AI software architects, developers, and data analysts. Our tech stack for machine learning is packed with robust tools and technologies like AWS, Azure, Google Cloud, and IBM Watson. We build dynamic AI solutions and applications that automate critical business operations effectively.

 

Join hands with our AI Development team to explore more about our AI capabilities.

 

 


About Author

Shivani Chaudhary

Shivani is a quick learner, self problem-solving, quickly grasp new things and hard-working Java Developer. She has good knowledge about Java, Spring MVC, Spring boot, Spring Data JPA, Hibernate, REST Web Services. Her hobbies are travelling and learning

Request For Proposal

[contact-form-7 404 "Not Found"]

Ready to innovate ? Let's get in touch

Chat With Us