Setting Up a Data Lake in AWS with example of simple POC using RDS database in AWS: A Step-by-Step Guide

3 min readNov 8, 2024

Setting up a Proof of Concept (PoC) for a data lake in AWS involves several steps, from configuring databases to transforming and cataloging data using AWS Glue. This guide outlines the process, focusing on efficient configuration, security, and creating a robust ETL pipeline.

Step 1: Set Up an RDS Database Instance

Create an RDS Database with Free Tier Options:

In AWS RDS, create a database instance using the free tier plan.
Select the necessary configurations based on your workload requirements, such as database engine (e.g., MySQL, PostgreSQL).
Enable public access to the database, allowing it to be accessible from external applications, including local SQL tools.

Configure Security and Inbound Rules:

Update the RDS security group to allow inbound traffic. Add your IP address and the necessary ports for SQL tools, such as TablePlus or DBeaver.
Self-Referencing for Glue Access: Include inbound rules to allow traffic from your own AWS resources (self-referencing), enabling AWS Glue to establish connections with the RDS database.

Create Sample Tables:

Use SQL commands to create sample tables, such as customers and orders, in the database. Tools like TablePlus, DBeaver, or MySQL Workbench are useful for interacting with the RDS instance.
Define these tables using DDL (Data Definition Language) commands, ensuring compatibility with AWS Glue data types and schemas.

Step 2: Establish AWS Glue Connections

Create a Glue Connection to the RDS Database:

In the AWS Glue Console, set up a connection to the RDS instance, choosing the appropriate database engine and configuration.
Configure the connection’s Virtual Private Cloud (VPC) settings, ensuring network compatibility between AWS Glue and RDS.

Set Up an S3 Endpoint:

Configure an endpoint gateway to the S3 bucket, which will store transformed data in various layers (bronze, silver, and gold).
This endpoint allows background processes to interact with S3 securely and efficiently when transferring data between Glue and S3.

Test the Glue Connection:

Run a quick test on the Glue connection to verify successful integration with the RDS instance.

Step 3: Run Glue Crawlers to Populate the Data Catalog

Create and Run Crawlers:

In AWS Glue, create crawlers to scan the RDS database, specifically the customers and orders tables.
Crawlers automatically populate metadata in the Glue Data Catalog, creating schemas for the tables and ensuring data accessibility in the Glue environment.

Verify the Glue Data Catalog:

After the crawler runs, confirm that the database and tables are accurately represented in the Glue Data Catalog, with appropriate schema details for data discovery.

Step 4: Build and Manage ETL Jobs with Glue

Create Glue Jobs for Data Transformation:

Design Glue jobs to handle extraction, transformation, and loading (ETL) processes according to the specific use-case or problem statement.
Configure different jobs for each transformation stage, using S3 buckets to store data in different stages:
Bronze Layer: Store raw data in Parquet format. This layer contains unprocessed data directly from the source.
Silver Layer: Apply initial transformations, such as excluding personally identifiable information (PII) or merging data through joins.
Gold Layer: Further transform and enrich the data to make it analytics-ready, suitable for data analysts or scientists.

Organize Data in S3 Buckets:

Partition the data in S3 for efficient querying and access. Separate folders for each layer (bronze, silver, and gold) allow for clear organization and streamlined access patterns for users.

Step 5: Validate and Optimize

Test the ETL Pipeline:

Run each Glue job and monitor the job logs for any issues. Validate that the data in each stage aligns with the transformation goals and business requirements.

Optimize Performance and Cost:

Review configurations in Glue and S3 to optimize storage costs, especially by leveraging partitioning and columnar formats (e.g., Parquet).
Regularly assess Glue job execution time and adjust configurations as needed for efficient data processing.