By Jitesh Bhattacharjee, Delivery Partner – TCS
By Nicolas Weydert, Chief Architect – TCS
By Sanjay Gupta, Sr. Partner Solutions Architect – AWS
Many organizations leverage unstructured data collected from social media feeds, stock streaming, and data clickstream to gain insights about the needs of their customers. They use this information to customize their products and improve customer experience using data lake solutions.
A Lake House architecture is defined by a central repository (data lake) which allows ingestion of unstructured, structured, and real-time data that’s consumed by various processes like analytics engine, data warehouses, machine learning (ML) models, and visualization tools.
The EZ Lake Access (EZLA) solution developed by Tata Consultancy Services (TCS), an AWS Premier Consulting Partner, centralizes and simplifies access management of the Data Lake House by codifying most of the enterprise access controls in the form of a rule engine. This provides increased efficiencies and easy adoption of the Data Lake House.
In this post, we’ll describe the Lake House ecosystem, complexities, and common challenges. We’ll also discuss the TCS EZLA solution overview, architecture, and functions, and review the benefits of the solution as a case study from a large life science enterprise.
Enterprise Lake House Ecosystem
Most of the organizations today looking to build Lake House-based solutions choose Amazon Web Services (AWS) due to the availability of custom solutions and depth and breadth of AWS offerings.
For instance, Amazon Simple Storage Service (Amazon S3) offers durability, availability, performance, security, and virtually unlimited scalability at low cost. This makes Amazon S3 a great choice for data lake, which is a core component for AWS Lake House implementations.
Figure 1 – Enterprise Lake House architecture.
In a Lake House approach, various stakeholders access the data through Access Management Layer for usage ranging from developing insights, ensuring security, creating and training ML models, and more.
These stakeholders form the top layer of the Lake House and create value by the extraction of meaningful data and inferences, while Amazon S3 provides a central repository.
Increased demand to extract information and knowledge from the data lake has made services like Amazon SageMaker, Amazon Redshift, AWS Glue, and Amazon Athena an integral part of the Lake House ecosystem as well.
Lake House Access Management Complexities
As a Lake House comprises a central repository for data—which is a collection from various sources—organizations need a complex set of rules to manage their Lake House access.
In creating these rules, they often face challenges:
- Compliance requirements and sensitivity of data.
- Demand to log approval processes and track every change for audit purposes.
- Granting complex roles and permissions to business users on diverse datasets.
- Complexity is increased by introducing various user personas, data subject areas (data classification), type of employees (full time versus contractors), and more.
Persona-Based Access Scenario:
Security teams are always looking to govern the process with stringent enforcement. Most enterprises either tag data with their subject areas or create independent S3 buckets.
Some of the rules applied are created by the security team to ensure each role has exactly the level of access it requires to accomplish the task at hand. Some of the rules are also in place due to business needs.
The table below explains various dimensions of such complexity:
|Subject area/persona/department||Persona A (Department 1)||Persona B (Department 2)||Persona C (Department 3)|
|Claims data||Read (landing data|
Read/write (summarized data)
Amazon Athena + Amazon QuickSight
Amazon EMR cluster
Amazon SageMaker Notebook
|Finance data||Deny||Deny||Read only|
Amazon Athena + Amazon QuickSight
|Sales data||Read only|
- Subject area is a classification of the data within the data lake.
- Persona defines the profile of the end users who may require a specific list of AWS services.
- Services define the permissions as templates, which can be converted into an inline policy or customer-managed policy. AWS-managed policies can also be managed through pre-defined service roles such as for Amazon EMR and Amazon SageMaker.
- Departments represent the different line of business in a company.
An example of a business rule as depicted in the table signifies the complexity of access management. For instance, a user with “Persona A” who belongs to “Department 1” should NOT have access to any financial data but gets full access to claims data. While the user gets access to claims data, they should also get Amazon Athena and Amazon QuickSight access.
On the other hand, a user with “Persona C” who belongs to “Department 3” should NOT have access to sales data but should get full access to claims data with their personal Amazon SageMaker Notebook instance. Rules like these are generally heuristic and difficult to implement consistently.
Lake House Access Management Common Challenges
Persona-based access management boils down to a few problem statements and common challenges:
- Create and manage data subject areas by Amazon S3 buckets and prefixes (subfolders) up to ‘n’ level, business rules to govern user personas, and user personas such as data scientist, data engineer, and report readers.
- Launch and manage AWS services like AWS Glue, Amazon Athena, and Amazon SageMaker, and integrate them with user personas and business logic for user and service roles creation and their maintenance.
- Create and manage departments and integrate them with various roles and user personas.
- Create and manage user roles and required service roles for Amazon SageMaker Notebooks, Amazon EMR clusters, and more.
Let’s see how TCS can help with EZLA solution.
TCS EZLA Solution Overview
The TCS EZLA solution sits in the Access Management Layer of the Lake House approach. It provides an automated way for business and IT staff to manage access and peripheral resource provisioning.
Figure 2 – TCS EZLA solution overview.
The TCS EZLA solution consists of five major configurable components, which are created keeping flexibility, security, and compliance considerations in mind. To achieve this goal, a business rule engine is created to codify requirements and infrastructure provisioning.
A batch process is executed periodically which invokes AWS APIs to collect relevant AWS resource names and their properties. This data is stored in a persistent metadata storage as a reference data, and it is dynamically updated based on AWS resources configuration changes.
This uses Amazon Aurora MySQL to store and manage all artifacts used by the solution. Access policies are optimized and decoupled to create specific formats, which specifically adhere to AWS Identity and Access Management (IAM) character limits. For example, access policies for S3 and AWS Glue are segregated to avoid the character limit on AWS customer-managed policies.
A robust data model has been built on Amazon Aurora MySQL to store and manage all artifacts used by the solution. This repository also contains optimization mechanisms to process IAM access policy document to comply as per IAM best practices. Intelligence is built in to split various parts of the aggregated access into multiple IAM policy documents for better manageability.
Business Rule Engine:
This uses serverless services to execute actions on microservices. AWS Step Functions are used for asynchronous actions that are executed according to a logic flow. It allows customers to enforce their naming conventions for IAM roles, sandboxes, Amazon Athena workgroups, AWS Glue databases, Amazon QuickSight assignments, Amazon SageMaker lifecycle configuration, and Amazon Redshift external schemas.
This also codifies enterprise level rules on access management to ensure each persona gets the right amount of access as required by their job function.
Graphical User Interface (GUI):
This uses AngularJS with HTML5 and CSS3-based web interface that delivers an intuitive user experience (UX).
Single Sign-On (SSO):
Amazon Cognito is used for SAML federation with any corporate identity provider. Profiles of end users are stored in the user pool as metadata to filter the Graphical User Interface (GUI) accordingly. JSON Web Token (JWT) is managed within application.
TCS EZLA Solution Architecture
The TCS EZLA solution can be configured to manage any of the AWS services beyond the scope of the basic S3-based data lake. The solution is deployed through AWS CloudFormation templates, which create necessary roles and access policies, besides other resources—such as Amazon Athena workgroups, Amazon QuickSight assignments, AWS Glue database, and more—that are dynamically based on their initial configuration set by the administrator.
Backend, UX, and Automation are the core modules of the TCS EZLA solution. The high-level architecture diagram below outlines the events flow for each module.
Figure 3 – High-level architecture of the TCS EZLA solution.
The information below explains the events flow for the Backend, UX, and Automation modules.
- Frequent run batch process; an AWS Batch is invoked by Amazon CloudWatch.
- Create inventory by reading S3 buckets and subfolders/prefixes.
- Persistently stores the relevant details into the Inventory Database Amazon Aurora MySQL.
B. User Experience:
- End user connects to TCS EZLA through a web browser.
- Amazon CloudFront used to cache the content by leveraging the edge locations for better performances.
- Any interactions directly to the GUI execute a REST API in Amazon API Gateway.
- Amazon API Gateway triggers an AWS Lambda function directly, or
- Amazon API Gateway triggers AWS Step Functions.
- AWS Step Functions invoke Lambda function.
AWS Lambda functions can:
- Create new IAM roles.
- Create IAM access policies.
- Attach it to other resources and services based on business rules and configuration.
Real-World Use Case in Life Sciences Vertical
A global life sciences enterprise is seeking to customize an access management solution for its Lake House platform. Data scientists must work within an Amazon SageMaker Notebook to be able to start an Amazon EMR cluster for executing an operation on specific data (classified through a particular subject area and stored in S3).
Sometimes, such users will also need Amazon Athena for simpler and ad-hoc queries on the same data set. Once done, users must save their queries and visualize results within Amazon QuickSight dashboards.
The TCS EZLA solution allows administrators to create and define a persona “data scientist” and associate it with the services listed above. They can also review the “new role” creation request, or update an existing role around the subject areas, layers, or sandbox access. They can then map the subject areas within S3 subfolders and create any new departments, if needed.
TCS EZLA also allows product owner to request the creation of a new role for a persona and a list of selected departments. They can view the history of the request and all changes on a role, generate a report on the role access backlogs, and export the access coverage of all roles managed within this solution into a CSV format.
- Completely serverless and cloud-native solution on AWS ensures no fixed cost.
- Automation via infrastructure as code (IaC) and approval workflow for each user request.
- Report on role backlogs to track changes and coverage of an IAM role.
- Built-in security, governance, and tagging mechanism.
- Completely configurable audit trails and segregation of duties.
The TCS EZLA solution reduces cost by automating manual efforts. It enables security engineering and AWS administrators to build more while freeing up time from mundane work.
The business rule engine ensures all heuristics are coded and actions converted into IaC. Meanwhile, AWS resources created using the solution always adhere to an organization’s naming convention, tagging standards, and other security best practices. It also provides seamless integration with AWS services, peace of mind, and low friction between IT, security, and business.
TCS EZLA promotes accelerated Lake House adoption and creates levers for business transformation.
A Lake House approach is one of the emerging and fast-growing areas for organizations to unlock hidden insights, but its implementation can be hindered by multiple rounds of security approvals due to the customized access requirement.
The TCS Lake Access (EZLA) solution accelerates the Lake House implementation and provides security guardrails to ensure alignment to industry compliance and least access policy.
TCS has a proven record of delivering industry-leading solutions for customers, with associates who are trained and certified in implementing AWS services. Contact TCS for more details and implementation of the EZLA solution.
The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
TCS – AWS Partner Spotlight
TCS is an AWS Premier Consulting Partner and Managed Service Provider (MSP). An IT services, consulting, and business solutions organization, TCS has been partnering with many of the world’s largest businesses in their transformation journeys for the last 50 years.
*Already worked with TCS? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.