yelp

Yelp connects people with great local businesses. Since its launch in 2004, Yelp has grown from offering services for just one city—its headquarters home of San Francisco—to a multinational presence spanning major metros across more than 30 countries. The company’s performance-based advertising and transactional business model led to revenues of more than $500 million during 2015, a 46% increase over the prior year. Yelp has evolved into a mobile-centric company, with more than 70% of searches, and more than 58% of content originating from mobile devices.

Yelp uses Amazon Redshift to analyze mobile app usage data and ad data on customer cohorts, auctions, and ad metrics. Yelp has immediately benefited by the new Amazon Redshift schema storage quota feature.

“Amazon Redshift is a managed data warehouse service that allows Yelp to focus on data analytics without spending time on database administration,” says Steven Moy, Lead Engineer for Yelp’s Metrics Platform. The Metrics Platform provides long-term persistent data storage and SQL-on-anything query capabilities for Yelp’s Engineering teams. “A key strategy for our data warehouse users to iterate quickly is to have a writable schema called ‘tmp’ for users to prototype various table schema. However, we occasionally faced challenges when there was not enough free space during a query execution, degrading the entire data warehouse query operation. With the new schema quota feature, we can provision a storage quota ceiling on the ‘tmp’ schema to safeguard runaway storage issues. We look forward to all the autonomous features coming from Amazon Redshift.”

Many organizations are moving toward self-service analytics, where different personas create their own insights on the evolved volume, variety, and velocity of data to keep up with the acceleration of business. This data democratization creates the need to enforce data governance, control cost, and prevent data mismanagement. Controlling the storage quota of different personas is a significant challenge for data governance and data storage operation. This post shows you how to set up Amazon Redshift storage quotas by different personas.

Prerequisites

Before starting this walkthrough, you must have the following:

  • An Amazon Redshift cluster. The US East (N. Virginia) Region is preferred because you need to load data from Amazon Simple Storage Service (Amazon S3) in us-east-1.
  • A database user with superuser permission.

Setting up and testing the schema quota

To set up the environment and implement the use case, complete the following steps:

  1. Connect to your Amazon Redshift cluster using your preferred SQL client as a superuser or user with CREATE SCHEMA privileges.
  1. Create the user sales with the following code:
CREATE USER sales WITH password 'Abcd1234!';

The user in the preceding code is just for the purposes of this walkthrough. You should use identity federation in AWS Identity and Access Management (IAM) to use a corporate user rather than creating one manually. For more information, see Authentication.

  1. Set up a schema quota with the CREATE SCHEMA See the following code:
CREATE SCHEMA [ IF NOT EXISTS ] schema_name [ AUTHORIZATION username ] [ QUOTA {quota [MB | GB | TB] | UNLIMITED} ] [ schema_element [ ... ] CREATE SCHEMA AUTHORIZATION username[ QUOTA {quota [MB | GB | TB] | UNLIMITED} ] [ schema_element [ ... ] ]

The QUOTA parameters define the maximum amount of disk space that the specified schema can use.

  1. Create the schema sales_schema with a quota of 2 GB for the user sales. See the following code:
CREATE SCHEMA sales_schema AUTHORIZATION sales QUOTA 2 GB; 

  1. Impersonate the user sales with the following code:
SET SESSION AUTHORIZATION 'sales';
SELECT CURRENT_USER;

  1. Create the tables region and lineitem with the following code:
CREATE TABLE sales_schema.region ( r_regionkey int4, r_name char(25), r_comment varchar(152) ) DISTSTYLE EVEN; CREATE TABLE sales_schema.lineitem ( l_orderkey int8, l_partkey int8, l_suppkey int4, l_linenumber int4, l_quantity numeric(12,2), l_extendedprice numeric(12,2), l_discount numeric(12,2), l_tax numeric(12,2), l_returnflag char(1), l_linestatus char(1), l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct char(25), l_shipmode char(10), l_comment varchar(44)
) DISTSTYLE EVEN;

Single statement transaction

To test single statement transaction use case, complete the following steps:

  1. Load data into the table region with the following code:
COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

 <Your-IAM-Role> is the ARN of your IAM role with the necessary permissions associated with your Amazon Redshift cluster.

 The COPY command should complete with five rows loaded. See the following code:

dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO: Load into table 'region' completed, 5 record(s) loaded successfully.
COPY

  1. Check the quota and disk usage for the schema from system view svv_schema_quota_state. See the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should show a 2,048 MB quota with 30 MB disk usage on your four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema'; schema_name | quota | disk_usage | disk_usage_pct --------------+-------+------------+---------------- sales_schema | 2048 | 30 | 1.46
(1 row)

  1. Load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

You should get an error that the transaction is aborted due to exceeding the disk space quota. See the following code:

dev=> COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO: Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR: Transaction 40895 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 2798).
Free up disk space or request increased quota for the schema(s).

Amazon Redshift checks each transaction for quota violations before committing the transaction. The size limit can exceed the quota temporarily within a transaction before it’s committed because the quota violation check occurs at the end of a transaction. That’s why you initially see the message loading to table lineitem completed successfully. When a transaction exceeds the quota, Amazon Redshift aborts the transaction, prohibits subsequent ingestions, and reverts all changes until you free up disk space.

The quota violation occurrence is stored in the system table stl_schema_quota_violations.

  1. Run the following code to see the violation:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

It should show the quota, disk usage, disk usage percentage out of the configured quota, and the timestamp of when the violation happened. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC; userid | schema_name | quota | disk_usage | disk_usage_pct | timestamp --------+--------------+-------+------------+----------------+---------------------------- 104 | sales_schema | 2048 | 2798 | 136.62 | 2020-04-20 20:09:25.494723
(1 row)

  1. Change the schema quota to 4 GB using the ALTER SCHEMA See the following code:
RESET SESSION AUTHORIZATION; ALTER SCHEMA sales_schema QUOTA 4 GB;
SET SESSION AUTHORIZATION 'sales';
SELECT CURRENT_USER;

You have to be a database superuser or user with CREATE SCHEMA permissions to set a schema quota during schema creation. However, only a superuser can change a schema quota. You use RESET SESSION AUTHORIZATION to reset the user impersonation back to the original superuser.

  1. Check the quota and disk usage for the schema with the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should return a 4,096 MB quota with 30 MB disk usage on your four-node DC2.large cluster. The failed COPY statement earlier automatically reclaims disk space when the transaction is aborted for a pristine table. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema'; schema_name | quota | disk_usage | disk_usage_pct --------------+-------+------------+---------------- sales_schema | 4096 | 30 | 0.73
(1 row)

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following message:

Load into table 'lineitem' completed, 59986052 record(s) loaded successfully

  1. Check the quota and disk usage for the schema with the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should return a 4,096 MB quota and 2798 MB disk usage on a four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema'; schema_name | quota | disk_usage | disk_usage_pct --------------+-------+------------+---------------- sales_schema | 4096 | 2798 | 68.31
(1 row)

  1. Rerun the COPY command to load a large amount of data into the table lineitem to exceed the quota. See the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return an error that the transaction is aborted due to exceeding the disk space quota. See the following code:

dev=# COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO: Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR: Transaction 86438 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).

  1. Execute the following code to perform subsequent small COPY/INSERT/UPDATE after the quota violation:
COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;

All statements should return that the error transaction is aborted due to exceeding disk space quota. See the following code:

dev=# COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO: Load into table 'region' completed, 5 record(s) loaded successfully.
ERROR: Transaction 86478 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
dev=# INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
ERROR: Transaction 86479 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
dev=# UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
ERROR: Transaction 86483 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).

For non-pristine tables, although the offending transaction that exceeded the quota was aborted (for example, COPY to lineitem), the transaction still consumes space and the schema is blocked until disk space is reclaimed. That is why the subsequent COPY/INSERT/UPDATE to the region table is aborted as well.

  1. Double-check the quota violation from system table with the following code:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

It should return disk usage of 5,486 MB with a four-node DC2.large cluster. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC; userid | schema_name | quota | disk_usage | disk_usage_pct | timestamp --------+--------------+-------+------------+----------------+---------------------------- 104 | sales_schema | 4096 | 5486 | 133.94 | 2020-04-20 21:27:29.653489 104 | sales_schema | 4096 | 5486 | 133.94 | 2020-04-20 21:27:29.469287 104 | sales_schema | 4096 | 5486 | 133.94 | 2020-04-20 21:27:29.197434 104 | sales_schema | 4096 | 5486 | 133.94 | 2020-04-20 20:28:01.344333 104 | sales_schema | 2048 | 2798 | 136.62 | 2020-04-20 20:09:25.494723
(5 rows)

Single statement shrink-only transaction

A shrink-only transaction is a transaction composed solely of statements that has the potential to directly or indirectly reduce the amount of disk space used, such as: DELETE, TRUNCATE, DROP TABLE, VACUUM, etc.

To test single statement shrink-only transaction use case, complete the following steps:

  1. Delete all data from the lineitem table with the following code:
DELETE sales_schema.lineitem;
SELECT COUNT(*) FROM sales_schema.lineitem;

It should return a record count of zero:

dev=> DELETE sales_schema.lineitem;
DELETE 59986052
dev=> SELECT COUNT(*) FROM sales_schema.lineitem; count ------- 0
(1 row)

Although the schema quota was exceeded, the DELETE transaction is committed because the shrink-only transaction is an exception to the quota violation check.

The TRUNCATE command is much more efficient than DELETE to delete all data and doesn’t require a VACUUM and ANALYZE command. The preceding code is for illustration purposes only.

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following error message:

Transaction is aborted due to exceeding the disk space quota

See the following code:

dev=> COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO: Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR: Transaction 87024 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 8150).
Free up disk space or request increased quota for the schema(s).

Because DELETE doesn’t automatically free up disk space, the preceding COPY command exceeds the quota and the transaction is aborted.

  1. Double-check the quota violation from the system table with the following code:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

The latest row should return a disk usage of 8,150 MB with a four-node DC2.large cluster. This shows that disk usage increases with the aborted transaction. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC; userid | schema_name | quota | disk_usage | disk_usage_pct | timestamp --------+--------------+-------+------------+----------------+---------------------------- 104 | sales_schema | 4096 | 8150 | 198.97 | 2020-04-20 21:30:54.354669 104 | sales_schema | 4096 | 5486 | 133.94 | 2020-04-20 21:27:29.653489 104 | sales_schema | 4096 | 5486 | 133.94 | 2020-04-20 21:27:29.469287 104 | sales_schema | 4096 | 5486 | 133.94 | 2020-04-20 21:27:29.197434 104 | sales_schema | 4096 | 5486 | 133.94 | 2020-04-20 20:28:01.344333 104 | sales_schema | 2048 | 2798 | 136.62 | 2020-04-20 20:09:25.494723
(6 rows)

  1. Run VACUUM to free up disk space:
VACUUM sales_schema.lineitem;

The schema quota is exceeded, but because Amazon Redshift disregards the quota violation and commits transactions in certain cases like VACUUM, this completes without error.

Amazon Redshift now automatically runs the VACUUM DELETE operation in the background to reclaim disk space.

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following message:

Load into table 'lineitem' completed, 59986052 record(s) loaded successfully

Because VACUUM reclaimed the disk space, the COPY command ran successfully.

CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema

To test CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema use case, complete the following steps:

  1. Execute the CTAS statement to create another table using a query. See the following code:
CREATE TABLE sales_schema.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;

It should return that the transaction is aborted due to exceeding the disk space quota. CTAS is subject to similar quota checks in the same schema. See the following code:

dev=> CREATE TABLE sales_schema.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;
ERROR: Transaction 112229 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5624).
Free up disk space or request increased quota for the schema(s).

  1. Create a temp table and populate data with the following code:
CREATE TEMP TABLE lineitem_temp AS SELECT * FROM sales_schema.lineitem;
SELECT COUNT(*) from lineitem_temp; 

It should return 59,986,052 rows. The schema quota doesn’t consider temporary tables created as part of a temporary namespace or schema.

  1. Create the same table lineitem in the public schema and load data into it using the following code:
CREATE TABLE public.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;
SELECT COUNT(*) FROM lineitem_ctas;

It should return 59,986,052 rows.

Ingestion into different schemas has no effect and the transaction is committed.

  1. Append data into the table lineitem in sales_schema from another table using the ALTER TABLE APPEND See the following code:
ALTER TABLE sales_schema.lineitem APPEND FROM lineitem_ctas;

It should return an error that the transaction is aborted due to exceeding the disk space quota. Quota violation is applicable to the full schema.

  1. Append data from the sales_schema lineitem table to another table in the public schema with the following code:
ALTER TABLE lineitem_ctas APPEND FROM sales_schema.lineitem;
SELECT COUNT(*) FROM sales_schema.lineitem;
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

It should return a record count of zero from the SELECT COUNT statement and the disk usage drop. The ALTER TABLE APPEND statement empties the table immediately and reclaims disk space when the transaction is complete. See the following code:

dev=> ALTER TABLE lineitem_ctas APPEND FROM sales_schema.lineitem;
INFO: ALTER TABLE APPEND "lineitem_ctas" from "lineitem" is complete.
ALTER TABLE APPEND and COMMIT TRANSACTION
dev=> SELECT COUNT(*) FROM sales_schema.lineitem; count ------- 0
(1 row) dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema'; schema_name | quota | disk_usage | disk_usage_pct --------------+-------+------------+---------------- sales_schema | 4096 | 30 | 0.73
(1 row)

Concurrent transactions

To test concurrent transactions use case, complete the following steps:

  1. Increase the quota to 6 GB using the following code:
RESET SESSION AUTHORIZATION;
ALTER SCHEMA sales_schema QUOTA 6 GB;
SET SESSION AUTHORIZATION 'sales';

You need two separate SQL client sessions connected to the Amazon Redshift cluster and run the code concurrently for the following test.

  1. On the first session, enter the following code:
CREATE TABLE sales_schema.lineitem_txn1 AS SELECT * FROM lineitem_ctas;

  1. While the first session is still running, on the second session, enter the following code:
CREATE TABLE sales_schema.lineitem_txn2 AS SELECT * FROM lineitem_ctas;

The statement from the first session should return a warning message:

Schema(s) close to reaching their allocated disk space quotas

This is a warning message that the schema is close to its quota, but the statement is still complete. See the following code:

dev=> CREATE TABLE sales_schema.lineitem_txn1 AS SELECT * FROM lineitem_ctas;
WARNING: Schema(s) close to reaching their allocated disk space quotas: (Schema: sales_schema, Quota: 6144, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
SELECT

The statement from the second session should return an error that the transaction is aborted due to exceeding the disk space quota. Because the first transaction from the first session was committed first, the disk space usage increased to close to the quota. The second transaction checks the quota violation before committing the transaction and thus aborts the transaction.

Reset quota below existing disk space usage

To test resetting quota below existing disk space usage use case, complete the following:

  1. Run the following code to reset the quota to 2 GB:
RESET SESSION AUTHORIZATION;
ALTER SCHEMA sales_schema QUOTA 2 GB;
SET SESSION AUTHORIZATION 'sales';
CREATE TABLE sales_schema.test AS SELECT 1;

You can set the quota below the existing disk space usage; however, all future ingestion transactions are aborted until you free disk space.  See the following code:

dev=> RESET SESSION AUTHORIZATION;
RESET
dev=# ALTER SCHEMA sales_schema QUOTA 2 GB;
ALTER SCHEMA
dev=# SET SESSION AUTHORIZATION 'sales';
SET
dev=> CREATE TABLE sales_schema.test AS SELECT 1;
ERROR: Transaction 112557 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5610).
Free up disk space or request increased quota for the schema(s).

Multi-statement transactions

To test multi-statement transaction use case, complete the following steps:

  1. Enter the following code to test a multi-statement transaction:
BEGIN; DELETE FROM sales_schema.region WHERE r_regionkey = 0; INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
COMMIT;
BEGIN; DELETE FROM sales_schema.region WHERE r_regionkey = 0; UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 1;
COMMIT;
BEGIN; DELETE FROM sales_schema.region WHERE r_regionkey = 0; COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
COMMIT;

All INSERT/UPDATE/COPY transactions should be aborted because the quota is exceeded. See the following code:

dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
INSERT 0 1
dev=> COMMIT;
ERROR: Transaction 114291 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 1;
UPDATE 1
dev=> COMMIT;
ERROR: Transaction 114295 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO: Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
dev=> COMMIT;
ERROR: Transaction 114303 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5624).
Free up disk space or request increased quota for the schema(s).

  1. Enter the following code to test a multi-statement transaction with DROP TABLE at the same transaction:
BEGIN; INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test'); UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100; COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1'; DROP TABLE sales_schema.region;
COMMIT;

The transaction should be committed successfully. The COPY/INSERT/DELETE transaction into a table that gets dropped at the same time the transaction is committed. See the following code:

dev=> BEGIN;
BEGIN
dev=> INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
INSERT 0 1
dev=> UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
UPDATE 3
dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO: Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
dev=> DROP TABLE sales_schema.region;
DROP TABLE
dev=> COMMIT;
COMMIT

  1. Run shrink-only statements in a multi-statement transaction with the following code:
BEGIN; DELETE FROM sales_schema.lineitem WHERE l_orderkey = 53924162; DROP TABLE sales_schema.lineitem;
COMMIT;

The transaction should be committed successfully. A transaction containing shrink-only statements should succeed. See the following code:

dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.lineitem WHERE l_orderkey = 53924162;
DELETE 0
dev=> DROP TABLE sales_schema.lineitem;
DROP TABLE
dev=> COMMIT;
COMMIT

Best practices

These following recommended practices can help you optimize your workload using storage quotas:

  • The schema feature works best when users are aligned with schemas – Use schemas to logically segregate the users, for example, the MAIN schema for the base and aggregated tables owned by data engineers, and the ANALYST schema controlled with storage quota owned by data analysts that perform their own aggregation by querying the base data from the MAIN schema into the ANALYST The data analyst only has read permissions on the MAIN schema but read and write permissions on the ANALYST schema.
  • Revoke usage on public schema – All users have CREATE and USAGE privileges on the PUBLIC schema of a database by default. Because storage quota isn’t applicable to the PUBLIC schema by default, all users should be disallowed from creating objects in the PUBLIC schema of a database. They should instead use their own aligned schemas as per the previous best practice. Use the REVOKE command to remove that privilege.
  • Use physical tables instead of temporary tables – Because temporary tables aren’t subject to storage quota, use physical tables instead—but remember to drop the physical tables at the end of the transaction. An alternative is to create a separate schema to cater to temporary physical tables with storage quota, so instead of using temporary tables, create physical tables in the dedicated schema. This would ease administration and you can perform an automation to clean up tables in this dedicated schema on a regular basis.
  • Use data shrinking operations – Use data shrinking operations like DELETE, VACUUM DELETE, DROP, and TRUNCATE to reclaim disk space, especially if it’s nearing the quota.
  • Use workload management (WLM) with query priority to control resources for different users – Data ingestion and aggregation performed by different users might be resource-intensive and impact the throughput of other mission-critical loads or queries running on the cluster. Amazon Redshift Advisor automatically analyzes the current WLM usage and can make recommendations to get more throughput from your cluster. Periodically reviewing the suggestions from Advisor helps you get the best performance.

Monitoring

You should periodically monitor for any storage quota violations so that you can take corrective action and provide business continuity. The system table stl_schema_quota_violations is a good resource to get quota violation events. To query the violation information, enter the following code:

SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

You can automate the execution of the preceding query and send notifications to users if any violations occur using the following solutions:

Cleaning up

When you’re done with testing, you can remove the objects and users with the following code:

RESET SESSION AUTHORIZATION;
DROP TABLE IF EXISTS sales_schema.lineitem_txn1; DROP TABLE IF EXISTS lineitem_temp;
DROP TABLE IF EXISTS lineitem_ctas;
DROP SCHEMA sales_schema;
DROP USER IF EXISTS sales;

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. Data storage operation with quotas is part of an important data governance framework that provides compliance and operation efficiency. A simple QUOTA parameter in a CREATE/ALTER SCHEMA statement enables disk usage quotas on an Amazon Redshift schema. You can assign Amazon Redshift database users who represent personas in the organization to different schemas with quotas to control the amount of disk space they can consume. This is crucial for data governance and cost control, and avoids running into operation issues like disk full errors, especially if they keep creating objects and ingesting data that might halt other mission-critical analytic operations.

This post showed you how easy it is to control the storage quota for Amazon Redshift using schemas. You went through the steps to create and alter the quota on a schema and authorize specific users to the schema. You also explored different use cases of ingesting data into the schema and monitoring the different outcomes and disk usage using the system view. You can adopt this feature to support your business needs.

 


About the Authors

BPYauBP Yau is a Data Warehouse Specialist Solutions Architect at AWS His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

AhmadBarghoutAhmad Barghout is a Software Development Engineer on the Amazon Redshift storage team. He has worked on cluster resize optimizations and currently works on ease-of-use features such as schema quota. Before that, he was a full stack developer at Cisco. He holds a master’s degree from The University of Ottawa.

 

 

 

himarajaHimanshu Raja is a Principal Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.