Guest post by Plamen Ivanov, Data Engineer, South Gate Tech
Tide is a U.K.-based FinTech startup with offices in London, Sofia, and Hyderabad. It is one of the first, and the largest business banking platform in the UK, with over 150,000 SME members. As of 2019, one of Tide’s main focuses is to be data driven. This resulted in the forming of a Data Science and Analytics Team with Hendrik Brackmann at its head.
My name is Plamen Ivanov, and I’m a data engineer at South Gate Tech. I am consulting at Tide and am deeply involved in redesigning and implementing its analytics platform in AWS. In this article, I’ll share the journey that we have been on, architecting the platform in a way to best serve Tide’s needs. I’ll share the platform’s past — what we faced when the team was formed back in November 2019, the platform’s present — what we achieved in 6 months (Nov 2019 — Apr 2020), and the platform’s future — what we are planning to achieve by the end of the year (2020).
When I joined Tide in November 2019, the analytics platform was almost nonexistent. It was more of a link between transactional databases (MySQL and Amazon Aurora) and Looker (used for data visualization).
The data flow looked something like this:
● AWS Database Migration Service (DMS) tasks moving data from transactional databases (MySQL and Amazon Aurora) to Amazon Redshift
● Stitch moving external data (Zendesk, Facebook, Google) to Redshift
● Looker dashboards reading source data from Redshift
We identified a number of issues that we wanted to fix with the redesign of the analytics platform:
● Redshift contained only source data and all the heavy computations were done on the fly by Looker. This can be quite bad for Redshift because if the data is not persisted in some way, Looker refreshes any open dashboard every 5 min. Our Redshift disk usage went from 60% to 100% several times just because some colleagues have forgotten a heavy Looker dashboard refreshing in the background and the constant execution of complex queries resulted in disk spills. Our decision was to introduce integration and presentation layers in Redshift that would contain cleaned and aggregated data that Looker can use.
● We had no data lake and Redshift contained only current data (no history). This prevented us from doing any historic analysis and to keep track of how our data changes. Our decision was to create a data lake in S3 where the data would be immutable.
● Some of the reports had a bigger delay than desired by the business units and the analysts were not able to find all the data that they needed. The underlying reason was that we had limited control over the transfer of external data when using Stitch. Our decision was to bring these external transfers in-house and reduce the usage of Stitch. By doing this we would gain full control over what is transferred and how often it is transferred.
In the next 6 months, we focused on these 3 points.
Processing layers and dbt
In order to both speed up report creation and reduce errors, we separated our processing into multiple layers. Specifically, we introduced a cleaned integration and presentation layers in our data-warehouse. The cleaned layer includes all data that is ingested from external sources and from the data-lake. This data is then transformed into business entities and event tables, which live in the integration layer. Lastly, we use the presentation layer in order to optimize the speed of our queries.
At Tide, we believe in enabling analysts to have the best knowledge of what good business entities constitute, so we wanted to give them the possibility to contribute to our transformation process. After investigating multiple tools, we found that dbt would allow us to manage relatively complex transformations and tests using SQL, a tool that most analysts are already familiar with.
After we decided that dbt would be used, the question was how to deploy it and schedule it. For orchestration, we decided to go with Airflow because, based on our previous experience, this is the best tool for the job. We could also use it for the transfer of external data, and we had to deploy both dbt and Airflow. The DevOps team had already built infrastructure for Tide’s microservices using Amazon ECS. We decided to use the same infrastructure for a quick start, so I deployed dbt and Airflow in containers and linked them together so that Airflow could run dbt models on a schedule. The downside of this setup was that Airflow was deployed to only one container, and we were not able to take advantage of its Celery Executor. This was not such a big issue , however, as the number of Airflow DAGs is low and the memory of the container was sufficient.
Once we had dbt in place, the data analysts were trained how to use it, and they started building models that clean and aggregate the data for Looker. These models were scheduled via Airflow. Some of them would run once per day, some would run every 12 hours, and others would update tables every 30 min. We also added data quality and data latency checks with alerts integrated into Slack so that everyone in the channel would get alerted in case of data issues.
The concept of a data lake is pretty simple — store everything, because storage is cheap while data is valuable. It was clear that we should use S3 for the data lake, but two questions remained: How could we actually store our data on S3, and how could we link the data lake to Redshift?
We started with migrating some Zendesk data from Stitch to bringing it in-house with Airflow. The Airflow DAG would read the data from Zendesk’s API, add the new data to a daily partition in S3 in parquet format, and then copy the data from S3 to Redshift and deduplicate the old Redshift records so that we only had the current data.
The historic data is not frequently used and simultaneously takes more space than the current data. Even so, we wanted to enable the data analysts to query historic data in case they ever needed it. Our options were either Amazon Redshift Spectrum or Amazon Athena. The main factor for our decision was that with Spectrum, the data becomes visible in Redshift and could be used in a dbt model, while Athena still doesn’t have a dbt integration. So we went with Spectrum . I created a data catalogue in AWS Glue and external Spectrum tables in Redshift.
It was clear how to proceed with data that would go through Airflow, but we still didn’t know how to proceed with the transactional tables that use DMS which writes the current data directly to Redshift. After some investigation on best practices, we found that DMS can use S3 as endpoint and write logs about inserts, updates, and deletes of the data. Then, using Delta Lake, we could transform these logs into parquet data (both historic and current) stored on S3. And, using Spectrum, we could read the S3 data in Redshift. There was just one problem — Delta Lake requires Spark. And at this point, we still didn’t have Spark (or any other means for distributed data processing) in our analytics platform. This became one of the main points for the future development of Tide’s analytics platform.
Transfer of External Data
As I’ve already mentioned, we started migrating the transfer of external data from Stitch to Airflow. This allowed us to reduce the data latency to as low as 5 minutes, transfer some custom fields that are missing in the generic integrations, and decide where (S3/Redshift) and what (historic/current data) to transfer.
The plan for the Redshift layers and the migration of external data is to continue with what we’ve already started. We’ve already completed proofs of concept in these areas, but the complete migration will take months.
The main focus in the coming months will be on building Tide’s data lake, ensuring that Airflow can take the increased workload, and enabling data scientists to process large amounts of data in memory. For all this, we have to set up one additional piece of infrastructure for the analytics platform — an Amazon EMR cluster.
An EMR cluster will enable several things on Tide’s analytics platform:
● We will be able to install Spark on the EMR cluster. Spark is a very powerful analytics engine. It will be useful both for our plan to use Delta Lake together with DMS and for the need of data scientists and data engineers to process large amounts of data in-memory in distributed fashion.
● We will be able to move Airflow to the EMR cluster and use Celery executor. This way, we can scale Airflow up without the risk of it running out of memory with the increase of the number of jobs.
● We will be able to install Dask on the EMR cluster. A lot of data scientists are not familiar with Spark and are used to the pandas library. With Dask the learning curve won’t be that steep and they will be able to use pandas in distributed fashion.
Initially we considered using AWS Glue jobs with Spark instead of setting up an EMR cluster but there were several factors that tipped the scales:
● We want to keep using Airflow for orchestration and we want to scale it
● Considering the frequency of our jobs (every 5 min) an EMR cluster is more cost effective than going serverless with AWS Glue
● With AWS Glue we are limited to only Spark, while on a cluster we can have Spark, Dask and whatever else we decide in the future
This is how Tide’s analytics platform would look like after it’s redesign. It is a nice mixture of open source software that has proven its value over the years and AWS services that are the backbone of the platform. It is one of the many possible designs for an analytics platform. It serves the specific needs of the company and will enable data analysts and data scientists to make data driven decisions.
If you would like to find out more about how AWS can help your FinTech startup launch and grow, please fill out this brief form, and we’ll be in touch.