Tech Talks: From zero to Modern Data Stack

As part of our engineering team's Tech Talks blog series Phlo's Data Scientist, Nacho Valdelvira, describes the evolution of Phlo’s data platform, from an early hand-rolled v1 to a scalable Modern Data Stack.

‍

As a digital service, data is at the centre of our operations. We collect a range of data every day and this gives us opportunities to do something useful with it. We can understand our patients profiles & behaviours more deeply, whilst also using it to predict stock demand and patient churn.

To take action on these opportunities we need a solid Data Architecture that will allow us to easily access any data generated by the business. Our Data Architecture should consist of three (apparently) simple requirements — data should be easy to query, reliable and easy to discover.

Let’s define what this means in practice:

Data is easy to query → We need a single source of truth, a place where data from disparate sources converges, so we know that we’ll find everything we need in a single place.
Data is reliable → We need data testing to ensure we are using reliable data in our analysis and predictions.
Data is easy to discover → There should be no ambiguity. All tables should be self-explanatory. When this isn’t possible, we need a Data Catalog — an index page where all data sources and metrics are documented and properly defined. We need a translation from technical to business definitions to reduce the number of blockers when someone unfamiliar with the data sources wants to dig into the data. Additionally, it should be easy to interpret insights and create stories based on them, so we need a data visualisation tool.

We have now defined our requirements, so which tools can meet our needs?

‍

The Modern Data Stack

In the last few years, many new tools and technologies have emerged into the data ecosystem, allowing us to retrieve and extract value from our data more easily than ever before. As a result, a new stack of tools has emerged, the Modern Data Stack (MDS).

‍

‍

The MDS (as defined by Fivetran) is a suite of tools used for data ingestion. In order of how the data flows, these tools include:

A fully managed ELT (Extract-Load-Transform) data pipeline
A cloud-based columnar warehouse or data lake as a destination
A data transformation tool
A business intelligence or data visualisation platform

The MDS is hosted in the cloud and requires little technical configuration by the user. These characteristics promote end-user accessibility as well as scalability to quickly meet growing data needs without the costly and lengthy downtime associated with scaling local server instances.

Having done our research, we knew that the MDS was what we required. However, implementing the MDS takes time and resources, and the data tasks backlog doesn’t wait.

‍

Data Architecture v1

We first started implementing a simplified v1 Data Architecture to keep the business moving. This consisted of:

Version-controlled SQL queries running on a GCP cronjob to extract only the data we wanted without interfering with production operations.
Extracts which were enriched with Scheduled Queries to get them ready for analysis and visualisation with Google Data Studio.
A data lake mounted using gcsfuse as a Linux file system on the same server that ran our SQL queries, so we could automatically extract data into our data lake each day without any fear that we’d exceed storage limits.
A consistently-structured “latest” daily dataset which we could then materialise automatically as tables in BigQuery to support operational reporting.

‍

That was our first architecture, why did we do it like that?

It allowed us to provide value quickly while buying us time to investigate the MDS. Once v1 was operational, we started looking for our new sets of tools and iterating over the Data Architecture, keeping a few principles in mind:

The main purpose of the data stack is to provide business value. We’ll implement the tools we need to solve our problems and refine our tooling as new requirements emerge. While our v1 cronjob architecture was good enough, it wasn’t scalable and lacked most of the requirements that we defined at the start of this article.
We’ll use open source solutions and not get tied to any proprietary programming language.
We know that tools come and go, so we should be tool agnostic and make it easy to couple/decouple different parts of our architecture.
Keep it simple.

‍

Our tool research journey

Starting our tool research journey, a key priority was finding a Data Warehouse solution to act as our single source of truth. A place to bring together data from disparate sources.

We chose BigQuery as our Data Warehouse.

‍

‍

There are good alternatives like Snowflake, AWS Redshift or Azure SQL Data Warehouse + Synapse Analytics, but as an existing Google Cloud Platform user, it felt sensible to remain in the same context. Our experience with BigQuery is that it’s an easy-to-use and cost-effective tool, which is something important to consider in a start-up environment.

‍

Next up — Airbyte

The Data Warehouse is our recipient, but we required a tool to ingest data from different sources into this recipient, and Airbyte offers what we need.

‍

‍

No-code tools are on the rise. In the end, the MDS is about empowering anyone to create a data stack without requiring a team of data engineers. Airbyte is on a mission to make data integration pipelines a commodity, connecting with most of the sources we required (Facebook Ads, Intercom, Google Ads, Google Analytics and more) with the click of a button. Airbyte continues to move quickly and is planning to add many more connectors in the future.

Additionally, Airbyte also offers Incremental Append functionality. As table size increases, we won’t be able to dump whole database tables into BigQuery every night without affecting our database’s performance. We want to add to the existing table only what has changed. We also want to refresh the data more frequently, so CDC was a natural next step.

We trialled Fivetran and Matillion, though Airbyte stood out as it gave us everything we required.

‍

Next steps — data reliability and interpretation

Great, we already have three entities in our architecture: Sources + Destination + tool that moves data from Sources to Destination. We now have all the raw data in our Data Warehouse, so with some knowledge of SQL we can join, transform and clean it to get whatever we need. This is a great first iteration and a huge improvement from the cronjob architecture. However, we are still lacking a data quality and discovery tool, so that’s our next step.

We need to know that our data is reliable and it has passed quality tests. We also want to be able to discover data and understand what ambiguous columns or metrics mean. And most importantly, we want to transform data and version control it.

DBT can help with all of that!

‍

‍

It’s difficult to find an alternative tool that offers DBT’s capabilities in the current data ecosystem. DBT has been the dominant solution in data architecture and has completely revolutionised it.

DBT offers:

Ability to version control data transformations (have branches, commits and syntax) and to create dependencies between tables. Let’s say that I want to show in a dashboard the weekly orders from patients with diabetes. We will need to join different tables to produce that — retrieving data from at least the orders, patients and medication tables. With DBT we will create this new table that will have a dependency with the other three, and we’ll be able to see all dependencies in a diagram. This way, if generating the table fails we can easily spot the root cause of the failure.
Data testing — For example, we can test if my orders table has no null ID fields every time I run the table. We can do more complex tests like checking that the number of rows of two tables is the same. This is ideal to deal with the data reliability requirement mentioned previously.
Data catalog — We can document each column and table, all version-controlled, and generate a documentation website with a search engine to find what we’re looking for. How would I know what ampp_id means if it wasn’t for the data catalog? We always try to make names self-explanatory but sometimes that’s not possible!

‍

Auto-generated dbt documentation website

‍

Syncing everything together

We have a few pieces of the puzzle now — Sources + Destination + a tool that moves data from Sources to Destination + a tool that does data testing, transforms it and documents it. Now we just need to orchestrate all these tools. Ideally, we want to run transformations just after data is ingested, which would be more efficient than doing both actions independently.

‍

‍

We will use Airflow for this.

Airflow is a popular orchestrating tool used to automatically organise, execute and monitor data flows. Airflow allows us to create the basic dependency between Airbyte and DBT that we required.

Astronomer.io do a supported version of Airflow, while Dagster, Prefect and Luigi are also in the same space. We went with Airflow (or rather, Airphlo 🙂 ) as it had predictable pricing — the cost of the virtual machine each month.

Moving forward we recognise our need to run custom Airflow containers may mean turning to other solutions as part of our overall data architecture.

‍

Looking at the data in more detail

Last but not least, we need to visualise the data.

We’re using two tools to meet this need — Google Data Studio and Metabase.

‍

‍

Google Data Studio was an easy first choice as we use the Google Stack, so it’s easy to integrate and it’s also free. It offers enough functionality for our use cases, which is to build KPI (Key Performance Indicator) and metric tracking dashboards.

We’re also experimenting with Metabase with the idea of enabling self-service analytics. Metabase is a great tool for this purpose and also has a better UI (User Interface) than Google Data Studio.

Having a self-service tool allows anyone in the business to be an analyst and solve data questions without needing to create SQL queries or rely on the data team. This looks easy on paper but it requires a lot of background work to implement. DBT Data Catalog and a two-way movement of metadata (from DBT to dashboards and vice versa) helps with data discovery, but the complex task here is to translate our technical model into a rich domain model anyone in the business can understand.

Another approach is to accept that it’s difficult to solve every data question without knowing SQL, so we build dashboards that answer our common data questions. For more complex queries, we’ll rely on analysts. For example, specific questions like “How many patients had medication A combined with medication B in the last two months?” are difficult to solve without SQL knowledge, so the approach would be to assign this task to a “domain expert” analyst.

There is lots of thinking to do here so we don’t have a definitive answer on how to approach this!

‍

We’re almost done! (for now…)

So, in summary, we have:

A data warehouse, a single source of truth where we store data from disparate sources → BigQuery
A tool that gets data from sources and puts them in our destination → Airbyte
A tool that manages data transformation, provides testing functionality and data documentation → DBT
A tool that orchestrates all of the above → Airflow
A tool for data visualisation → Google Data Studio and Metabase

Our Data Architecture maturity has now reached an acceptable level for retrospective analysis, i.e. to understand what has happened. Next is paving the way to feed valuable data back into our products, and starting to look into the future with predictive models. This requires a higher level of maturity, and currently:

We have no properly defined approach to update tooling versions, like Airbyte and the connectors it uses.
Staging and production are treated as independent environments which makes it easier for issues to appear.
We’ll likely have more issues that we haven’t encountered yet.

Therefore, we need to improve on this and aim for a DevOps type of approach on our Data Architecture. We have a long way to go and lots of improvements to make, so we’ll continue to review our Data Architecture and update it to meet our needs.

As a final thought, we want to emphasise that building an effective data platform boils down to the principles mentioned at the beginning of this article. Our objective is to provide commercial business value by having data that is easy to query, reliable and easy to discover. Keeping that front of mind, it doesn’t necessarily matter which tools you use to achieve it.

‍