Engineering

Tech Talks: From zero to Modern Data Stack

June 28, 2022
8 minute read
Nacho Valdelvira

As part of our engineering team's Tech Talks blog series Phlo's Data Scientist, Nacho Valdelvira, describes the evolution of Phlo’s data platform, from an early hand-rolled v1 to a scalable Modern Data Stack.

As a digital service, data is at the centre of our operations. We collect a range of data every day and this gives us opportunities to do something useful with it. We can understand our patients profiles & behaviours more deeply, whilst also using it to predict stock demand and patient churn.

To take action on these opportunities we need a solid Data Architecture that will allow us to easily access any data generated by the business. Our Data Architecture should consist of three (apparently) simple requirements — data should be easy to query, reliable and easy to discover.

Let’s define what this means in practice:

We have now defined our requirements, so which tools can meet our needs?

The Modern Data Stack

In the last few years, many new tools and technologies have emerged into the data ecosystem, allowing us to retrieve and extract value from our data more easily than ever before. As a result, a new stack of tools has emerged, the Modern Data Stack (MDS).

Our Modern Data Stack

The MDS (as defined by Fivetran) is a suite of tools used for data ingestion. In order of how the data flows, these tools include:

The MDS is hosted in the cloud and requires little technical configuration by the user. These characteristics promote end-user accessibility as well as scalability to quickly meet growing data needs without the costly and lengthy downtime associated with scaling local server instances.

Having done our research, we knew that the MDS was what we required. However, implementing the MDS takes time and resources, and the data tasks backlog doesn’t wait.

Data Architecture v1

We first started implementing a simplified v1 Data Architecture to keep the business moving. This consisted of:

That was our first architecture, why did we do it like that?

It allowed us to provide value quickly while buying us time to investigate the MDS. Once v1 was operational, we started looking for our new sets of tools and iterating over the Data Architecture, keeping a few principles in mind:

Our tool research journey

Starting our tool research journey, a key priority was finding a Data Warehouse solution to act as our single source of truth. A place to bring together data from disparate sources.

We chose BigQuery as our Data Warehouse.

There are good alternatives like Snowflake, AWS Redshift or Azure SQL Data Warehouse + Synapse Analytics, but as an existing Google Cloud Platform user, it felt sensible to remain in the same context. Our experience with BigQuery is that it’s an easy-to-use and cost-effective tool, which is something important to consider in a start-up environment.

Next up — Airbyte

The Data Warehouse is our recipient, but we required a tool to ingest data from different sources into this recipient, and Airbyte offers what we need.

No-code tools are on the rise. In the end, the MDS is about empowering anyone to create a data stack without requiring a team of data engineers. Airbyte is on a mission to make data integration pipelines a commodity, connecting with most of the sources we required (Facebook Ads, Intercom, Google Ads, Google Analytics and more) with the click of a button. Airbyte continues to move quickly and is planning to add many more connectors in the future.

Additionally, Airbyte also offers Incremental Append functionality. As table size increases, we won’t be able to dump whole database tables into BigQuery every night without affecting our database’s performance. We want to add to the existing table only what has changed. We also want to refresh the data more frequently, so CDC was a natural next step.

We trialled Fivetran and Matillion, though Airbyte stood out as it gave us everything we required.

Next steps — data reliability and interpretation

Great, we already have three entities in our architecture: Sources + Destination + tool that moves data from Sources to Destination. We now have all the raw data in our Data Warehouse, so with some knowledge of SQL we can join, transform and clean it to get whatever we need. This is a great first iteration and a huge improvement from the cronjob architecture. However, we are still lacking a data quality and discovery tool, so that’s our next step.

We need to know that our data is reliable and it has passed quality tests. We also want to be able to discover data and understand what ambiguous columns or metrics mean. And most importantly, we want to transform data and version control it.

DBT can help with all of that!

It’s difficult to find an alternative tool that offers DBT’s capabilities in the current data ecosystem. DBT has been the dominant solution in data architecture and has completely revolutionised it.

DBT offers:

Auto-generated dbt documentation website

Syncing everything together

We have a few pieces of the puzzle now — Sources + Destination + a tool that moves data from Sources to Destination + a tool that does data testing, transforms it and documents it. Now we just need to orchestrate all these tools. Ideally, we want to run transformations just after data is ingested, which would be more efficient than doing both actions independently.

We will use Airflow for this.

Airflow is a popular orchestrating tool used to automatically organise, execute and monitor data flows. Airflow allows us to create the basic dependency between Airbyte and DBT that we required.

Astronomer.io do a supported version of Airflow, while Dagster, Prefect and Luigi are also in the same space. We went with Airflow (or rather, Airphlo 🙂 ) as it had predictable pricing — the cost of the virtual machine each month.

Moving forward we recognise our need to run custom Airflow containers may mean turning to other solutions as part of our overall data architecture.

Looking at the data in more detail

Last but not least, we need to visualise the data.

We’re using two tools to meet this need — Google Data Studio and Metabase.

Google Data Studio was an easy first choice as we use the Google Stack, so it’s easy to integrate and it’s also free. It offers enough functionality for our use cases, which is to build KPI (Key Performance Indicator) and metric tracking dashboards.

We’re also experimenting with Metabase with the idea of enabling self-service analytics. Metabase is a great tool for this purpose and also has a better UI (User Interface) than Google Data Studio.

Having a self-service tool allows anyone in the business to be an analyst and solve data questions without needing to create SQL queries or rely on the data team. This looks easy on paper but it requires a lot of background work to implement. DBT Data Catalog and a two-way movement of metadata (from DBT to dashboards and vice versa) helps with data discovery, but the complex task here is to translate our technical model into a rich domain model anyone in the business can understand.

Another approach is to accept that it’s difficult to solve every data question without knowing SQL, so we build dashboards that answer our common data questions. For more complex queries, we’ll rely on analysts. For example, specific questions like “How many patients had medication A combined with medication B in the last two months?” are difficult to solve without SQL knowledge, so the approach would be to assign this task to a “domain expert” analyst.

There is lots of thinking to do here so we don’t have a definitive answer on how to approach this!

We’re almost done! (for now…)

So, in summary, we have:

Our Data Architecture maturity has now reached an acceptable level for retrospective analysis, i.e. to understand what has happened. Next is paving the way to feed valuable data back into our products, and starting to look into the future with predictive models. This requires a higher level of maturity, and currently:

Therefore, we need to improve on this and aim for a DevOps type of approach on our Data Architecture. We have a long way to go and lots of improvements to make, so we’ll continue to review our Data Architecture and update it to meet our needs.

As a final thought, we want to emphasise that building an effective data platform boils down to the principles mentioned at the beginning of this article. Our objective is to provide commercial business value by having data that is easy to query, reliable and easy to discover. Keeping that front of mind, it doesn’t necessarily matter which tools you use to achieve it.

We’re always looking for talented people to join Phlo. Find out more about our open positions in engineering and help lead the future of digital pharmacy.


More Blogs

Dietary fibre: Why it’s important and how to get more into your diet

Generic vs. Brand-name medication

In this blog piece we discuss the differences between generic and brand medication, how generic medicines are marketed and whether their efficacy is similar to brand medication.

Read More

Phlo puts you in control

Manage, order and track your medication via the app

Chat with Phlo pharmacists via phone, email or live-chat

Choose a delivery option that suits you

Switch to Phlo

Further Reading

5 minute read

Tech Talks: Deploying a React Native App with Azure DevOps

Learn More
5 minute read

A day in the life of Noah Valuks, Junior Software Engineer

Learn More
7 minute read

A day in the life of a senior software engineer

Learn More