En poursuivant votre navigation, vous acceptez l'utilisation de “cookies” destinés à améliorer la performance de ce site, à en adapter les fonctionnalités. Accepter - Refuser - Plus d'infos !
Antoine Lizée
Data @ Alan
19 septTech & produit

Alan's solution to sharing data knowledge: "Synced Views"

At Alan, we make data a core piece of our decision processes.

Data is imported and stored in our data warehouse (Postgres) from many different sources: production data from our applications, third parties (intercom, trello, segment user tracking, Hubspot, ...), and data from a few tightly integrated actors from the health ecosystem.

Yet, gathering all this data into one place is not enough. Users of the data tend to separately run overlapping analyses. Work is duplicated, errors more likely, and every user creates "local metrics" that are not generalizable and not consistent with each others. Making data easily accessible, when it's raw data, turns into a nightmare.

Synced Views Blog post Schema of our main data architecture with an accurate depiction of the value created when raw data is used directly.

Prepared data should be centralized

Raw data, even "clean", is not ready for use. To make it actionable, it requires treatment, and this treatment needs to be harmonized:

  • Denormalization: scattered in different tables and schemas, imported from different sources, the data needs to be joined together to be useful.
  • Semantics: names need to be adapted to best practices, and key concepts defined elsewhere should be ready to use.
  • Computed fields & aggregations: KPIs and statistics are built from raw data.

To this effect, a layer that abstracts knowledge about data is needed. This layer has two missions: exposing a data API to the Alan team and creating the data corresponding to this API. This layer should be readily usable by all data users, transparently defined, easy to maintain and have documentation embedded.

Abstracting knowledge in code

We decided to use database views to solve this problem. These view were initially a quick solution to the repeated need of JOINing tables and renaming fields. In a few months, they grew organically into an easily extensible framework that serves all the data needs at Alan. We call it the “Synced Views”.

The principle is simple: we use SQL to define this data layer, in views directly stored in our data warehouse. The code defining these views is treated like any code produced by and for a team: using version control, reviews and tests.

We built the synced_views module that takes the code defining the views from a Github repository and syncs them to a remote database as a continuous integration step (using dear Circle-CI). These views are the basis for all our analyses and applications down the road.

Description

Synced Views Blog post (1) When you consider SQL like any other production code, you can leverage the existing ecosystems of test, review and continuous integration to create reliable processes and long-term value.

Views are defined as SQL queries in individual files. The name and schema of a view is determined by the file’s name and path: templates/<schema_name>/<view_name>.sql.

When a branch is merged to master on Github, our continuous integration provider syncs the views to the remote data warehouse, automatically.

When a new commit is pushed to our GitHub repository, we trigger a “dry run” sync that will effectively test the syntax of the views that have been changed or added since the last master commit.

Structure

We identified 5 key areas of application for the data at Alan: Growth, Product, Insurance, Operations, Business Metrics (more on them in a later post). Each has its own "application" database schema, which hosts the views corresponding to the area of the problem they characterize.

To support dependency between views in a clean way, we have the following rules:

  • No view in a schema can depend on a view from another schema.
  • An additional schema named core gathers views that can be used in any application schema.
  • Views within a single schema depend on each other following a model of computation stages. For instance, views from the files stored in the folder insurance_1 are synced before views in the folder insurance_2, so that the latter can depend on the former. They all end up in the insurance schema.

These are the building blocks of the framework, but we expect it to evolve as our needs expand.

Example & Code

To reward the readers that read that far (thank you!), we share a few snippets extracted from our current synced view module for illustration.

We use Hubspot as the CRM that powers our amazing sales team. That information is imported from their APIs in raw form by ETLs managed on airflow, before being prepared by the synced views. The file templates/core/hubspot_deals.sql defines the synced view that presents the information about the deals managed by our sales team. Here is an excerpt of the file, that demonstrates typical usage of the synced views :

SELECT d.id
    , cp.id AS company_id
    , d.dealstage AS stage_id

    , 'https://app.hubspot.com/contacts/4510447/deal/' || d.id AS url_to_hs
    , d.dealname AS name
    , dps.label AS stage
    , dps."displayOrder" AS stage_order
    , dp.label AS pipeline

    , d.createdate AS created_at
    , d.hs_lastmodifieddate AS updated_at
    , d.closedate AS closed_at

    , cp.numberofemployees AS n_employees
    , cp.name AS company_name

FROM hubspot.synced_deals d
LEFT JOIN hubspot.synced_companies cp ON cp.id = d."associatedCompanyIds"[1]
LEFT JOIN hubspot_stitch.deal_pipelines dp ON dp."pipelineId" = d.pipeline
LEFT JOIN hubspot_stitch.deal_pipelines__stages dps
ON dps."stageId" = d.dealstage AND dps."_sdc_source_key_pipelineId" = d.pipeline

The file above directly defines the query that will be synced as a view. We use Jinja templating to do the heavy lifting in a simple way. Here is a simplified version of the jinja template that we use:

DROP VIEW IF EXISTS  "{{ schema_name }}"."{{ view_name }}" CASCADE;
CREATE VIEW "{{ schema_name }}"."{{ view_name }}"
AS
{% include view_template_path %};

Then the main module renders it with code that looks like:

view_creator_template = jinja_env.get_template('view_creator.sql')
template_params = {
    'view_template_path': template_path_name,
    'schema_name': schema_name,
    'view_name': view_name,
}
view_creation_sql = view_creator_template.render(template_params)

Key Features & Next steps

Here are the key features of the running version of our synced_views module:

Dependent syncing: The ability to declare dependencies between the views, through stages (discussed above)

Historized views: We didn’t talk about it here, but we support a simple syntax in the definition of the view that allows us to retrieve, as a function, a version of the view that emulates past data from current data. Selective testing: The ability to not test all the views all the time, but only the ones that have been modified in the sister branch when comparing to master.

Materialization: All views are now materialized and refreshed every day for better performance. One can still access the real-time version of the views by adding __rt as a suffix to the view name.

Missing-only: Detects and syncs again only views that are missing. This is useful when a dropped table or view has removed synced views that depend on it.

On the roadmap:

Versioned syncing on master: Currently, when merging to master, all the views are re-synced. This process is long and wasteful. We can leverage version control (git) to detect what view definitions have changed since the last successful syncing, by storing in the target DB the corresponding commit hash.

Dependent testing: Currently, testing depends on views currently synced to the db. That prevents us from testing changes on several dependant views (since testing doesn’t actually persist the views).

Embedded documentation: This is one of the key requirements, so that the framework truly fulfills its mission of a global data API for any user in the company. We have not yet tackled this point, but the vision is to collocate documentation and the code in the files that defines the views.

Open sourcing: The syncing module would easily be re-usable for teams that don’t have a solution to comparable problems yet, and would like to buy-in this framework. We believe however that the project is missing a few key features (above) to be really useful and start the open-sourcing discussion.

Beyond the roadmap of the module itself, the data community is working hard to make the data we publish usable and used by the Alan team. This goes beyond the data layer that creates and publishes this data, and requires appropriate tools and training - and we’re far from done!

Conclusion

Synced Views Blog post (2)Our architecture with the Synced Views piece in place. Ouf!

Taking a step back, the framework of synced views allowed us to quickly build on our current infrastructure to serve the data needs of Alan. Yet, it ends up converging with a more traditional ETL framework that would build summary tables in succession, using a tool like airflow to orchestrate it.

Compared to a classic ETL tool, the current framework shows one main limitation: the inability to have arbitrary code create the data. This framework is thought as a step to prepare data that already exists - we actually use airflow to create the data upstream.

We believe our synced views framework has a few benefits that are worth keeping investing in it:

  • Simplicity & Maintainability: No DAG to explicitly manage, no python code to write. Adding a new view is as simple as creating a new SQL file with a query written in it.
  • Declarativeness: The structure of the doesn't reflect "how" it's done, but "what" is done. Anyone, regardless of technical ability, can easily find the definition of a given view and field.
  • Light & Portable: The module responsible for operating the sync is only a few hundreds line of codes, and doesn't require knowledge of heavy libraries. All the computation is done in the Database.
  • Standard: Any prepared data is in a view created with the same tool, the same way. We can focus on data best practices (naming, concept definition, documentation), not engineering ones (architecture of the code, pipeline management).

If you've read that far, you're passionate about enabling a team with data they can use. We are too! We believe it's core to a company's success, that's why we're hiring to develop our data community. Talk to us!

Antoine Lizée
Data @ Alan

Plus d'articles de Antoine Lizée

La crème des articles alan

Dans votre boite mail. Garantie sans spam.

Populaires en ce moment

Populaire en ce moment

De la même catégorie