<aside> <img src="/icons/hare_gray.svg" alt="/icons/hare_gray.svg" width="40px" /> In order to create this unique UX for data analytics, we needed to use a very different tech stack than other competitors that based their tech on SQL queries or Python Notebooks.

</aside>

Arrow

Apache Arrow is a column-oriented data format that is optimized specifically for the kind of operations that are common in data analysis. It allows very efficient data storage and data transfer between processes since it’s a compact binary format that allows operating on batches. There is a very rapidly developing ecosystem built around it in Python & machine learning communities where widely used technologies such as Spark, Pandas, ClickHouse, or DuckDB are compatible with Arrow. It also simplifies executing computations in different architectures as it shares the same memory representation regardless of the programming language or processing unit (GPUs or CPUs), so it's perfect for sharing data between different platforms.

Darro

Darro is our own column-oriented binary format that shares most of Arrow’s advantages but is specially optimized for the kind of aggregations and filtering we use in our UI. It is tens of times faster than some SQL-based alternatives such as DuckDB.

We developed it because we wanted our product to give immediate results to any user interaction without any noticeable delay and that means multiple UI refreshes (with hundreds of queries executed) per second. In order to avoid the overhead of network requests (in many cases bigger than the query execution itself), we compute queries directly in the browser without talking with any server, thanks to WebAssembly. To that end, we need to keep memory requirements low so memory efficiency is another major focus that we are continuously improving.

All of this means a more responsive user experience and much lower maintenance and operational costs for us as most of the computation is done locally on the user’s computer. Which also makes it possible to offer more generous freemiums.

The exact same code that runs in the browser can be compiled to a native executable and that is what we use when we need to do the computations in the backend (e.g. when automatically refreshing a project with new data from an integration without the user interaction). Sharing code in both frontend and backend also lowers maintenance costs.

We are now leveraging this architecture by executing not only exploratory queries but also data transformations (converting data types, computing derived columns by formulas etc.) and algorithms (clustering, network layout…) in the frontend without the need to port codebases.

This enables us to put in production new features very fast.

Arrow vs Darro

We use Arrow mainly when working with machine learning algorithms in Python in our backend as it is widely supported in the Python ecosystem.

We use Darro in our frontend and now we are using it more and more in the backend replacing Arrow for most functions with the goal to keep a single codebase in that every new feature implemented in the backend is readily available in the frontend as well.

Backend vs Frontend

Even if our goal is to compute almost everything locally for the users in order to improve their experience, there are certain functions that are only feasible to be executed in our servers (e.g. big machine learning models). We are building a heterogeneous computation architecture where we leverage the best of both worlds: sometimes relying on GPUs in the backend while some other computations are running in-browser. Darro and Arrow formats allow us to make this real with a small development team.