Return to site

Under the hood: Architecture (and morality).

· DataOps,Data Operations,Data Pipeline

Welcome to the next blog in our series of 5 minute reads where we will briefly cover the architecture, and the thinking behind it, of the DataShaka platform. This post is definitely for the more ‘technically’ minded :-). And sorry for the ‘morality’ line, a VERY bad 80’s music joke that only OMD fans will get.

The architecture of DataShaka is based on the elements of: Modern Data Pipeline, Microservices and an ‘API first’ approach. The combination of these is designed to achieve the three key tenets of performance, flexibility and security (wherever across the globe we are called on to deploy). Our overriding mindset is to keep the design and architecture as simple and flexible as possible while providing the critical elements our clients require. Hopefully, without too many trade-offs along the way. Our clients ultimately care about one thing; that their data is fit for purpose when they need it to be. All of our efforts are geared to this goal.

Following the pipeline pattern previously mentioned, we give users a visible and trackable view of their data flow, via the Data Control Panel. DataShaka pipelines work in parallel to ingest data both directly from clients/users, as well as data from external third party APIs (like Brandwatch, Quandl, etc). And, since it works in parallel, it is performant comparatively, enabling us to concurrently handle multiple users and datasets.

When the data is ingested to the system, it is parsed and converted to our unique format - Katsu (more on this ‘special sauce’ in future posts). The data propagates through our various pipeline stages (configured per deployment) such as: unification of data, taxonomy to apply specific rules, uploading to a staging area, etc and before arriving at its final destination, for example data storage location (with a ‘success’ email sent to the user if required).

As you would expect, we orchestrate a number of components under the hood, including the following:

  • Azure App service, on which we host a Web Application

  • The Web Application contains the UI and API endpoints

  • The Azure App Service runs with load balancer to provide failover and performance

  • When the data is ingested the data file is passed to our Pipeline Service

  • The Pipeline Service runs a mix of powershell scripts and a console application (with the help of Windows Task Schedulers)

  • Blobs provide temporary storage

  • RabbitMQ is the communication layer

  • Pipeline steps are executed and data is parsed, validated and uploaded to data storage (MongoDB)

  • The status of each step is passed back onto the Data Control Panel using blobs

  • Our ‘Static Data’ deployment model sees the data delivered to the client immediately once processed

  • Our ‘Live Data’ deployment model has an API endpoint in the Web Application to serve query requests to our hosted, refined, client data

  • The endpoint is constantly ‘listening’ for requests

  • An intelligent Query Engine leverages distributed and modular micro services to respond to query requests securely

  • Once a request is made, it is parsed and passed to the ‘idle worker’ components

  • When the API endpoint is called (alongwith the query parameters such as Custom Calendar, Tractors, etc) the Web Application parses the request and passes it to the Query processors (which are hosted on a cluster of Ubuntu machines)

  • Inside the Query Processor, the system parses the dates, Custom Calendar, etc and fetches data from the underlying data storage (MongoDB) and then applies Tractor scripts before returning the final data to the Web Application

  • The Web Application waits for the response from the Query Processors before returning the output to the user

  • In fulfilling a query request, we conduct detailed tracing and provide key performance metrics to our Datadog monitoring service for publication to the Performance Dashboard (shared with the client)

With regard to security, we use TLS 1.2 protocol to provide secure encryption, meeting the PCI Data Security Standard (PCI DSS). Additionally, we use a Token based authentication approach in our APIs, whereby a query hitting our API must have the token which was issued separately for every account to be more secure.

Our architecture is founded on the principle of flexibility and we are not tied to a specific external service, such as Azure or AWS. Our services are written in .NET core, which is platform independent, and we can deploy our services within any cloud service.

Service deployment is also done in a way to overcome any failover in a specific region, with components placed in different regions and geographies across the globe. Our design approach is to place the primary node of every component in the same region, with the failover in different regions. Auto recovery of all components is also in place in case of any failure.

As you would expect, we are constantly evolving and are always looking for ways to make our platform more robust and performant. Right now, we’re taking a good look at design level enhancements within our data storage.


Thanks for reading. Drop us a line at hello@datashaka.com if you’d like to talk.