Data Science in a Box. Part 1: Overview


As we go about upskilling internal resources at work into Data Science, Machine Learning and Python I took a step back and looked at the reference architecture we were building.

Whilst I appreciate in an Enterprise environment you sometimes need to pay for proprietary, closed-source software (service wrapper, functionality or simply easier than building an internal capability) there's no reason why this should be out of touch for someone looking to teach themselves and who's willing to get their hands dirty.

In addition a lot of the commercials with vendors now rely on a consumption model for a variable cost. This is great for flexibility but not necessarily when you're just starting out and/or looking to test new ideas - tools like BigQuery (Google) or Athena (AWS) charge by the amount of data scanned.

Given pretty much everything I know in this space has been learnt through sharing and OpenSource projects I thought I'd set about giving something back... There's also the real issue that I'd like to be able to set this up and play around at home without worrying about cost!

This will end up being a series of postings as I stumble my way through implementing and, at the end, I'd like to have a series of Ansible Playbooks, probably also utilising the new ansible-container for Docker such that this infrastructure can be deployed and configured in a matter of minutes for anyone to help democratise this knowledge.

Requirements and assumptions

I'm assuming that we're looking to run a small (<= 5) team of analysts with minimal fuss - that therefore means integrated services and Single Sign On (SSO). Luckily I'd already covered setting up OpenLDAP for Single Sign On (SSO) so this will hopefully just be a logical extension. You can obviously drop in your own Active Directory / LDAP implementation but if you're at that stage chances are you don't need this article anyway...!

  • Notebooks: I'm going to use Jupyter Hub for a collaboration space with a docker spawner, integrated with LDAP
  • Databases: We'll need to cover off blob storage, SQL and NoSQL, might also chuck a Key/Value store like Redis in there too for good measure
  • Version control: Gitlab CE will be deployed with LDAP integration so we can save, commit and manage code for projects.
  • Project management: Somewhere to collaborate, store ideas and track issues. Tagia has been selected to cover this off. Looks awesome, OpenSource and written in the language of champions...

Reference Architecture

As I always find it easier when I can visualise what I'm aiming for I've taken the liberty of wire-framing out what we're trying to build.

Enterprise (the blue pill)

In the blue corner we have the best-in-class architecture (circa. Q4 2017) for Major Cloud Vendor X which I'm in the process of implementing at work. Whilst I've taken out some specifics for commercial sensitivity I've included it for reference as there are a few capabilities there which we either need to find an equivalent FOSS one for or just accept we won't have. For the moment I've excluded for commercial sensitivity whilst some features we're building haven't been released to general public.

The main gaps between OpenSource and Enterprise are around the 'managed service' type products (primarily in the administration) and the security layers we've put in place so the infrastructure is capable of scaling exponentially with no additional overhead. Whilst these might very well be pluggable with a bit of effort I'm not sure it's 100% necessary for the target audience here...

OpenSource (the red pill)

One of the critical pieces for sanity of end user is the ability to query a single namespace for datasets, whether it be in SQL, NoSQL or Object Storage. The Presto Distributed SQL Query Engine for Big Data claims to solve this headache with a series of connectors to common datasources such as our PostgreSQL and Mongodb and the ability to query S3 data using Hive and Presto. As we'll be using Minio to replicate the Amazon S3 API for storage this should work fine.

OpenSource Data Lake reference architecture

What we're going to lose - for the moment - is the ability for some fine-grained control and masking over the datasets. You could achieve this with a combination of application security levels and mappings between users and groups using Apache Ranger, Postgres schemas and Mongodb users for Row / Column level security but this is outside of the scope for today.

James Veitch

Read more posts by this author.