Dotscience consists of a hub and one or more runners. The hub is a central repository for projects (containing runs), datasets (or pointers to S3 datasets) and models (in the model library). The runners are where runs of actual work (data engineering, model training etc) happen. Runs which generate labelled models automatically show up in the model library.
Runners are connected to the hub by starting a Docker container (dotscience-runner) which opens a gRPC connection to the hub and awaits instructions. The user then sends instructions to start Tasks (interactive Jupyter or ds run CLI tasks) on a runner. When the runner receives this instruction, it starts a container called the dotscience-agent which synchronizes datasets and workspaces (mounted as the home directory from the perspective of the task) onto the runner. Workspaces and datasets are ZFS filesystems managed by Dotmesh. Once the data is synchronized, the agent spawns another container with the user's specified Docker image (in the case of ds run) or the bundled jupyterlab-tensorflow container in the case of Jupyter.
Run tracking is like git for executions of code
Central to Dotscience is the notion of Runs. As a run tracker, you can think of Dotscience as a bit like "Git for executions (runs) of your code". Instead of just tracking source files on disk, Dotscience tracks the interaction between code, input data & output data/models, and the logs and metrics (such as model accuracy) that are side-effects of these runs. This is necessary for Machine Learning because it has many more moving parts than regular software engineering (data, parameters, models and metrics as well as code) and they all need to be tracked to make ML teams productive and to ensure that models have proper reproducibility and provenance. This is important so that you can track back from a version of a model to how it was trained, the data it was trained on, and, transitively, where that data came from — this is useful because it relieves the burden of manual tracking, it helps with forensic debugging of model issues and it makes it much easier to comply with regulations about tracking your work.
See also our blog post about how Run tracking liberates ML teams.
How runs are detected
A component called the committer is running within the dotscience-agent process and watches for new runs (in ds run, the run metadata is written to stdout by the workload; in Jupyter, it's written into the notebook itself and saved to disk — this acts as a trigger). When a new run occurs, the committer automatically creates a new lightweight filesystem snapshot and synchronizes it to the hub. This filesystem snapshot contains all the code, data, and metadata such as run parameters and metrics.
In the case of Jupyter tasks, the committer is continuously watching for runs by detecting changes to notebook files which contain the Dotscience metadata JSON written by ds.publish in the Dotscience Python library. In the case of ds run command tasks, the run metadata is written to the stdout of the process by the Dotscience Python library, and the run metadata is picked up at the end of the run by the dotscience-agent.
Efficient lightweight filesystem snapshots
Because Dotscience uses Dotmesh, which uses ZFS, it's able to very efficiently synchronize changes to workspaces and datasets, both of which can contain large data files, between the Hub and the Runners. Only the blocks that have changed on disk from one run to another need to be synchronized to the hub, and because ZFS knows which blocks have changed there's no need to scan or hash large files. ZFS can support multi-petabyte datasets and billions of files.
Bring your own compute
With Dotscience you can attach your own machines as runners to the platform, simply by running the Docker commands given in the runners UI or by using the ds runner subcommand in the CLI. The only requirement for a runner is Docker and an internet connection. It's not even necessary to have a public IP address — and yet you can access the Jupyter container running on a runner from anywhere, just by logging in to the Dotscience Hub. How does this work?
The trick is that we start a tunnel container on the runner, which makes an outbound connection to the Dotscience tunnel service, and securely exposes the Jupyter container as a subdomain of app.cloud.dotscience.com. When a connection is made from the user's browser to the tunnel URL, it gets proxied through the tunnel service to the connected runner and back to the Jupyter container, even if the runner itself is behind NAT or a firewall which only allows outbound connections. This gives you flexibility about being able to attach any available compute resource to the cluster, and still allow users to log in from anywhere, while managing the work in a central location (the hub).
Runners enable "hybrid cloud" architectures
Because Dotscience uses Dotmesh for the workspace and dataset filesystems (which can be mirrors of S3 buckets), and because Dotmesh uses ZFS, and because ZFS supports zfs send and zfs receive to stream snapshots between any nodes regardless of the underlying infrastructure, this makes it possible to synchronize data from any Linux machine to any other Linux machine, even if they are running in different environments or on different cloud providers.
This makes it possible to very easily construct "hybrid" architectures where, for example, the hub is running on one cloud provider and for some reason one or more of the Runners are running on a different cloud (for example, you want access to some of the data services provided by one cloud, like BigQuery on GCP from a GCP runner). Or, you can run the hub in the cloud but attach a powerful GPU machine you have in your office or data center, such as NVIDIA DGX servers, which make awesome Dotscience runners.
Integration with CI systems to trigger model training
It is possible to integrate Dotscience with your CI system so that models can be automatically trained and their metrics and provenance published to Dotscience on a push of your code to version control. Simply configure your CI job to run ds run -d --repo [email protected]:org/repo --ref $CI_COMMIT_SHA python train.py, for example. This way, the model training (which might take some time) happens asynchronously in Dotscience, freeing up your CI runners to carry on running tests. So every model training is fully tracked, and lands in the Dotscience model library, from where it can easily be deployed (via API or clicky buttons) then statistically monitored.
Read more about this use case in our ds run CI integration tutorial.
Deploying to production
Beta feature: contact us to enable on your account
Dotscience supports deploying models to production by integrating with a CI system. When deploying to production is triggered in Dotscience it triggers a CI job which pulls the model files from a Dotscience S3-compatible endpoint, builds an optimized container image for that model, and then pushes it into a Docker registry, from where a continuous delivery tool can deploy it to e.g. a Kubernetes cluster.
Monitoring in production
Beta feature: contact us to enable on your account
Dotscience enables statistical monitoring with a component called the Dotscience model proxy. This service works as an interceptor of requests/responses to and from Tensorflow Serving (or similar services). Users, using the API can set which parameters they want to capture for statistics. This integrates with Prometheus to, for example, allow you to monitor the distribution of predictions in a categorical model (one which is predicting what category of thing a certain input is, such as predicting road signs from images). You can then use Prometheus and Grafana to create dashboards of the statistics of your models in production, in addition to the usual RED metrics (request rate, errors, duration) that you would want to monitor for any microservice.