Machine Learning and Remote Work

Eero Laaksonen / March 13, 2020

A lot of companies and teams are going fully remote for the first time due to the Coronavirus. We at Valohai are big believers in remote work. Having practiced with a distributed team for a good 4 years we would like to share some of our thoughts on remote work in Machine Learning. A lot of major pain points we have seen revolve around tooling.

There are a host of things to consider but we are going to focus on the specifics of machine learning and data science. Here are the two main considerations

  • Access to data and compute
  • Collaboration and sharing your work

Access to data and compute

There are so-called natural and artificial/legal reasons that limit access to data and compute.

Natural

A lot of the datasets are so big that they do not really fit on an average laptop or workstation and the computation on those datasets is so big that an average single GPU laptop cannot give an effective workflow.

Naturally, companies are tackling this with on-prem clusters, on-prem compute workstations and moving larger compute on the cloud. Remote access for cloud is the easiest to handle and one of the good reasons to justify a larger price that cloud services ask.

Companies with on-prem clusters depend heavily on the capabilities of their dev-ops teams and how well they have organized remote access for teams. These environments can often be less than robust and possible outages in today's landscape can completely immobilize your ML progress. So I would suggest you make sure your dev-ops teams prioritize hardening remote access solutions. The burden of everyone in the company going remote can cause surprising increases in loads.

The toughest situation is with teams who have individual compute workstations. These often don't have robust remote access solutions and its quite common to lose connection to these types of workstations. The way it looks like most companies are going to be locked down for a better part of the month. Now is a good time to assess whether cloud could be a solution for you or if your company has compute clusters that dev-ops is actively maintaining.

In an optimal case, you would have a centralized machine learning platform that takes care of access to compute and automatically manages your data without the data scientists really needing to spend time on that. Now is also a good time to look into possible solutions like Valohai that might be quick to implement and help you get a productivity jump from remote work instead of losing productivity. (A shameless plug I know ;)

Artificial/Legal limitations

A lot of the data that companies have on-prem or in hardened cloud systems is not allowed or should not be downloaded on individual workstations or laptops. This can make the workflows around working on real data extremely frustrating for data scientists and machine learning engineers.

Easy access to cloud or on-prem compute and a streamlined way to execute code on those systems reduces the risk of having a data breach significantly. Everyone can imagine the risks involved in 100 laptops that contain private data meant for R&D purposes floating around in a 30-day full remote situation.

Again companies that have employed or employ an ML platform that streamlines their work on managed clusters on-premise and on the cloud will see a relatively small drop in productivity when going remote.

Collaboration

No-one works alone. The biggest reason why some software companies can be completely distributed is again tooling. Version control like Git, online meeting tools, Slack, etc. make it possible for people to be extremely productive while not sharing the same space.

However, it did not start like this. Software development is starting to be a mature industry with streamlined workflows that we have collectively been practicing for decades now. Especially version control and workflows like pull requests have made it possible for more distributed teams to work effectively.

This is not necessarily true for data science and machine learning. In many ways, machine learning is more complicated than software development. Instead of just code, you are also tracking data, parameters, results, etc.

In a remote environment, it’s no longer effective to just walk to your colleague and show them how your notebook should be used. You need a better way to collaborate. Preferably something that offers reproducibility and version control for your experiments both for Jupyter notebook experimentation and production code.

In a remote environment, there is a need to over-communicate everything and nothing achieves that better than complete reproducibility and version control for experiments. Not to mention that risks are greatly reduced when anyone can pick up where someone left off and hit the ground running in case someone is out for a week or two.

If you are looking for a platform that can achieve these tasks Valohai can be a good solution for you to quickly get you ready for remote work and get that productivity jump people talk about. Valohai installation and onboarding can take as little as one day in your environment so act now if you want to give productive remote work a try! If you want to hear more book a demo .

MLOps Ebook

Free eBook

Practical MLOps

Learn what MLOps is all about and how MLOps helps you avoid the deadlock between machine learning and operations. This eBook gives an overview of why MLOps matters and how you should think about implementing it as a standard practice.