Valohai Logo

Skillup had machine learning version control from the beginning

Case

Keeping the marketplace up to date with machine learning models

Country

France

Team size

3 people

Vendor

Skillup develops machine learning models to build and maintain a marketplace for professional trainings. It is a SaaS platform where a company’s HR people can manage the training cycle from the beginning to the end. Starting from requirement gatherings, search of the trainings, planning, booking and evaluation.

The Challenge

Many companies in France offer their employees an opportunity to educate themselves with different kinds of courses once or twice a year. That may not sound much if you are booking trainings for yourself, but imagine being a HR person booking trainings for hundreds of employees.

Normally a HR person would browse through multiple already known training providers’ sites, or go to Google to find new ones, and decide which training suits whom. This time consuming search and matching is something that the French company Skillup has tackled with machine learning.

Machine learning helps to keep the marketplace up to date

There are thousands of training organizations in France with websites listing their offering. Manually keeping up with all the changes is next to impossible which is why several machine learning models are being built to keep the marketplace up to date.

New trainings are added, old ones are removed and dates, prices, programs, durations and other details of already known trainings may change. The challenge is to keep the marketplace up to date with the lowest possible cost.

Skillup is using web scraping technology to fetch this data from online training websites. Scraping simulates human actions in the website and clicks through a page, finds training pages and extracts the title, program, dates and other necessary information of the training.

Ari Rouvinen, Skillup
If we would have to integrate using traditional scraping techniques, a marketplace of 1 000 sites would require 10 people in software development. Instead, we have a three person technical team, several machine learning models and Valohai orchestrating the infrastructure side.
– Ari Rouvinen, Head of Data, Skillup

A classification model recognizes different parts of the training page

Rouvinen has trained a classification model on Valohai to recognize whether a scraped page is a training page or not. When Rouvinen started at Skillup, the first objective was to collect data from 50 sites to use for the model training. Today they have a model that first decides whether the scraped page is a training page and they are working on a model to categorize different parts of it.

The model is taught to understand the semantics of the text and to estimate which part is a duration, title, program, objective or some other field. After some cleanup, the data can be added to the marketplace.

On the marketplace, there are two thousand different categories for trainings. Another model trained with Valohai classifies the trainings automatically.

Version control in machine learning and sharing the results

In Skillup’s data team, there is the CTO, one person working with the website scraping and Rouvinen himself working with data science. In a small team, the help from Valohai platform is essential. Without the proper tools they would use lots of time in maintaining servers, managing machine learning infrastructure and keeping record of executions – all tasks that can now be automated.

Screenshot from Valohai that shows hyperparameter tuning and part of version control

Before using Valohai, Rouvinen kept track of executions with pen and paper which might work with a handful of trainings but is not nearly enough when building production level machine learning models.

I’m the only person working with machine learning currently. But working alone is not an excuse to work badly. When new people come in, I want to have everything stored and to be able to onboard them easily.
– Ari Rouvinen, Head of Data, Skillup

After a training finishes in Valohai, Rouvinen has a metrics table with the performance for each category for example. He shares the end results even with company’s business team and the discussion has proven to be fruitful.

Thanks to Valohai, there is a clear process. My features, predictions, metrics and models are always stored automatically and the platform helps me to communicate the results and iterate.
– Ari Rouvinen, Head of Data, Skillup

Airflow for sequencing

The Valohai platform is fully based on an API which has allowed Rouvinen to create a Valohai operator for the Apache Airflow system. This means that he can swimmingly integrate Valohai trainings to be a part of his data ecosystem. Airflow is an open source platform to author, schedule and manage tasks and it allows orchestration of tasks in a complex network of job dependencies.

Valohai operator in Apache Airflow

Together with Valohai’s development team, Rouvinen is working on open sourcing the Valohai operator and contributing it to the Airflow project in GitHub.

Future brings scale

Skillup has an ambitious goal to be the biggest marketplace for professional trainings in France and to achieve this they are planning to hire one data engineer and one data scientist. Growth means also that the amount of scraped sites needs to increase from hundreds to thousands and the machine learning models will be trained and applied to bigger amounts of data.

Luckily, Valohai helps us to scale fast. A couple of clicks and I have more than enough cloud instances up and working with my model.
– Ari Rouvinen, Head of Data, Skillup

About Valohai

Valohai automates deep learning infrastructure and allows data scientists to concentrate on model development. Scale models to hundreds of CPUs, GPUs and TPUs at the click of a button. Reproduce models and increase team transparency with automatic version control of all input data, hyperparameters, training algorithms and environments.

Are you ready to get started?