Version Control for ML as a necessity from the get-go
Skillup develops machine learning models to build and maintain a marketplace for professional trainings. It is a SaaS platform where a company’s HR people can manage the training cycle from the beginning to the end. Starting from requirement gatherings, search of the trainings, planning, booking and evaluation.
Many companies in France offer their employees an opportunity to educate themselves with different kinds of courses once or twice a year. That may not sound much if you are booking trainings for yourself, but imagine being an HR person booking trainings for hundreds of employees.
Normally an HR person would browse through multiple already known training providers’ sites, or go to Google to find new ones, and decide which training suits whom. This time-consuming search and matching is something that the French company Skillup has tackled with machine learning.
Machine learning helps to keep the marketplace up to date
There are thousands of training organizations in France with websites listing their offering. Manually keeping up with all the changes is next to impossible which is why several machine learning models are being built to keep the marketplace up to date.
New trainings are added, old ones are removed and dates, prices, programs, durations and other details of already known trainings may change. The challenge is to keep the marketplace up to date with the lowest possible cost.
Skillup is using web scraping technology to fetch this data from online training websites. Scraping simulates human actions on the website and clicks through a page, finds training pages and extracts the title, program, dates and other necessary information of the training.
If we would have to integrate using traditional scraping techniques, a marketplace of 1 000 sites would require 10 people in software development. Instead, we have a three person technical team, several machine learning models and Valohai orchestrating the infrastructure side.Ari Rouvinen – Head of data, Skillup
A classification model recognizes different parts of the training page
Rouvinen has trained a classification model on Valohai to recognize whether a scraped page is a training page or not. When Rouvinen started at Skillup, the first objective was to collect data from 50 sites to use for the model training. Today they have a model to first decide whether the scraped page is a training page, and they are working on a model to categorize different parts of it.
The model is taught to understand the semantics of the text and to estimate which part is a duration, title, program, objective or some other field. After some cleanup, the data can be added to the marketplace.
On the marketplace, there are two thousand different categories for trainings. Another model trained with Valohai classifies the trainings automatically.
Version Control in ML and sharing of results
In Skillup’s data team, there is the CTO, one person working with the website scraping and Rouvinen himself working with data science. In a small team, the help from Valohai platform is essential. Without the proper tools they would use lots of time in maintaining servers, managing machine learning infrastructure and keeping record of executions – all tasks that can now be automated.
Before using Valohai, Rouvinen kept track of executions with pen and paper which might work with a handful of trainings, but is not nearly enough when building production level machine learning model
I’m the only person working with machine learning currently. But working alone is not an excuse to work badly. When new people come in, I want to have everything stored and to be able to onboard them easily.Ari Rouvinen – Head of Data, Skillup
After a training finishes in Valohai, Rouvinen has a metrics table with the performance for each category for example. He shares the end results even with the company’s business team, and the discussion has proven to be fruitful.
Thanks to Valohai, there is a clear process. My features, predictions, metrics and models are always stored automatically, and the platform helps me to communicate the results and iterate.Ari Rouvinen – Head of Data, Skillup
Airflow for sequencing
The Valohai platform is fully based on an API which has allowed Rouvinen to create a Valohai operator for the Apache Airflow system. This means that he can swimmingly integrate Valohai trainings to be a part of his data ecosystem. Airflow is an open source platform to author, schedule and manage tasks, and it allows orchestration of tasks in a complex network of job dependencies.
The Valohai operator has been open sourced by Skillup and is available on GitHub.
Thousands of models!
Skillup has an ambitious goal to be the biggest marketplace for professional trainings in France and to achieve this they are planning to hire one data engineer and one data scientist. Growth means also that the amount of scraped sites needs to increase from hundreds to thousands, and the machine learning models will be trained and applied to bigger amounts of data.
Luckily, Valohai helps us to scale fast. A couple of clicks, and I have more than enough cloud instances up and working with my model.Ari Rouvinen – Head of Data, Skillup