Identify relevant text from complex documents

Joanna Purosto / January 13, 2020

Selko.io builds solutions for multi-disciplinary project teams working in large companies. These teams work according to project documents that usually have several hundreds of pages. Finding the relevant sections for each team member is a real burden in the project-based working environment.

The normal workflow for the project teams is to go through the project documents manually at the beginning of each project and allocate parts of the text to a respective team. But with Selko’s solution, a machine learning model classifies different sections of the text automatically based on pre-defined categories. For example, certain parts of the text are highlighted and marked relevant to software engineers.

Multilabel text classification with domain knowledge

Selko is working with a couple of different pre-trained models – for example, open-sourced projects from Fast.ai and Hugging Face – and they use transfer learning to customize the models for their customers’ use cases.

To build a customer-specific model, we chop off the prediction part from a pre-trained language model and replace it with a feedforward network for classifying the texts. The classification layer includes the labels that our customer needs, and then we retrain the classifier.
-Aditya Jitta, Senior Data Scientist

The needs and the labels for the classification layer come from Selko’s customers. The number of labels varies from just a couple labels to tens and to even hierarchical label structures. Customers choose the required labels for the classification in the user interface and based on that, Selko knows what model to choose for this specific case.

Selko UI ❤️ Valohai API

After choosing the desired labels on Selko’s tool, the user uploads text files – containing approximately 200 sentences corresponding to each desired category – for training the model. And what happens under the hood, is that the user interface calls Valohai API to start the training with the user-defined labels and data set, and with a suitable model that is automatically chosen based on the need. Valohai then runs the training in Selko’s AWS cloud instances with automatic versioning. This way the model is ready for the inference stage without any interference from a DevOps specialist.

When the user uploads the actual project documents with the text that needs to be categorized, the inference step is run via Valohai API, and the user gets a categorized document back to Selko’s tool. So Valohai works as an orchestration layer between Selko’s user interface and the lower level architecture.

We, a team of two full-stack developers and one data scientist, took up the job to build ourselves a complete machine learning orchestration system. We faced a lot of hurdles to reach a fully functional and working system. Since time was definitely a constraint we decided that we should concentrate on our customers' needs and let Valohai take care of the ML infrastructure.

-Aditya Jitta, Senior Data Scientist

Read how Selko's technical team describes their initial steps with machine learning infrastructure and how they ended up using Valohai.

Future for Selko

Due to rapid advancements in the field, Selko's data scientists are continuously exploring different models to find out whether the new technologies would suit their customers’ needs. With the help of the Valohai platform, they can make sure that all of the experiments are stored automatically, and it is easy to share the findings across the whole team.

Aditya describes that, in the future, they plan to move towards active learning where the machine learning algorithm queries the user whether the model has predicted the right label for a section of text or not. Also, the intent is to use unsupervised learning to identify similarities across different documents.