Automated SKOS Vocabulary Design

Ensuring quick and consistent access to large collections of unstructured documents is one of the biggest challenges facing knowledge-intensive organizations. Designing specific vocabularies to index and retrieve documents is often deemed too expensive, full-text search being preferred despite its known limitations.

However, the process of creating controlled vocabularies can be partly automated thanks to natural language processing and machine learning techniques.

With a case study from the biopharmaceutical industry, we demonstrate how small organizations can use an automated workflow in order to create a controlled vocabulary to index unstructured documents in a semantically meaningful way.

MONIC Python package

https://github.com/rhubain/monic

MONIC (for MOsaic Non-structured Information Classification) is a Python package developed in the context of this paper, based on our methodology.

There can be wide disparities between the classifications’ outcomes, depending on both the features and the targeted statements. There is therefore a risk of using an inappropriate feature for a given target.

An adaptative classification system was needed in order to handle this high variability in the quality of the results. To achieve this goal, MONIC uses a weighted classifier combination and a κ-fold assessment module which are detailed in the paper.

Preprint paper