Vincent Warmerdam - Bulk Labelling Techniques

Ғылым және технология

Let's say you've to some unlabelled data and you want to train a classifier. You need annotations before you can model, but because you're time-bound you must stay pragmatic. You only have an afternoon to spend. What would you do?
Let's say you've to some unlabelled data and you want to train a classifier. You need annotations before you can model, but because you're time-bound you must stay pragmatic. You only have an afternoon to spend. What would you do?
It turns out there are a few techniques that can totally help you with this. You can easily get interesting subset annotated quickly by leveraging:
a quick search engine
pre-trained models
sentence/image embeddings
a trick to generate phrase embeddings
In this talk I will explain these techniques for bulk labelling whil I will also highlight some tools to get all of this to work. In particular you'll see:
lunr.py (a lightweight search engine)
sentimany (a library with pretrained sentiment models)
embetter (adds pretrained embeddings for scikit-learn)
umap (an amazing dimensionality reduction library)
spaCy (a great NLP tool)
sense2vec (phrase embeddings trained on reddit)
bulk (a user interface for bulk labelling embeddings)
For this talk I'll assume you're familiar with scikit-learn and that you've heard of embeddings before.