Textbooks
-
Foundations of Statistical Natural Language Processing, Christopher Manning & Hinrich Schütze
Foundational text on natural language processing. Available as free pdf online (linked)
Introductory tutorials
-
How to solve 90% of NLP problems: a step-by-step guide
Written by an insight alumnus. Begins with the simplest NLP method that could work, and then moves on to more nuanced solutions, such as feature engineering, word vectors, and deep learning.
-
A blog post giving a review of BERT-based models and explains transformers at a high level.
- What you need to know about the new 2019 Transformer Models
- XLNet: Generalized Autoregressive Pretraining for Language Understanding
- ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
-
Text Classification Algorithms: A Survey
This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed
-
NLP From Scratch: Classifying Names with a Character-Level RNN
PyTorch tutorial for using RNNs for NLP.
Methods
-
Flowchart showing which Text Classifying Method to choose
A flowchart from Google showing which model to use when classifying textual data. They also have a whole tutorial on text classification.
-
From UC Irvine Business Analytics programming guide. If you don’t have enough time to read through the entire post, the following hits on the key components:
- Bag-of-words: How to break up long text into individual words.
- Filtering: Different approaches to remove uninformative words.
- Bag of n-grams: Retain some context by breaking long text into sequences of words.
- Log likelihood ratio test: Identify unique combination of words that are more likely to be used together than not.
- Parts of speech: Tag words with their parts of speech (i.e. noun, verb, adjective).
-
Term frequency — Inverse Document Frequency from Scratch in Python with Real World Example
Nice blogpost showing how to write your own TD-IDF algorithm to embed documents, and then retrieving documents with a matching score or cosine similarity.
-
What are Word Embeddings for Text
Very brief overview of word embeddings, including using transfer learning with embeddings like GLoVE or Word2Vec.
-
Word Bags vs Word Sequences for Text Classification
Sequence respecting approaches have an edge over bag-of-words implementations when the said sequence is material to classification. Long Short Term Memory (LSTM) neural nets with word sequences are evaluated against Naive Bayes with TF-IDF vectors on a synthetic text corpus for classification effectiveness. Uses Keras.
Tools
-
spaCy: Industrial Strength Natural Language Processing
A very convenient python package with lots of pre-trained models for NLP
-
🤗 transformers: State-of-the-art NLP for everyone:
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.