Natural Language Processing

Textbooks

Foundations of Statistical Natural Language Processing, Christopher Manning & Hinrich Schütze

Foundational text on natural language processing. Available as free pdf online (linked)

Introductory tutorials

How to solve 90% of NLP problems: a step-by-step guide

Written by an insight alumnus. Begins with the simplest NLP method that could work, and then moves on to more nuanced solutions, such as feature engineering, word vectors, and deep learning.
A Review of Bert-based models

A blog post giving a review of BERT-based models and explains transformers at a high level.
What you need to know about the new 2019 Transformer Models
- XLNet: Generalized Autoregressive Pretraining for Language Understanding
- ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
Text Classification Algorithms: A Survey

This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed
NLP From Scratch: Classifying Names with a Character-Level RNN

PyTorch tutorial for using RNNs for NLP.

Methods

Flowchart showing which Text Classifying Method to choose

A flowchart from Google showing which model to use when classifying textual data. They also have a whole tutorial on text classification.
Creating text features

From UC Irvine Business Analytics programming guide. If you don’t have enough time to read through the entire post, the following hits on the key components:
- Bag-of-words: How to break up long text into individual words.
- Filtering: Different approaches to remove uninformative words.
- Bag of n-grams: Retain some context by breaking long text into sequences of words.
- Log likelihood ratio test: Identify unique combination of words that are more likely to be used together than not.
- Parts of speech: Tag words with their parts of speech (i.e. noun, verb, adjective).
Term frequency — Inverse Document Frequency from Scratch in Python with Real World Example

Nice blogpost showing how to write your own TD-IDF algorithm to embed documents, and then retrieving documents with a matching score or cosine similarity.
What are Word Embeddings for Text

Very brief overview of word embeddings, including using transfer learning with embeddings like GLoVE or Word2Vec.
Word Bags vs Word Sequences for Text Classification

Sequence respecting approaches have an edge over bag-of-words implementations when the said sequence is material to classification. Long Short Term Memory (LSTM) neural nets with word sequences are evaluated against Naive Bayes with TF-IDF vectors on a synthetic text corpus for classification effectiveness. Uses Keras.

Tools

spaCy: Industrial Strength Natural Language Processing

A very convenient python package with lots of pre-trained models for NLP
🤗 transformers: State-of-the-art NLP for everyone:

🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.