Text as numerical features

Text is unstructured, and in order for it to be used in a model, we typically need to find a numerical representation of it.
For the code here, you need to run some imports:
import string

The Bag of Words framework

The Bag of Words framework (usually shortened as the "BoW") is possibly the simplest numerical representation of strings of text one could envisage.
Wikipedia claims that an early reference to this name is present in Z Harris' paper Distributional Structure, of 1954.
In BoW, a text is simply transformed into a "bag" (a multiset, that is, a set allowing for multiple occurrences) of the words composing it: this methods is very simplistic in that it disregards grammar and word order.
You have a corpus of sentences. What you do is you take all the unique words in the corpus and for each of the sentences you count the occurrences of each of those words.


Given a corpus composed by the two texts
  1. 1.
    "John likes watching movies. Mary likes movies too."
  2. 2.
    "John also likes watching football games."
The list of unique words in it is ["John", "likes", "watching", "movies", "also", "football", "games", "Mary", "too"], there's 9 words.
The two texts get encoded into the lists, 9-items long, of the occurrences counts of all of those words. Respecting the order we chose for the list of unique words, we have:
  1. 1.
    [1, 2, 1, 2, 0, 0, 0, 1, 1]
  2. 2.
    [1, 1, 1, 0, 1, 1, 1, 0, 0]
This is because "John" (first item) appears once in the first text, "likes" appears twice in the first text, and so on.
We can play around with this a bit!

Play with bag of words

Let's use three sentences:
A. I like to watch Netflix but can never choose a movie. B. The movie I saw last night was good. C. Mary didn't like the movie but she can choose on Netflix.
Let's now lower the case, remove punctuation and tokenise (simply space-splitting in this case) the sentences, gathering all unique words:
# Concatenate sentences, replace punctuation with space and split on space
# Do the same for each single sentence (for later use)
s = s1 + s2 + s3
s = s.lower()
for sign in string.punctuation:
s = s.replace(sign, ' ')
s1 = s1.replace(sign, ' ')
s2 = s2.replace(sign, ' ')
s3 = s3.replace(sign, ' ')
# Create the unique words list
unique_words = list(set(s.split()))
print('unique words are: ', unique_words)
This retrieves the unique words: "a", "t", "yesterday", "i", "the", "good", "movie", "choose", "like", "to", "was", "she", "mary", 'but", "watch", "netflix", "can", "didn", "saw", "on"
Finally, for each sentence provided, we now compute its BoW representation:
s1_bow, s2_bow, s3_bow = [], [], []
for word in unique_words:
print('First sentence in BoW: ', s1_bow)
print('First sentence in BoW: ', s2_bow)
print('First sentence in BoW: ', s3_bow)

The TF-IDF framework

TF-IDF stands for term frequency - inverse document frequency and it is an improvement over the BoW model.
There are two main ingredients in the framework. Suppose we are considering a word (or term)
and its presence in a set of documents (texts)
D=dD = {d}
(corpus). We have:
  • the term frequency
    : a function of the number of times
    appears in a document
    , quantifies the relevance of
  • the inverse document frequency
    idf(d,D)idf(d, D)
    : an inverse function of the number of documents in the corpus in which
    appears, it quantifies how spread
    is across
    , so that it will be used to scale down words which are very frequent across the texts
The idea behind the
is that words which are very common in the whole corpus (say articles, for instance) carry little specific information and need to be weighted down.
The TF-IDF is given as the product of these two components:
tf idf(t,d,D)=tf(t,d)idf(t,D)tf \ idf(t, d, D) = tf(t, d) \cdot idf(t, D)
The term frequency
can be given as either of:
  1. 1.
    raw frequency (count of occurrences):
    ft,df_{t, d}
  2. 2.
    boolean frequency, 1 if
    tdt \in d
    , 0 otherwise
  3. 3.
    log-scaled frequency
    1+log(ft,d)1 + \log(f_{t, d})
    or 0 if
    t∉dt \not \in d
  4. 4.
    augmented frequency:
    ft,dmaxft,d:tD\frac{f{t, d}}{\max{f{t',d'} : t'\in D}}
    (the augmentation is intended as the division by the max of the raw frequencies in the document, so that we control for the bias towards very frequent words in a text)
The inverse document frequency
(see the third reference for the original idea of scaling down words by their presence in the corpus) is given as
idf(t,D)=log[D{dD:td}] ,idf(t, D) = \log \left[ \frac{\left|D\right|}{\left|\{d \in D: t \in d\}\right|} \right] \ ,
but typically a 1 is added at the denominator to control for those cases where term does not exist in document so it is zero. So the
is the logarithm of ratio of the total number of document in corpus and the number of documents containing the term.
The probabilistic interpretation of this choice (see Wikipedia) is given by the fact that the probability that a given document
contains term
is written as
P(td)={dD:td}D ,P(t | d) = \frac{\left|\{d \in D: t \in d\}\right|}{\left|D\right|} \ ,
so that the
is a function of the inverse of this probability. Furthermore, the choice of a logarithm is a natural one in Information Retrieval, where scoring function are sought to be additive, and also because of the Zipf's law (see the fifth reference).
idf(t,D)=log(P(td))idf(t, D) = -\log(P(t | d))
is then the logarithm of the inverse relative document frequency.

An example

Let us consider a corpus
with the two documents:
  • A: "this is a sample"
  • B. "this is another example"
We have
idf(this,D)=log(2/2)=0 ,idf(this, D) = \log(2/2) = 0 \ ,
because the term "this" appears in two out of two documents, and
idf(example)=log(2/1) ,idf(example) = \log(2/1) \ ,
because "example" appears in one out of two documents.

Term-document matrix

Once each text in a collection of documents has been encoded into numerical features, what we can do to represent the whole collection is building the so-called term-document matrix, which is a matrix with documents on the rows and terms (words) on the columns, as in, as features.
The typical way is to put each row of the matrix as the array of occurrences of each term in that document, with a cell for each term which exists in the whole corpus of texts.