Text as numerical features
Text is unstructured, and in order for it to be used in a model, we typically need to find a numerical representation of it.
For the code here, you need to run some imports:
The Bag of Words framework (usually shortened as the "BoW") is possibly the simplest numerical representation of strings of text one could envisage.
In BoW, a text is simply transformed into a "bag" (a multiset, that is, a set allowing for multiple occurrences) of the words composing it: this methods is very simplistic in that it disregards grammar and word order.
You have a corpus of sentences. What you do is you take all the unique words in the corpus and for each of the sentences you count the occurrences of each of those words.
Given a corpus composed by the two texts
- 1."John likes watching movies. Mary likes movies too."
- 2."John also likes watching football games."
The list of unique words in it is ["John", "likes", "watching", "movies", "also", "football", "games", "Mary", "too"], there's 9 words.
The two texts get encoded into the lists, 9-items long, of the occurrences counts of all of those words. Respecting the order we chose for the list of unique words, we have:
- 1.[1, 2, 1, 2, 0, 0, 0, 1, 1]
- 2.[1, 1, 1, 0, 1, 1, 1, 0, 0]
This is because "John" (first item) appears once in the first text, "likes" appears twice in the first text, and so on.
We can play around with this a bit!
Let's use three sentences:
A. I like to watch Netflix but can never choose a movie. B. The movie I saw last night was good. C. Mary didn't like the movie but she can choose on Netflix.
Let's now lower the case, remove punctuation and tokenise (simply space-splitting in this case) the sentences, gathering all unique words:
# Concatenate sentences, replace punctuation with space and split on space
# Do the same for each single sentence (for later use)
s = s1 + s2 + s3
s = s.lower()
for sign in string.punctuation:
s = s.replace(sign, ' ')
s1 = s1.replace(sign, ' ')
s2 = s2.replace(sign, ' ')
s3 = s3.replace(sign, ' ')
# Create the unique words list
unique_words = list(set(s.split()))
print('unique words are: ', unique_words)
This retrieves the unique words: "a", "t", "yesterday", "i", "the", "good", "movie", "choose", "like", "to", "was", "she", "mary", 'but", "watch", "netflix", "can", "didn", "saw", "on"
Finally, for each sentence provided, we now compute its BoW representation:
s1_bow, s2_bow, s3_bow = , , 
for word in unique_words:
print('First sentence in BoW: ', s1_bow)
print('First sentence in BoW: ', s2_bow)
print('First sentence in BoW: ', s3_bow)
TF-IDF stands for term frequency - inverse document frequency and it is an improvement over the BoW model.
There are two main ingredients in the framework. Suppose we are considering a word (or term)
and its presence in a set of documents (texts)
(corpus). We have:
- the term frequency: a function of the number of timesappears in a document, quantifies the relevance ofin
- the inverse document frequency: an inverse function of the number of documents in the corpus in whichappears, it quantifies how spreadis across, so that it will be used to scale down words which are very frequent across the texts
The idea behind the
is that words which are very common in the whole corpus (say articles, for instance) carry little specific information and need to be weighted down.
The TF-IDF is given as the product of these two components:
The term frequency
can be given as either of:
- 1.raw frequency (count of occurrences):
- 2.boolean frequency, 1 if, 0 otherwise
- 3.log-scaled frequencyor 0 if
- 4.augmented frequency:(the augmentation is intended as the division by the max of the raw frequencies in the document, so that we control for the bias towards very frequent words in a text)
but typically a 1 is added at the denominator to control for those cases where term does not exist in document so it is zero. So the
is the logarithm of ratio of the total number of document in corpus and the number of documents containing the term.
is then the logarithm of the inverse relative document frequency.
Let us consider a corpus
with the two documents:
- A: "this is a sample"
- B. "this is another example"
because the term "this" appears in two out of two documents, and
because "example" appears in one out of two documents.
Once each text in a collection of documents has been encoded into numerical features, what we can do to represent the whole collection is building the so-called term-document matrix, which is a matrix with documents on the rows and terms (words) on the columns, as in, as features.
The typical way is to put each row of the matrix as the array of occurrences of each term in that document, with a cell for each term which exists in the whole corpus of texts.