What is the TF-IDF algorithm?

simply put,The vector space model is to express the query keyword and the document into a vector, and then use the operation between the vectors to further express the relationship between the vectors.. For example, a more common operation is to calculate the vector between the query keyword and the vector corresponding to the document.relativity. "

TF (Term Frequency) - "word frequency"

This means that we calculate the number of times a word in a query keyword appears in the target document. For example, if we want to query "Car Insurance," for each document, we calculate how many times the word "Car" appears in it, and how many times the word "Insurance" appears in it. This is the calculation method of TF.

The underlying assumption behind TF is that the words in the query keyword should be more important than the other words, and the importance of the document, that is, the relevance, is proportional to the number of times the word appears in the document. For example, if the word "Car" appears 5 times in document A and 20 times in document B, the TF calculation considers document B to be more relevant.

However, information retrieval workers quickly discovered that only TF could not fully describe the relevance of a document. Because of language factors, some words may appear more and more naturally in many documents, such as "The", "An", "But" in English and so on. Most of these words play the role of link statements and are indispensable for maintaining language coherence. However, if we want to search for the keyword "How to Build A Car", the "How", "To" and "A" are most likely to appear in most documents. At this time, TF cannot help us distinguish The relevance of the document is gone.

IDF (Inverse Document Frequency) - "Inverse Document Frequency"

In this case, it came into being. The idea here is actually very simple, that is, we need to “Penalize” the words that appear in too many documents.

That is to say, words that actually carry "related" information appear only relatively few, sometimes in very few documents. This information is easily calculated using the "document frequency", that is, how many documents cover the word. Obviously, if there are too many documents covering a word, the less important the word is, or the less information the word has. Therefore, we need to correct the value of TF, and the idea of ​​IDF is to use the reciprocal of DF to correct it. The application of the reciprocal just expresses the idea that the larger the DF value, the less important it is.

The TF-IDF algorithm is mainly applied to English. Chinese must first be divided into words. After the word segmentation, it must solve the multi-word meaning and the word polysemy. These two problems cannot be solved well by the simple tf-idf method. Then there is a later word embedding method that uses a vector to represent a word.

4 variants of TF-IDF

Variant 1: Avoid TF linear growth by logarithmic function

Many people have noticed that the value of TF has no upper limit in the original definition. Although we generally think that a document contains query keywords multiple times to express a certain degree of relevance, such a relationship is hard to say linear. For the example of "Car Insurance" we just mentioned, document A may contain the word "Car" 100 times, and document B may contain 200 times. Does it mean that the relevance of document B is 2 times that of document A? What? In fact, many people realize that after a certain threshold is exceeded, the TF is less discriminating.

Using Log, which is a logarithmic function, transforming TF is a technique that does not let TF grow linearly.. Specifically, people often use the value of 1+Log(TF) instead of the original TF value. Under this new calculation, assume that "Car" appears once, the new value is 1, 100 occurs, the new value is 5.6, and 200 occurs, the new value is 6.3. Obviously, such calculations maintain a balance that is both discriminating but not completely linear.

Variant 2: Standardization for long document, short document problems

The classic calculation does not consider the difference between "long document" and "short document". A document A has 3,000 words, and a document B has 250 words. Obviously, even if "Car" has 20 times in both documents, it cannot be said that both documents are equally relevant."Normalization" of TF, especially the standardization based on the maximum TF value of the document, becomes another more common technique..

Variant 3: Logarithmic function handling IDF

The third common technique, which uses a logarithmic function to transform, is to process the IDF.. Instead of using IDF directly as a "penalty factor," we can use N+1 and divide by DF as the reciprocal of a new DF, and then pass a logarithmic change on this basis. Here N is the total number of all documents. The advantage of this is that, first, the total number of documents is used for standardization, which is similar to the standardization mentioned above. Second, logarithm is used to achieve non-linear growth.

Variant 4: Standardization of query words and document vectors

Another important TF-IDF variant is to normalize the query key vector and the document vector so that these vectors are not affected by the number of valid elements in the vector, ie different documents may have different lengths. In linear algebra, you can normalize vectors to the length of a unit vector. At this time, the dot product operation is performed, which is equivalent to performing the cosine similarity calculation on the original vector. Therefore, another rule that uses this rule is to perform cosine similarity operations directly in most cases instead of dot product operations.

TF-IDF history

Converting both the query keyword (Query) and the document (Document) into "vectors" and trying to solve the problem of information retrieval with mathematical tools such as linear algebra can be traced back at least to the 20 century 70 era.

In 1971, Cornell University professor Gerard Salton published the article "The SMART Retrieval System-Experiments in Automatic Document Processing", which was first mentioned in the article. Convert both the query key and the document into "vectors" and assign different values ​​to the elements in those vectors. The SMART retrieval system described in this paper, especially the description of TF-IDF and its variants, has become an important reference for many subsequent industrial-grade systems.

In the year of 1972, the British computer scientist Karen Spärck Jones in "A Statistical Interpretation of Term Specificity and Its Application in Retrieval" For the first time in the article, the application of IDF is elaborated. Later, Karen discussed the combination of TF and IDF in the article "Index Term Weighting". It can be said that Karen is the first computer scientist to theoretically fully demonstrate TF-IDF, so many people in the future have attributed the invention of TF-IDF to Karen.

Gerard himself is considered to be "the father of information retrieval." He was born in Nuremberg, Germany in 1927, and received his bachelor's and master's degrees in mathematics from Brooklyn College in New York in 1950 and 1952. He received his Ph.D. in applied mathematics from Harvard University in 1958 and then joined Cornell University to form a Ph.D. computer science. In order to pay tribute to Gerard's outstanding contribution to modern information retrieval technology, the American Computer Society's Association of Computing Machinery (ACM) now awards the Gerard Salton Award every three years for Recognize researchers who have made outstanding contributions to information retrieval technology. Karen Jones won the second "Gerard Solton Award" in 1988.

Remarks: Most of the above content is excerpted from the paid course -AI technology internal reference"

Baidu Encyclopedia + Wikipedia

Baidu Encyclopedia version

TF-IDF is a statistical method used to assess the importance of a word for a file set or one of the files in a corpus. The importance of a word increases proportionally with the number of times it appears in the file, but it also decreases inversely with the frequency it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between a file and a user query.

In addition to TF-IDF, search engines on the Internet also use a link analysis-based rating method to determine the order in which files appear in search results.