Network reprint version
To describe the LDA model, it is necessary to talk about the LDA model.Production modelBackground. Production model is relative toDiscriminant modelAnd said. Here, we assume that the data to be modeled has characteristic information, that is, X, which is usually said, and tag information, which is commonly referred to as Y.
Discriminant models often describe the Generative Process directly, but do not model the feature information itself. This makes the discriminant model inherently a useful tool for building classifiers or regression analysis. The production model is to model both X and Y, which makes the production model more suitable for unlabeled data analysis, such as clustering. Of course, because the production model is to be modeled with more information, it is generally considered that the production model is more difficult to learn than the discriminant model for the same data.
In general, production models hope to help the reader understand a model through a production process. Note that the essence of this production process is to describe oneJoint probability distribution(Joint Distribution) decomposition process. In other words, this process is a virtual process, and real data is often not generated in this way. Such a production process is a hypothesis of the model, a description. Any one of the production processes can be mathematically equivalent to a joint probability distribution.
The LDA generation process describes the document and the process of generating the text in the document. In the original LDA paper, the authors described a generation process for each document:
- First, the length N of a document is generated from a global Poisson parameter for the distribution of β;
- Generating a θ of the current document from a global distribution of the Dirichlet parameter α;
- Then, for each word of the current document length N, the following two steps are performed: one is to generate a topic (Index) z_n from a multinomial distribution with θ as a parameter; the second is to use φ and z collectively produces a word (Word) w_n for the multiple distribution of the parameters.
From this description we can get these important model information right away. First, we have a topic matrix with a dimension multiplied by V (Topic) Matrix). Each of these lines is a φ, which is a multiple distribution of a generated word. Of course, this theme matrix is not known in advance and needs to be learned. In addition, for each document, θ is a vector of length K that describes the distribution of the current document on K topics. The generation process tells us that for each word in the document, we first generate a subscript from this θ vector to tell us which line in the subject matrix is to generate the current word.
The above content was transferred fromAI technology internal reference", is a very good paid tutorial, recommend everyone to subscribe.
Baidu Encyclopedia version
LDA (Latent Dirichlet Allocation) is a document theme generation model, also known as a three-layer Bayesian probability model, which contains three-layer structure of words, topics and documents. The so-called generation model, that is, we believe that each word in an article is obtained through a process of "choosing a topic with a certain probability and selecting a certain word from the topic with a certain probability." The document to topic follows a polynomial distribution, and the subject to the word follows a polynomial distribution.
LDA is an unsupervised machine learning technique that can be used to identify topic information hidden in large-scale document collections or corpus. It uses a bag of words approach that treats each document as a word frequency vector, transforming the text information into digital information that is easy to model. However, the word bag method does not consider the order between words and words, which simplifies the complexity of the problem and also provides an opportunity for the improvement of the model. Each document represents a probability distribution of topics, and each topic represents a probability distribution of many words.
In natural language processing, potential Dirichlet Assignment (LDA) is a generative statistical model that allows unobserved groups to interpret observation sets and explain why certain parts of the data are similar. For example, if the observation is a word collected in a document, it is assumed that each document is a mixture of a small number of topics, and the existence of each word can be attributed to one of the topics of the document. LDA is an example of a topic model.