There are three main aspects:
First, we will crawl the relevant industry data through crawlers.
Second, the log data generated by our cloud products will be collected directly into our data platform.
Third, the material provided by the customer, we will turn it into data and knowledge.
Unstructured data: First, we will clean the collected data, then classify it by machine + manual according to the classification of knowledge, and then do some coarse-grained labeling by some means (such as rules). Confirmed by hand, and confirmed after storage.
Semi-structured data: The original formatted document provided by the customer is classified or clustered by means of format analysis or machine learning model, and then manually combed and finally stored.
The machine does the pre-assisted assistance, and the final confirmation is made manually, instead of directly entering the warehouse after the machine is processed.
Xiaoyi has a large data platform and an annotation system, and a laboratory system that works together to generate these industry training data and industry background knowledge, and then deploy it to the actual system in the form of a domain semantic library.