blog




  • Essay / Information retrieval based on domain term extraction and query classification algorithms

    Summary: Information retrieval (IR) system searches for relevant documents from a large dataset based on the user's query. Queries submitted by users to search engines can be ambiguous, concise, and their meaning can change over time. As a result, understanding the nature of information needed for queries has become an important research problem. So, various search engines emphasize on query classification. For an efficient IR system, this system offers Query Classification Algorithm (QCA) and Domain Term Extraction Algorithm. This system classifies requests into each of the predefined target categories. In query classification, domain terms are extracted from the query and each of them is classified into its relevant categories stored in the database. Using QCA categories, this system finds the relevant document in the document collection. The vector space IR model is used in this system to retrieve the relevant document. Say no to plagiarism. Get a tailor-made essay on “Why Violent Video Games Should Not Be Banned”? Get the original essayI. INTRODUCTIONThe Information Retrieval (IR) system finds relevant documents from a large dataset based on the user's query. IR includes basic elements such as indexing, searching, and classifying documents. Current IR systems, including search engines, have a standard interface consisting of a single input box that accepts keywords. User-submitted keywords are compared to the collection index to find documents containing those keywords. When a user query contains multiple topic-specific keywords that accurately describe their information need, the system is likely to return good matches; However, when the user's query is short and the natural language is inherently ambiguous, this simple retrieval model is typically prone to errors and omissions. Understanding the meaning of search queries is a key task that lies at the heart of search querying. Query classification is a difficult task because queries typically have only a few terms, which often leads to significant ambiguity. Semantic logic is very important in understanding queries to create a powerful search engine. A user may not formalize his request when looking for information even though he knows what he wants. As a result, understanding the nature of the information needed behind queries has become an important research problem. So, this system provides the domain term extraction algorithm and query classification algorithm (QCA). In the proposed system, the conceptual term strategy is used to identify the relevant category with the ambiguous domain term. This system stores conceptual terms in the NoSQL graph database. Based on the conceptual term strategy and NoSQL graph database, this system uses QCA to classify query features and ambiguous domain terms. Using a classified user query, this system performs the information retrieval process. In the IR system based on query classification, QCA and vector spatial model are used to retrieve relevant information related to user queries. According to the results of the analysis of conceptual terms, thissystem becomes a good IR system by extracting documents more relevant to the user's requirements. The remainder of the paper is organized as follows: Related work is described in Section 2. The basic theory is presented in Section 3. The proposed system design is presented in Section 4. The proposed methodology is described in section 5 and the experimental results of the system are presented in section 6. Finally, the conclusion is given in section 7.II. RELATED WORKIn 2006, W. Yue, Z.Chen and X. Lu proposed a new information retrieval algorithm based on query expansion and classification. The algorithm is induced by the observation that very short queries with traditional information retrieval methods often have low precision, although they can achieve high recall. Their approach aimed to capture more relevant documents by l query expansion and text classification. The experiment results showed that the proposed algorithm is more accurate and efficient than traditional query expansion methods. In 2012, SM Fathalla and YF Hassan presented a hybrid method for reformation and classification of user queries based on a fuzzy semantics-based approach and K-Nearest. Neighbor classifier (KNN). The overall processes of the system are query preprocessing, fuzzy member calculation, classification and query reformation. Classification is performed using the KNN classifier, not only by keyword-based semantics, but using sentence-level semantics. After classification, the user's query is rephrased for submission to a search engine, yielding better results than submitting the original query to the search engine. Experiments show significant improvement in search results compared to traditional keyword-based search engine results. In 2015, C. Xia and X. Wang adopted a new method for classifying web queries. Their method includes three steps. First, some contextual information is labeled to enrich their training set. In the second step, the list of labeled queries is divided into word sequences, and then a graph whose nodes and edges are indexed with category labels is constructed. After that, a row equation is formed to evaluate the possibility that a given query belongs to a certain category. Their method can reduce training time by 10% compared to support vector machine (SVM).III. BASIC THEORY. Domain term extraction Domain term extraction is a categorization or classification task in which terms are classified into a set of predefined domains. It has been applied to tasks such as keyphrase extraction, word sense disambiguation, multilingual text categorization, and query classification.B. Query Classification Queries submitted by users to search engines can be ambiguous, concise, and their meaning can change over time. Nowadays, query classification is being emphasized by various search engines due to the increasing size of the web as millions of resources are added to it every day. Query classification assigns a search query to one or more predefined categories, based on its topics. This involves classifying a Qi user request into a list of n categories ci1, ci2, cin. The importance of query classification is emphasized by many services provided by search engines. A direct applicationis to provide better search results to users for the benefit of different categories. Search result pages can be grouped based on categories predicted by the query classification method. Query classification is a two-step process. The first is the learning stage where a classification model is built. The second is the classification stage where the model is used to predict the class label for given data. If a certain category in an intermediate taxonomy is given, the query classification is directly mapped to a target category if and only if the following condition is satisfied: one or more terms in each node along the path in the target category appear along of the path corresponding to matched intermediate category.C. Information RetrievalThe Information Retrieval (IR) system is capable of accepting a user query, understanding the user's requirements, searching a database for relevant documents, retrieving the documents for user and classify documents according to their relevance. There are four main IR models. These are: 1) Boolean pattern: A document matches the query if the set of terms associated with the document satisfies the Boolean expression representing the query. The Boolean expression of terms uses the standard Boolean operators: and, or and not. The result of the query is the set of matching documents.2) Vector spatial model: In the vector spatial model, the text is represented by a vector of terms. Terms are generally words and phrases. If words are chosen as terms, then each word in the vocabulary becomes an independent dimension in a very high-dimensional vector space. Any text can then be represented by a vector in this high-dimensional space. If a term belongs to a text, it takes a non-zero value in the text vector along the dimension corresponding to the term. A vector IR method represents both documents and queries with high-dimensional vectors, while calculating their similarities by vector inner product. 3) Language model: Statistical language models are probability-based and rely on statistical theory. It first estimates a language model for each document, then ranks documents based on the probability of the query given in the language model. 4) Probabilistic model: Probabilistic IR models estimate the probability of relevance of documents for a request. This model is based on probability theory. It can be estimated by the relevance of a given document based on their query.IV. PROPOSED SYSTEM DESIGNIn this system, there are three main steps. At first, this system uses the domain term extraction algorithm to extract the domain terms from the user's query. In the second step, this system classifies each extracted domain term into each category using the QCA graph database and Neo4j. In the last step, this system retrieves the relevant information from the user's query using a classified query.V. PROPOSED METHODOLOGYIn this system, domain term extraction and query classification algorithms are proposed. Using a classified query, this system retrieves relevant information according to the IR model of the vector space. Vector Space IR Model In the vector space IR model, a document is represented as a vector of term weights. The number of dimensions in the vector space is equal to the number of terms used in the overall collection of).