An Improved Information Retrieval Framework for Sparse Data using Knowledge Graph Generation and Enhanced Clustering

Sriyas Kanduri; Radha K

doi:10.14445/22312803/IJCTT-V73I6P115

Research Article | Open Access | Download PDF

Volume 73 | Issue 6 | Year 2025 | Article Id. IJCTT-V73I6P115 | DOI : https://doi.org/10.14445/22312803/IJCTT-V73I6P115

An Improved Information Retrieval Framework for Sparse Data using Knowledge Graph Generation and Enhanced Clustering

Sriyas Kanduri, Radha K

Received	Revised	Accepted	Published
05 May 2025	05 Jun 2025	22 Jun 2025	30 Jun 2025

Citation :

Sriyas Kanduri, Radha K, "An Improved Information Retrieval Framework for Sparse Data using Knowledge Graph Generation and Enhanced Clustering," International Journal of Computer Trends and Technology (IJCTT), vol. 73, no. 6, pp. 124-133, 2025. Crossref, https://doi.org/10.14445/22312803/IJCTT-V73I6P115

Abstract

When dealing with sparse information, classical RAG with hybrid retrieval frequently fails to produce satisfactory answers, which reduces the efficiency and dependability of information retrieval. In order to overcome this shortcoming, we include cosine distance measures, which quantify the difference between vectors and thus offer a complementary viewpoint. Compared to the current approach, the suggested technique provides a more complete picture of the semantic links between documents or objects and shows superior retrieval results. Compared to the Traditional Information Retrieval Models, such as the Vector Space Model (VSM), TF-IDF, Hybrid Retrieval Approaches, and Knowledge Graph-Based Enhancements, Latent Semantic Techniques provide a potential approach for effectively and accurately retrieving relevant information in knowledge intensive applications by increasing the F1-Score, Precision, and Recall, thereby facilitating efficient information retrieval. In sparse data environments, information retrieval (IR) remains a major challenge, especially for knowledge-intensive applications that require a high degree of contextual relevance and accuracy. This research introduces a unique hybrid approach that combines conventional IR models, contemporary embedding methods, and transformer-based architectures with KGGen and KGGen Clustering. The results indicate that the full capabilities of Large Language Models (LLMs) can be realized by incorporating the Hybrid Retrieval (BM25 + Embeddings) method into traditional RAG, which guarantees high-precision and high-efficiency information retrieval for business-specific data. The representation and retrieval of documents are greatly improved by the use of KGGen and clustering. The objective is to increase retrieval performance by enhancing semantic comprehension, contextual alignment, and access to limited information effectively. We assess the effectiveness of our strategy using a variety of accepted IR metrics, which show that it performs better across several datasets. Data representation in knowledge-intensive sectors is frequently sparse, which results in lower efficiency and accuracy in information retrieval (IR) systems. To improve the system's overall performance, this research suggests a hybrid strategy that combines conventional and contemporary retrieval methods with improvements made using Knowledge Graph Generation (KGGen) and KGGen-based clustering. For sparse and complicated data environments, the suggested approach seeks to increase the effectiveness, dependability, and correctness of IR operations.

Keywords

IR, TF-IDF, KGGen, Precision, Recall, F1-score, Knowledge Graph-Based Improvements.

References

[1] Shengdong Zhang et al., “A Novel Ultrathin Elevated Channel Low-Temperature Poly-Si TFT,” IEEE Electron Device Letters, vol. 20, no. 4, pp. 569–571, 1999.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Kush Juvekar, and Anupam Purwar, “COS-Mix: Cosine Similarity and Distance Fusion for Improved Information Retrieval,” arXiv, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Harald Steck, Chaitanya Ekanadham, and Nathan Kallus, “Is Cosine-similarity of Embeddings Really about Similarity?,” Companion Proceedings of the ACM on Web Conference 2024, pp. 887–890, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Patrick Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” Advances in Neural Information Processing Systems, 2020.
[Google Scholar] [Publisher Link]
[5] Fabio Petroni et al., “Language Models as Knowledge Bases?,” arXiv, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Anupam Purwar, and Rahul Sundar, “Keyword Augmented Retrieval: Novel Framework for Information Retrieval Integrated with Speech Interface,” Proceedings of the Third International Conference on AI-ML Systems, pp. 1-5, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Ekin Akyurek et al., “Towards Tracing Knowledge in Language Models Back to the Training Data,” arXiv, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Tomaarsen/Spanmarker. [Online]. Available: https://github.com/tomaarsen/SpanMarkerNER
[9] Nelson F. Liu et al., “Lost in the Middle: How Language Models use Long Contexts,” arXiv, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Stephen Robertson, and Hugo Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond,” Foundations and Trends in Information Retrieval, vol. 3, no. 4, p. 333–389, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Satanjeev Banerjee, and Alon Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, 2005.
[Google Scholar] [Publisher Link]
[12] Confident-ai/Deepeval. [Online]. Available: https://github.com/confident-ai/deepeval
[13] Gauthier Guinet et al., “Automated Evaluation of Retrieval-augmented Language Models with Task-specific Exam Generation,” amazon Science, 2024.
[Google Scholar] [Publisher Link]
[14] Mingyong Li, and Mingyuan Ge, “Enhanced-Similarity Attention Fusion for Unsupervised Cross-Modal Hashing Retrieval,” Data Science and Engineering, vol. 10, pp. 258-276, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Yewen Li et al., “Adaptive Graph Attention Hashing for Unsupervised Cross-Modal Retrieval via Multimodal Transformers,” Web and Big Data, pp. 1-15, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Wenjun Meng et al., “Using the Retrieval-Augmented Generation to Improve the Question-Answering System in Human Health Risk Assessment: The Development and Application,” Electronics, vol. 14, no. 2, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Zhilin Liu, Qun Yang, and Jianjian Zou, “Lowering Costs and Increasing Benefits Through the Ensemble of LLMs and Machine Learning Models,” Advanced Intelligent Computing Technology and Applications, pp. 368-379, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Abdul Majeed, and Seong Oun Hwang, “Reliability Issues of LLMs: ChatGPT a Case Study,” IEEE Reliability Magazine, vol. 1, no. 4, pp. 36-46, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Laurence Dierickx et al., “Striking the Balance in Using LLMs for Fact-Checking: A Narrative Literature Review,” Disinformation in Open Online Media, pp. 1-15, 2024.
[CrossRef] [Google Scholar] [Publisher Link]