Text Summarization using K-Means, Tanimoto Distance & Jaccard Similarity

Annu Sharma; Ms.Nandini Sharma

doi:10.14445/22312803/  IJCTT-V68I6P115

Research Article | Open Access | Download PDF

Volume 68 | Issue 6 | Year 2020 | Article Id. IJCTT-V68I6P115 | DOI : https://doi.org/10.14445/22312803/IJCTT-V68I6P115

Text Summarization using K-Means, Tanimoto Distance & Jaccard Similarity

Annu Sharma, Ms.Nandini Sharma

Received	Revised	Accepted
09 May 2020	25 Jun 2020	29 Jun 2020

Citation :

Annu Sharma, Ms.Nandini Sharma, "Text Summarization using K-Means, Tanimoto Distance & Jaccard Similarity," International Journal of Computer Trends and Technology (IJCTT), vol. 68, no. 6, pp. 87-93, 2020. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V68I6P115

Abstract

Text Summarization is reduction procedure of content, text, passage source into the tiny or short text nevertheless still preserve and retain the crucial and significant information enclosed. This scheme confers the Summarization of the information like reviews, blogs, news from the web pages based on the content and context for the specific category or class using machine learning techniques like K-Means, Tanimoto Distance Jaccard Similarity and word frequency weighting. The aim or contemplation is to recapitulate, minimize and summarize the reviews, blogs and news web pages automatically to abridge the procedure of discovery a middle of reviews, blogs and news information. The analysis was completed by measuring the accurateness of the précis and summary by precision and recall calculation. From the analysis consequences, it was establish that the précis or summary produces accuracy rate of precise summary is approx 80% for and concise summary is approx 73% for English language reviews, blogs and news available online. The proposed scheme depicts that by assimilation of two or more techniques using machine learning were relatively successful and effectual in intriguing the essence of equivalent reviews, blogs and news that taken manually by humans as a précis or summary.

Keywords

Automatic Text Summarization, Machine Learning, K-Means Clustering, Tanimoto Distance, Jaccard Similarity.

References

[1] Nicholson, John. (2019). “Information Retrieval”. 10.4324/9780367809416-63.
[2] Dawson, Catherine. (2019). “Information retrieval”. 10.4324/9781351044677-24.
[3] Boughanem, Mohand & Akermi, Imen & Pasi, Gabriella & Abdulahhad, Karam. (2020). “Information Retrieval and Artificial Intelligence”. 10.1007/978-3-030-06170-8_5.
[4] Mainenti, David. (2019). “Information retrieval: retaining its relevance”.
[5] Banerjee, Swapna. (2017).”Information Retrieval”.
[6] Bari, Poonam & Nihlani, Pracheta & Dev, Martand & Choudhary, Samruddhi. (2019). “Automatic Text Summarizer”.
[7] Simske, Steven & Lins, Rafael. (2018). “Automatic Text Summarization and Classification”. DocEng `18: Proceedings of the ACM Symposium on Document Engineering 2018. 1-2. 10.1145/3209280.3232791.
[8] Al-Taani, Ahmad. (2017). “Automatic text summarization approaches”. 93-94. 10.1109/ICTUS.2017.8285983.
[9] Chettri, Roshna & Kr, Udit. (2017). “Automatic Text Summarization”. International Journal of Computer Applications. 161. 5-7. 10.5120/ijca2017912326.
[10] Torres-Moreno, Juan-Manuel. (2014). “Automatic Text Summarization”. 10.1002/9781119004752.ch3.
[11] Patil, Annapurna & Dalmia, Shivam & Ansari, Syed & Aul, Tanay & Bhatnagar, Varun. (2014). “Automatic text summarizer”. 1530-1534. 10.1109/ICACCI.2014.6968629.
[12] Prakash, B. & Sanjeev, K. & Prakash, Ramesh & Chandrasekaran, K. & Rathnamma, M. & Ramana, V.. (2020). “Review of Techniques for Automatic Text Summarization”. 10.1007/978-981-15-1480-7_47.
[13] Bhole, Varsha. (2014). “AUTOMATIC TEXT SUMMARIZATION”.
[14] Mathews, Lincy & Sathiyamoorthy, E.. (2013). “Intricacies of an Automatic Text Summarizer”. International Journal of Engineering and Technology. 5. 2871-2878.
[15] Soumya, S. & S Kumar, Geethu & Naseem, Rasia & Mohan, Saumya. (2011). “Automatic Text Summarization”. 10.1007/978-3-642-25734-6_140.
[16] Zhou, Hong. (2020). “K-Means Clustering”. 10.1007/978- 1-4842-5982-5_3.
[17] Berenger, Francois & Yamanishi, Yoshihiro. (2018). “Combining a bisector tree with the Tanimoto distance for similarity searches and beyond”. 10.13140/RG.2.2.15044.53121.
[18] Yan, Ziqi & Wu, Qiong & Ren, Meng & Liu, Jiqiang & Liu, Shaowu & Qiu, Shuo. (2018). “Locally Private Jaccard Similarity Estimation. Concurrency and Computation: Practice and Experience”. 10.1002/cpe.4889.
[19] Lavin, Matthew. (2019). “Analyzing Documents with TFIDF. The Programming Historian.” 10.46430/phen0082.
[20] Chowdhary, K.. (2020). “Natural Language Processing”. 10.1007/978-81-322-3972-7_19.
[21] Koit, Mare. (2014). “(Semi-)automatic analysis of dialogues”. ICAART 2014 - Proceedings of the 6th International Conference on Agents and Artificial Intelligence. 1. 445-452.
[22] Tascón, M. (2013). “Introduction: Big Data. Past, present and future. Telos: Communication notebooks and innovation”, (95), 47-50.
[23] Anchalia, Prajesh & Koundinya, Anjan & Nk, Srinath. (2013). “MapReduce Design of K-Means Clustering Algorithm”. 1-5. 10.1109/ICISA.2013.6579448.
[24] Kalimoldayev, Maksat & Siládi, Vladimír & Satymbekov, Maksat & Naizabayeva, Lyazat. (2017). “Solving mean-shift clustering using MapReduce Hadoop”.
[25] Ramiz M. Aliguliyev, “A new sentence similarity measure and sentence based extractive technique for automatic text summarization, Expert Systems with Applications”, Volume 36, Issue 4, 2009, Pages 7764-7772, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2008.11.022.
[26] Piegorsch, Walter. (2020). Confusion Matrix. 10.1002 / 9781118445112. Stat 08244.
[27] B.Srinivasa Rao, S.Vellusamy Raddy, "A Hard K-Means Clustering Techniques for Information Retrieval from Search Engine" SSRG International Journal of Computer Science and Engineering 4.2 (2017)