Dictionary Based Text Filter for Lossless Text Compression

  IJCTT-book-cover
 
International Journal of Computer Trends and Technology (IJCTT)          
 
© 2017 by IJCTT Journal
Volume-49 Number-3
Year of Publication : 2017
Authors : Rexline S. J, Robert L, Trujilla Lobo.F
  10.14445/22312803/IJCTT-V49P122

MLA

Rexline S. J, Robert L, Trujilla Lobo.F "Dictionary Based Text Filter for Lossless Text Compression". International Journal of Computer Trends and Technology (IJCTT) V49(3):143-149, July 2017. ISSN:2231-2803. www.ijcttjournal.org. Published by Seventh Sense Research Group.

Abstract -
This paper presents a new text transformation technique called Dictionary Based Text Filter for Lossless Text Compression. A text transformation technique should preserve the data during the encoding and decoding process. In the proposed approach, words in the source file are replaced with shorter codewords, whenever they are present in an external static dictionary. The rapid advantage of text transformation is that codewords are shorter than actual words and, thus, the same amount of text will require less space. As we are aware, 16% of the characters in the text files are spaces on average and hence to achieve better improvement of the compression rates for text files, the space between words can be removed from the source files. The unused ASCII characters from 128 to 255 are used to generate the codewords. This codeword combination chosen helps us to remove the space between the words in the encoded file. The proposed algorithm has been implemented and tested using standard Corpuses and compresses the files up to 85% reduction of its source file. We recommend the use of this proposed technique to compress the large text files in the field of the digitalization of library.

References
[1] Abel J. ―Record preprocessing for data compression, Proceedings of the 2004 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, California, pp .521,2004.
[2] Abel,J, Teahan,W, ―Universal Text Preprocessing for Data Compression,IEEE Trans.Computers,54(5)pp :497-507,2005.
[3] Antonio Farina, Gonzalo Navarro, Jose R. Parama, ―Boosting Text Compression with Word-Based Statistical Encoding. The Computer Journal, 55(1): 111-131 (2012).
[4] F. Awan and A. Mukherjee, ―LIPT: A Lossless Text Transform to Improve Compression, Proceedings of International Conference on Information and Theory:Coding and Computing, IEEE Computer Society, pp. 452-460, April 2001.
[5] M. Burrows and D.J. Wheeler, ―A Block-Sorting Lossless Data Compression Algorithm, SRC Research Report 124, Digital Systems Research Center, Palo Alto, CA, 1994.
[6] Chapin, B. ―Higher Compression from the Burrows-Wheeler Transform with new Algorithms for the List Update Problem‖, Ph.D. Dissertation, University of North Texas, 2001.
[7] Chapin B, Tate SR.Higher Compression from the Burrows–Wheeler Transform by Modified Sorting, In Storer JA, Cohn M, editors, Proceedings of the 1998 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, California,pp.532,1998.
[8] S.Deorowicz,Improvements to Burrows-Wheeler Compression Algorithm ―, Software – Practice and experience, pp.1465-1483, 2000.
[9] Edleno de Moura, Gonzalo Navarro, Nivio Ziviani: ―Indexing Compressed Text. WSP'97: 95-111.
[10] R. Franceschini, H. Kruse, N. Zhang, R. Iqbal, and A. Mukherjee, ―Lossless, Reversible Transformations that Improve Text Compression Ratio, Project paper, University of Central Florida, USA. 2000.
[11] V.K. Govindan, B.S. Shajee mohan, ―IDBE – An Intelligent Dictionary Based Encoding Algorithm for Text Data Compression for High Speed Data Transmission Over Internet, Proceeding of the International Conference on Intelligent Signal Processing and Robotics IIIT Allahabad February 2004.
[12] H.S. Heaps. ―Information Retrieval - Computational and Theoretical Aspects. Academic Press, 1978.
[13] Horspool N, Cormack G. ―Constructing Word-Based Text Compression Algorithms, Proceedings of the 1992 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, California, pp. 62–71,1992.
[14] Huffman, D.A., A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng., 40: pp: 1098-1101.1952.
[15] Joaquin Adiego, Miguel A. Martinez-Prieto, Pablo de la Fuente: ―High Performance Word-Codeword Mapping Algorithm on PPM‖. DCC 2009: 23-32
[16] Joaquin Adiego, Pablo de la Fuente: ―Mapping Words into Codewords on PPM. SPIRE 2006: 181-192.
[17] H. Kruse and A. Mukherjee, ―Preprocessing Text to Improve Compression Ratios, Proceedings of Data Compression Conference, IEEE Computer Society, Snowbird Utah, pp. 556, 1998.
[18] U. Manger, ―A Text compression scheme that allows fast searching directly in compressed file , ACM Transactions on Information Systems, Vol.52, N0.1, pp.124-136, 1997.
[19] Miguel A. Martinez-Prieto, Joaquin Adiego, Pablo de la Fuente: ―Natural Language Compression on Edge-Guided text preprocessing. Information Sciences, 181(24): 5387-5411 (2011)
[20] Robert Franceschini, Amar Mukherjee, ― Data Compression Using Encrypted Text ―,proceedings of the third forum on Research and Technology, Advances on Digital Libraries,ADL 96,pp .130-138, May 1996.
[21] P. Skibiński, Sz. Grabowski and S. Deorowicz. ―Revisiting dictionary-based compression. Software–Practice and Experience, pp.1455-1476, 2005.
[22] Sun W, Mukherjee A, Zhang N. ―A Dictionary-based Multi-Corpora Text Compression System . In Storer JA, Cohn M, editors, Proceedings of the 2003 IEEE Data Compression Conference, IEEE Computer Society Press, Los Alamitos, California, pp .448 ,2003.
[23] Weiling Chang, Binxing Fang, Xiaochun Yun, Shupeng Wang The Block Lossless Data Compression Algorithm, International Journal of Computer Science and Network Security, VOL.9 No.10, October 2009.
[24] Witten, I.H., R.M. Neal and J.G. Cleary, ―Arithmetic coding for data compression,. Commun.ACM, 30: pp : 520-540.,1987
[25] Md. Ziaul Karim Zia, Dewan Md. Fayzur Rahman, and Chowdhury Mofizur Rahman, ―Two-Level Dictionary-Based Text Compression Scheme, Proceedings of 11th International Conference on Computer and Information Technology, Khulna,Bangladesh.,pp.25-27,December-2008.
[26] R. R. Baruah , V.Deka , M. P. Bhuyan. "Enhancing Dictionary Based Preprocessing for Better Text Compression". International Journal of Computer Trends and Technology (IJCTT) V9(1):4-9, March 2014.

Keywords
Decoder, Encoder, text transformation, preprocessing, Text Filter.