Development of Focused Crawlers for Building Large Punjabi News Corpus

Gurjot Singh Mahi; Amandeep Verma

doi:10.5614/itbj.ict.res.appl.2021.15.3.1

Authors

Gurjot Singh Mahi Department of Computer Science, Punjabi University Patiala, Punjab, 147002 India
Amandeep Verma Department of Computer Science, Punjabi University Patiala, Punjab, 147002 India

DOI:

https://doi.org/10.5614/itbj.ict.res.appl.2021.15.3.1

Keywords:

corpus, crawler, NLP, Punjabi language, scraper, text extraction, text processing

Abstract

Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

Downloads

Download data is not yet available.

References

LeCun, Y., Bengio, Y. & Hinton, G., Deep Learning, Nature, 521(7553), pp. 436-444, 2015.

Kumar, M., Bhatia, R., & Rattan, D., A Survey of Web Crawlers for Information Retrieval, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(6), pp. 1-45, 2017.

Lesher, G.W. & Sanelli, C., A Web-based System for Autonomous Text Corpus Generation, in Proceedings of ISSAAC, Washington DC, USA, 2000.

Ekbal, A. & Bandyopadhyay, S., Lexicon Development and POS Tagging Using a Tagged Bengali News Corpus, in Proceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference (FLAIRS), pp. 261-262, 2007.

Bojar, O., Diatka, V., Rychl P., Stran, P., Suchomel, V., Tamchyna, A. & Zeman, D., HindEnCorp ? Hindi-English and Hindi-only corpus for machine translation, Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), pp. 3550-3555, 2014.

Eichmann D., The RBSE Spider ? Balancing Effective Search Against Web Load, Proceedings of the 1st International World Wide Web Conference, pp. 113-120, 1994.

Heydon, A., & Najork, M., Mercator: A Scalable, Extensible Web Crawler, World Wide Web, 2(4), pp. 219-229, 1999.

Boldi, P., Codenotti, B., Santini, M., & Vigna, S., Ubicrawler: A Scalable Fully Distributed Web Crawler, Software: Practice and Experience, 34(8), pp. 711-726, 2004.

Blanvillain, O., Kasioumis, N. & Banos, V., Blogforever Crawler: Techniques and Algorithms to Harvest Modern Weblogs, in Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14), pp. 1-8, 2014.

Casalnuovo, C., Suchak, Y., Ray, B., & Rubio-Gonzez, C., Gitcproc: A Tool for Processing and Classifying Github Commits, in Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), pp. 396-399, 2017.

Minhas, G., & Kumar, M., LSI Based Relevance Computation for Topical Web Crawler, Journal of Emerging Technologies in Web Intelligence, 5(4), pp. 401-406, 2013.

Wan, G., Ding, Y., Li, B., & Tan, X., E&Vrobot: A Crawler of Education and Vocation, Proceedings of the 9th International Conference on Computer Science and Education (ICCCSE), pp. 473-476, 2014.

Bo?njak, M., Oliveira, E., Martins, J., Mendes Rodrigues, E. & Sarmento, L., Twitterecho ? A Distributed Focused Crawler to Support Open Research with Twitter Data, Proceedings of the 21st Annual Conference on World Wide Web Companion, pp. 1233-1239, 2012.

Del Vigna1, F., Cimino, A., Dell?Orletta, F., Petrocchi, M. & Tesconi, M., Hate Me, Hate Me Not: Hate Speech Detection on Facebook, Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), pp. 86-95, 2017.

Raad, B.T., Philipp, B., Patrick, H. & Christoph, M., ASEDS: Towards Automatic Social Emotion Detection System Using Facebook Reactions, Proceedings of IEEE 20th International Conference on High Performance Computing and Communications (HPCC), pp. 860-866, 2018.

Guevara, E., NoWaC: A Large Web-Based Corpus for Norwegian, Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop, pp. 1-7, 2010.

Jha, G.N., The TDIL Program and the Indian Language Corpora Initiative (ILCI), Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pp. 982-985, 2010.

Kaur, J. & Saini, J.R., PuPoCl : Development of Punjabi Poetry Classifier Using Linguistic Features and Weighting, INFOCOMP: Journal of Computer Science, 16(1-2), pp. 1-7, 2017.

Jindal, S., Goyal, V. & Bhullar, J.S., English to Punjabi Statistical Machine Translation using Moses (Corpus Based), Journal of Statistics and Management Systems, 21(4), pp. 553-560, 2018.

Rossum, G.V. & Drake, F.L., Python 3 Reference Manual, CreateSpace 2009.

Urllib, https://docs.python.org/2/library/urllib.html (June 2021)

BeautifulSoup, Accessed from https://www.crummy.com/software/BeautifulSoup/bs4/doc/(June 2021)