Automatically Detect Software Security Vulnerabilities Based on Natural Language Processing Techniques and Machine Learning Algorithms


  • Do Xuan Cho Faculty of Information Assurance, Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
  • Vu Ngoc Son Information Assurance Departement, FPT University, Hanoi, Vietnam
  • Duong Duc Information Assurance Departement, FPT University, Hanoi, Vietnam



machine learning algorithms, natural language processing techniques, software security vulnerability detection, software vulnerabilities, source code features


Nowadays, software vulnerabilities pose a serious problem, because cyber-attackers often find ways to attack a system by exploiting software vulnerabilities. Detecting software vulnerabilities can be done using two main methods: i) signature-based detection, i.e. methods based on a list of known security vulnerabilities as a basis for contrasting and comparing; ii) behavior analysis-based detection using classification algorithms, i.e., methods based on analyzing the software code. In order to improve the ability to accurately detect software security vulnerabilities, this study proposes a new approach based on a technique of analyzing and standardizing software code and the random forest (RF) classification algorithm. The novelty and advantages of our proposed method are that to determine abnormal behavior of functions in the software, instead of trying to define behaviors of functions, this study uses the Word2vec natural language processing model to normalize and extract features of functions. Finally, to detect security vulnerabilities in the functions, this study proposes to use a popular and effective supervised machine learning algorithm.


Download data is not yet available.


The State of Open-Source Vulnerabilities 2021., (25 December 2021).

2020 Vulnerability and Threat Trends Report., (25 December 2021).

Zhidong, S. & Si, C., A Survey of Automatic Software Vulnerability Detection, Program Repair, and Defect Prediction Techniques, Security and Communication Networks, 2020. DOI: 10.1155/2020/8858010.

Gu, T., Lu, M., Li, L. & Li, Q., An Approach to Analyze Vulnerability of Information Flow in Software Architecture, Appl. Sci., 10, pp. 393, 2020. DOI: 10.3390/app10010393.

Lin, G., Wen, S., Han, Q.L., Zhang, J. & Xiang, Y., Software Vulnerability Detection Using Deep Neural Networks: A Survey, in Proceedings of the IEEE, 108(10), pp. 1825-1848, 2020. DOI: 10.1109/JPROC.2020.2993293.

Akimova Elena, N., Alexander Yu Bersenev, Deikov Artem. A., Kobylkin Konstantin, S. & Konygin Anton, V., A Survey on Software Defect Prediction Using Deep Learning, Mathematics Basel, 9, 1180, 2021. DOI: 10.3390/math9111180.

Arakelyan, S., Arasteh, S., Hauser, C., Kline, E. & Galstyan, A., Bin2vec: Learning Representations of Binary Executable Programs for Security Tasks, Cybersecurity, 4, 26, 2021. DOI: 10.1186/s42400-021-00088-4.

Suneja, S., Zheng, Y., Zhuang, Y., Laredo, J. & Morari, A., Learning to Map Source Code to Software Vulnerability Using Code-As-A-Graph, 2021. arXiv:2006.08614.

Jacob, A., Automated Software Vulnerability Detection with Machine Learning, 2018. arXiv:1803.04497.

Chen, Z., Kommrusch, S. & Monperrus, M., Neural Transfer Learning for Repairing Security Vulnerabilities in C Code, 2021. arXiv:2104.08308v1.

Kazman, R. & Woody, C., Identifying the Architectural Roots of Vulnerabilities, assetid=451035, (4 February 2016).

Al-Azzani, S. & Bahsoon, R., SecArch: Architecture-level Evaluation and Testing for Security, in Proceedings of the 2012 Joint Working IEEE/IFIP Conference on Software Architecture and European Conference on Software Architecture, 2012.

Karppinen, K., Lindvall, M. & Yonkwa, L., Detecting Security Vulnerabilities with Software Architecture Analysis Tools, in Proceedings of the 2008 IEEE International Conference on Software Testing Verification and Validation Workshop, 2008.

Su, J., Xu, T., Wang, Y., Cui, B. & Jiang, L. & Sun, W., Vulnerability Analysis of Software Structure, Acta Electron, 37, pp. 2404-2408, 2009.

Bo, X., Jiang, J., Luo, X. & Zhang, Y., Simulation and verification of C4ISR Architecture based on UML&OPN, Syst. Eng. Electron. Technol., 30, pp. 617-676, 2008.

Xu, Z., Static Analysis of C Program, Institute of Software Chinese Academy of Sciences, 2009.

Larochelle, D. & Evans D., Statically Detecting Likely Buffer Overflow Vulnerabilities, in Proceedings of the SSYM 2001 10th conference on USENIX Security Symposium, 2001.

Xie, Y., Chou, A. & Engler, D., ARCHER: Using Symbolic, Path-Sensitive Analysis to Detect Memory Access Errors, in Proceedings of the European Software Engineering Conference Held Jointly with ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2003.

Zhang, D., Liu, D., Wang, W., Lei, J., Kung, D. & Csallner, C., Testing C Programs for Vulnerability Using Trace-Based Symbolic Execution and Satisfiability Analysis, in Proceedings of the International Conference on Computational Science and Engineering, 2010.

Ganapathy, V., Jha, S., Ch, D., Melski, D. & Vitek, D., Buffer Overrun Detection using Linear Programming and Static Analysis, in Proceedings of the 10th ACM Conference on Computer and Communications Security, pp. 345-354, 2003.

Aiken, A., Introduction to Set Constraint-Based Program Analysis, Springer, 1999.

Nelson, G., Extended Static Checking for Java, in Proceedings of the International Conference on Mathematics of Program Construction, pp. 22-33, 2002.

Zhen, Li., SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities, 2018. arXiv:1807.06756v3.

Tomas. M., Efficient Estimation of Word Representations in Vector Space, 2013. ar?iv:1301.3781.

Breiman, L., Random Forests, Machine Learning, 45(1), pp. 5-32, 2001.

Software Assurance Reference Dataset Project., (25 March 2021).

Chakraborty, S., Krishna, R., Ding, Y. & Ray, B., Deep Learning based Vulnerability Detection: Are We There Yet, in IEEE Transactions on Software Engineering, 2021. DOI: 10.1109/TSE.2021.3087402.

Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, S. & Jin, H., VulDeeLocator: A Deep Learning-based Fine-grained Vulnerability Detector, in IEEE Transactions on Dependable and Secure Computing, 2021. DOI: 10.1109/TDSC.2021.3076142.

Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M. & Jin, H., A Comparative Study of Deep Learning-Based Vulnerability Detection System, in IEEE Access, 7, pp. 103184-103197, 2019. DOI: 10.1109/ACCESS.2019. 2930578.

Yu, L., Lu, Y., Shen, Y., Huang, H. & Zhu, K., BEDetector: A Two-Channel Encoding Method to Detect Vulnerabilities Based on Binary Similarity, in IEEE Access, 9, pp. 51631-51645, 2021. DOI: 10.1109/ACCESS.2021.3064687.

Zagane, M., Abdi, M.K. & Alenezi, M., Deep Learning for Software Vulnerabilities Detection Using Code Metrics, in IEEE Access, 8, pp. 74562-74570, 2020. DOI: 10.1109/ACCESS.2020.2988557.




How to Cite

Cho, D. X., Son, V. N., & Duc, D. (2022). Automatically Detect Software Security Vulnerabilities Based on Natural Language Processing Techniques and Machine Learning Algorithms. Journal of ICT Research and Applications, 16(1), 70-87.