Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
research-article

Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial Learning for Proactive Cyber Threat Intelligence

Published: 10 March 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Automated monitoring of dark web (DW) platforms on a large scale is the first step toward developing proactive Cyber Threat Intelligence (CTI). While there are efficient methods for collecting data from the surface web, large-scale dark web data collection is often hindered by anti-crawling measures. In particular, text-based CAPTCHA serves as the most prevalent and prohibiting type of these measures in the dark web. Text-based CAPTCHA identifies and blocks automated crawlers by forcing the user to enter a combination of hard-to-recognize alphanumeric characters. In the dark web, CAPTCHA images are meticulously designed with additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing automated CAPTCHA breaking methods have difficulties in overcoming these dark web challenges. As such, solving dark web text-based CAPTCHA has been relying heavily on human involvement, which is labor-intensive and time-consuming. In this study, we propose a novel framework for automated breaking of dark web CAPTCHA to facilitate dark web data collection. This framework encompasses a novel generative method to recognize dark web text-based CAPTCHA with noisy background and variable character length. To eliminate the need for human involvement, the proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web background noise and leverages an enhanced character segmentation algorithm to handle CAPTCHA images with variable character length. Our proposed framework, DW-GAN, was systematically evaluated on multiple dark web CAPTCHA testbeds. DW-GAN significantly outperformed the state-of-the-art benchmark methods on all datasets, achieving over 94.4% success rate on a carefully collected real-world dark web dataset. We further conducted a case study on an emergent Dark Net Marketplace (DNM) to demonstrate that DW-GAN eliminated human involvement by automatically solving CAPTCHA challenges with no more than three attempts. Our research enables the CTI community to develop advanced, large-scale dark web monitoring. We make DW-GAN code available to the community as an open-source tool in GitHub.

    References

    [1]
    Elie Bursztein, Matthieu Martin, and John Mitchell. 2011. Text-based CAPTCHA strengths and weaknesses. In Proceedings of the 18th ACM Conference on Computer and Communications Security. 125–138.
    [2]
    Hsinchun Chen. 2012. Dark web: Exploring and Data Mining the Dark Side of the Web. Springer, New York.
    [3]
    Jun Chen, Xiangyang Luo, Yanqing Guo, Yi Zhang, and Daofu Gong. 2017. A survey on breaking technique of text-based CAPTCHA. Security and Communication Networks 2017 (2017), 1–15.
    [4]
    Roger H. L. Chiang, Paulo Goes, and Edward A. Stohr. 2012. Business intelligence and analytics education, and program development: A unique opportunity for the information systems discipline. ACM Transactions on Management Information Systems 3, 3, Article 12 (2012), 13 pages. DOI:
    [5]
    Po-Yi Du, Ning Zhang, Mohammadreza Ebrahimi, Sagar Samtani, Ben Lazarine, Nolan Arnold, Rachael Dunn, Sandeep Suntwal, Guadalupe Angeles, Robert Schweitzer, and Hsinchun Chen. 2018. Identifying, collecting, and presenting hacker community data: Forums, IRC, carding shops, and DNMs. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics. IEEE, 70–75.
    [6]
    Mohammadreza Ebrahimi, Yidong Chai, Sagar Samtani, and Hsinchun Chen. Forthcoming. Cross-lingual cybersecurity analytics in the international dark web with adversarial deep representation learning. MIS Quarterly (Forthcoming). DOI:
    [7]
    Mohammadreza Ebrahimi, Jay F. Nunamaker Jr, and Chen Hsinchun. 2020. Semi-supervised cyber threat identification in dark net markets: A transductive and deep learning approach. Journal of Management Information Systems 37, 3 (2020), 694–722.
    [8]
    Mohammadreza Ebrahimi, Sagar Samtani, Yidong Chai, and Hsinchun Chen. 2020. Detecting cyber threats in non-english hacker forums: An adversarial cross-lingual knowledge transfer approach. In Proceedings of the 2020 IEEE Security and Privacy Workshops. IEEE, 20–26.
    [9]
    Mohammadreza Ebrahimi, Mihai Surdeanu, Sagar Samtani, and Hsinchun Chen. 2018. Detecting cyber threats in non-english dark net markets: A cross-lingual transfer learning approach. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics. IEEE, 85–90.
    [10]
    Diogo Daniel Ferreira, Luís Leira, Petya Mihaylova, and Petia Georgieva. 2019. Breaking text-based CAPTCHA with sparse convolutional neural networks. In Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis. Springer, 404–415.
    [11]
    Haichang Gao, Wei Wang, Jiao Qi, Xuqin Wang, Xiyang Liu, and Jeff Yan. 2013. The robustness of hollow CAPTCHAs. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security. 1075–1086.
    [12]
    Dileep George, Wolfgang Lehrach, Ken Kansky, Miguel Lázaro-Gredilla, Christopher Laan, Bhaskara Marthi, Xinghua Lou, Zhaoshi Meng, Yi Liu, Huayan Wang, Alex Lavin, and D. Scott Phoenix. 2017. A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs. Science 358, 6368 (2017), 1–19.
    [13]
    Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT press.
    [14]
    Katherine Heires. 2016. Terror tech: New and innovative tools are being deployed in the ongoing fight against terrorism. Risk Management 63, 10 (2016), 28–32.
    [15]
    Rafaqat Hussain, Hui Gao, and Riaz Ahmed Shaikh. 2017. Segmentation of connected characters in text-based CAPTCHAs for intelligent character recognition. Multimedia Tools and Applications 76, 24 (2017), 25547–25561.
    [16]
    Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.
    [17]
    Tuan Anh Le, Atilim Giineş Baydin, Robert Zinkov, and Frank Wood. 2017. Using synthetic data to train neural networks is model-based reasoning. In Proceedings of the 2017 International Joint Conference on Neural Networks. IEEE, 3514–3521.
    [18]
    Shing-Han Li, Yu-Cheng Kao, Zong-Cyuan Zhang, Ying-Ping Chuang, and David C. Yen. 2015. A network behavior-based botnet detection mechanism using PSO and k-means. ACM Transactions on Management Information Systems 6, 1, Article 3 (2015), 30 pages. DOI:
    [19]
    Yizhi Liu, Fang Yu Lin, Zara Ahmad-Post, Mohammadreza Ebrahimi, Ning Zhang, James Lee Hu, Jingyu Xin, Weifeng Li, and Hsinchun Chen. 2020. Identifying, collecting, and monitoring personally identifiable information: From the dark web to the surface web. In Proceedings of the 2020 IEEE International Conference on Intelligence and Security Informatics. IEEE, 1–6.
    [20]
    Steve Morgan. 2017. 2017 Cybercrime Report. Technical Report. Cybersecurity Ventures. Retrieved 11 Feb., 2022 from http://www.cybersecurityventures.com/hackerpocalypse-cybercrime-report-2016.
    [21]
    Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.807–814.
    [22]
    Zahra Nouri and Mahdi Rezaei. 2020. Deep-CAPTCHA: A deep learning based CAPTCHA solver for vulnerability assessment. Available at SSRN 3633354 (2020). Accessed 11 Feb., 2022.
    [23]
    Fabio Pierazzi, Ghita Mezzour, Qian Han, Michele Colajanni, and V. S. Subrahmanian. 2020. A data-driven characterization of modern android spyware. ACM Transactions on Management Information Systems 11, 1, Article 4 (2020), 38 pages. DOI:
    [24]
    Brian M. Powell, Gaurav Goswami, Mayank Vatsa, Richa Singh, and Afzel Noore. 2014. fgCAPTCHA: Genetically optimized face image CAPTCHA 5. IEEE Access 2 (2014), 473–484. https://ieeexplore.ieee.org/document/6807630.
    [25]
    Awang Hendrianto Pratomo, Anggit Ferdita Nugraha, Joko Siswantoro, and Mohammad Faidzul Nasruddin. 2019. Algorithm border tracing vs scanline in blob detection for robot soccer vision system. International Journal of Advances in Soft Computing & Its Applications 11, 3 (2019), 40–56.
    [26]
    John Robertson, Ahmad Diab, Ericsson Marin, Eric Nunes, Vivin Paliath, Jana Shakarian, and Paulo Shakarian. 2017. Darkweb Cyber Threat Intelligence Mining. Cambridge University Press.
    [27]
    Sagar Samtani, Ryan Chinn, Hsinchun Chen, and Jay F. Nunamaker Jr. 2017. Exploring emerging hacker assets and key hackers for proactive cyber threat intelligence. Journal of Management Information Systems 34, 4 (2017), 1023–1053.
    [28]
    Anna Sapienza, Sindhu Kiranmai Ernala, Alessandro Bessi, Kristina Lerman, and Emilio Ferrara. 2018. Discover: Mining online chatter for emerging cyber threats. In Companion Proceedings of the The Web Conference 2018. 983–990.
    [29]
    Rajalingappaa Shanmugamani. 2018. Deep Learning for Computer Vision: Expert Techniques to Train Advanced Neural Networks using TensorFlow and Keras. Packt Publishing Ltd.
    [30]
    Mengyun Tang, Haichang Gao, Yang Zhang, Yi Liu, Ping Zhang, and Ping Wang. 2018. Research on deep learning techniques in breaking text-based captchas and designing image-based captcha. IEEE Transactions on Information Forensics and Security 13, 10 (2018), 2522–2537.
    [31]
    Claire Tsosie. [n.d.]. Discover Launches Social Security Number Alert Features. Retrieved from https://www.nerdwallet.com/article/credit-cards/discover-feature-alerts-social-security-number-risky-sites-dark-web. (2020/07/21).
    [32]
    Moshe Unger, Alexander Tuzhilin, and Amit Livne. 2020. Context-aware recommendations based on deep learning frameworks. ACM Transactions on Management Information Systems 11, 2, Article 8 (2020), 15 pages. DOI:
    [33]
    Bo Wen, Paul Jen-Hwa Hu, Mohammadreza Ebrahimi, and Hsinchun Chen. 2021. Key factors affecting user adoption of open-access data repositories in intelligence and security informatics: An affordance perspective. ACM Transactions on Management Information System 13, 1 (2021), 1–24.
    [34]
    Haiqin Weng, Binbin Zhao, Shouling Ji, Jianhai Chen, Ting Wang, Qinming He, and Raheem Beyah. 2019. Towards understanding the security of modern image captchas and underground captcha-solving services. Big Data Mining and Analytics 2, 2 (2019), 118–144.
    [35]
    Xing Wu, Shuji Dai, Yike Guo, and Hamido Fujita. 2019. A machine learning attack against variable-length chinese character CAPTCHAs. Applied Intelligence 49, 4 (2019), 1548–1565.
    [36]
    Madhuri Yadav and Alok Kumar. 2018. Feature extraction techniques for hand written character recognition. International Journal of Advanced Research in Computer Science 9, 2 (2018), 521.
    [37]
    Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, and Zheng Wang. 2018. Yet another text captcha solver: A generative adversarial network based approach. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 332–348.
    [38]
    Yang Zhang, Haichang Gao, Ge Pei, Sainan Luo, Guoqin Chang, and Nuo Cheng. 2019. A survey of research on CAPTCHA designing and breaking techniques. In Proceedings of the 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). IEEE, 75–84.

    Cited By

    View all
    • (2024)Adaptive CAPTCHA: A CRNN-Based Text CAPTCHA Solver with Adaptive Fusion Filter NetworksApplied Sciences10.3390/app1412501614:12(5016)Online publication date: 8-Jun-2024
    • (2024)The Matter of Captchas: An Analysis of a Brittle Security Feature on the Modern WebProceedings of the ACM on Web Conference 202410.1145/3589334.3645619(1835-1846)Online publication date: 13-May-2024
    • (2024)Unveiling the dark: Analyzing and Categorizing dark web activities using Bi-Directional LSTMs2024 2nd International Conference on Networking and Communications (ICNWC)10.1109/ICNWC60771.2024.10537490(1-6)Online publication date: 2-Apr-2024
    • Show More Cited By

    Index Terms

    1. Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial Learning for Proactive Cyber Threat Intelligence

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      ACM Transactions on Management Information Systems  Volume 13, Issue 2
      June 2022
      261 pages
      ISSN:2158-656X
      EISSN:2158-6578
      DOI:10.1145/3483345
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 March 2022
      Accepted: 01 December 2021
      Revised: 01 August 2021
      Received: 01 November 2020
      Published in TMIS Volume 13, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Automated CAPTCHA breaking
      2. dark web
      3. generative adversarial networks

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • National Science Foundation (NSF)

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)194
      • Downloads (Last 6 weeks)13
      Reflects downloads up to 31 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Adaptive CAPTCHA: A CRNN-Based Text CAPTCHA Solver with Adaptive Fusion Filter NetworksApplied Sciences10.3390/app1412501614:12(5016)Online publication date: 8-Jun-2024
      • (2024)The Matter of Captchas: An Analysis of a Brittle Security Feature on the Modern WebProceedings of the ACM on Web Conference 202410.1145/3589334.3645619(1835-1846)Online publication date: 13-May-2024
      • (2024)Unveiling the dark: Analyzing and Categorizing dark web activities using Bi-Directional LSTMs2024 2nd International Conference on Networking and Communications (ICNWC)10.1109/ICNWC60771.2024.10537490(1-6)Online publication date: 2-Apr-2024
      • (2024)Agriculture 4.0 and beyondComputers and Security10.1016/j.cose.2024.103754140:COnline publication date: 1-May-2024
      • (2023)Meta-Learning-Based Spatial-Temporal Adaption for Coldstart Air Pollution PredictionInternational Journal of Intelligent Systems10.1155/2023/37345572023Online publication date: 1-Jan-2023
      • (2023)Recent Trends in Task and Motion Planning for Robotics: A SurveyACM Computing Surveys10.1145/358313655:13s(1-36)Online publication date: 13-Jul-2023
      • (2023)An Experimental Investigation of Text-based CAPTCHA Attacks and Their RobustnessACM Computing Surveys10.1145/355975455:9(1-38)Online publication date: 16-Jan-2023
      • (2023)Extended Research on the Security of Visual Reasoning CAPTCHAIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.323840820:6(4976-4992)Online publication date: 20-Jan-2023
      • (2023)Farsi CAPTCHA Recognition Using Attention-Based Convolutional Neural Network2023 9th International Conference on Web Research (ICWR)10.1109/ICWR57742.2023.10139078(221-226)Online publication date: 3-May-2023
      • (2023)Exploring Dark Web Crawlers: A Systematic Literature Review of Dark Web Crawlers and Their ImplementationIEEE Access10.1109/ACCESS.2023.325516511(35914-35933)Online publication date: 2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media