Cybercriminals extensively exploit the tools provided by the digital revolution as well as various social networks in order to communicate and engage in illegal activities including drug trafficking, hacking, online fraudulence, blackmailing, money laundering, cyberbullying, online predation, and others. To counteract the increased number of cybercriminal activities, studying and analyzing the content of various online criminal communities are of paramount importance.
A recently published paper presents a method for analyzing chat logs of potential cybercriminal communities via means of Natural Language Processing (NLP) and data mining techniques. The proposed method obtains chat logs from the social network and then summarizes relevant conversations into different topics. Throughout this article, we will take a look at the system presented via this paper.
Overview of the used method:
A data mining framework is developed to obtain relevant evidence from a given chat log in order to improve the investigation process, namely in the initial stage during which the investigator could not have obtained sufficient indicators to start with. Figure (1) illustrates the basic structure of the proposed method.
Figure (1): Framework of the proposed method
The method’s framework is composed of the following main three modules: the Clique Detector, the Concept Miner, and the Information Visualizer.
1- The Clique Detector detects the cliques, i.e. communities, in the analyzed chat log. To accomplish this, it initially identifies all the entities mentioned within the analyzed chat log. The entity can be the name/alias of an individual, a physical address, a phone number, or an organization. After identifying the entities, the Clique Detector will utilize the co-occurrence frequencies of chat sessions’ entities and then identify the communities, which are referred to as cliques.
2- Thereafter, the Concept Miner will process each clique’s chat log and obtains the concepts associated with the topics discussed. To accomplish this, it detects essential terms, which are the ones most frequently used in the text. Each detected important term will then be mapped to a matching concept within the WordNet, which is utilized to build a hierarchy of concepts and to identify potential relationships between them. A modified version of agglomerative hierarchical clustering will then be implemented to form concept groups with strong relationships between various terms. The top node with the concept hierarchy represents the conversation’s main topic.
3- The Information Visualizer will eventually display the detected cliques in the form of an interactive graph, on which the recognized entities are represented by nodes, while the identified cliques are represented by edges. The Information Visualizer will also display the summary, relevant keywords, and clique’s concept via the graph.
Contributions of the study:
This study showed that unstructured texts can be used in social online community mining. This is oppositely to most of previous research studies that mainly relied on structured data in crime investigations. This study managed to extract valuable information via mining chat logs.
The conventional approaches of network analysis mainly rely on counting the number of direct communication instances between different members, such as the number of exchanged emails. Authors of this paper had a discussion with law enforcement agents, which led them to discover that simple counting of the number of communication instances is inadequate and results in outcomes with incomplete or missing information. Accordingly, the study defined a unique notion for a clique, especially for usage in data mining from criminal communities. When considering the mining of chat logs, entities frequently appearing together within chat sessions can be considered a clique, even if they don’t communicate directly with each other.
Many current topic identification approaches require investigators to own training data in order to be able to train a specialized classification model. Nevertheless, when crime investigation is considered, this form of data is not readily available. The proposed method does not require any previous data or training information and can detect key concepts via the chats’ content, with the aid of WordNet, which is a lexical database. Moreover, the process of concept extraction depends on semantic similarity of keywords, instead of their frequencies. As such, this novel approach is capable of identification of the most relevant concepts which can summarize and represent the information discussed in chat sessions.
This novel method for mining data from criminal communities enables investigators to combine domain knowledge with the ultimate of goal of improvisation of the process of analysis. Domain knowledge may be represented by a term that is frequently used in online conversations, or a street term associated with certain crimes.
This study presents a unique method for criminal investigations that relies on chat session data mining. The system was developed with the help of a Canadian internet forensics team. The effectiveness of this system was proven by experienced cybercrime investigators.