ISACA Journal
Volume 1, 2,018 


Web Monitoring: From Big Data to Small Data Analysis Through OSINT—A Practical and Cost-Effective Implementation Using Web Crawlers 

Paolo Gasperi, CISM, CSIRT Transits First-I, ISO 27001, Luigi Sbriz, CISM, CRISC, ISO 27001, ITILv3 and Marco Tomazzoni, CSIRT Transits First-I 

Every organization should know what information is circulating on the Internet about its activities so that it can use concrete actions to handle any potential threat to the organization or to capture a competitive advantage for its own business. Organizations should monitor not only the so-called “sentiment,”1 but also, for example, information on the unauthorized sale of products, existence of information classified as confidential, unauthorized use of name and trademark, and counterfeiting.

The identification and assessment of specific risk stemming from information freely circulating on the Internet are often unsatisfactory, inefficient and costly. Monitoring the web means working directly on the big data, which carries multiple challenges: too much information to be managed, limited availability of information on which to focus the research, repetitive and frustrating manual work, problems obtaining information in time, and the assignment of priorities to the results obtained.

It should be noted that the web monitoring process has focused services available on niches of data, which can improve the quality of research compared to generalist search engines. There are specialized search engines, portals and social networks equipped with their own internal research and aggregators. Using them makes it possible to go from a generalized manual search process (costly, inefficient, obscure, delayed) to a more focused service (automatic, economical, intelligent, on time), which is useful for discovering external risk to business objectives.2 The resulting information, appropriately filtered, aggregated and put in order is, therefore, focused, can be easily analyzed, and is of better quality.

To demonstrate all this, this article describes a web monitoring project that proved to be sustainable and adequate for the extent of the monitoring task. It was created and implemented in 2016 at a geographically dispersed company offering a diverse array of products in the automotive sector. The identification of information on the Internet that is potentially significant for the business turned out, in this case, to be ineffective if managed purely by manual research.

Three Steps for Defining the Web Monitoring Model

The first step to creating a systematic search process of all the information on the web that is potentially interesting to the organization (web monitoring) is to establish the context (e.g., industry sector, type of activity, values, problems, needs, goals, expectations) of the organization.

After conducting an analysis of its context and its problems, the organization can then prepare a list of Internet risk.3 This is useful for identifying and isolating the lists of key information, which will form the definition basis of the search rules.

The second step consists of understanding what types of public information archives on the Internet (news, forums, e-commerce, web pages, databases, file sharing services, messages, social media, etc.) are useful in terms of business objectives. To be clear, the organization needs to identify Open-Source Intelligence (OSINT) services4, 5 useful for improving the recovery of the information.

The third and last step is the systematic analysis of the data to extract the most reduced and, therefore, humanly manageable set of information. The technology is powerful for making decisions, but a little human knowledge on the business goals drastically reduces false positives, i.e., collected data that are not a problem for the organization and whose knowledge does not bring any advantage to it. Because of this, it is necessary to use an organizational process that involves both technology and human resources.

In summary, this web monitoring service:

  • Identifies the key words and the search rules based on the knowledge of the internal purposes of the business, its questions, requirements and expectations
  • Extracts sets of specialized data from OSINT services that are sufficiently small enough to be processed by traditional methods, compared to the total chaos of big data
  • Further reduces the size of the data sets and improves the quality of the results by combining human work and ranking algorithms

The upper layer of the results obtained, arranged by the desired ranking, represents the information useful for top management to understand the exposure to risk or potential opportunities for the company.

Outline of the Web Monitoring Process

The web monitoring process is summarized in figure 1.6

It is necessary to have a C-suite (executive committee) team responsible for the process, which performs three main functions:

  1. Defines the corporate assets that could benefit from the web monitoring described. For example, in the case of an asset established by an official spare parts network, there is the risk of unauthorized distributors appearing (possibly with counterfeit products); consequently, it provides instructions to the operating team on the products on which to focus. As a result, the operating team enters the names of certain products and their specifications in the lists of search terms (in criterion 11 in figure 3) and in the column “OSINT sources” for certain online sales portals considered worthy of attention.
  2. Guides an operating team responsible for the management of the system both with feedback on the notices received (from the system) and through continuous alignment with the company context.
  3. After analyzing the notices received, acts to reduce the risk level or seize the opportunities.

The operating team, composed of individuals with vast experience in the areas identified for the research, uses the instructions of the executive committee to create (on the Internet) a solid identity of the business, aligned with corporate objectives. A well-organized collection of dynamic lists of key words, with weights and rules of correlation, represents a virtual model of the business and the principal factors of risk, threat or opportunity (using a strengths-weaknesses-opportunities-threats [SWOT]7 analysis to focus on the outside factors). The team members must keep the lists of key words updated and their weight aligned with the company requirements, and they must adapt (manually) the classification of the results if not pertinent with their knowledge.

Logic Structure of the Search Model

The outline of the web monitoring system through its main functions, as shown in figure 2, provides a better understanding of its overall logic operation.

The sequence of the three different operating phases provides the flow of the principal operations in the system:

  1. Search—A set of web crawlers8, 9 (also called spiders), guided by the key words and the relative correlation rules, scans the Internet to feed the specific data set (called spider lair) of the search. Every web crawler is specialized in searches within a single OSINT data source. The results of the search are saved in the spider lair with a frequency defined by the operating team. The web crawler can be launched many times—for example, once for every risk category.
  2. Normalization—The data sets of the searches, correctly filtered, feed into a single data set composed of all the results (the result data set). The simple filters connected to the risk categories guide the extraction of the data. A transformation phase (extraction, transformation, transportation and loading [ETTL] pipeline) in a metadata record precedes the transfer of the actual record in the result data set or its updating, if the record is already present.
  3. Prioritization—The result data set contains the union of all the records collected in the search data set. The assessment of the data performed by the operating team and the ranking algorithm assign a score to each record. The algorithm uses the manual assessment and an aging policy to improve the quality of its classification. Only the first-level records (significant score) receive the right to remain in the presentation layer (database visible to the final user); the hidden part of the database contains all the remaining downgraded records.

A planned process extracts the summary of the top (specified number) records for every risk category and sends an email to the members of the executive team. The team must check every report and, consequently, act, informing the operating team of every decision made. The operating team plans the changes and implements them by repeating the cycle of the operations.

Search Through Web Crawler

A particular feature of the search described to this point is the need for focused types of web crawlers, each focused on a specific objective on the Internet. An open source platform that satisfies these criteria is Scrapy.10, 11 In this environment, the focused crawlers are written in Python,12, 13 and the results are stored on MySQL14 databases. The programming of around 15 different spiders has shown how the platform selected makes it possible to create new spiders relatively easily. With regard to the frequency of use of spiders, the analysis of the sites for the search for possible unauthorized sales has shown how a range of seven to 10 days is optimal. This is also considering that part of the analysis of the data collected is, in any case, done manually.

Every single OSINT source is explored by one or more spiders, which are homogeneous but have their own operating parameters and independent rules. Every source is analyzed based on the search and frequency rules assigned. The results are connections to the pages that satisfy the search criteria plus certain additional data.

The data vary depending on the type of source and search criteria, but the intent is to recover at least the URL, date and author and to extract the text cleaned of the HTML tags. For ease of use by the operating team, a screenshot of the page found is also downloaded.

The data collected by every spider are immediately checked and, if needed, stored in the spider database. Records will be discarded due to duplications or Internet sites considered secure (e.g., company sites, trusted sites) because they are considered devoid of potential risk. Every group created from an OSINT source, spider and relative database is a separate object (module), and this makes the system robust. A malfunction of one of these does not spread, but possibly compromises only the local update.

During data collection, the spider attributes a temporary score to every page found, based on the type and number of terms found on the page and their weight. The score is then transformed into the real ranking of the page during the normalization/ transfer to the database of the results, with an additional weight derived from the spider, the data source and the search category.

The normalization/transfer of the data to the result data set is a batch process performed at the end of the search session of the spider. It is based on stored procedures that are different for every spider.

Search Criteria

The method adopted to describe the search criteria uses two types of data: a risk category and an OSINT source. The table in figure 3 clarifies this concept.

A risk category and an OSINT source together identify a search criterion. The risk category operationally defines the type of filtering that will be adopted, while the OSINT source identifies where the search will be conducted on the Internet.

The data filtering is based on dynamic lists that can be reused by the various search criteria. The search rule captures the web pages corresponding to the identity of the business on the Internet, but only when additional key words of inclusion are present and without specific key words of exclusion. A weight on each record is then necessary for the ranking algorithm to work correctly.

The method for defining the rule requires a logic expression to be prepared that combines dynamic lists of key words according to the following procedure: A list is represented by a sequence of rows of a table of the database with a weight for each one. The words within a single cell of the table are used by connecting them together with OR logic. The columns use AND logic (possibly in negation) and the rows are always connected using OR logic between them.

User Interface

The system provides all the Internet pages with contents corresponding to the identified selection rules as output, and they are classified and arranged through an automatic algorithm. The team members can manually change the classification of any page using their knowledge and experience. This manual classification affects the ranking algorithm and, therefore, the order. Its propagation by affinity, with the retention of the history of the modifications, influences any additional data processing, improving the quality of the notices to the executive committee.

For a rapid release of the data interface, Drupal has been adopted, which is an open-source content management system (CMS). The database of the CMS (presentation layer) and that of the data collected and processed (data layer) are kept separate to permit, as needed, use of other types of user interfaces.

The CMS reads the data from the result data set and manages the data’s access, display and management to the members of the operating team. A convenient set of parameters and filters reduces the final collection of the data presented. It is possible to filter by risk category, search criteria, seriousness, department, source and other factors, and to change the order of the columns. A simple table (figure 4) of formatted data is displayed as the result.

As an additional step, it is possible to expand every row (equivalent to a web page) to see its details (figure 5) and, if necessary, also access the original page on the Internet using the link provided.


This article describes a solution for how to identify in the enormous breadth of the unstructured world of big data just that information of potential enterprise interest. This information can be used to perform a detailed investigation aimed at the search for concrete threats.

The methodology balances open-source IT tools with human resources to capture the best aspects of both.

In this method, the organization connects the skills and experience of an operating team with the executive committee’s constant alignment to the business objectives. The technology transforms big data into small data for the operating team and reduces it further, also improving the final quality for the executive committee (figure 1).

The reduction of big data through the programmed action of spiders on specialized portals or search engines makes the overall system simpler and more economical to develop and manage. Furthermore, through the modular approach based on the risk categories, the size of the database depends on the number of search criteria and how restrictive the rules are, but not the characteristics of the company (e.g., size, products, organization, geolocation, exposure on the Internet).

This type of analysis of the web is becoming, over time, a significant part of enterprise risk management and can be usefully integrated into an information security continuous monitoring (ISCM) system.15, 16, 17 Naturally, the costs have to be considered. The modularity and use of the OSINT technology have been the real drivers that guaranteed a low economic exposure in terms of the system. Furthermore, the adoption of opensource software, in particular for the crawlers and the database, has made it possible to drastically reduce costs without compromising the quality of the results, both in terms of its implementation and full working order.


1 Bing, L.; Sentiment Analysis, Cambridge University Press, 2015,
2 International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC), ISO/IEC 27001: 2013, Clause 4.1
3 A suitable list of risk for the organization can be found in the statement of applicability of Clause 6.1.3 of the standard ISO/IEC 27001: 2013, item d, limited to outside risk.
4 Bazzel, M.; Open Source Intelligence Techniques: Resources for Searching and Analyzing Online Information, CreateSpace Independent Publishing Platform, 2016,
5 Glassmann, M.; M. Kang; “Intelligence in the Internet Age: The Emergence and Evolution of Open Source Intelligence (OSINT),” Computers in Human Behavior, March 2012,
6 All the images are taken from the Operating Project, Version 1.1.
7 Management Study Guide, “SWOT Analysis—Definition, Advantages and Limitations,”
8 Merriam-Webster Dictionary, “Web Crawler,”
9 Kausar, A.; V. S. Dhaka; S. K. Singh; “Web Crawler: A Review,” International Journal of Computer Applications, vol. 63, no. 2, 2013,
10 Scrapy, Scrapy 1.4 Documentation,
11 Kouzis-Loukas, D.; Learning Scrapy Packt Publishing, 2016,
12 Python, Functions Defined,
13 Lawson, R.; Web Scraping with Python, Packt Publishing, 2015,
14 My SQL,
15 National Institute of Standards and Technology, Special Publication 800-137, “Information Security Continuous Monitoring (ISCM) for Federal Information Systems and Organizations”, USA, 2011, Information security continuous monitoring (ISCM) is defined as maintaining ongoing awareness of information security, vulnerabilities and threats to support organizational risk management decisions. The terms ‘continuous’ and ‘ongoing’ in this context mean that security controls and organizational risk factors are assessed and analyzed at a frequency sufficient to support risk-based security decisions to adequately protect organization information. Data collection, no matter how frequent, is performed at discrete intervals.”
16 Hargenrader, B.; “Information Security Continuous Monitoring: The Promise and the Challenge,” ISACA Journal, vol. 1, 2015,
17 Luu, T.; “Implementing an Information Security Continuous Monitoring Solution—A Case Study,” ISACA Journal, vol. 1, 2015,

Paolo Gasperi, CISM, CSIRT Transits First-I, ISO 27001
Lives and works in Switzerland as a consultant in cybersecurity. He can be contacted at

Luigi Sbriz, CISM, CRISC, ISO 27001, ITILv3
Has worked as risk monitoring manager at Magneti Marelli for more than three years. His previous experience includes responsibility for information and communications technology (ICT) operations and resources in the Asia-Pacific region (China, Japan, Malaysia), serving as a worldwide IS officer and consulting on business intelligence systems. For internal risk monitoring, he developed a methodology merging an operative risk analysis with a consequent risk assessment driven by the maturity level of the processes. He can be contacted at or

Marco Tomazzoni, CSIRT Transits First-I
Is a programmer and IT consultant with more than 25 years of experience. After working as IT manager for the Italian headquarters of a Danish multinational, he now mainly deals with web crawling and cybersecurity on a freelance basis.


Add Comments

Recent Comments

Opinions expressed in the ISACA Journal represent the views of the authors and advertisers. They may differ from policies and official statements of ISACA and from opinions endorsed by authors’ employers or the editors of the Journal. The ISACA Journal does not attest to the originality of authors’ content.