Detecting Distributed Scans Using High-Performance Query-DrivenVisualization Page: 1 of 17
This article is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided to UNT Digital Library by the UNT Libraries Government Documents Department.
Extracted Text
The following text was automatically extracted from the image on this page using optical character recognition software:
Online Submission ID: 0
Detecting Distributed Scans Using High-Performance Query-Driven
Visualization
Abstract
Modern forensic analytics applications, like network traffic analysis, perform high-performance hypothesis testing, knowledge discovery and data mining on very
large datasets. One essential strategy to reduce the time required for these operations is to select only the most relevant data records for a given computation. In this
paper, we present a set of parallel algorithms that demonstrate how an efficient selection mechanism - bitmap indexing - significantly speeds up a common analysis
task, namely, computing conditional histogram on very large datasets. We present a thorough study of the performance characteristics of the parallel conditional
histogram algorithms. As a case study, we compute conditional histograms for detecting distributed scans hidden in a dataset consisting of approximately 2.5
billion network connection records. We show that these conditional histograms can be computed on interactive time scale (i.e., in seconds). We also show how to
progressively modify the selection criteria to narrow the analysis and find the sources of the distributed scans.
Keywords: query-driven visualization, network security, network connection analysis, data mining, visual analytics
1 Introduction
A typical day's worth of network traffic at an "average" government research laboratory may consist of tens of millions of connections
comprising multiple gigabytes worth of connection records. These connection records can be considered as "conversations" between two hosts
on a network. They are generated by routers, traffic analyzers or security systems, and contain information such as source and destination
IP addresses, source and destination ports, duration of connection, number of bytes exchanged and so forth. A year's worth of such data
currently requires on the order of tens of terabytes or more of storage. According to Burrescia [Burrescia and Johnston 2005], traffic volume
over ESnet, a production network servicing the U. S. Department of Energy's research laboratories, has been increasing by an order of
magnitude every 46 months since 1990. This trend is expected to continue into the foreseeable future.
The steady increase in network traffic volume exacerbates the difficulty of forensic cybersecurity and network performance analysis. Current
network traffic analysis toolsets often rely on simple utilities like grep, awk and gnuplot. While sufficient for analyzing small amounts of
network traffic data, these utilities do not scale nor perform to the level needed for analyzing current or future levels of network traffic.
To address the need for rapid forensic analysis capabilities, our work presents advances in two complementary technology areas, namely
scientific data management and visual analytics.
Interactive network traffic data analysis is often based on histogram methods. One key feature distinguishing histograms in data mining
applications from other types of applications [Ioannidis 2003] is the use of ad-hoc conditions to reduce the number of data records being
analyzed. For example, to identify unusual activities on TCP ports, we may compute a histogram over destination ports and time to discover
high or unusual levels of activity on a particular port. However, rather than computing a histogram over all data records, we might restrict the
computation to those network sessions to a set of criteria like those that originate from a specific IP address or address range. The restriction
on originating IP or other conditions is referred to an external condition. Analyses of data records subject to such external conditions are
commonly known as conditional analyses. Our objective here is to develop efficient algorithms for a special type of conditional analysis, i.e.,
computing multi-variate histograms under arbitrary external conditions. We call this type of histogram conditional histogram.
The main contributions of this paper are as follows:
" We build upon recent previous work that combines state-of-the-art indexing with a novel approach to visual analytics [Bethel et al.
2006]. In this paper we introduce a new family of parallel algorithms for efficiently building 2D conditional histograms. The key
feature of these parallel algorithms is that they operate on bitmap indices for evaluating the external conditions to efficiently populate
histogram bins.
* We present a detailed performance analysis of these conditional histogramming algorithms on a 32-way parallel SMP platform. We
show that this work helps accelerate the most data-intensive and time-consuming part of forensic analysis, namely data mining and
knowledge discovery based upon multi-dimensional queries.
* We apply the rapid conditional histogramming technology in conjunction with a specialized visual analytics application for detecting
and analyzing distributed scans. The data for this case study covers a 42-week period at a US government science laboratory. The case
study and its forensic analysis would not be possible without the combination of technologies from scientific data management and
visual analytics.
2 Previous Work
Our work is based on a multi-disciplinary approach of techniques in network traffic analysis, efficient querying and indexing, and query-driven
visualization. In this section we give a brief overview of the related work in these areas.
2.1 Network Traffic Analysis and Visualization
A network connection can be thought of as a set of packets passing between two hosts within a given time interval that have common
characteristics. An example of a network connection is a single communication session or an interaction between two hosts on the Internet.
Several standard tools exist for capturing network connection data. For larger environments, routers and switches can provide connection
data in specialized formats such as NetFlow [Systems 2005] or SFlow [Phaal et al. 2001]. Another tool for analyzing network connection1
Upcoming Pages
Here’s what’s next.
Search Inside
This article can be searched. Note: Results may vary based on the legibility of text within the document.
Tools / Downloads
Get a copy of this page or view the extracted text.
Citing and Sharing
Basic information for referencing this web page. We also provide extended guidance on usage rights, references, copying or embedding.
Reference the current page of this Article.
Stockinger, Kurt; Bethel, E. Wes; Campbell, Scott; Dart, Eli & Wu,Kesheng. Detecting Distributed Scans Using High-Performance Query-DrivenVisualization, article, September 1, 2006; Berkeley, California. (https://digital.library.unt.edu/ark:/67531/metadc886771/m1/1/: accessed April 17, 2024), University of North Texas Libraries, UNT Digital Library, https://digital.library.unt.edu; crediting UNT Libraries Government Documents Department.