Deep PDF parsing to extract features for detecting embedded malware.

PDF Version Also Available for Download.

Description

The number of PDF files with embedded malicious code has risen significantly in the past few years. This is due to the portability of the file format, the ways Adobe Reader recovers from corrupt PDF files, the addition of many multimedia and scripting extensions to the file format, and many format properties the malware author may use to disguise the presence of malware. Current research focuses on executable, MS Office, and HTML formats. In this paper, several features and properties of PDF Files are identified. Features are extracted using an instrumented open source PDF viewer. The feature descriptions of benign ... continued below

Physical Description

20 p.

Creation Information

Munson, Miles Arthur & Cross, Jesse S. (Missouri University of Science and Technology, Rolla, MO) September 1, 2011.

Context

This report is part of the collection entitled: Office of Scientific & Technical Information Technical Reports and was provided by UNT Libraries Government Documents Department to Digital Library, a digital repository hosted by the UNT Libraries. More information about this report can be viewed below.

Who

People and organizations associated with either the creation of this report or its content.

Provided By

UNT Libraries Government Documents Department

Serving as both a federal and a state depository library, the UNT Libraries Government Documents Department maintains millions of items in a variety of formats. The department is a member of the FDLP Content Partnerships Program and an Affiliated Archive of the National Archives.

Contact Us

What

Descriptive information to help identify this report. Follow the links below to find similar items on the Digital Library.

Description

The number of PDF files with embedded malicious code has risen significantly in the past few years. This is due to the portability of the file format, the ways Adobe Reader recovers from corrupt PDF files, the addition of many multimedia and scripting extensions to the file format, and many format properties the malware author may use to disguise the presence of malware. Current research focuses on executable, MS Office, and HTML formats. In this paper, several features and properties of PDF Files are identified. Features are extracted using an instrumented open source PDF viewer. The feature descriptions of benign and malicious PDFs can be used to construct a machine learning model for detecting possible malware in future PDF files. The detection rate of PDF malware by current antivirus software is very low. A PDF file is easy to edit and manipulate because it is a text format, providing a low barrier to malware authors. Analyzing PDF files for malware is nonetheless difficult because of (a) the complexity of the formatting language, (b) the parsing idiosyncrasies in Adobe Reader, and (c) undocumented correction techniques employed in Adobe Reader. In May 2011, Esparza demonstrated that PDF malware could be hidden from 42 of 43 antivirus packages by combining multiple obfuscation techniques [4]. One reason current antivirus software fails is the ease of varying byte sequences in PDF malware, thereby rendering conventional signature-based virus detection useless. The compression and encryption functions produce sequences of bytes that are each functions of multiple input bytes. As a result, padding the malware payload with some whitespace before compression/encryption can change many of the bytes in the final payload. In this study we analyzed a corpus of 2591 benign and 87 malicious PDF files. While this corpus is admittedly small, it allowed us to test a system for collecting indicators of embedded PDF malware. We will call these indicators features throughout the rest of this report. The features are extracted using an instrumented PDF viewer, and are the inputs to a prediction model that scores the likelihood of a PDF file containing malware. The prediction model is constructed from a sample of labeled data by a machine learning algorithm (specifically, decision tree ensemble learning). Preliminary experiments show that the model is able to detect half of the PDF malware in the corpus with zero false alarms. We conclude the report with suggestions for extending this work to detect a greater variety of PDF malware.

Physical Description

20 p.

Language

Item Type

Identifier

Unique identifying numbers for this report in the Digital Library or other systems.

  • Report No.: SAND2011-7982
  • Grant Number: AC04-94AL85000
  • DOI: 10.2172/1030303 | External Link
  • Office of Scientific & Technical Information Report Number: 1030303
  • Archival Resource Key: ark:/67531/metadc832480

Collections

This report is part of the following collection of related materials.

Office of Scientific & Technical Information Technical Reports

Reports, articles and other documents harvested from the Office of Scientific and Technical Information.

Office of Scientific and Technical Information (OSTI) is the Department of Energy (DOE) office that collects, preserves, and disseminates DOE-sponsored research and development (R&D) results that are the outcomes of R&D projects or other funded activities at DOE labs and facilities nationwide and grantees at universities and other institutions.

What responsibilities do I have when using this report?

When

Dates and time periods associated with this report.

Creation Date

  • September 1, 2011

Added to The UNT Digital Library

  • May 19, 2016, 3:16 p.m.

Description Last Updated

  • Dec. 5, 2016, 1:29 p.m.

Usage Statistics

When was this report last used?

Yesterday: 0
Past 30 days: 2
Total Uses: 5

Interact With This Report

Here are some suggestions for what to do next.

Start Reading

PDF Version Also Available for Download.

Citations, Rights, Re-Use

Munson, Miles Arthur & Cross, Jesse S. (Missouri University of Science and Technology, Rolla, MO). Deep PDF parsing to extract features for detecting embedded malware., report, September 1, 2011; United States. (digital.library.unt.edu/ark:/67531/metadc832480/: accessed November 24, 2017), University of North Texas Libraries, Digital Library, digital.library.unt.edu; crediting UNT Libraries Government Documents Department.