UNT Theses and Dissertations - Browse

Cross Language Information Retrieval for Languages with Scarce Resources
Our generation has experienced one of the most dramatic changes in how society communicates. Today, we have online information on almost any imaginable topic. However, most of this information is available in only a few dozen languages. In this thesis, I explore the use of parallel texts to enable cross-language information retrieval (CLIR) for languages with scarce resources. To build the parallel text I use the Bible. I evaluate different variables and their impact on the resulting CLIR system, specifically: (1) the CLIR results when using different amounts of parallel text; (2) the role of paraphrasing on the quality of the CLIR output; (3) the impact on accuracy when translating the query versus translating the collection of documents; and finally (4) how the results are affected by the use of different dialects. The results show that all these variables have a direct impact on the quality of the CLIR system.
Cuff-less Blood Pressure Measurement Using a Smart Phone
Blood pressure is vital sign information that physicians often need as preliminary data for immediate intervention during emergency situations or for regular monitoring of people with cardiovascular diseases. Despite the availability of portable blood pressure meters in the market, they are not regularly carried by people, creating a need for an ultra-portable measurement platform or device that can be easily carried and used at all times. One such device is the smartphone which, according to comScore survey is used by 26.2% of the US adult population. the mass production of these phones with built-in sensors and high computation power has created numerous possibilities for application development in different domains including biomedical. Motivated by this capability and their extensive usage, this thesis focuses on developing a blood pressure measurement platform on smartphones. Specifically, I developed a blood pressure measurement system on a smart phone using the built-in camera and a customized external microphone. the system consists of first obtaining heart beats using the microphone and finger pulse with the camera, and finally calculating the blood pressure using the recorded data. I developed techniques for finding the best location for obtaining the data, making the system usable by all categories of people. the proposed system resulted in accuracies between 90-100%, when compared to traditional blood pressure meters. the second part of this thesis presents a new system for remote heart beat monitoring using the smart phone. with the proposed system, heart beats can be transferred live by patients and monitored by physicians remotely for diagnosis. the proposed blood pressure measurement and remote monitoring systems will be able to facilitate information acquisition and decision making by the 9-1-1 operators.
DADS - A Distributed Agent Delivery System
Mobile agents require an appropriate platform that can facilitate their migration and execution. In particular, the design and implementation of such a system must balance several factors that will ensure that its constituent agents are executed without problems. Besides the basic requirements of migration and execution, an agent system must also provide mechanisms to ensure the security and survivability of an agent when it migrates between hosts. In addition, the system should be simple enough to facilitate its widespread use across large scale networks (i.e Internet). To address these issues, this thesis discusses the design and implementation of the Distributed Agent Delivery System (DADS). The DADS provides a de-coupled design that separates agent acceptance from agent execution. Using functional modules, the DADS provides services ranging from language execution and security to fault-tolerance and compression. Modules allow the administrator(s) of hosts to declare, at run-time, the services that they want to provide. Since each administrative domain is different, the DADS provides a platform that can be adapted to exchange heterogeneous blends of agents across large scale networks.
Data-Driven Decision-Making Framework for Large-Scale Dynamical Systems under Uncertainty
Managing large-scale dynamical systems (e.g., transportation systems, complex information systems, and power networks, etc.) in real-time is very challenging considering their complicated system dynamics, intricate network interactions, large scale, and especially the existence of various uncertainties. To address this issue, intelligent techniques which can quickly design decision-making strategies that are robust to uncertainties are needed. This dissertation aims to conquer these challenges by exploring a data-driven decision-making framework, which leverages big-data techniques and scalable uncertainty evaluation approaches to quickly solve optimal control problems. In particular, following techniques have been developed along this direction: 1) system modeling approaches to simplify the system analysis and design procedures for multiple applications; 2) effective simulation and analytical based approaches to efficiently evaluate system performance and design control strategies under uncertainty; and 3) big-data techniques that allow some computations of control strategies to be completed offline. These techniques and tools for analysis, design and control contribute to a wide range of applications including air traffic flow management, complex information systems, and airborne networks.
Ddos Defense Against Botnets in the Mobile Cloud
Mobile phone advancements and ubiquitous internet connectivity are resulting in ever expanding possibilities in the application of smart phones. Users of mobile phones are now capable of hosting server applications from their personal devices. Whether providing services individually or in an ad hoc network setting the devices are currently not configured for defending against distributed denial of service (DDoS) attacks. These attacks, often launched from a botnet, have existed in the space of personal computing for decades but recently have begun showing up on mobile devices. Research is done first into the required steps to develop a potential botnet on the Android platform. This includes testing for the amount of malicious traffic an Android phone would be capable of generating for a DDoS attack. On the other end of the spectrum is the need of mobile devices running networked applications to develop security against DDoS attacks. For this mobile, phones are setup, with web servers running Apache to simulate users running internet connected applications for either local ad hoc networks or serving to the internet. Testing is done for the viability of using commonly available modules developed for Apache and intended for servers as well as finding baseline capabilities of mobiles to handle higher traffic volumes. Given the unique challenge of the limited resources a mobile phone can dedicate to Apache when compared to a dedicated hosting server a new method was needed. A proposed defense algorithm is developed for mitigating DDoS attacks against the mobile server that takes into account the limited resources available on the mobile device. The algorithm is tested against TCP socket flooding for effectiveness and shown to perform better than the common Apache module installations on a mobile device.
Defensive Programming
This research explores the concepts of defensive programming as currently defined in the literature. Then these concepts are extended and more explicitly defined. The relationship between defensive programming, as presented in this research, and current programming practices is discussed and several benefits are observed. Defensive programming appears to benefit the entire software life cycle. Four identifiable phases of the software development process are defined, and the relationship between these four phases and defensive programming is shown. In this research, defensive programming is defined as writing programs in such a way that during execution the program itself produces communication allowing the programmer and the user to observe its dynamic states accurately and critically. To accomplish this end, the use of defensive programming snap shots is presented as a software development tool.
Design and Analysis of Novel Verifiable Voting Schemes
Free and fair elections are the basis for democracy, but conducting elections is not an easy task. Different groups of people are trying to influence the outcome of the election in their favor using the range of methods, from campaigning for a particular candidate to well-financed lobbying. Often the stakes are too high, and the methods are illegal. Two main properties of any voting scheme are the privacy of a voter’s choice and the integrity of the tally. Unfortunately, they are mutually exclusive. Integrity requires making elections transparent and auditable, but at the same time, we must preserve a voter’s privacy. It is always a trade-off between these two requirements. Current voting schemes favor privacy over auditability, and thus, they are vulnerable to voting fraud. I propose two novel voting systems that can achieve both privacy and verifiability. The first protocol is based on cryptographical primitives to ensure the integrity of the final tally and privacy of the voter. The second protocol is a simple paper-based voting scheme that achieves almost the same level of security without usage of cryptography.
The Design and Implementation of a Prolog Parser Using Javacc
Operatorless Prolog text is LL(1) in nature and any standard LL parser generator tool can be used to parse it. However, the Prolog text that conforms to the ISO Prolog standard allows the definition of dynamic operators. Since Prolog operators can be defined at run-time, operator symbols are not present in the grammar rules of the language. Unless the parser generator allows for some flexibility in the specification of the grammar rules, it is very difficult to generate a parser for such text. In this thesis we discuss the existing parsing methods and their modified versions to parse languages with dynamic operator capabilities. Implementation details of a parser using Javacc as a parser generator tool to parse standard Prolog text is provided. The output of the parser is an “Abstract Syntax Tree” that reflects the correct precedence and associativity rules among the various operators (static and dynamic) of the language. Empirical results are provided that show that a Prolog parser that is generated by the parser generator like Javacc is comparable in efficiency to a hand-coded parser.
The Design and Implementation of an Intelligent Agent-Based File System
As bandwidth constraints on LAN/WAN environments decrease, the demand for distributed services will continue to increase. In particular, the proliferation of user-level applications requiring high-capacity distributed file storage systems will demand that such services be universally available. At the same time, the advent of high-speed networks have made the deployment of application and communication solutions based upon an Intelligent Mobile Agent (IMA) framework practical. Agents have proven to present an ideal development paradigm for the creation of autonomous large-scale distributed systems, and an agent-based communication scheme would facilitate the creation of independently administered distributed file services. This thesis thus outlines an architecture for such a distributed file system based upon an IMA communication framework.
Design and Implementation of Large-Scale Wireless Sensor Networks for Environmental Monitoring Applications
Environmental monitoring represents a major application domain for wireless sensor networks (WSN). However, despite significant advances in recent years, there are still many challenging issues to be addressed to exploit the full potential of the emerging WSN technology. In this dissertation, we introduce the design and implementation of low-power wireless sensor networks for long-term, autonomous, and near-real-time environmental monitoring applications. We have developed an out-of-box solution consisting of a suite of software, protocols and algorithms to provide reliable data collection with extremely low power consumption. Two wireless sensor networks based on the proposed solution have been deployed in remote field stations to monitor soil moisture along with other environmental parameters. As parts of the ever-growing environmental monitoring cyberinfrastructure, these networks have been integrated into the Texas Environmental Observatory system for long-term operation. Environmental measurement and network performance results are presented to demonstrate the capability, reliability and energy-efficiency of the network.
A Design Approach for Digital Computer Peripheral Controllers, Case Study Design and Construction
The purpose of this project was to describe a novel design approach for a digital computer peripheral controller, then design and construct a case study controller. This document consists of three chapters and an appendix. Chapter II presents the design approach chosen; a variation to a design presented by Charles R. Richards in an article published in Electronics magazine. Richards' approach consists of a finite state machine circuitry controlling all the functions of a controller. The variation to Richards' approach consists of considering the various logically independent processes which a controller carries out and assigning control of each process to a separate finite state machine. The appendix contains the documentation of the design and construction of the controller.
The Design Of A Benchmark For Geo-stream Management Systems
The recent growth in sensor technology allows easier information gathering in real-time as sensors have grown smaller, more accurate, and less expensive. The resulting data is often in a geo-stream format continuously changing input with a spatial extent. Researchers developing geo-streaming management systems (GSMS) require a benchmark system for evaluation, which is currently lacking. This thesis presents GSMark, a benchmark for evaluating GSMSs. GSMark provides a data generator that creates a combination of synthetic and real geo-streaming data, a workload simulator to present the data to the GSMS as a data stream, and a set of benchmark queries that evaluate typical GSMS functionality and query performance. In particular, GSMark generates both moving points and evolving spatial regions, two fundamental data types for a broad range of geo-stream applications, and the geo-streaming queries on this data.
Detection of Ulcerative Colitis Severity and Enhancement of Informative Frame Filtering Using Texture Analysis in Colonoscopy Videos
There are several types of disorders that affect our colon’s ability to function properly such as colorectal cancer, ulcerative colitis, diverticulitis, irritable bowel syndrome and colonic polyps. Automatic detection of these diseases would inform the endoscopist of possible sub-optimal inspection during the colonoscopy procedure as well as save time during post-procedure evaluation. But existing systems only detects few of those disorders like colonic polyps. In this dissertation, we address the automatic detection of another important disorder called ulcerative colitis. We propose a novel texture feature extraction technique to detect the severity of ulcerative colitis in block, image, and video levels. We also enhance the current informative frame filtering methods by detecting water and bubble frames using our proposed technique. Our feature extraction algorithm based on accumulation of pixel value difference provides better accuracy at faster speed than the existing methods making it highly suitable for real-time systems. We also propose a hybrid approach in which our feature method is combined with existing feature method(s) to provide even better accuracy. We extend the block and image level detection method to video level severity score calculation and shot segmentation. Also, the proposed novel feature extraction method can detect water and bubble frames in colonoscopy videos with very high accuracy in significantly less processing time even when clustering is used to reduce the training size by 10 times.
Development, Implementation, and Analysis of a Contact Model for an Infectious Disease
With a growing concern of an infectious diseases spreading in a population, epidemiology is becoming more important for the future of public health. In the past epidemiologist used existing data of an outbreak to help them determine how an infectious disease might spread in the future. Now with computational models, they able to analysis data produced by these models to help with prevention and intervention plans. This paper looks at the design, implementation, and analysis of a computational model based on the interactions of the population between individuals. The design of the working contact model looks closely at the SEIR model used as the foundation and the two timelines of a disease. The implementation of the contact model is reviewed while looking closely at data structures. The analysis of the experiments provide evidence this contact model can be used to help epidemiologist study the spread of an infectious disease based on the contact rate of individuals.
Direct Online/Offline Digital Signature Schemes.
Online/offline signature schemes are useful in many situations, and two such scenarios are considered in this dissertation: bursty server authentication and embedded device authentication. In this dissertation, new techniques for online/offline signing are introduced, those are applied in a variety of ways for creating online/offline signature schemes, and five different online/offline signature schemes that are proved secure under a variety of models and assumptions are proposed. Two of the proposed five schemes have the best offline or best online performance of any currently known technique, and are particularly well-suited for the scenarios that are considered in this dissertation. To determine if the proposed schemes provide the expected practical improvements, a series of experiments were conducted comparing the proposed schemes with each other and with other state-of-the-art schemes in this area, both on a desktop class computer, and under AVR Studio, a simulation platform for an 8-bit processor that is popular for embedded systems. Under AVR Studio, the proposed SGE scheme using a typical key size for the embedded device authentication scenario, can complete the offline phase in about 24 seconds and then produce a signature (the online phase) in 15 milliseconds, which is the best offline performance of any known signature scheme that has been proven secure in the standard model. In the tests on a desktop class computer, the proposed SGS scheme, which has the best online performance and is designed for the bursty server authentication scenario, generated 469,109 signatures per second, and the Schnorr scheme (the next best scheme in terms of online performance) generated only 223,548 signatures. The experimental results demonstrate that the SGE and SGS schemes are the most efficient techniques for embedded device authentication and bursty server authentication, respectively.
DirectShow Approach to Low-Cost Multimedia Security Surveillance System
In response to the recent intensive needs for civilian security surveillance, both full and compact versions of a Multimedia Security Surveillance (MSS) system have been built up. The new Microsoft DirectShow technology was applied in implementing the multimedia stream-processing module. Through Microsoft Windows Driver Model interface, the chosen IEEE1394 enabled Fire-i cameras as external sensors are integrated with PC based continuous storage unit. The MSS application also allows multimedia broadcasting and remote controls. Cost analysis is included.
Distributed Frameworks Towards Building an Open Data Architecture
Data is everywhere. The current Technological advancements in Digital, Social media and the ease at which the availability of different application services to interact with variety of systems are causing to generate tremendous volumes of data. Due to such varied services, Data format is now not restricted to only structure type like text but can generate unstructured content like social media data, videos and images etc. The generated Data is of no use unless been stored and analyzed to derive some Value. Traditional Database systems comes with limitations on the type of data format schema, access rates and storage sizes etc. Hadoop is an Apache open source distributed framework that support storing huge datasets of different formatted data reliably on its file system named Hadoop File System (HDFS) and to process the data stored on HDFS using MapReduce programming model. This thesis study is about building a Data Architecture using Hadoop and its related open source distributed frameworks to support a Data flow pipeline on a low commodity hardware. The Data flow components are, sourcing data, storage management on HDFS and data access layer. This study also discuss about a use case to utilize the architecture components. Sqoop, a framework to ingest the structured data from database onto Hadoop and Flume is used to ingest the semi-structured Twitter streaming json data on to HDFS for analysis. The data sourced using Sqoop and Flume have been analyzed using Hive for SQL like analytics and at a higher level of data access layer, Hadoop has been compared with an in memory computing system using Spark. Significant differences in query execution performances have been analyzed when working with Hadoop and Spark frameworks. This integration helps for ingesting huge Volumes of streaming json Variety data to derive better Value based analytics using Hive and ...
Dynamic Grid-Based Data Distribution Management in Large Scale Distributed Simulations
Distributed simulation is an enabling concept to support the networked interaction of models and real world elements that are geographically distributed. This technology has brought a new set of challenging problems to solve, such as Data Distribution Management (DDM). The aim of DDM is to limit and control the volume of the data exchanged during a distributed simulation, and reduce the processing requirements of the simulation hosts by relaying events and state information only to those applications that require them. In this thesis, we propose a new DDM scheme, which we refer to as dynamic grid-based DDM. A lightweight UNT-RTI has been developed and implemented to investigate the performance of our DDM scheme. Our results clearly indicate that our scheme is scalable and it significantly reduces both the number of multicast groups used, and the message overhead, when compared to previous grid-based allocation schemes using large-scale and real-world scenarios.
Dynamic Resource Management in RSVP- Controlled Unicast Networks
Resources are said to be fragmented in the network when they are available in non-contiguous blocks, and calls are dropped as they may not end sufficient resources. Hence, available resources may remain unutilized. In this thesis, the effect of resource fragmentation (RF) on RSVP-controlled networks was studied and new algorithms were proposed to reduce the effect of RF. In order to minimize the effect of RF, resources in the network are dynamically redistributed on different paths to make them available in contiguous blocks. Extra protocol messages are introduced to facilitate resource redistribution in the network. The Dynamic Resource Redistribution (DRR) algorithm when used in conjunction with RSVP, not only increased the number of calls accommodated into the network but also increased the overall resource utilization of the network. Issues such as how many resources need to be redistributed and of which call(s), and how these choices affect the redistribution process were investigated. Further, various simulation experiments were conducted to study the performance of the DRR algorithm on different network topologies with varying traffic characteristics.
Efficient Algorithms and Framework for Bandwidth Allocation, Quality-of-Service Provisioning and Location Management in Mobile Wireless Computing
The fusion of computers and communications has promised to herald the age of information super-highway over high speed communication networks where the ultimate goal is to enable a multitude of users at any place, access information from anywhere and at any time. This, in a nutshell, is the goal envisioned by the Personal Communication Services (PCS) and Xerox's ubiquitous computing. In view of the remarkable growth of the mobile communication users in the last few years, the radio frequency spectrum allocated by the FCC (Federal Communications Commission) to this service is still very limited and the usable bandwidth is by far much less than the expected demand, particularly in view of the emergence of the next generation wireless multimedia applications like video-on-demand, WWW browsing, traveler information systems etc. Proper management of available spectrum is necessary not only to accommodate these high bandwidth applications, but also to alleviate problems due to sudden explosion of traffic in so called hot cells. In this dissertation, we have developed simple load balancing techniques to cope with the problem of tele-traffic overloads in one or more hot cells in the system. The objective is to ease out the high channel demand in hot cells by borrowing channels from suitable cold cells and by proper assignment (or, re-assignment) of the channels among the users. We also investigate possible ways of improving system capacity by rescheduling bandwidth in case of wireless multimedia traffic. In our proposed scheme, traffic using multiple channels releases one or more channels to increase the carried traffic or throughput in the system. Two orthogonal QoS parameters, called carried traffic and bandwidth degradation, are identified and a cost function describing the total revenue earned by the system from a bandwidth degradation and call admission policy, is formulated. A channel sharing scheme is proposed for ...
Efficient Linked List Ranking Algorithms and Parentheses Matching as a New Strategy for Parallel Algorithm Design
The goal of a parallel algorithm is to solve a single problem using multiple processors working together and to do so in an efficient manner. In this regard, there is a need to categorize strategies in order to solve broad classes of problems with similar structures and requirements. In this dissertation, two parallel algorithm design strategies are considered: linked list ranking and parentheses matching.
Elicitation of Protein-Protein Interactions from Biomedical Literature Using Association Rule Discovery
Extracting information from a stack of data is a tedious task and the scenario is no different in proteomics. Volumes of research papers are published about study of various proteins in several species, their interactions with other proteins and identification of protein(s) as possible biomarker in causing diseases. It is a challenging task for biologists to keep track of these developments manually by reading through the literatures. Several tools have been developed by computer linguists to assist identification, extraction and hypotheses generation of proteins and protein-protein interactions from biomedical publications and protein databases. However, they are confronted with the challenges of term variation, term ambiguity, access only to abstracts and inconsistencies in time-consuming manual curation of protein and protein-protein interaction repositories. This work attempts to attenuate the challenges by extracting protein-protein interactions in humans and elicit possible interactions using associative rule mining on full text, abstracts and captions from figures available from publicly available biomedical literature databases. Two such databases are used in our study: Directory of Open Access Journals (DOAJ) and PubMed Central (PMC). A corpus is built using articles based on search terms. A dataset of more than 38,000 protein-protein interactions from the Human Protein Reference Database (HPRD) is cross-referenced to validate discovered interactive pairs. A set of an optimal size of possible binary protein-protein interactions is generated to be made available for clinician or biological validation. A significant change in the number of new associations was found by altering the thresholds for support and confidence metrics. This study narrows down the limitations for biologists in keeping pace with discovery of protein-protein interactions via manually reading the literature and their needs to validate each and every possible interaction.
Embedded monitors for detecting and preventing intrusions in cryptographic and application protocols.
There are two main approaches for intrusion detection: signature-based and anomaly-based. Signature-based detection employs pattern matching to match attack signatures with observed data making it ideal for detecting known attacks. However, it cannot detect unknown attacks for which there is no signature available. Anomaly-based detection builds a profile of normal system behavior to detect known and unknown attacks as behavioral deviations. However, it has a drawback of a high false alarm rate. In this thesis, we describe our anomaly-based IDS designed for detecting intrusions in cryptographic and application-level protocols. Our system has several unique characteristics, such as the ability to monitor cryptographic protocols and application-level protocols embedded in encrypted sessions, a very lightweight monitoring process, and the ability to react to protocol misuse by modifying protocol response directly.
An Empirical Evaluation of Communication and Coordination Effectiveness in Autonomous Reactive Multiagent Systems
This thesis describes experiments designed to measure the effect of collaborative communication on task performance of a multiagent system. A discrete event simulation was developed to model a multi-agent system completing a task to find and collect food resources, with the ability to substitute various communication and coordination methods. Experiments were conducted to find the effects of the various communication methods on completion of the task to find and harvest the food resources. Results show that communication decreases the time required to complete the task. However, all communication methods do not fare equally well. In particular, results indicate that the communication model of the bee is a particularly effective method of agent communication and collaboration. Furthermore, results indicate that direct communication with additional information content provides better completion results. Cost-benefit models show some conflicting information, indicating that the increased performance may not offset the additional cost of achieving that performance.
An Empirical Study of How Novice Programmers Use the Web
Students often use the web as a source of help for problems that they encounter on programming assignments.In this work, we seek to understand how students use the web to search for help on their assignments.We used a mixed methods approach with 344 students who complete a survey and 41 students who participate in a focus group meetings and helped in recording data about their search habits.The survey reveals data about student reported search habits while the focus group uses a web browser plug-in to record actual search patterns.We examine the results collectively and as broken down by class year.Survey results show that at least 2/3 of the students from each class year rely on search engines to locate resources for help with their programming bugs in at least half of their assignments;search habits vary by class year;and the value of different types of resources such as tutorials and forums varies by class year.Focus group results exposes the high frequency web sites used by the students in solving their programming assignments.
End of Insertion Detection in Colonoscopy Videos
Colorectal cancer is the second leading cause of cancer-related deaths behind lung cancer in the United States. Colonoscopy is the preferred screening method for detection of diseases like Colorectal Cancer. In the year 2006, American Society for Gastrointestinal Endoscopy (ASGE) and American College of Gastroenterology (ACG) issued guidelines for quality colonoscopy. The guidelines suggest that on average the withdrawal phase during a screening colonoscopy should last a minimum of 6 minutes. My aim is to classify the colonoscopy video into insertion and withdrawal phase. The problem is that currently existing shot detection techniques cannot be applied because colonoscopy is a single camera shot from start to end. An algorithm to detect phase boundary has already been developed by the MIGLAB team. Existing method has acceptable levels of accuracy but the main issue is dependency on MPEG (Moving Pictures Expert Group) 1/2. I implemented exhaustive search for motion estimation to reduce the execution time and improve the accuracy. I took advantages of the C/C++ programming languages with multithreading which helped us get even better performances in terms of execution time. I propose a method for improving the current method of colonoscopy video analysis and also an extension for the same to make it usable for real time videos. The real time version we implemented is capable of handling streams coming directly from the camera in the form of uncompressed bitmap frames. Existing implementation could not be applied to real time scenario because of its dependency on MPEG 1/2. Future direction of this research includes improved motion search and GPU parallel computing techniques.
The enhancement of machine translation for low-density languages using Web-gathered parallel texts.
The majority of the world's languages are poorly represented in informational media like radio, television, newspapers, and the Internet. Translation into and out of these languages may offer a way for speakers of these languages to interact with the wider world, but current statistical machine translation models are only effective with a large corpus of parallel texts - texts in two languages that are translations of one another - which most languages lack. This thesis describes the Babylon project which attempts to alleviate this shortage by supplementing existing parallel texts with texts gathered automatically from the Web -- specifically targeting pages that contain text in a pair of languages. Results indicate that parallel texts gathered from the Web can be effectively used as a source of training data for machine translation and can significantly improve the translation quality for text in a similar domain. However, the small quantity of high-quality low-density language parallel texts on the Web remains a significant obstacle.
Evaluating the Scalability of SDF Single-chip Multiprocessor Architecture Using Automatically Parallelizing Code
Advances in integrated circuit technology continue to provide more and more transistors on a chip. Computer architects are faced with the challenge of finding the best way to translate these resources into high performance. The challenge in the design of next generation CPU (central processing unit) lies not on trying to use up the silicon area, but on finding smart ways to make use of the wealth of transistors now available. In addition, the next generation architecture should offer high throughout performance, scalability, modularity, and low energy consumption, instead of an architecture that is suitable for only one class of applications or users, or only emphasize faster clock rate. A program exhibits different types of parallelism: instruction level parallelism (ILP), thread level parallelism (TLP), or data level parallelism (DLP). Likewise, architectures can be designed to exploit one or more of these types of parallelism. It is generally not possible to design architectures that can take advantage of all three types of parallelism without using very complex hardware structures and complex compiler optimizations. We present the state-of-art architecture SDF (scheduled data flowed) which explores the TLP parallelism as much as that is supplied by that application. We implement a SDF single-chip multiprocessor constructed from simpler processors and execute the automatically parallelizing application on the single-chip multiprocessor. SDF has many desirable features such as high throughput, scalability, and low power consumption, which meet the requirements of the next generation of CPU design. Compared with superscalar, VLIW (very long instruction word), and SMT (simultaneous multithreading), the experiment results show that for application with very little parallelism SDF is comparable to other architectures, for applications with large amounts of parallelism SDF outperforms other architectures.
Execution Time Analysis through Software Monitors
The analysis of an executing program and the isolation of critical code has been a problem since the first program was written. This thesis examines the process of program analysis through the use of a software monitoring system. Since there is a trend toward structured languages a subset of PL/I was developed t~o exhibit source statement monitoring and costing techniques. By filtering a PL/W program through a preorocessor which determines the cost of source statements and inserts monitoring code, a post-execution analysis of the program can be obtained. This analysis displays an estimated time cost for each source statements the number of times the statement w3s executed, and the product of these values. Additionally, a bar graph is printed in order to quickly locate very active code.
Exon/Intron Discrimination Using the Finite Induction Pattern Matching Technique
DNA sequence analysis involves precise discrimination of two of the sequence's most important components: exons and introns. Exons encode the proteins that are responsible for almost all the functions in a living organism. Introns interrupt the sequence coding for a protein and must be removed from primary RNA transcripts before translation to protein can occur. A pattern recognition technique called Finite Induction (FI) is utilized to study the language of exons and introns. FI is especially suited for analyzing and classifying large amounts of data representing sequences of interest. It requires no biological information and employs no statistical functions. Finite Induction is applied to the exon and intron components of DNA by building a collection of rules based upon what it finds in the sequences it examines. It then attempts to match the known rule patterns with new rules formed as a result of analyzing a new sequence. A high number of matches predict a probable close relationship between the two sequences; a low number of matches signifies a large amount of difference between the two. This research demonstrates FI to be a viable tool for measurement when known patterns are available for the formation of rule sets.
Exploring Trusted Platform Module Capabilities: A Theoretical and Experimental Study
Trusted platform modules (TPMs) are hardware modules that are bound to a computer's motherboard, that are being included in many desktops and laptops. Augmenting computers with these hardware modules adds powerful functionality in distributed settings, allowing us to reason about the security of these systems in new ways. In this dissertation, I study the functionality of TPMs from a theoretical as well as an experimental perspective. On the theoretical front, I leverage various features of TPMs to construct applications like random oracles that are impossible to implement in a standard model of computation. Apart from random oracles, I construct a new cryptographic primitive which is basically a non-interactive form of the standard cryptographic primitive of oblivious transfer. I apply this new primitive to secure mobile agent computations, where interaction between various entities is typically required to ensure security. I prove these constructions are secure using standard cryptographic techniques and assumptions. To test the practicability of these constructions and their applications, I performed an experimental study, both on an actual TPM and a software TPM simulator which has been enhanced to make it reflect timings from a real TPM. This allowed me to benchmark the performance of the applications and test the feasibility of the proposed extensions to standard TPMs. My tests also show that these constructions are practical.
Extensions to Jinni Mobile Agent Architecture
We extend the Jinni mobile agent architecture with a multicast network transport layer, an agent-to-agent delegation mechanism and a reflection based Prolog-to-Java interface. To ensure that our agent infrastructure runs efficiently, independently of router-level multicast support, we describe a blackboard based algorithm for locating a randomly roaming agent. As part of the agent-to-agent delegation mechanism, we describe an alternative to code-fetching mechanism for stronger mobility of mobile agents with less network overhead. In the context of direct and reflection based extension mechanisms for Jinni, we describe the design and the implementation of a reflection based Prolog-to-Java interface. The presence of subtyping and method overloading makes finding the most specific method corresponding to a Prolog call pattern fairly difficult. We describe a run-time algorithm which provides accurate handling of overloaded methods beyond Java's reflection package's limitations.
Flexible Digital Authentication Techniques
Abstract This dissertation investigates authentication techniques in some emerging areas. Specifically, authentication schemes have been proposed that are well-suited for embedded systems, and privacy-respecting pay Web sites. With embedded systems, a person could own several devices which are capable of communication and interaction, but these devices use embedded processors whose computational capabilities are limited as compared to desktop computers. Examples of this scenario include entertainment devices or appliances owned by a consumer, multiple control and sensor systems in an automobile or airplane, and environmental controls in a building. An efficient public key cryptosystem has been devised, which provides a complete solution to an embedded system, including protocols for authentication, authenticated key exchange, encryption, and revocation. The new construction is especially suitable for the devices with constrained computing capabilities and resources. Compared with other available authentication schemes, such as X.509, identity-based encryption, etc, the new construction provides unique features such as simplicity, efficiency, forward secrecy, and an efficient re-keying mechanism. In the application scenario for a pay Web site, users may be sensitive about their privacy, and do not wish their behaviors to be tracked by Web sites. Thus, an anonymous authentication scheme is desirable in this case. That is, a user can prove his/her authenticity without revealing his/her identity. On the other hand, the Web site owner would like to prevent a bunch of users from sharing a single subscription while hiding behind user anonymity. The Web site should be able to detect these possible malicious behaviors, and exclude corrupted users from future service. This dissertation extensively discusses anonymous authentication techniques, such as group signature, direct anonymous attestation, and traceable signature. Three anonymous authentication schemes have been proposed, which include a group signature scheme with signature claiming and variable linkability, a scheme for direct anonymous attestation in trusted computing platforms ...
Force-Directed Graph Drawing and Aesthetics Measurement in a Non-Strict Pure Functional Programming Language
Non-strict pure functional programming often requires redesigning algorithms and data structures to work more effectively under new constraints of non-strict evaluation and immutable state. Graph drawing algorithms, while numerous and broadly studied, have no presence in the non-strict pure functional programming model. Additionally, there is currently no freely licensed standalone toolkit used to quantitatively analyze aesthetics of graph drawings. This thesis addresses two previously unexplored questions. Can a force-directed graph drawing algorithm be implemented in a non-strict functional language, such as Haskell, and still be practically usable? Can an easily extensible aesthetic measuring tool be implemented in a language such as Haskell and still be practically usable? The focus of the thesis is on implementing one of the simplest force-directed algorithms, that of Fruchterman and Reingold, and comparing its resulting aesthetics to those of a well-known C++ implementation of the same algorithm.
FORTRAN Optimizations at the Source Code Level
This paper discusses FORTRAN optimizations that the user can perform manually at the source code level to improve object code performance. It makes use of descriptive examples within the text of the paper for explanatory purposes. The paper defines key areas in writing a FORTRAN program and recommends ways to improve efficiency in these areas.
FP-tree Based Spatial Co-location Pattern Mining
A co-location pattern is a set of spatial features frequently located together in space. A frequent pattern is a set of items that frequently appears in a transaction database. Since its introduction, the paradigm of frequent pattern mining has undergone a shift from candidate generation-and-test based approaches to projection based approaches. Co-location patterns resemble frequent patterns in many aspects. However, the lack of transaction concept, which is crucial in frequent pattern mining, makes the similar shift of paradigm in co-location pattern mining very difficult. This thesis investigates a projection based co-location pattern mining paradigm. In particular, a FP-tree based co-location mining framework and an algorithm called FP-CM, for FP-tree based co-location miner, are proposed. It is proved that FP-CM is complete, correct, and only requires a small constant number of database scans. The experimental results show that FP-CM outperforms candidate generation-and-test based co-location miner by an order of magnitude.
A Framework for Analyzing and Optimizing Regional Bio-Emergency Response Plans
The presence of naturally occurring and man-made public health threats necessitate the design and implementation of mitigation strategies, such that adequate response is provided in a timely manner. Since multiple variables, such as geographic properties, resource constraints, and government mandated time-frames must be accounted for, computational methods provide the necessary tools to develop contingency response plans while respecting underlying data and assumptions. A typical response scenario involves the placement of points of dispensing (PODs) in the affected geographic region to supply vaccines or medications to the general public. Computational tools aid in the analysis of such response plans, as well as in the strategic placement of PODs, such that feasible response scenarios can be developed. Due to the sensitivity of bio-emergency response plans, geographic information, such as POD locations, must be kept confidential. The generation of synthetic geographic regions allows for the development of emergency response plans on non-sensitive data, as well as for the study of the effects of single geographic parameters. Further, synthetic representations of geographic regions allow for results to be published and evaluated by the scientific community. This dissertation presents methodology for the analysis of bio-emergency response plans, methods for plan optimization, as well as methodology for the generation of synthetic geographic regions.
Framework for Evaluating Dynamic Memory Allocators Including a New Equivalence Class Based Cache-conscious Allocator
Software applications’ performance is hindered by a variety of factors, but most notably by the well-known CPU-memory speed gap (often known as the memory wall). This results in the CPU sitting idle waiting for data to be brought from memory to processor caches. The addressing used by caches cause non-uniform accesses to various cache sets. The non-uniformity is due to several reasons, including how different objects are accessed by the code and how the data objects are located in memory. Memory allocators determine where dynamically created objects are placed, thus defining addresses and their mapping to cache locations. It is important to evaluate how different allocators behave with respect to the localities of the created objects. Most allocators use a single attribute, the size, of an object in making allocation decisions. Additional attributes such as the placement with respect to other objects, or specific cache area may lead to better use of cache memories. In this dissertation, we proposed and implemented a framework that allows for the development and evaluation of new memory allocation techniques. At the root of the framework is a memory tracing tool called Gleipnir, which provides very detailed information about every memory access, and relates it back to source level objects. Using the traces from Gleipnir, we extended a commonly used cache simulator for generating detailed cache statistics: per function, per data object, per cache line, and identify specific data objects that are conflicting with each other. The utility of the framework is demonstrated with a new memory allocator known as equivalence class allocator. The new allocator allows users to specify cache sets, in addition to object size, where the objects should be placed. We compare this new allocator with two well-known allocators, viz., Doug Lea and Pool allocators.
Freeform Cursive Handwriting Recognition Using a Clustered Neural Network
Optical character recognition (OCR) software has advanced greatly in recent years. Machine-printed text can be scanned and converted to searchable text with word accuracy rates around 98%. Reasonably neat hand-printed text can be recognized with about 85% word accuracy. However, cursive handwriting still remains a challenge, with state-of-the-art performance still around 75%. Algorithms based on hidden Markov models have been only moderately successful, while recurrent neural networks have delivered the best results to date. This thesis explored the feasibility of using a special type of feedforward neural network to convert freeform cursive handwriting to searchable text. The hidden nodes in this network were grouped into clusters, with each cluster being trained to recognize a unique character bigram. The network was trained on writing samples that were pre-segmented and annotated. Post-processing was facilitated in part by using the network to identify overlapping bigrams that were then linked together to form words and sentences. With dictionary assisted post-processing, the network achieved word accuracy of 66.5% on a small, proprietary corpus. The contributions in this thesis are threefold: 1) the novel clustered architecture of the feed-forward neural network, 2) the development of an expanded set of observers combining image masks, modifiers, and feature characterizations, and 3) the use of overlapping bigrams as the textual working unit to assist in context analysis and reconstruction.
General Purpose Computing in Gpu - a Watermarking Case Study
The purpose of this project is to explore the GPU for general purpose computing. The GPU is a massively parallel computing device that has a high-throughput, exhibits high arithmetic intensity, has a large market presence, and with the increasing computation power being added to it each year through innovations, the GPU is a perfect candidate to complement the CPU in performing computations. The GPU follows the single instruction multiple data (SIMD) model for applying operations on its data. This model allows the GPU to be very useful for assisting the CPU in performing computations on data that is highly parallel in nature. The compute unified device architecture (CUDA) is a parallel computing and programming platform for NVIDIA GPUs. The main focus of this project is to show the power, speed, and performance of a CUDA-enabled GPU for digital video watermark insertion in the H.264 video compression domain. Digital video watermarking in general is a highly computationally intensive process that is strongly dependent on the video compression format in place. The H.264/MPEG-4 AVC video compression format has high compression efficiency at the expense of having high computational complexity and leaving little room for an imperceptible watermark to be inserted. Employing a human visual model to limit distortion and degradation of visual quality introduced by the watermark is a good choice for designing a video watermarking algorithm though this does introduce more computational complexity to the algorithm. Research is being conducted into how the CPU-GPU execution of the digital watermark application can boost the speed of the applications several times compared to running the application on a standalone CPU using NVIDIA visual profiler to optimize the application.
General Purpose Programming on Modern Graphics Hardware
I start with a brief introduction to the graphics processing unit (GPU) as well as general-purpose computation on modern graphics hardware (GPGPU). Next, I explore the motivations for GPGPU programming, and the capabilities of modern GPUs (including advantages and disadvantages). Also, I give the background required for further exploring GPU programming, including the terminology used and the resources available. Finally, I include a comprehensive survey of previous and current GPGPU work, and end with a look at the future of GPU programming.
A general purpose semantic parser using FrameNet and WordNet®.
Syntactic parsing is one of the best understood language processing applications. Since language and grammar have been formally defined, it is easy for computers to parse the syntactic structure of natural language text. Does meaning have structure as well? If it has, how can we analyze the structure? Previous systems rely on a one-to-one correspondence between syntactic rules and semantic rules. But such systems can only be applied to limited fragments of English. In this thesis, we propose a general-purpose shallow semantic parser which utilizes a semantic network (WordNet), and a frame dataset (FrameNet). Semantic relations recognized by the parser are based on how human beings represent knowledge of the world. Parsing semantic structure allows semantic units and constituents to be accessed and processed in a more meaningful way than syntactic parsing, moving the automation of understanding natural language text to a higher level.
Generating Machine Code for High-Level Programming Languages
The purpose of this research was to investigate the generation of machine code from high-level programming language. The following steps were undertaken: 1) Choose a high-level programming language as the source language and a computer as the target computer. 2) Examine all stages during the compiling of a high-level programming language and all data sets involved in the compilation. 3) Discover the mechanism for generating machine code and the mechanism to generate more efficient machine code from the language. 3) Construct an algorithm for generating machine code for the target computer. The results suggest that compiler is best implemented in a high-level programming language, and that SCANNER and PARSER should be independent of target representations, if possible.
A Global Stochastic Modeling Framework to Simulate and Visualize Epidemics
Epidemics have caused major human and monetary losses through the course of human civilization. It is very important that epidemiologists and public health personnel are prepared to handle an impending infectious disease outbreak. the ever-changing demographics, evolving infrastructural resources of geographic regions, emerging and re-emerging diseases, compel the use of simulation to predict disease dynamics. By the means of simulation, public health personnel and epidemiologists can predict the disease dynamics, population groups at risk and their geographic locations beforehand, so that they are prepared to respond in case of an epidemic outbreak. As a consequence of the large numbers of individuals and inter-personal interactions involved in simulating infectious disease spread in a region such as a county, sizeable amounts of data may be produced that have to be analyzed. Methods to visualize this data would be effective in facilitating people from diverse disciplines understand and analyze the simulation. This thesis proposes a framework to simulate and visualize the spread of an infectious disease in a population of a region such as a county. As real-world populations have a non-homogeneous demographic and spatial distribution, this framework models the spread of an infectious disease based on population of and geographic distance between census blocks; social behavioral parameters for demographic groups. the population is stratified into demographic groups in individual census blocks using census data. Infection spread is modeled by means of local and global contacts generated between groups of population in census blocks. the strength and likelihood of the contacts are based on population, geographic distance and social behavioral parameters of the groups involved. the disease dynamics are represented on a geographic map of the region using a heat map representation, where the intensity of infection is mapped to a color scale. This framework provides a tool for public health personnel and ...
GPS CaPPture: a System for GPS Trajectory Collection, Processing, and Destination Prediction
In the United States, smartphone ownership surpassed 69.5 million in February 2011 with a large portion of those users (20%) downloading applications (apps) that enhance the usability of a device by adding additional functionality. a large percentage of apps are written specifically to utilize the geographical position of a mobile device. One of the prime factors in developing location prediction models is the use of historical data to train such a model. with larger sets of training data, prediction algorithms become more accurate; however, the use of historical data can quickly become a downfall if the GPS stream is not collected or processed correctly. Inaccurate or incomplete or even improperly interpreted historical data can lead to the inability to develop accurately performing prediction algorithms. As GPS chipsets become the standard in the ever increasing number of mobile devices, the opportunity for the collection of GPS data increases remarkably. the goal of this study is to build a comprehensive system that addresses the following challenges: (1) collection of GPS data streams in a manner such that the data is highly usable and has a reduction in errors; (2) processing and reduction of the collected data in order to prepare it and make it highly usable for the creation of prediction algorithms; (3) creation of prediction/labeling algorithms at such a level that they are viable for commercial use. This study identifies the key research problems toward building the CaPPture (collection, processing, prediction) system.
Graph-based Centrality Algorithms for Unsupervised Word Sense Disambiguation
This thesis introduces an innovative methodology of combining some traditional dictionary based approaches to word sense disambiguation (semantic similarity measures and overlap of word glosses, both based on WordNet) with some graph-based centrality methods, namely the degree of the vertices, Pagerank, closeness, and betweenness. The approach is completely unsupervised, and is based on creating graphs for the words to be disambiguated. We experiment with several possible combinations of the semantic similarity measures as the first stage in our experiments. The next stage attempts to score individual vertices in the graphs previously created based on several graph connectivity measures. During the final stage, several voting schemes are applied on the results obtained from the different centrality algorithms. The most important contributions of this work are not only that it is a novel approach and it works well, but also that it has great potential in overcoming the new-knowledge-acquisition bottleneck which has apparently brought research in supervised WSD as an explicit application to a plateau. The type of research reported in this thesis, which does not require manually annotated data, holds promise of a lot of new and interesting things, and our work is one of the first steps, despite being a small one, in this direction. The complete system is built and tested on standard benchmarks, and is comparable with work done on graph-based word sense disambiguation as well as lexical chains. The evaluation indicates that the right combination of the above mentioned metrics can be used to develop an unsupervised disambiguation engine as powerful as the state-of-the-art in WSD.
Graph-Based Keyphrase Extraction Using Wikipedia
Keyphrases describe a document in a coherent and simple way, giving the prospective reader a way to quickly determine whether the document satisfies their information needs. The pervasion of huge amount of information on Web, with only a small amount of documents have keyphrases extracted, there is a definite need to discover automatic keyphrase extraction systems. Typically, a document written by human develops around one or more general concepts or sub-concepts. These concepts or sub-concepts should be structured and semantically related with each other, so that they can form the meaningful representation of a document. Considering the fact, the phrases or concepts in a document are related to each other, a new approach for keyphrase extraction is introduced that exploits the semantic relations in the document. For measuring the semantic relations between concepts or sub-concepts in the document, I present a comprehensive study aimed at using collaboratively constructed semantic resources like Wikipedia and its link structure. In particular, I introduce a graph-based keyphrase extraction system that exploits the semantic relations in the document and features such as term frequency. I evaluated the proposed system using novel measures and the results obtained compare favorably with previously published results on established benchmarks.
Grid-based Coordinated Routing in Wireless Sensor Networks
Wireless sensor networks are battery-powered ad-hoc networks in which sensor nodes that are scattered over a region connect to each other and form multi-hop networks. These nodes are equipped with sensors such as temperature sensors, pressure sensors, and light sensors and can be queried to get the corresponding values for analysis. However, since they are battery operated, care has to be taken so that these nodes use energy efficiently. One of the areas in sensor networks where an energy analysis can be done is routing. This work explores grid-based coordinated routing in wireless sensor networks and compares the energy available in the network over time for different grid sizes.
Group-EDF: A New Approach and an Efficient Non-Preemptive Algorithm for Soft Real-Time Systems
Hard real-time systems in robotics, space and military missions, and control devices are specified with stringent and critical time constraints. On the other hand, soft real-time applications arising from multimedia, telecommunications, Internet web services, and games are specified with more lenient constraints. Real-time systems can also be distinguished in terms of their implementation into preemptive and non-preemptive systems. In preemptive systems, tasks are often preempted by higher priority tasks. Non-preemptive systems are gaining interest for implementing soft-real applications on multithreaded platforms. In this dissertation, I propose a new algorithm that uses a two-level scheduling strategy for scheduling non-preemptive soft real-time tasks. Our goal is to improve the success ratios of the well-known earliest deadline first (EDF) approach when the load on the system is very high and to improve the overall performance in both underloaded and overloaded conditions. Our approach, known as group-EDF (gEDF), is based on dynamic grouping of tasks with deadlines that are very close to each other, and using a shortest job first (SJF) technique to schedule tasks within the group. I believe that grouping tasks dynamically with similar deadlines and utilizing secondary criteria, such as minimizing the total execution time can lead to new and more efficient real-time scheduling algorithms. I present results comparing gEDF with other real-time algorithms including, EDF, best-effort, and guarantee scheme, by using randomly generated tasks with varying execution times, release times, deadlines and tolerances to missing deadlines, under varying workloads. Furthermore, I implemented the gEDF algorithm in the Linux kernel and evaluated gEDF for scheduling real applications.
High Performance Architecture using Speculative Threads and Dynamic Memory Management Hardware
With the advances in very large scale integration (VLSI) technology, hundreds of billions of transistors can be packed into a single chip. With the increased hardware budget, how to take advantage of available hardware resources becomes an important research area. Some researchers have shifted from control flow Von-Neumann architecture back to dataflow architecture again in order to explore scalable architectures leading to multi-core systems with several hundreds of processing elements. In this dissertation, I address how the performance of modern processing systems can be improved, while attempting to reduce hardware complexity and energy consumptions. My research described here tackles both central processing unit (CPU) performance and memory subsystem performance. More specifically I will describe my research related to the design of an innovative decoupled multithreaded architecture that can be used in multi-core processor implementations. I also address how memory management functions can be off-loaded from processing pipelines to further improve system performance and eliminate cache pollution caused by runtime management functions.