In order to run and manage resource-intensive high-performance applications, large-scale computing and storage platforms have been evolving rapidly in various domains in both academia and industry. The energy expenditure consumed to operate and maintain these cloud computing infrastructures is a major factor to influence the overall profit and efficiency for most cloud service providers. Moreover, considering the mitigation of environmental damage from excessive carbon dioxide emission, the amount of power consumed by enterprise-scale data centers should be constrained for protection of the environment.Generally speaking, there exists a trade-off between power consumption and application performance in large-scale computing systems and how to balance these two factors has become an important topic for researchers and engineers in cloud and HPC communities. Therefore, minimizing the power usage while satisfying the Service Level Agreements have become one of the most desirable objectives in cloud computing research and implementation. Since the fundamental feature of the cloud computing platform is hosting workloads with a variety of characteristics in a consolidated and on-demand manner, it is demanding to explore the inherent relationship between power usage and machine configurations. Subsequently, with an understanding of these inherent relationships, researchers are able to develop effective power management policies to optimize productivity by balancing power usage and system performance. In this dissertation, we develop an autonomic power-aware system management framework for large-scale computer systems. We propose a series of techniques including coarse-grain power profiling, VM power modelling, power-aware resource auto-configuration and full-system power usage simulator. These techniques help us to understand the characteristics of power consumption of various system components. Based on these techniques, we are able to test various job scheduling strategies and develop resource management approaches to enhance the systems' power efficiency.
Modern wireless communication systems use spectrally efficient modulation schemes to reach high data rate transmission. These schemes are generally involved with signals with high peak-to-average power ratio (PAPR). Moreover, the development of next generation wireless communication systems requires the power amplifiers to operate over a wide frequency band or multiple frequency bands to support different applications. These wide-band and multi-band solutions will lead to reductions in both the size and cost of the whole system. This dissertation presents several advanced power amplifier solutions to provide wide-band and multi-band operations with efficiency improvement at power back-offs.
In recent years, the collection of genomic data has skyrocketed and databases of genomic data are growing at a faster rate than ever before. Although many computational methods have been developed to interpret these data, they tend to struggle to process the ever increasing file sizes that are being produced and fail to take advantage of the advances in multi-core processors by using parallel processing. In some instances, loss of accuracy has been a necessary trade off to allow faster computation of the data. This thesis discusses one such algorithm that has been developed and how changes were made to allow larger input file sizes and reduce the time required to achieve a result without sacrificing accuracy. An information entropy based algorithm was used as a basis to demonstrate these techniques. The algorithm dissects the distinctive patterns underlying genomic data efficiently requiring no a priori knowledge, and thus is applicable in a variety of biological research applications. This research describes how parallel processing and object-oriented programming techniques were used to process larger files in less time and achieve a more accurate result from the algorithm. Through object oriented techniques, the maximum allowable input file size was significantly increased from 200 mb to 2000 mb. Using parallel processing techniques allowed the program to finish processing data in less than half the time of the sequential version. The accuracy of the algorithm was improved by reducing data loss throughout the algorithm. Finally, adding user-friendly options enabled the program to use requests more effectively and further customize the logic used within the algorithm.
Like cell to the human body, transistors are the basic building blocks of any electronics circuits. Silicon has been the industries obvious choice for making transistors. Transistors with large size occupy large chip area, consume lots of power and the number of functionalities will be limited due to area constraints. Thus to make the devices smaller, smarter and faster, the transistors are aggressively scaled down in each generation. Moore's law states that the transistors count in any electronic circuits doubles every 18 months. Following this Moore's law, the transistor has already been scaled down to 14 nm. However there are limitations to how much further these transistors can be scaled down. Particularly below 10 nm, these silicon based transistors hit the fundamental limits like loss of gate control, high leakage and various other short channel effects. Thus it is not possible to favor the silicon transistors for future electronics applications. As a result, the research has shifted to new device concepts and device materials alternative to silicon. Carbon is the next abundant element found in the Earth and one of such carbon based nanomaterial is graphene. Graphene when extracted from Graphite, the same material used as the lid in pencil, have a tremendous potential to take future electronics devices to new heights in terms of size, cost and efficiency. Thus after its first experimental discovery of graphene in 2004, graphene has been the leading research area for both academics as well as industries. This dissertation is focused on the analysis and optimization of graphene based circuits for future electronics. The first part of this dissertation considers graphene based transistors for analog/radio frequency (RF) circuits. In this section, a dual gate Graphene Field Effect Transistor (GFET) is considered to build the case study circuits like voltage controlled oscillator (VCO) and low ...
Wireless sensor networks are composed of sensor nodes, which can monitor an environment and observe events of interest. These networks are applied in various fields including but not limited to environmental, industrial and habitat monitoring. In many applications, the exact location of the sensor nodes is unknown after deployment. Localization is a process used to find sensor node's positional coordinates, which is vital information. The localization is generally assisted by anchor nodes that are also sensor nodes but with known locations. Anchor nodes generally are expensive and need to be optimally placed for effective localization. Passive localization is one of the localization techniques where the sensor nodes silently listen to the global events like thunder sounds, seismic waves, lighting, etc. According to previous studies, the ideal location to place anchor nodes was on the perimeter of the sensor network. This may not be the case in passive localization, since the function of anchor nodes here is different than the anchor nodes used in other localization systems. I do extensive studies on positioning anchor nodes for effective localization. Several simulations are run in dense and sparse networks for proper positioning of anchor nodes. I show that, for effective passive localization, the optimal placement of the anchor nodes is at the center of the network in such a way that no three anchor nodes share linearity. The more the non-linearity, the better the localization. The localization for our network design proves better when I place anchor nodes at right angles.
This thesis explores the classification of emotions in song lyrics, using automatic approaches applied to a novel corpus of 100 popular songs. I use crowd sourcing via Amazon Mechanical Turk to collect line-level emotions annotations for this collection of song lyrics. I then build classifiers that rely on textual features to automatically identify the presence of one or more of the following six Ekman emotions: anger, disgust, fear, joy, sadness and surprise. I compare different classification systems and evaluate the performance of the automatic systems against the manual annotations. I also introduce a system that uses data collected from the social network Twitter. I use the Twitter API to collect a large corpus of tweets manually labeled by their authors for one of the six emotions of interest. I then compare the classification of emotions obtained when training on data automatically collected from Twitter versus data obtained through crowd sourced annotations.
The effectiveness of colonoscopy depends on the quality of the inspection of the colon. There was no automated measurement method to evaluate the quality of the inspection. This thesis addresses this issue by investigating an automated post-procedure quality measurement technique and proposing a novel approach automatically deciding a percentage of stool areas in images of digitized colonoscopy video files. It involves the classification of image pixels based on their color features using a new method of planes on RGB (red, green and blue) color space. The limitation of post-procedure quality measurement is that quality measurements are available long after the procedure was done and the patient was released. A better approach is to inform any sub-optimal inspection immediately so that the endoscopist can improve the quality in real-time during the procedure. This thesis also proposes an extension to post-procedure method to detect stool, bite-block, and blood regions in real-time using color features in HSV color space. These three objects play a major role in quality measurements in colonoscopy. The proposed method partitions very large positive examples of each of these objects into a number of groups. These groups are formed by taking intersection of positive examples with a hyper plane. This hyper plane is named as 'positive plane'. 'Convex hulls' are used to model positive planes. Comparisons with traditional classifiers such as K-nearest neighbor (K-NN) and support vector machines (SVM) proves the soundness of the proposed method in terms of accuracy and speed that are critical in the targeted real-time quality measurement system.
Shadows in indoor scenarios are usually characterized with multiple light sources that produce complex shadow patterns of a single object. Without removing shadow, the foreground object tends to be erroneously segmented. The inconsistent hue and intensity of shadows make automatic removal a challenging task. In this thesis, a dynamic thresholding and transfer learning-based method for removing shadows is proposed. The method suppresses light shadows with a dynamically computed threshold and removes dark shadows using an online learning strategy that is built upon a base classifier trained with manually annotated examples and refined with the automatically identified examples in the new videos. Experimental results demonstrate that despite variation of lighting conditions in videos our proposed method is able to adapt to the videos and remove shadows effectively. The sensitivity of shadow detection changes slightly with different confidence levels used in example selection for classifier retraining and high confidence level usually yields better performance with less retraining iterations.
Globally distributed software teams are widespread throughout industry. But finding reliable methods that can properly assess a team's activities is a real challenge. Methods such as surveys and manual coding of activities are too time consuming and are often unreliable. Recent advances in information retrieval and linguistics, however, suggest that automated and/or semi-automated text classification algorithms could be an effective way of finding differences in the communication patterns among individuals and groups. Communication among group members is frequent and generates a significant amount of data. Thus having a web-based tool that can automatically analyze the communication patterns among global software teams could lead to a better understanding of group performance. The goal of this thesis, therefore, is to compare automatic and semi-automatic measures of communication and evaluate their effectiveness in classifying different types of group activities that occur within a global software development project. In order to achieve this goal, we developed a web-based component that can be used to help clean and classify communication activities. The component was then used to compare different automated text classification techniques on various group activities to determine their effectiveness in correctly classifying data from a global software development team project.
The increasingly popular cloud-computing paradigm provides on-demand access to computing and storage with the appearance of unlimited resources. Users are given access to a variety of data and software utilities to manage their work. Users rent virtual resources and pay for only what they use. In spite of the many benefits that cloud computing promises, the lack of dependability in shared virtualized infrastructures is a major obstacle for its wider adoption, especially for mission-critical applications. Virtualization and multi-tenancy increase system complexity and dynamicity. They introduce new sources of failure degrading the dependability of cloud computing systems. To assure cloud dependability, in my dissertation research, I develop autonomic failure identification and diagnosis techniques that are crucial for understanding emergent, cloud-wide phenomena and self-managing resource burdens for cloud availability and productivity enhancement. We study the runtime cloud performance data collected from a cloud test-bed and by using traces from production cloud systems. We define cloud signatures including those metrics that are most relevant to failure instances. We exploit profiled cloud performance data in both time and frequency domain to identify anomalous cloud behaviors and leverage cloud metric subspace analysis to automate the diagnosis of observed failures. We implement a prototype of the anomaly identification system and conduct the experiments in an on-campus cloud computing test-bed and by using the Google datacenter traces. Our experimental results show that our proposed anomaly detection mechanism can achieve 93% detection sensitivity while keeping the false positive rate as low as 6.1% and outperform other tested anomaly detection schemes. In addition, the anomaly detector adapts itself by recursively learning from these newly verified detection results to refine future detection.
In many real-world applications, it is common to have uneven number of examples among multiple classes. The data imbalance, however, usually complicates the learning process, especially for the minority classes, and results in deteriorated performance. Boosting methods were proposed to handle the imbalance problem. These methods need elongated training time and require diversity among the classifiers of the ensemble to achieve improved performance. Additionally, extending the boosting method to handle multi-class data sets is not straightforward. Examples of applications that suffer from imbalanced multi-class data can be found in face recognition, where tens of classes exist, and in capsule endoscopy, which suffers massive imbalance between the classes. This dissertation introduces RegBoost, a new boosting framework to address the imbalanced, multi-class problems. This method applies a weighted stratified sampling technique and incorporates a regularization term that accommodates multi-class data sets and automatically determines the error bound of each base classifier. The regularization parameter penalizes the classifier when it misclassifies instances that were correctly classified in the previous iteration. The parameter additionally reduces the bias towards majority classes. Experiments are conducted using 12 diverse data sets with moderate to high imbalance ratios. The results demonstrate superior performance of the proposed method compared to several state-of-the-art algorithms for imbalanced, multi-class classification problems. More importantly, the sensitivity improvement of the minority classes using RegBoost is accompanied with the improvement of the overall accuracy for all classes. With unpredictability regularization, a diverse group of classifiers are created and the maximum accuracy improvement reaches above 24%. Using stratified undersampling, RegBoost exhibits the best efficiency. The reduction in computational cost is significant reaching above 50%. As the volume of training data increase, the gain of efficiency with the proposed method becomes more significant.
We have a huge amount of video data from extensively available surveillance cameras and increasingly growing technology to record the motion of a moving object in the form of trajectory data. With proliferation of location-enabled devices and ongoing growth in smartphone penetration as well as advancements in exploiting image processing techniques, tracking moving objects is more flawlessly achievable. In this work, we explore some domain-independent qualitative and quantitative features in raw trajectory (spatio-temporal) data in videos captured by a fixed single wide-angle view camera sensor in outdoor areas. We study the efficacy of those features in classifying four basic high level actions by employing two supervised learning algorithms and show how each of the features affect the learning algorithms’ overall accuracy as a single factor or confounded with others.
Mitigation response plans must be created to protect affected populations during biological emergencies resulting from the release of harmful biochemical substances. Medical countermeasures have been stockpiled by the federal government for such emergencies. However, it is the responsibility of local governments to maintain solid, functional plans to apply these countermeasures to the entire target population within short, mandated time frames. Further, vulnerabilities in the population may serve as barriers preventing certain individuals from participating in mitigation activities. Therefore, functional response plans must be capable of reaching vulnerable populations.Transportation vulnerability results from lack of access to transportation. Transportation vulnerable populations located too far from mitigation resources are at-risk of not being able to participate in mitigation activities. Quantification of these populations requires the development of computational methods to integrate spatial demographic data and transportation resource data from disparate sources into the context of planned mitigation efforts. Research described in this dissertation focuses on quantifying transportation vulnerable populations and maximizing participation in response efforts. Algorithms developed as part of this research are integrated into a computational framework to promote a transition from research and development to deployment and use by biological emergency planners.
Publicly available datasets in health science are often large and observational, in contrast to experimental datasets where a small number of data are collected in controlled experiments. Variables' causal relationships in the observational dataset are yet to be determined. However, there is a significant interest in health science to discover and analyze causal relationships from health data since identified causal relationships will greatly facilitate medical professionals to prevent diseases or to mitigate the negative effects of the disease. Recent advances in Computer Science, particularly in Bayesian networks, has initiated a renewed interest for causality research. Causal relationships can be possibly discovered through learning the network structures from data. However, the number of candidate graphs grows in a more than exponential rate with the increase of variables. Exact learning for obtaining the optimal structure is thus computationally infeasible in practice. As a result, heuristic approaches are imperative to alleviate the difficulty of computations. This research provides effective and efficient learning tools for local causal discoveries and novel methods of learning causal structures with a combination of background knowledge. Specifically in the direction of constraint based structural learning, polynomial-time algorithms for constructing causal structures are designed with first-order conditional independence. Algorithms of efficiently discovering non-causal factors are developed and proved. In addition, when the background knowledge is partially known, methods of graph decomposition are provided so as to reduce the number of conditioned variables. Experiments on both synthetic data and real epidemiological data indicate the provided methods are applicable to large-scale datasets and scalable for causal analysis in health data. Followed by the research methods and experiments, this dissertation gives thoughtful discussions on the reliability of causal discoveries computational health science research, complexity, and implications in health science research.
POD (Point of Dispensing)-based emergency response plans involving mass prophylaxis may seem feasible when considering the choice of dispensing points within a region, overall population density, and estimated traffic demands. However, the plan may fail to serve particular vulnerable sub-populations, resulting in access disparities during emergency response. Federal authorities emphasize on the need to identify sub-populations that cannot avail regular services during an emergency due to their special needs to ensure effective response. Vulnerable individuals require the targeted allocation of appropriate resources to serve their special needs. Devising schemes to address the needs of vulnerable sub-populations is essential for the effectiveness of response plans. This research focuses on data-driven computational methods to quantify and address vulnerabilities in response plans that require the allocation of targeted resources. Data-driven methods to identify and quantify vulnerabilities in response plans are developed as part of this research. Addressing vulnerabilities requires the targeted allocation of appropriate resources to PODs. The problem of resource allocation to PODs during public health emergencies is introduced and the variants of the resource allocation problem such as the spatial allocation, spatio-temporal allocation and optimal resource subset variants are formulated. Generating optimal resource allocation and scheduling solutions can be computationally hard problems. The application of metaheuristic techniques to find near-optimal solutions to the resource allocation problem in response plans is investigated. A vulnerability analysis and resource allocation framework that facilitates the demographic analysis of population data in the context of response plans, and the optimal allocation of resources with respect to the analysis are described.
Blood pressure is vital sign information that physicians often need as preliminary data for immediate intervention during emergency situations or for regular monitoring of people with cardiovascular diseases. Despite the availability of portable blood pressure meters in the market, they are not regularly carried by people, creating a need for an ultra-portable measurement platform or device that can be easily carried and used at all times. One such device is the smartphone which, according to comScore survey is used by 26.2% of the US adult population. the mass production of these phones with built-in sensors and high computation power has created numerous possibilities for application development in different domains including biomedical. Motivated by this capability and their extensive usage, this thesis focuses on developing a blood pressure measurement platform on smartphones. Specifically, I developed a blood pressure measurement system on a smart phone using the built-in camera and a customized external microphone. the system consists of first obtaining heart beats using the microphone and finger pulse with the camera, and finally calculating the blood pressure using the recorded data. I developed techniques for finding the best location for obtaining the data, making the system usable by all categories of people. the proposed system resulted in accuracies between 90-100%, when compared to traditional blood pressure meters. the second part of this thesis presents a new system for remote heart beat monitoring using the smart phone. with the proposed system, heart beats can be transferred live by patients and monitored by physicians remotely for diagnosis. the proposed blood pressure measurement and remote monitoring systems will be able to facilitate information acquisition and decision making by the 9-1-1 operators.
Managing large-scale dynamical systems (e.g., transportation systems, complex information systems, and power networks, etc.) in real-time is very challenging considering their complicated system dynamics, intricate network interactions, large scale, and especially the existence of various uncertainties. To address this issue, intelligent techniques which can quickly design decision-making strategies that are robust to uncertainties are needed. This dissertation aims to conquer these challenges by exploring a data-driven decision-making framework, which leverages big-data techniques and scalable uncertainty evaluation approaches to quickly solve optimal control problems. In particular, following techniques have been developed along this direction: 1) system modeling approaches to simplify the system analysis and design procedures for multiple applications; 2) effective simulation and analytical based approaches to efficiently evaluate system performance and design control strategies under uncertainty; and 3) big-data techniques that allow some computations of control strategies to be completed offline. These techniques and tools for analysis, design and control contribute to a wide range of applications including air traffic flow management, complex information systems, and airborne networks.
Mobile phone advancements and ubiquitous internet connectivity are resulting in ever expanding possibilities in the application of smart phones. Users of mobile phones are now capable of hosting server applications from their personal devices. Whether providing services individually or in an ad hoc network setting the devices are currently not configured for defending against distributed denial of service (DDoS) attacks. These attacks, often launched from a botnet, have existed in the space of personal computing for decades but recently have begun showing up on mobile devices. Research is done first into the required steps to develop a potential botnet on the Android platform. This includes testing for the amount of malicious traffic an Android phone would be capable of generating for a DDoS attack. On the other end of the spectrum is the need of mobile devices running networked applications to develop security against DDoS attacks. For this mobile, phones are setup, with web servers running Apache to simulate users running internet connected applications for either local ad hoc networks or serving to the internet. Testing is done for the viability of using commonly available modules developed for Apache and intended for servers as well as finding baseline capabilities of mobiles to handle higher traffic volumes. Given the unique challenge of the limited resources a mobile phone can dedicate to Apache when compared to a dedicated hosting server a new method was needed. A proposed defense algorithm is developed for mitigating DDoS attacks against the mobile server that takes into account the limited resources available on the mobile device. The algorithm is tested against TCP socket flooding for effectiveness and shown to perform better than the common Apache module installations on a mobile device.
Free and fair elections are the basis for democracy, but conducting elections is not an easy task. Different groups of people are trying to influence the outcome of the election in their favor using the range of methods, from campaigning for a particular candidate to well-financed lobbying. Often the stakes are too high, and the methods are illegal. Two main properties of any voting scheme are the privacy of a voter’s choice and the integrity of the tally. Unfortunately, they are mutually exclusive. Integrity requires making elections transparent and auditable, but at the same time, we must preserve a voter’s privacy. It is always a trade-off between these two requirements. Current voting schemes favor privacy over auditability, and thus, they are vulnerable to voting fraud. I propose two novel voting systems that can achieve both privacy and verifiability. The first protocol is based on cryptographical primitives to ensure the integrity of the final tally and privacy of the voter. The second protocol is a simple paper-based voting scheme that achieves almost the same level of security without usage of cryptography.
Environmental monitoring represents a major application domain for wireless sensor networks (WSN). However, despite significant advances in recent years, there are still many challenging issues to be addressed to exploit the full potential of the emerging WSN technology. In this dissertation, we introduce the design and implementation of low-power wireless sensor networks for long-term, autonomous, and near-real-time environmental monitoring applications. We have developed an out-of-box solution consisting of a suite of software, protocols and algorithms to provide reliable data collection with extremely low power consumption. Two wireless sensor networks based on the proposed solution have been deployed in remote field stations to monitor soil moisture along with other environmental parameters. As parts of the ever-growing environmental monitoring cyberinfrastructure, these networks have been integrated into the Texas Environmental Observatory system for long-term operation. Environmental measurement and network performance results are presented to demonstrate the capability, reliability and energy-efficiency of the network.
The recent growth in sensor technology allows easier information gathering in real-time as sensors have grown smaller, more accurate, and less expensive. The resulting data is often in a geo-stream format continuously changing input with a spatial extent. Researchers developing geo-streaming management systems (GSMS) require a benchmark system for evaluation, which is currently lacking. This thesis presents GSMark, a benchmark for evaluating GSMSs. GSMark provides a data generator that creates a combination of synthetic and real geo-streaming data, a workload simulator to present the data to the GSMS as a data stream, and a set of benchmark queries that evaluate typical GSMS functionality and query performance. In particular, GSMark generates both moving points and evolving spatial regions, two fundamental data types for a broad range of geo-stream applications, and the geo-streaming queries on this data.
Recent reports suggest that measuring the objective quality is very essential towards the success of colonoscopy. Several quality indicators (i.e. metrics) proposed in recent studies are implemented in software systems that compute real-time quality scores for routine screening colonoscopy. Most quality metrics are derived based on various temporal events occurred during the colonoscopy procedure. The location of the phase boundary between the insertion and the withdrawal phases and the amount of circumferential inspection are two such important temporal events. These two temporal events can be determined by analyzing various camera motions of the colonoscope. This dissertation put forward a novel method to estimate X, Y and Z directional motions of the colonoscope using motion vector templates. Since abnormalities of a WCE or a colonoscopy video can be found in a small number of frames (around 5% out of total frames), it is very helpful if a computer system can decide whether a frame has any mucosal abnormalities. Also, the number of detected abnormal lesions during a procedure is used as a quality indicator. Majority of the existing abnormal detection methods focus on detecting only one type of abnormality or the overall accuracies are somewhat low if the method tries to detect multiple abnormalities. Most abnormalities in endoscopy images have unique textures which are clearly distinguishable from normal textures. In this dissertation a new method is proposed that achieves the objective of detecting multiple abnormalities with a higher accuracy using a multi-texture analysis technique. The multi-texture analysis method is designed by representing WCE and colonoscopy image textures as textons.
There are several types of disorders that affect our colon’s ability to function properly such as colorectal cancer, ulcerative colitis, diverticulitis, irritable bowel syndrome and colonic polyps. Automatic detection of these diseases would inform the endoscopist of possible sub-optimal inspection during the colonoscopy procedure as well as save time during post-procedure evaluation. But existing systems only detects few of those disorders like colonic polyps. In this dissertation, we address the automatic detection of another important disorder called ulcerative colitis. We propose a novel texture feature extraction technique to detect the severity of ulcerative colitis in block, image, and video levels. We also enhance the current informative frame filtering methods by detecting water and bubble frames using our proposed technique. Our feature extraction algorithm based on accumulation of pixel value difference provides better accuracy at faster speed than the existing methods making it highly suitable for real-time systems. We also propose a hybrid approach in which our feature method is combined with existing feature method(s) to provide even better accuracy. We extend the block and image level detection method to video level severity score calculation and shot segmentation. Also, the proposed novel feature extraction method can detect water and bubble frames in colonoscopy videos with very high accuracy in significantly less processing time even when clustering is used to reduce the training size by 10 times.
Data is everywhere. The current Technological advancements in Digital, Social media and the ease at which the availability of different application services to interact with variety of systems are causing to generate tremendous volumes of data. Due to such varied services, Data format is now not restricted to only structure type like text but can generate unstructured content like social media data, videos and images etc. The generated Data is of no use unless been stored and analyzed to derive some Value. Traditional Database systems comes with limitations on the type of data format schema, access rates and storage sizes etc. Hadoop is an Apache open source distributed framework that support storing huge datasets of different formatted data reliably on its file system named Hadoop File System (HDFS) and to process the data stored on HDFS using MapReduce programming model. This thesis study is about building a Data Architecture using Hadoop and its related open source distributed frameworks to support a Data flow pipeline on a low commodity hardware. The Data flow components are, sourcing data, storage management on HDFS and data access layer. This study also discuss about a use case to utilize the architecture components. Sqoop, a framework to ingest the structured data from database onto Hadoop and Flume is used to ingest the semi-structured Twitter streaming json data on to HDFS for analysis. The data sourced using Sqoop and Flume have been analyzed using Hive for SQL like analytics and at a higher level of data access layer, Hadoop has been compared with an in memory computing system using Spark. Significant differences in query execution performances have been analyzed when working with Hadoop and Spark frameworks. This integration helps for ingesting huge Volumes of streaming json Variety data to derive better Value based analytics using Hive and ...
As vehicle manufacturers continue to increase their emphasis on safety with advanced driver assistance systems (ADAS), I propose a ubiquitous device that is able to analyze and advise on safety conditions. Mobile smartphones are increasing in popularity among younger generations with an estimated 64% of 25-34 year olds already using one in their daily lives. with over 10 million car accidents reported in the United States each year, car manufacturers have shifted their focus of a passive approach (airbags) to more active by adding features associated with ADAS (lane departure warnings). However, vehicles manufactured with these sensors are not economically priced while older vehicles might only have passive safety features. Given its accessibility and portability, I target a mobile smartphone as a device to compliment ADAS that can bring a driver assist to any vehicle without regards for any on-vehicle communication system requirements. I use the 3-axis accelerometer of multiple Android based smartphone to record and analyze various safety factors which can influence a driver while operating a vehicle. These influences with respect to the driver, vehicle and road are lane change maneuvers, vehicular comfort and road conditions. Each factor could potentially be hazardous to the health of the driver, neighboring public, and automobile and is therefore analyzed thoroughly achieving 85.60% and 89.89% classification accuracy for identifying road anomalies and lane changes, respectively. Effective use of this data can educate a potentially dangerous driver on how to operate a vehicle safely and efficiently. with real time analysis and auditory alerts of these factors, I hope to increase a driver's overall awareness to maximize safety.
Colonoscopy is an endoscopic technique that allows a physician to inspect the mucosa of the human colon. Previous methods and software solutions to detect informative frames in a colonoscopy video (a process called informative frame filtering or IFF) have been hugely ineffective in (1) covering the proper definition of an informative frame in the broadest sense and (2) striking an optimal balance between accuracy and speed of classification in both real-time and non real-time medical procedures. In my thesis, I propose a more effective method and faster software solutions for IFF which is more effective due to the introduction of a heuristic algorithm (derived from experimental analysis of typical colon features) for classification. It contributed to a 5-10% boost in various performance metrics for IFF. The software modules are faster due to the incorporation of sophisticated parallel-processing oriented coding techniques on modern microprocessors. Two IFF modules were created, one for post-procedure and the other for real-time. Code optimizations through NVIDIA CUDA for GPU processing and/or CPU multi-threading concepts embedded in two significant microprocessor design philosophies (multi-core design and many-core design) resulted a 5-fold acceleration for the post-procedure module and a 40-fold acceleration for the real-time module. Some innovative software modules, which are still in testing phase, have been recently created to exploit the power of multiple GPUs together.
With the introduction of 4G LTE, multiple new technologies were introduced. MIMO is one of the important technologies introduced with fourth generation. The main MIMO modes used in LTE are open loop and closed loop spatial multiplexing modes. This thesis develops an algorithm to calculate the threshold values of UE speed and SNR that is required to implement a switching algorithm which can switch between different MIMO modes for a UE based on the speed and channel conditions (CSI). Specifically, this thesis provides the values of UE speed and SNR at which we can get better results by switching between open loop and closed loop MIMO modes and then be scheduled in sub-channels accordingly. Thus, the results can be used effectively to get better channel capacity with less ISI. The main objectives of this thesis are: to determine the type of MIMO mode suitable for a UE with certain speed, to determine the effects of SNR on selection of MIMO modes, and to design and implement a scheduling algorithm to enhance channel capacity.
Extracting information from a stack of data is a tedious task and the scenario is no different in proteomics. Volumes of research papers are published about study of various proteins in several species, their interactions with other proteins and identification of protein(s) as possible biomarker in causing diseases. It is a challenging task for biologists to keep track of these developments manually by reading through the literatures. Several tools have been developed by computer linguists to assist identification, extraction and hypotheses generation of proteins and protein-protein interactions from biomedical publications and protein databases. However, they are confronted with the challenges of term variation, term ambiguity, access only to abstracts and inconsistencies in time-consuming manual curation of protein and protein-protein interaction repositories. This work attempts to attenuate the challenges by extracting protein-protein interactions in humans and elicit possible interactions using associative rule mining on full text, abstracts and captions from figures available from publicly available biomedical literature databases. Two such databases are used in our study: Directory of Open Access Journals (DOAJ) and PubMed Central (PMC). A corpus is built using articles based on search terms. A dataset of more than 38,000 protein-protein interactions from the Human Protein Reference Database (HPRD) is cross-referenced to validate discovered interactive pairs. A set of an optimal size of possible binary protein-protein interactions is generated to be made available for clinician or biological validation. A significant change in the number of new associations was found by altering the thresholds for support and confidence metrics. This study narrows down the limitations for biologists in keeping pace with discovery of protein-protein interactions via manually reading the literature and their needs to validate each and every possible interaction.
Students often use the web as a source of help for problems that they encounter on programming assignments.In this work, we seek to understand how students use the web to search for help on their assignments.We used a mixed methods approach with 344 students who complete a survey and 41 students who participate in a focus group meetings and helped in recording data about their search habits.The survey reveals data about student reported search habits while the focus group uses a web browser plug-in to record actual search patterns.We examine the results collectively and as broken down by class year.Survey results show that at least 2/3 of the students from each class year rely on search engines to locate resources for help with their programming bugs in at least half of their assignments;search habits vary by class year;and the value of different types of resources such as tutorials and forums varies by class year.Focus group results exposes the high frequency web sites used by the students in solving their programming assignments.
Hand and arm gestures are a great way of communication when you don't want to be heard, quieter and often more reliable than whispering into a radio mike. In recent years hand gesture identification became a major active area of research due its use in various applications. The objective of my work is to develop an integrated sensor system, which will enable tactical squads and SWAT teams to communicate when there is absence of a Line of Sight or in the presence of any obstacles. The gesture set involved in this work is the standardized hand signals for close range engagement operations used by military and SWAT teams. The gesture sets involved in this work are broadly divided into finger movements and arm movements. The core components of the integrated sensor system are: Surface EMG sensors, Flex sensors and accelerometers. Surface EMG is the electrical activity produced by muscle contractions and measured by sensors directly attached to the skin. Bend Sensors use a piezo resistive material to detect the bend. The sensor output is determined by both the angle between the ends of the sensor as well as the flex radius. Accelerometers sense the dynamic acceleration and inclination in 3 directions simultaneously. EMG sensors are placed on the upper and lower forearm and assist in the classification of the finger and wrist movements. Bend sensors are mounted on a glove that is worn on the hand. The sensors are located over the first knuckle of each figure and can determine if the finger is bent or not. An accelerometer is attached to the glove at the base of the wrist and determines the speed and direction of the arm movement. Classification algorithm SVM is used to classify the gestures.
Automatic text summarization and keyphrase extraction are two interesting areas of research which extend along natural language processing and information retrieval. They have recently become very popular because of their wide applicability. Devising generic techniques for these tasks is challenging due to several issues. Yet we have a good number of intelligent systems performing the tasks. As different systems are designed with different perspectives, evaluating their performances with a generic strategy is crucial. It has also become immensely important to evaluate the performances with minimal human effort. In our work, we focus on designing a relativized scale for evaluating different algorithms. This is our major contribution which challenges the traditional approach of working with an absolute scale. We consider the impact of some of the environment variables (length of the document, references, and system-generated outputs) on the performance. Instead of defining some rigid lengths, we show how to adjust to their variations. We prove a mathematically sound baseline that should work for all kinds of documents. We emphasize automatically determining the syntactic well-formedness of the structures (sentences). We also propose defining an equivalence class for each unit (e.g. word) instead of the exact string matching strategy. We show an evaluation approach that considers the weighted relatedness of multiple references to adjust to the degree of disagreements between the gold standards. We publish the proposed approach as a free tool so that other systems can use it. We have also accumulated a dataset (scientific articles) with a reference summary and keyphrases for each document. Our approach is applicable not only for evaluating single-document based tasks but also for evaluating multiple-document based tasks. We have tested our evaluation method for three intrinsic tasks (taken from DUC 2004 conference), and in all three cases, it correlates positively with ROUGE. Based on our experiments ...
This research is concerned with the identification of sentiment in multimodal content. This is of particular interest given the increasing presence of subjective multimodal content on the web and other sources, which contains a rich and vast source of people's opinions, feelings, and experiences. Despite the need for tools that can identify opinions in the presence of diverse modalities, most of current methods for sentiment analysis are designed for textual data only, and few attempts have been made to address this problem. The dissertation investigates techniques for augmenting linguistic representations with acoustic, visual, and physiological features. The potential benefits of using these modalities include linguistic disambiguation, visual grounding, and the integration of information about people's internal states. The main goal of this work is to build computational resources and tools that allow sentiment analysis to be applied to multimodal data. This thesis makes three important contributions. First, it shows that modalities such as audio, video, and physiological data can be successfully used to improve existing linguistic representations for sentiment analysis. We present a method that integrates linguistic features with features extracted from these modalities. Features are derived from verbal statements, audiovisual recordings, thermal recordings, and physiological sensors signals. The resulting multimodal sentiment analysis system is shown to significantly outperform the use of language alone. Using this system, we were able to predict the sentiment expressed in video reviews and also the sentiment experienced by viewers while exposed to emotionally loaded content. Second, the thesis provides evidence of the portability of the developed strategies to other affect recognition problems. We provided support for this by studying the deception detection problem. Third, this thesis contributes several multimodal datasets that will enable further research in sentiment and deception detection.
The design of VLSI electronic circuits can be achieved at many different abstraction levels starting from system behavior to the most detailed, physical layout level. As the number of transistors in VLSI circuits is increasing, the complexity of the design is also increasing, and it is now beyond human ability to manage. Hence CAD (Computer Aided design) or EDA (Electronic Design Automation) tools are involved in the design. EDA or CAD tools automate the design, verification and testing of these VLSI circuits. In today’s market, there are many EDA tools available. However, they are very expensive and require high-performance platforms. One of the key challenges today is to select appropriate CAD or EDA tools which are open-source for academic purposes. This thesis provides a detailed examination of an open-source EDA tool called Electric VLSI Design system. An excellent and efficient CAD tool useful for students and teachers to implement ideas by modifying the source code, Electric fulfills these requirements. This thesis' primary objective is to explain the Electric software features and architecture and to provide various digital and analog designs that are implemented by this software for educational purposes. Since the choice of an EDA tool is based on the efficiency and functions that it can provide, this thesis explains all the analysis and synthesis tools that electric provides and how efficient they are. Hence, this thesis is of benefit for students and teachers that choose Electric as their open-source EDA tool for educational purposes.
With conventional CMOS technologies approaching their scaling limits, researchers are actively investigating alternative technologies for ever increasing computing and mobile demand. A number of different technologies are currently being studied by different research groups. In the last decade, one-dimensional (1D) carbon nanotubes (CNT), graphene, which is a two-dimensional (2D) natural occurring carbon rolled in tubular form, and zero-dimensional (0D) fullerenes have been the subject of intensive research. In 2008, HP Labs announced a ground-breaking fabrication of memristors, the fourth fundamental element postulated by Chua at the University of California, Berkeley in 1971. In the last few years, the memristor has gained a lot of attention from the research community. In-depth studies of the memristor and its analog behavior have convinced the community that it has the potential in future nano-architectures for optimization of high-density memory and neuromorphic computing architectures. The objective of this thesis is to explore memristors for analog and mixed-signal system design using Simscape. This thesis presents a memristor model in the Simscape language. Simscape has been used as it has the potential for modeling large systems. A memristor based programmable oscillator is also presented with simulation results and characterization. In addition, simulation results of different memristor models are presented which are crucial for the detailed understanding of the memristor along with its properties.
Location-based services (LBS) are available on a variety of mobile platforms like cell phones, PDA's, etc. and an increasing number of users subscribe to and use these services. Two of the popular models of information flow in LBS are the client-server model and the peer-to-peer model, in both of which, existing approaches do not always provide privacy for all parties concerned. In this work, I study the feasibility of applying cryptographic protocols to design privacy-preserving solutions for LBS from an experimental and theoretical standpoint. In the client-server model, I construct a two-phase framework for processing nearest neighbor queries using combinations of cryptographic protocols such as oblivious transfer and private information retrieval. In the peer-to-peer model, I present privacy preserving solutions for processing group nearest neighbor queries in the semi-honest and dishonest adversarial models. I apply concepts from secure multi-party computation to realize our constructions and also leverage the capabilities of trusted computing technology, specifically TPM chips. My solution for the dishonest adversarial model is also of independent cryptographic interest. I prove my constructions secure under standard cryptographic assumptions and design experiments for testing the feasibility or practicability of our constructions and benchmark key operations. My experiments show that the proposed constructions are practical to implement and have reasonable costs, while providing strong privacy assurances.
Sense amplifiers are important circuit components of a dynamic random access memory (DRAM), which forms the main memory of digital computers. The ability of the sense amplifier to detect and amplify voltage signals to correctly interpret data in DRAM cells cannot be understated. The sense amplifier plays a significant role in the overall speed of the DRAM. Sense amplifiers require matched transistors for optimal performance. Hence, the effects of mismatch through process variations must be minimized. This thesis presents a research which leads to optimal nanoscale CMOS sense amplifiers by incorporating the effects of process variation early in the design process. The effects of process variation on the performance of a standard voltage sense amplifier, which is used in conventional DRAMs, is studied. Parametric analysis is performed through circuit simulations to investigate which parameters have the most impact on the performance of the sense amplifier. The figures-of-merit (FoMs) used to characterize the circuit are the precharge time, power dissipation, sense delay and sense margin. Statistical analysis is also performed to study the impact of process variations on each FoM. By analyzing the results from the statistical study, a method is presented to select parameter values that minimize the effects of process variation. A design flow algorithm incorporating dual oxide and dual threshold voltage based techniques is used to optimize the FoMs for the sense amplifier. Experimental results prove that the proposed approach improves precharge time by 83.9%, sense delay by 80.2% sense margin by 61.9%, and power dissipation by 13.1%.
Socrates articulated it best, "Speak, so I may see you." Indeed, language represents an invisible probe into the mind. It is the medium through which we express our deepest thoughts, our aspirations, our views, our feelings, our inner reality. From the beginning of artificial intelligence, researchers have sought to impart human like understanding to machines. As much of our language represents a form of self expression, capturing thoughts, beliefs, evaluations, opinions, and emotions which are not available for scrutiny by an outside observer, in the field of natural language, research involving these aspects has crystallized under the name of subjectivity and sentiment analysis. While subjectivity classification labels text as either subjective or objective, sentiment classification further divides subjective text into either positive, negative or neutral. In this thesis, I investigate techniques of generating tools and resources for subjectivity analysis that do not rely on an existing natural language processing infrastructure in a given language. This constraint is motivated by the fact that the vast majority of human languages are scarce from an electronic point of view: they lack basic tools such as part-of-speech taggers, parsers, or basic resources such as electronic text, annotated corpora or lexica. This severely limits the implementation of techniques on par with those developed for English, and by applying methods that are lighter in the usage of text processing infrastructure, we are able to conduct multilingual subjectivity research in these languages as well. Since my aim is also to minimize the amount of manual work required to develop lexica or corpora in these languages, the techniques proposed employ a lever approach, where English often acts as the donor language (the fulcrum in a lever) and allows through a relatively minimal amount of effort to establish preliminary subjectivity research in a target language.
Making computers automatically find the appropriate meaning of words in context is an interesting problem that has proven to be one of the most challenging tasks in natural language processing (NLP). Widespread potential applications of a possible solution to the problem could be envisaged in several NLP tasks such as text simplification, language learning, machine translation, query expansion, information retrieval and text summarization. Ambiguity of words has always been a challenge in these applications, and the traditional endeavor to solve the problem of this ambiguity, namely doing word sense disambiguation using resources like WordNet, has been fraught with debate about the feasibility of the granularity that exists in WordNet senses. The recent trend has therefore been to move away from enforcing any given lexical resource upon automated systems from which to pick potential candidate senses,and to instead encourage them to pick and choose their own resources. Given a sentence with a target ambiguous word, an alternative solution consists of picking potential candidate substitutes for the target, filtering the list of the candidates to a much shorter list using various heuristics, and trying to match these system predictions against a human generated gold standard, with a view to ensuring that the meaning of the sentence does not change after the substitutions. This solution has manifested itself in the SemEval 2007 task of lexical substitution and the more recent SemEval 2010 task of cross-lingual lexical substitution (which I helped organize), where given an English context and a target word within that context, the systems are required to provide between one and ten appropriate substitutes (in English) or translations (in Spanish) for the target word. In this dissertation, I present a comprehensive overview of state-of-the-art research and describe new experiments to tackle the tasks of lexical substitution and cross-lingual lexical substitution. In particular ...
The presence of naturally occurring and man-made public health threats necessitate the design and implementation of mitigation strategies, such that adequate response is provided in a timely manner. Since multiple variables, such as geographic properties, resource constraints, and government mandated time-frames must be accounted for, computational methods provide the necessary tools to develop contingency response plans while respecting underlying data and assumptions. A typical response scenario involves the placement of points of dispensing (PODs) in the affected geographic region to supply vaccines or medications to the general public. Computational tools aid in the analysis of such response plans, as well as in the strategic placement of PODs, such that feasible response scenarios can be developed. Due to the sensitivity of bio-emergency response plans, geographic information, such as POD locations, must be kept confidential. The generation of synthetic geographic regions allows for the development of emergency response plans on non-sensitive data, as well as for the study of the effects of single geographic parameters. Further, synthetic representations of geographic regions allow for results to be published and evaluated by the scientific community. This dissertation presents methodology for the analysis of bio-emergency response plans, methods for plan optimization, as well as methodology for the generation of synthetic geographic regions.
Software applications’ performance is hindered by a variety of factors, but most notably by the well-known CPU-memory speed gap (often known as the memory wall). This results in the CPU sitting idle waiting for data to be brought from memory to processor caches. The addressing used by caches cause non-uniform accesses to various cache sets. The non-uniformity is due to several reasons, including how different objects are accessed by the code and how the data objects are located in memory. Memory allocators determine where dynamically created objects are placed, thus defining addresses and their mapping to cache locations. It is important to evaluate how different allocators behave with respect to the localities of the created objects. Most allocators use a single attribute, the size, of an object in making allocation decisions. Additional attributes such as the placement with respect to other objects, or specific cache area may lead to better use of cache memories. In this dissertation, we proposed and implemented a framework that allows for the development and evaluation of new memory allocation techniques. At the root of the framework is a memory tracing tool called Gleipnir, which provides very detailed information about every memory access, and relates it back to source level objects. Using the traces from Gleipnir, we extended a commonly used cache simulator for generating detailed cache statistics: per function, per data object, per cache line, and identify specific data objects that are conflicting with each other. The utility of the framework is demonstrated with a new memory allocator known as equivalence class allocator. The new allocator allows users to specify cache sets, in addition to object size, where the objects should be placed. We compare this new allocator with two well-known allocators, viz., Doug Lea and Pool allocators.
Optical character recognition (OCR) software has advanced greatly in recent years. Machine-printed text can be scanned and converted to searchable text with word accuracy rates around 98%. Reasonably neat hand-printed text can be recognized with about 85% word accuracy. However, cursive handwriting still remains a challenge, with state-of-the-art performance still around 75%. Algorithms based on hidden Markov models have been only moderately successful, while recurrent neural networks have delivered the best results to date. This thesis explored the feasibility of using a special type of feedforward neural network to convert freeform cursive handwriting to searchable text. The hidden nodes in this network were grouped into clusters, with each cluster being trained to recognize a unique character bigram. The network was trained on writing samples that were pre-segmented and annotated. Post-processing was facilitated in part by using the network to identify overlapping bigrams that were then linked together to form words and sentences. With dictionary assisted post-processing, the network achieved word accuracy of 66.5% on a small, proprietary corpus. The contributions in this thesis are threefold: 1) the novel clustered architecture of the feed-forward neural network, 2) the development of an expanded set of observers combining image masks, modifiers, and feature characterizations, and 3) the use of overlapping bigrams as the textual working unit to assist in context analysis and reconstruction.
The purpose of this project is to explore the GPU for general purpose computing. The GPU is a massively parallel computing device that has a high-throughput, exhibits high arithmetic intensity, has a large market presence, and with the increasing computation power being added to it each year through innovations, the GPU is a perfect candidate to complement the CPU in performing computations. The GPU follows the single instruction multiple data (SIMD) model for applying operations on its data. This model allows the GPU to be very useful for assisting the CPU in performing computations on data that is highly parallel in nature. The compute unified device architecture (CUDA) is a parallel computing and programming platform for NVIDIA GPUs. The main focus of this project is to show the power, speed, and performance of a CUDA-enabled GPU for digital video watermark insertion in the H.264 video compression domain. Digital video watermarking in general is a highly computationally intensive process that is strongly dependent on the video compression format in place. The H.264/MPEG-4 AVC video compression format has high compression efficiency at the expense of having high computational complexity and leaving little room for an imperceptible watermark to be inserted. Employing a human visual model to limit distortion and degradation of visual quality introduced by the watermark is a good choice for designing a video watermarking algorithm though this does introduce more computational complexity to the algorithm. Research is being conducted into how the CPU-GPU execution of the digital watermark application can boost the speed of the applications several times compared to running the application on a standalone CPU using NVIDIA visual profiler to optimize the application.
The current trend towards miniaturization of modern consumer electronic devices significantly affects their design. The demand for efficient all-in-one appliances leads to smaller, yet more complex and powerful nanoelectronic devices. The increasing complexity in the design of such nanoscale Analog/Mixed-Signal Systems-on-Chip (AMS-SoCs) presents difficult challenges to designers. One promising design method used to mitigate the burden of this design effort is the use of metamodeling (surrogate) modeling techniques. Their use significantly reduces the time for computer simulation and design space exploration and optimization. This dissertation addresses several issues of metamodeling based nanoelectronic based AMS design exploration. A surrogate modeling technique which uses geostatistical based Kriging prediction methods in creating metamodels is proposed. Kriging prediction techniques take into account the correlation effects between input parameters for performance point prediction. We propose the use of Kriging to utilize this property for the accurate modeling of process variation effects of designs in the deep nanometer region. Different Kriging methods have been explored for this work such as simple and ordinary Kriging. We also propose another metamodeling technique Kriging-Bootstrapped Neural Network that combines the accuracy and process variation awareness of Kriging with artificial neural network models for ultra-fast and accurate process aware metamodeling design. The proposed methodologies combine Kriging metamodels with selected algorithms for ultra-fast layout optimization. The selected algorithms explored are: Gravitational Search Algorithm (GSA), Simulated Annealing Optimization (SAO), and Ant Colony Optimization (ACO). Experimental results demonstrate that the proposed Kriging metamodel based methodologies can perform the optimizations with minimal computational burden compared to traditional (SPICE-based) design flows.
Epidemics have caused major human and monetary losses through the course of human civilization. It is very important that epidemiologists and public health personnel are prepared to handle an impending infectious disease outbreak. the ever-changing demographics, evolving infrastructural resources of geographic regions, emerging and re-emerging diseases, compel the use of simulation to predict disease dynamics. By the means of simulation, public health personnel and epidemiologists can predict the disease dynamics, population groups at risk and their geographic locations beforehand, so that they are prepared to respond in case of an epidemic outbreak. As a consequence of the large numbers of individuals and inter-personal interactions involved in simulating infectious disease spread in a region such as a county, sizeable amounts of data may be produced that have to be analyzed. Methods to visualize this data would be effective in facilitating people from diverse disciplines understand and analyze the simulation. This thesis proposes a framework to simulate and visualize the spread of an infectious disease in a population of a region such as a county. As real-world populations have a non-homogeneous demographic and spatial distribution, this framework models the spread of an infectious disease based on population of and geographic distance between census blocks; social behavioral parameters for demographic groups. the population is stratified into demographic groups in individual census blocks using census data. Infection spread is modeled by means of local and global contacts generated between groups of population in census blocks. the strength and likelihood of the contacts are based on population, geographic distance and social behavioral parameters of the groups involved. the disease dynamics are represented on a geographic map of the region using a heat map representation, where the intensity of infection is mapped to a color scale. This framework provides a tool for public health personnel and ...
In the United States, smartphone ownership surpassed 69.5 million in February 2011 with a large portion of those users (20%) downloading applications (apps) that enhance the usability of a device by adding additional functionality. a large percentage of apps are written specifically to utilize the geographical position of a mobile device. One of the prime factors in developing location prediction models is the use of historical data to train such a model. with larger sets of training data, prediction algorithms become more accurate; however, the use of historical data can quickly become a downfall if the GPS stream is not collected or processed correctly. Inaccurate or incomplete or even improperly interpreted historical data can lead to the inability to develop accurately performing prediction algorithms. As GPS chipsets become the standard in the ever increasing number of mobile devices, the opportunity for the collection of GPS data increases remarkably. the goal of this study is to build a comprehensive system that addresses the following challenges: (1) collection of GPS data streams in a manner such that the data is highly usable and has a reduction in errors; (2) processing and reduction of the collected data in order to prepare it and make it highly usable for the creation of prediction algorithms; (3) creation of prediction/labeling algorithms at such a level that they are viable for commercial use. This study identifies the key research problems toward building the CaPPture (collection, processing, prediction) system.
Keyphrases describe a document in a coherent and simple way, giving the prospective reader a way to quickly determine whether the document satisfies their information needs. The pervasion of huge amount of information on Web, with only a small amount of documents have keyphrases extracted, there is a definite need to discover automatic keyphrase extraction systems. Typically, a document written by human develops around one or more general concepts or sub-concepts. These concepts or sub-concepts should be structured and semantically related with each other, so that they can form the meaningful representation of a document. Considering the fact, the phrases or concepts in a document are related to each other, a new approach for keyphrase extraction is introduced that exploits the semantic relations in the document. For measuring the semantic relations between concepts or sub-concepts in the document, I present a comprehensive study aimed at using collaboratively constructed semantic resources like Wikipedia and its link structure. In particular, I introduce a graph-based keyphrase extraction system that exploits the semantic relations in the document and features such as term frequency. I evaluated the proposed system using novel measures and the results obtained compare favorably with previously published results on established benchmarks.
This dissertation focuses on the novel learning strategy based on geometric support vector machines to address the difficulties of processing immense data set. Support vector machines find the hyper-plane that maximizes the margin between two classes, and the decision boundary is represented with a few training samples it becomes a favorable choice for incremental learning. The dissertation presents a novel method Geometric Incremental Support Vector Machines (GISVMs) to address both efficiency and accuracy issues in handling massive data sets. In GISVM, skin of convex hulls is defined and an efficient method is designed to find the best skin approximation given available examples. The set of extreme points are found by recursively searching along the direction defined by a pair of known extreme points. By identifying the skin of the convex hulls, the incremental learning will only employ a much smaller number of samples with comparable or even better accuracy. When additional samples are provided, they will be used together with the skin of the convex hull constructed from previous dataset. This results in a small number of instances used in incremental steps of the training process. Based on the experimental results with synthetic data sets, public benchmark data sets from UCI and endoscopy videos, it is evident that the GISVM achieved satisfactory classifiers that closely model the underlying data distribution. GISVM improves the performance in sensitivity in the incremental steps, significantly reduced the demand for memory space, and demonstrates the ability of recovery from temporary performance degradation.
Indoor localization consists of locating oneself inside new buildings. GPS does not work indoors due to multipath reflection and signal blockage. WiFi based systems assume ubiquitous availability and infrastructure based systems require expensive installations, hence making indoor localization an open problem. This dissertation consists of solving the problem of indoor localization by thoroughly exploiting the indoor ambient magnetic fields comprising mainly of disturbances termed as anomalies in the Earth’s magnetic field caused by pillars, doors and elevators in hallways which are ferromagnetic in nature. By observing uniqueness in magnetic signatures collected from different campus buildings, the work presents the identification of landmarks and guideposts from these signatures and further develops magnetic maps of buildings - all of which can be used to locate and navigate people indoors. To understand the reason behind these anomalies, first a comparison between the measured and model generated Earth’s magnetic field is made, verifying the presence of a constant field without any disturbances. Then by modeling the magnetic field behavior of different pillars such as steel reinforced concrete, solid steel, and other structures like doors and elevators, the interaction of the Earth’s field with the ferromagnetic fields is described thereby explaining the causes of the uniqueness in the signatures that comprise these disturbances. Next, by employing the dynamic time warping algorithm to account for time differences in signatures obtained from users walking at different speeds, an indoor localization application capable of classifying locations using the magnetic signatures is developed solely on the smart phone. The application required users to walk short distances of 3-6 m anywhere in hallway to be located with accuracies of 80-99%. The classification framework was further validated with over 90% accuracies using model generated magnetic signatures representing hallways with different kinds of pillars, doors and elevators. All in all, this dissertation ...
The fight against epidemics/pandemics is one of man versus nature. Technological advances have not only improved existing methods for monitoring and controlling disease outbreaks, but have also provided new means for investigation, such as through modeling and simulation. This dissertation explores the relationship between social structure and disease dynamics. Social structures are modeled as graphs, and outbreaks are simulated based on a well-recognized standard, the susceptible-infectious-removed (SIR) paradigm. Two independent, but related, studies are presented. The first involves measuring the severity of outbreaks as social network parameters are altered. The second study investigates the efficacy of various vaccination policies based on social structure. Three disease-related centrality measures are introduced, contact, transmission, and spread centrality, which are related to previously established centrality measures degree, betweenness, and closeness, respectively. The results of experiments presented in this dissertation indicate that reducing the neighborhood size along with outside-of-neighborhood contacts diminishes the severity of disease outbreaks. Vaccination strategies can effectively reduce these parameters. Additionally, vaccination policies that target individuals with high centrality are generally shown to be slightly more effective than a random vaccination policy. These results combined with past and future studies will assist public health officials in their effort to minimize the effects of inevitable disease epidemics/pandemics.
Trusted Computing capability has become ubiquitous these days, and it is being widely deployed into consumer devices as well as enterprise platforms. As the number of threats is increasing at an exponential rate, it is becoming a daunting task to secure the systems against them. In this context, the software integrity measurement at runtime with the support of trusted platforms can be a better security strategy. Trusted Computing devices like TPM secure the evidence of a breach or an attack. These devices remain tamper proof if the hardware platform is physically secured. This type of trusted security is crucial for forensic analysis in the aftermath of a breach. The advantages of trusted platforms can be further leveraged if they can be used wisely. RADIUM (Race-free on-demand Integrity Measurement Architecture) is one such architecture, which is built on the strength of TPM. RADIUM provides an asynchronous root of trust to overcome the TOC condition of DRTM. Even though the underlying architecture is trusted, attacks can still compromise applications during runtime by exploiting their vulnerabilities. I propose an application-level integrity measurement solution that fits into RADIUM, to expand the trusted computing capability to the application layer. This is based on the concept of program invariants that can be used to learn the correct behavior of an application. I used Daikon, a tool to obtain dynamic likely invariants, and developed a method of observing these properties at runtime to verify the integrity. The integrity measurement component was implemented as a Python module on top of Volatility, a virtual machine introspection tool. My approach is a first step towards integrity attestation, using hypervisor-based introspection on RADIUM and a proof of concept of application-level measurement capability.
This dialog allows you to filter your current search.
Each of the Years listed note their name and the number of records that will be limited down to if you choose that option.
The list can be sorted by name or the count.