Subset Selection Algorithms with Applications PDF Download

Are you looking for read ebook online? Search for your book and save it on your Kindle device, PC, phones or tablets. Download Subset Selection Algorithms with Applications PDF full book. Access full book title Subset Selection Algorithms with Applications by Shane Francis Cotter. Download full books in PDF and EPUB format.

Subset Selection Algorithms with Applications

Subset Selection Algorithms with Applications PDF Author: Shane Francis Cotter
Publisher:
ISBN:
Category :
Languages : en
Pages : 394

Book Description


Subset Selection Algorithms with Applications

Subset Selection Algorithms with Applications PDF Author: Shane Francis Cotter
Publisher:
ISBN:
Category :
Languages : en
Pages : 394

Book Description


Scalable Subset Selection with Filters and Its Applications

Scalable Subset Selection with Filters and Its Applications PDF Author: Gregory Charles Ditzler
Publisher:
ISBN:
Category : Electrical engineering
Languages : en
Pages : 278

Book Description
Increasingly many applications of machine learning are encountering large data that were almost unimaginable just a few years ago, and hence, many of the current algorithms cannot handle, i.e., do not scale to, today's extremely large volumes of data. The data are made up of a large set of features describing each observation, and the complexity of the models for making predictions tend to increase not only with the number of observations, but also the number of features. Fortunately, not all of the features that make up the data carry meaningful information about making the predictions. Thus irrelevant features should be filtered from the data prior to building a model. Such a process of removing features to produce a subset is commonly referred to as feature subset selection. In this work, we present two new filter-based feature subset selection algorithms that are scalable to large data sets that address: (i) potentially large & distributed data sets, and (ii) they are capable of scaling to very large feature sets. Our first proposed algorithm, Neyman-Pearson Feature Selection (NPFS), uses a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any feature selection algorithm, regardless of the feature selection criteria used, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point, and it fits into a computationally attractive MapReduce model. We also describe a sequential learning framework for feature subset selection (SLSS) that scales with both the number of features as well as the number of observations. SLSS uses bandit algorithms to process features and form a level of importance for each feature. Feature selection is performed independently from the optimization of any classifier to reduce unnecessary complexity. We demonstrate the capabilities of NPFS and SLSS on synthetic and real-world data sets. We also present a new approach for classifier-dependent feature selection that is an online learning algorithm that easily handles large amounts of missing feature values in a data stream. There are many real-world applications that can benefit from scalable feature subset selection algorithms; one such area is the study of the microbiome (i.e., the study of micro-organisms and their influence on the environments that they inhabit). Feature subset selection algorithms can be used to sift through massive amounts of data collected from the genomic sciences to help microbial ecologists understand the microbes -- particularly the micro-organisms that are the best indicators by some phenotype, such as healthy or unhealthy. In this work, we provide insights into data collected from the American Gut Project, and deliver open-source software implementations for feature selection with biological data formats.

Computational Subset Model Selection Algorithms and Applications

Computational Subset Model Selection Algorithms and Applications PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages :

Book Description
This dissertation develops new computationally efficient algorithms for identifying the subset of variables that minimizes any desired information criteria in model selection. In recent years, the statistical literature has placed more and more emphasis on information theoretic model selection criteria. A model selection criterion chooses model that "closely" approximates the true underlying model. Recent years have also seen many exciting developments in the model selection techniques. As demand increases for data mining of massive datasets with many variables, the demand for model selection techniques are becoming much stronger and needed. To this end, we introduce a new Implicit Enumeration (IE) algorithm and a hybridized IE with the Genetic Algorithm (GA) in this dissertation. The proposed Implicit Enumeration algorithm is the first algorithm that explicitly uses an information criterion as the objective function. The algorithm works with a variety of information criteria including some for which the existing branch and bound algorithms developed by Furnival and Wilson (1974) and Gatu and Kontoghiorghies (2003) are not applicable. It also finds the "best" subset model directly without the need of finding the "best" subset of each size as the branch and bound techniques do. The proposed methods are demonstrated in multiple, multivariate, logistic regression and discriminant analysis problems. The implicit enumeration algorithm converged to the optimal solution on real and simulated data sets with up to 80 predictors, thus having 280 = 1,208,925,819,614,630,000,000,000 possible subset models in the model portfolio. To our knowledge, none of the existing exact algorithms have the capability of optimally solving such problems of this size.

Subset Selection in Regression

Subset Selection in Regression PDF Author: Alan J. Miller
Publisher: Chapman and Hall/CRC
ISBN:
Category : Computers
Languages : en
Pages : 248

Book Description
Most scientific computing packages contain facilities for stepwise regression and often for 'all subsets' and other techniques for finding 'best-fitting' subsets of regression variables. The application of standard theory can be very misleading in such cases when the model has not been chosen a priori, but from the data. There is widespread awareness that considerable over-fitting occurs and that prediction equations obtained after extensive 'data dredging' often perform poorly when applied to new data. This monograph relates almost entirely to least-squares methods of finding and fitting subsets of regression variables, though most of the concepts are presented in terms of the interpretation and statistical properties of orthogonal projections. An early chapter introduces these methods, which are still not widely known to users of least-squares methods. Existing methods are described for testing whether any useful improvement can be obtained by using any of a set of predictors. Spjotvoll's method for comparing two arbitrary subsets of predictor variables is illustrated and described in detail. When the selected model is the 'best-fitting' in some sense, conventional fitting methods give estimates of regression coefficients which are usually biased in the direction of being too large. The extent of this bias is demonstrated for simple cases. Various ad hoc methods for correcting the bias are discussed (ridge regression, James-Stein shrinkage, jack-knifing, etc.), together with the author's maximum likelihood technique. Areas in which further research is needed are also outlined.

Machine Learning Under a Modern Optimization Lens

Machine Learning Under a Modern Optimization Lens PDF Author: Dimitris Bertsimas
Publisher:
ISBN: 9781733788502
Category : Machine learning
Languages : en
Pages : 589

Book Description


Feature Selection for High-Dimensional Data

Feature Selection for High-Dimensional Data PDF Author: Verónica Bolón-Canedo
Publisher: Springer
ISBN: 3319218581
Category : Computers
Languages : en
Pages : 163

Book Description
This book offers a coherent and comprehensive approach to feature subset selection in the scope of classification problems, explaining the foundations, real application problems and the challenges of feature selection for high-dimensional data. The authors first focus on the analysis and synthesis of feature selection algorithms, presenting a comprehensive review of basic concepts and experimental results of the most well-known algorithms. They then address different real scenarios with high-dimensional data, showing the use of feature selection algorithms in different contexts with different requirements and information: microarray data, intrusion detection, tear film lipid layer classification and cost-based features. The book then delves into the scenario of big dimension, paying attention to important problems under high-dimensional spaces, such as scalability, distributed processing and real-time processing, scenarios that open up new and interesting challenges for researchers. The book is useful for practitioners, researchers and graduate students in the areas of machine learning and data mining.

Integrated Feature Subset Selection/extraction with Applications in Bioinformatics

Integrated Feature Subset Selection/extraction with Applications in Bioinformatics PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages : 209

Book Description
Feature subset selection and extraction algorithms are actively and extensively studied in machine learning literature to reduce the dimensionality of feature space, since high dimensional data sets are generally not efficiently and effectively handled by a large array of machine learning and pattern recognition algorithms. When we stride into the analysis of large scale bioinformatics data sets, such as microarray gene expression data sets, the high dimensionality of feature space compounded with the low dimensionality of sample space, creates even more problems for data analysis algorithms. Two foremost characteristics of microarray gene expression data sets are: (1) the correlation between features (genes) and (2) the availability of domain knowledge in computable format. In this dissertation, we will study effective feature selection and extraction algorithms with applications to the analysis of the new emerging data sets in the bioinformatics domain. Microarray gene expression data set, the result of large scale RNA profiling techniques, is our primary focus in this thesis. Several novel feature (gene) selection and extraction algorithms are proposed to deal with peculiarities on microarray gene expression data set. To address the first characteristic of the microarray gene expression data set, we first propose a general feature selection algorithm called Boost Feature Subset Selection (BFSS) based on permutation analysis to broaden the scope of selected gene set and thus improve classification performance. In BFSS, subsequent features to be selected focus on those samples where previously selected features fail. Our experiments showed the benefit of BFSS for t-score and S2N (signal to noise) based single gene scores on a variety of publicly available microarray gene expression data sets. We then examine the correlations among features (genes) explicitly to see if such correlations are informative for the purpose of sample classification. This results in our gene extraction algorithm called virtual gene. A virtual gene is a group of genes whose expression levels are combined linearly. The combined expression levels of a virtual gene instead of the real gene expression levels are used for sample classification. Our experiments confirm that by taking into consideration the correlations between gene pairs, we could indeed build a better sample classifier. Microarray gene expression data set only represents one aspect of our knowledge of the underlying biological system. Currently there are lots of biological knowledge in computable format that can be accessed from Internet. Continue to address the second characteristic of the microarray gene expression data set, we investigate the integration of domain knowledge, such as those imbedded in gene ontology annotations, for the use of gene selection and extraction. (Abstract shortened by UMI.).

Feature Extraction, Construction and Selection

Feature Extraction, Construction and Selection PDF Author: Huan Liu
Publisher: Springer Science & Business Media
ISBN: 1461557259
Category : Computers
Languages : en
Pages : 418

Book Description
There is broad interest in feature extraction, construction, and selection among practitioners from statistics, pattern recognition, and data mining to machine learning. Data preprocessing is an essential step in the knowledge discovery process for real-world applications. This book compiles contributions from many leading and active researchers in this growing field and paints a picture of the state-of-art techniques that can boost the capabilities of many existing data mining tools. The objective of this collection is to increase the awareness of the data mining community about the research of feature extraction, construction and selection, which are currently conducted mainly in isolation. This book is part of our endeavor to produce a contemporary overview of modern solutions, to create synergy among these seemingly different branches, and to pave the way for developing meta-systems and novel approaches. Even with today's advanced computer technologies, discovering knowledge from data can still be fiendishly hard due to the characteristics of the computer generated data. Feature extraction, construction and selection are a set of techniques that transform and simplify data so as to make data mining tasks easier. Feature construction and selection can be viewed as two sides of the representation problem.

Feature Subset Selection by Estimation of Distribution Algorithms

Feature Subset Selection by Estimation of Distribution Algorithms PDF Author:
Publisher:
ISBN:
Category :
Languages : en
Pages :

Book Description
This paper describes the application of four evolutionary algorithms to the identification of feature subsets for classification problems. Besides a simple GA, the paper considers three estimation of distribution algorithms (EDAs): a compact GA, an extended compact GA, and the Bayesian Optimization Algorithm. The objective is to determine if the EDAs present advantages over the simple GA in terms of accuracy or speed in this problem. The experiments used a Naive Bayes classifier and public-domain and artificial data sets. In contrast with previous studies, we did not find evidence to support or reject the use of EDAs for this problem.

Feature Subset Selection by Estimation of Distribution Algorithms

Feature Subset Selection by Estimation of Distribution Algorithms PDF Author: Erick Cantú-Paz
Publisher:
ISBN:
Category :
Languages : en
Pages : 8

Book Description
This paper describes the application of four evolutionary algorithms to the selection of feature subsets for classification problems. Besides of a simple genetic algorithm (GA), the paper considers three estimation of distribution algorithms (EDAs): a compact GA, an extended compact GA, and the Bayesian Optimization Algorithm. The objective is to determine if the EDAs present advantages over the simple GA in terms of accuracy or speed in this problem. The experiments used a Naive Bayes classifier and public-domain and artificial data sets. All the algorithms found feature subsets that resulted in higher accuracies than using all the features. However, in contrast with other studies, we did not find evidence to support or reject the use of EDAs for this problem.