Improved Tools for Large-scale Hypothesis Testing PDF Download

Are you looking for read ebook online? Search for your book and save it on your Kindle device, PC, phones or tablets. Download Improved Tools for Large-scale Hypothesis Testing PDF full book. Access full book title Improved Tools for Large-scale Hypothesis Testing by Zihao Zheng. Download full books in PDF and EPUB format.

Improved Tools for Large-scale Hypothesis Testing

Improved Tools for Large-scale Hypothesis Testing PDF Author: Zihao Zheng
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Book Description
Large-scale hypothesis testing, as one of the key statistical tools, has been widely studied and applied to high throughput bioinformatics experiments, such as high-density peptide array studies and brain image data sets. The high dimensionality and small sample size of many experiments challenge conventional statistical approaches, including those aiming to control the false discovery rate (FDR). Motivated by this, in this dissertation, I develop several improved statistical and computational tools for large-scale hypothesis testing. The first method, MixTwice, advances an empirical-Bayesian tool that computes local false discovery rate statistics when provided with data on estimated effects and estimated standard errors. I also extend this method from two group comparison problems to multiple group comparison settings and develop a generalized method called MixTwice-ANOVA. The second method GraphicalT calculates local FDRs semiparametrically using available graph-associated information. The first method, MixTwice, introduces an empirical-Bayes approach that involves the estimation of two mixing distributions, one on underlying effects and one on underlying variance parameters. Provided with the estimated effect sizes and estimated errors, MixTwice estimates the mixing distribution and calculates the local false discovery rates via nonparametric MLE and constrained optimization with unimodal shape constraint of the effect distribution. Numerical experiments show that MixTwice can accurately estimate generative parameters and have good testing operating characteristics. Applied to a high-density peptide array, it powerfully identifies non-null peptides to recover meaningful peptide markers when the underlying signal is weak, and has strong reproducibility properties when the underlying signal is strong. The second contribution of this dissertation generalizes MixTwice from scenarios comparing two conditions to scenarios comparing multiple groups. Similar to MixTwice, MixTwice-ANOVA takes numerator and denominator statistics of F test to estimate two underlying mixing distributions. Compared with other large-scale testing tools for one-way ANOVA settings, MixTwice-ANOVA has better power properties and FDR control through numerical experiments. Applied to the peptide array study comparing multiple Sjogren-disease (SjD) populations, the proposed approach discovers meaningful epitope structure and novel scientific findings on Sjogren disease. Numerical experiments support evaluation among testing tools. Besides the methodology contribution of MixTwice in large-scale testing, I also discuss generalized evaluation and computational aspects. For the former part, I propose an evaluation metric, in additional to FDR control, power, etc., called reproducibility, to provide a practical guide for different testing tools. For the latter part, I borrow the idea from pool adjacent violator algorithm (PAVA) and advance a computational algorithm called EM-PAVA to solve nonparametric MLE with isotonic partial order constraint. This algorithm is discussed through theoretical guarantees and computational performances. The last contribution of this dissertation deals with large-scale testing problems with graph-associated data. Different from many studies that incorporate the graph-associated information through detailed modeling specifications, GraphicalT provides a semiparametric way to calculate the local false discovery rates using available auxiliary data graph. The method shows good performance in synthetic examples and in a brain-imaging problem from the study of Alzheimer's disease.

Improved Tools for Large-scale Hypothesis Testing

Improved Tools for Large-scale Hypothesis Testing PDF Author: Zihao Zheng
Publisher:
ISBN:
Category :
Languages : en
Pages : 0

Book Description
Large-scale hypothesis testing, as one of the key statistical tools, has been widely studied and applied to high throughput bioinformatics experiments, such as high-density peptide array studies and brain image data sets. The high dimensionality and small sample size of many experiments challenge conventional statistical approaches, including those aiming to control the false discovery rate (FDR). Motivated by this, in this dissertation, I develop several improved statistical and computational tools for large-scale hypothesis testing. The first method, MixTwice, advances an empirical-Bayesian tool that computes local false discovery rate statistics when provided with data on estimated effects and estimated standard errors. I also extend this method from two group comparison problems to multiple group comparison settings and develop a generalized method called MixTwice-ANOVA. The second method GraphicalT calculates local FDRs semiparametrically using available graph-associated information. The first method, MixTwice, introduces an empirical-Bayes approach that involves the estimation of two mixing distributions, one on underlying effects and one on underlying variance parameters. Provided with the estimated effect sizes and estimated errors, MixTwice estimates the mixing distribution and calculates the local false discovery rates via nonparametric MLE and constrained optimization with unimodal shape constraint of the effect distribution. Numerical experiments show that MixTwice can accurately estimate generative parameters and have good testing operating characteristics. Applied to a high-density peptide array, it powerfully identifies non-null peptides to recover meaningful peptide markers when the underlying signal is weak, and has strong reproducibility properties when the underlying signal is strong. The second contribution of this dissertation generalizes MixTwice from scenarios comparing two conditions to scenarios comparing multiple groups. Similar to MixTwice, MixTwice-ANOVA takes numerator and denominator statistics of F test to estimate two underlying mixing distributions. Compared with other large-scale testing tools for one-way ANOVA settings, MixTwice-ANOVA has better power properties and FDR control through numerical experiments. Applied to the peptide array study comparing multiple Sjogren-disease (SjD) populations, the proposed approach discovers meaningful epitope structure and novel scientific findings on Sjogren disease. Numerical experiments support evaluation among testing tools. Besides the methodology contribution of MixTwice in large-scale testing, I also discuss generalized evaluation and computational aspects. For the former part, I propose an evaluation metric, in additional to FDR control, power, etc., called reproducibility, to provide a practical guide for different testing tools. For the latter part, I borrow the idea from pool adjacent violator algorithm (PAVA) and advance a computational algorithm called EM-PAVA to solve nonparametric MLE with isotonic partial order constraint. This algorithm is discussed through theoretical guarantees and computational performances. The last contribution of this dissertation deals with large-scale testing problems with graph-associated data. Different from many studies that incorporate the graph-associated information through detailed modeling specifications, GraphicalT provides a semiparametric way to calculate the local false discovery rates using available auxiliary data graph. The method shows good performance in synthetic examples and in a brain-imaging problem from the study of Alzheimer's disease.

Large-scale Multiple Hypothesis Testing with Complex Data Structure

Large-scale Multiple Hypothesis Testing with Complex Data Structure PDF Author: Xiaoyu Dai
Publisher:
ISBN:
Category : Electronic dissertations
Languages : en
Pages : 104

Book Description
In the last decade, motivated by a variety of applications in medicine, bioinformatics, genomics, brain imaging, etc., a growing amount of statistical research has been devoted to large-scale multiple testing, where thousands or even greater numbers of tests are conducted simultaneously. However, due to the complexity of real data sets, the assumptions of many existing multiple testing procedures, e.g. that tests are independent and have continuous null distributions of p-values, may not hold. This poses limitations in their performances such as low detection power and inflated false discovery rate (FDR). In this dissertation, we study how to better proceed the multiple testing problems under complex data structures. In Chapter 2, we study the multiple testing with discrete test statistics. In Chapter 3, we study the discrete multiple testing with prior ordering information incorporated. In Chapter 4, we study the multiple testing under complex dependency structure. We propose novel procedures under each scenario, based on the marginal critical functions (MCFs) of randomized tests, the conditional random field (CRF) or the deep neural network (DNN). The theoretical properties of our procedures are carefully studied, and their performances are evaluated through various simulations and real applications with the analysis of genetic data from next-generation sequencing (NGS) experiments.

Large-Scale Inference

Large-Scale Inference PDF Author: Bradley Efron
Publisher: Cambridge University Press
ISBN: 1139492136
Category : Mathematics
Languages : en
Pages :

Book Description
We live in a new age for statistical inference, where modern scientific technology such as microarrays and fMRI machines routinely produce thousands and sometimes millions of parallel data sets, each with its own estimation or testing problem. Doing thousands of problems at once is more than repeated application of classical methods. Taking an empirical Bayes approach, Bradley Efron, inventor of the bootstrap, shows how information accrues across problems in a way that combines Bayesian and frequentist ideas. Estimation, testing and prediction blend in this framework, producing opportunities for new methodologies of increased power. New difficulties also arise, easily leading to flawed inferences. This book takes a careful look at both the promise and pitfalls of large-scale statistical inference, with particular attention to false discovery rates, the most successful of the new statistical techniques. Emphasis is on the inferential ideas underlying technical developments, illustrated using a large number of real examples.

Large-scale Simultaneous Hypothesis Testing

Large-scale Simultaneous Hypothesis Testing PDF Author: Bradley Efron
Publisher:
ISBN:
Category :
Languages : en
Pages : 22

Book Description


Introduction to Robust Estimation and Hypothesis Testing

Introduction to Robust Estimation and Hypothesis Testing PDF Author: Rand R. Wilcox
Publisher: Academic Press
ISBN: 0123869838
Category : Mathematics
Languages : en
Pages : 713

Book Description
"This book focuses on the practical aspects of modern and robust statistical methods. The increased accuracy and power of modern methods, versus conventional approaches to the analysis of variance (ANOVA) and regression, is remarkable. Through a combination of theoretical developments, improved and more flexible statistical methods, and the power of the computer, it is now possible to address problems with standard methods that seemed insurmountable only a few years ago"--

Model-Based Hypothesis Testing in Biomedicine

Model-Based Hypothesis Testing in Biomedicine PDF Author: Rikard Johansson
Publisher: Linköping University Electronic Press
ISBN: 9176854574
Category :
Languages : en
Pages : 102

Book Description
The utilization of mathematical tools within biology and medicine has traditionally been less widespread compared to other hard sciences, such as physics and chemistry. However, an increased need for tools such as data processing, bioinformatics, statistics, and mathematical modeling, have emerged due to advancements during the last decades. These advancements are partly due to the development of high-throughput experimental procedures and techniques, which produce ever increasing amounts of data. For all aspects of biology and medicine, these data reveal a high level of inter-connectivity between components, which operate on many levels of control, and with multiple feedbacks both between and within each level of control. However, the availability of these large-scale data is not synonymous to a detailed mechanistic understanding of the underlying system. Rather, a mechanistic understanding is gained first when we construct a hypothesis, and test its predictions experimentally. Identifying interesting predictions that are quantitative in nature, generally requires mathematical modeling. This, in turn, requires that the studied system can be formulated into a mathematical model, such as a series of ordinary differential equations, where different hypotheses can be expressed as precise mathematical expressions that influence the output of the model. Within specific sub-domains of biology, the utilization of mathematical models have had a long tradition, such as the modeling done on electrophysiology by Hodgkin and Huxley in the 1950s. However, it is only in recent years, with the arrival of the field known as systems biology that mathematical modeling has become more commonplace. The somewhat slow adaptation of mathematical modeling in biology is partly due to historical differences in training and terminology, as well as in a lack of awareness of showcases illustrating how modeling can make a difference, or even be required, for a correct analysis of the experimental data. In this work, I provide such showcases by demonstrating the universality and applicability of mathematical modeling and hypothesis testing in three disparate biological systems. In Paper II, we demonstrate how mathematical modeling is necessary for the correct interpretation and analysis of dominant negative inhibition data in insulin signaling in primary human adipocytes. In Paper III, we use modeling to determine transport rates across the nuclear membrane in yeast cells, and we show how this technique is superior to traditional curve-fitting methods. We also demonstrate the issue of population heterogeneity and the need to account for individual differences between cells and the population at large. In Paper IV, we use mathematical modeling to reject three hypotheses concerning the phenomenon of facilitation in pyramidal nerve cells in rats and mice. We also show how one surviving hypothesis can explain all data and adequately describe independent validation data. Finally, in Paper I, we develop a method for model selection and discrimination using parametric bootstrapping and the combination of several different empirical distributions of traditional statistical tests. We show how the empirical log-likelihood ratio test is the best combination of two tests and how this can be used, not only for model selection, but also for model discrimination. In conclusion, mathematical modeling is a valuable tool for analyzing data and testing biological hypotheses, regardless of the underlying biological system. Further development of modeling methods and applications are therefore important since these will in all likelihood play a crucial role in all future aspects of biology and medicine, especially in dealing with the burden of increasing amounts of data that is made available with new experimental techniques. Användandet av matematiska verktyg har inom biologi och medicin traditionellt sett varit mindre utbredd jämfört med andra ämnen inom naturvetenskapen, såsom fysik och kemi. Ett ökat behov av verktyg som databehandling, bioinformatik, statistik och matematisk modellering har trätt fram tack vare framsteg under de senaste decennierna. Dessa framsteg är delvis ett resultat av utvecklingen av storskaliga datainsamlingstekniker. Inom alla områden av biologi och medicin så har dessa data avslöjat en hög nivå av interkonnektivitet mellan komponenter, verksamma på många kontrollnivåer och med flera återkopplingar både mellan och inom varje nivå av kontroll. Tillgång till storskaliga data är emellertid inte synonymt med en detaljerad mekanistisk förståelse för det underliggande systemet. Snarare uppnås en mekanisk förståelse först när vi bygger en hypotes vars prediktioner vi kan testa experimentellt. Att identifiera intressanta prediktioner som är av kvantitativ natur, kräver generellt sett matematisk modellering. Detta kräver i sin tur att det studerade systemet kan formuleras till en matematisk modell, såsom en serie ordinära differentialekvationer, där olika hypoteser kan uttryckas som precisa matematiska uttryck som påverkar modellens output. Inom vissa delområden av biologin har utnyttjandet av matematiska modeller haft en lång tradition, såsom den modellering gjord inom elektrofysiologi av Hodgkin och Huxley på 1950?talet. Det är emellertid just på senare år, med ankomsten av fältet systembiologi, som matematisk modellering har blivit ett vanligt inslag. Den något långsamma adapteringen av matematisk modellering inom biologi är bl.a. grundad i historiska skillnader i träning och terminologi, samt brist på medvetenhet om exempel som illustrerar hur modellering kan göra skillnad och faktiskt ofta är ett krav för en korrekt analys av experimentella data. I detta arbete tillhandahåller jag sådana exempel och demonstrerar den matematiska modelleringens och hypotestestningens allmängiltighet och tillämpbarhet i tre olika biologiska system. I Arbete II visar vi hur matematisk modellering är nödvändig för en korrekt tolkning och analys av dominant-negativ-inhiberingsdata vid insulinsignalering i primära humana adipocyter. I Arbete III använder vi modellering för att bestämma transporthastigheter över cellkärnmembranet i jästceller, och vi visar hur denna teknik är överlägsen traditionella kurvpassningsmetoder. Vi demonstrerar också frågan om populationsheterogenitet och behovet av att ta hänsyn till individuella skillnader mellan celler och befolkningen som helhet. I Arbete IV använder vi matematisk modellering för att förkasta tre hypoteser om hur fenomenet facilitering uppstår i pyramidala nervceller hos råttor och möss. Vi visar också hur en överlevande hypotes kan beskriva all data, inklusive oberoende valideringsdata. Slutligen utvecklar vi i Arbete I en metod för modellselektion och modelldiskriminering med hjälp av parametrisk ”bootstrapping” samt kombinationen av olika empiriska fördelningar av traditionella statistiska tester. Vi visar hur det empiriska ”log-likelihood-ratio-testet” är den bästa kombinationen av två tester och hur testet är applicerbart, inte bara för modellselektion, utan också för modelldiskriminering. Sammanfattningsvis är matematisk modellering ett värdefullt verktyg för att analysera data och testa biologiska hypoteser, oavsett underliggande biologiskt system. Vidare utveckling av modelleringsmetoder och tillämpningar är därför viktigt eftersom dessa sannolikt kommer att spela en avgörande roll i framtiden för biologi och medicin, särskilt när det gäller att hantera belastningen från ökande datamängder som blir tillgänglig med nya experimentella tekniker.

Analysis of Error Control in Large Scale Two-stage Multiple Hypothesis Testing

Analysis of Error Control in Large Scale Two-stage Multiple Hypothesis Testing PDF Author: Wenge Guo
Publisher:
ISBN:
Category :
Languages : en
Pages : 50

Book Description


Improving Large-scale Assessment in Education

Improving Large-scale Assessment in Education PDF Author: Marielle Simon
Publisher: Routledge
ISBN: 0415894565
Category : Education
Languages : en
Pages : 318

Book Description
This book focuses on central issues that are key components of successful planning, development and implementation of LSAs. The book's main distinction is its focus on practice- based, cutting-edge research. This is achieved by having chapters co-authored by world-class researchers in collaboration with measurement practitioners.

Learning Statistics with R

Learning Statistics with R PDF Author: Daniel Navarro
Publisher: Lulu.com
ISBN: 1326189727
Category : Computers
Languages : en
Pages : 617

Book Description
"Learning Statistics with R" covers the contents of an introductory statistics class, as typically taught to undergraduate psychology students, focusing on the use of the R statistical software and adopting a light, conversational style throughout. The book discusses how to get started in R, and gives an introduction to data manipulation and writing scripts. From a statistical perspective, the book discusses descriptive statistics and graphing first, followed by chapters on probability theory, sampling and estimation, and null hypothesis testing. After introducing the theory, the book covers the analysis of contingency tables, t-tests, ANOVAs and regression. Bayesian statistics are covered at the end of the book. For more information (and the opportunity to check the book out before you buy!) visit http://ua.edu.au/ccs/teaching/lsr or http://learningstatisticswithr.com

Estimating the Local False Discovery Rate Via a Bootstrap Solution to the Reference Class Problem

Estimating the Local False Discovery Rate Via a Bootstrap Solution to the Reference Class Problem PDF Author: Farnoosh Abbas Aghababazadeh
Publisher:
ISBN:
Category : Analysis of variance
Languages : en
Pages :

Book Description
Modern scientific technology such as microarrays, imaging devices, genome-wide association studies or social science surveys provide statisticians with hundreds or even thousands of tests to consider simultaneously. Testing many thousands of null hypotheses may increase the number of Type $I$ errors. In large-scale hypothesis testing, researchers can use different statistical techniques such as family-wise error rates, false discovery rates, permutation methods, local false discovery rate, where all available data usually should be analyzed together. In applications, the thousands of tests are related by a scientifically meaningful structure. Ignoring that structure can be misleading as it may increase the number of false positives and false negatives. As an example, in genome-wide association studies each test corresponds to a specific genetic marker. In such a case, the scientific structure for each genetic marker can be its minor allele frequency. In this research, the local false discovery rate as a relevant statistical approach is considered to analyze the thousands of tests together. We present a model for multiple hypothesis testing when the scientific structure of each test is incorporated as a co-variate. The purpose of this model is to incorporate the co-variate to improve the performance of testing procedures. The method we consider has different estimates depending on the tuning parameter. We would like to estimate the optimal value of that parameter by considering observed statistics. Thus, among those estimators, the one which minimizes the estimated errors due to bias and to variance is chosen by applying the bootstrap approach. Such an estimation method is called an adaptive reference class method. Under the combined reference class method, the effect of the co-variates is ignored and all null hypotheses should be analyzed together. In this research, under some assumptions for the co-variates and the prior probabilities, the proposed adaptive reference class method shows smaller error than the combined reference class method in estimating the local false discovery rate, when the number of tests gets large. We describe the adaptive reference class method to the coronary artery disease data, and we use simulation data to evaluate the performance of the estimator associated with the adaptive reference class method.