Nonparametric Methods of data Analysis in Astroparticle Physics
Nonparametric Methods of Data Analysis in Cosmic Ray Astrophysics. An Applied Theory of Monte Carlo Statistical Inference. Monograph
Tech Area / Field
- INF-COM/High Performance Computing and Networking/Information and Communications
- PHY-PFA/Particles, Fields and Accelerator Physics/Physics
8 Project completed
Senior Project Manager
Bunyatov K S
Yerevan Physics Institute, Armenia, Yerevan
- Forschungszentrum Karlsruhe Technik und Umwelt / Institut fur Prozessdaten-verarbeitung und Elektronik, Germany, Karlsruhe\nStanford University, USA, CA, Stanford
Project summaryNowadays, when multidimensionality of physical phenomena is well recognized and experimental techniques reach excellence to measure simultaneously many parameters with high precision, the necessity of adequate multivariate analysis methods is apparent.
The monograph will present coherent system of multivariate statistical methods dealing with analysis of data of stochastic nature. All stages of analysis, from preprocessing and indication of outliers to sophisticated physical inference on the theoretical models under consideration, will be presented with numerous examples of application.
We still don’t have such full collection of multivariate methods with emphasis on applications in high energy astrophysics and in new emerging field of comparative gene expression enumeration.
The most general framework in which to formulate solutions to physical inference problems in Cosmic Ray Astrophysics experiments is a statistical one, which recognizes the probabilistic nature both of the physical processes of cosmic radiation propagation, and of the form in which the data analysis results should be expressed.
To make the conclusions about investigated physical phenomenon more reliable and significant we have developed a unified theory of statistical inference, based on nonparametric models, in which various nonparametric approaches and Neural Networks are implemented and compared. In this context it is necessary to mention that we consider the Neural information technology not as a “black box,” but as an extension of conventional nonparametric technique of statistical inference.
The Analysis and Nonparametric Inference (ANI) program package is the software realization of our concept and appropriate tool for the physical inference in High Energy Cosmic Ray Astrophysics experiments. During last 10 years ANI package was updated and intensively used for comparisons of different nonparametric techniques and for data analysis of world biggest experiments, like PAMIR emulsion chamber collaboration, Wipple air Cherenkov telescope, KASCADE, and ANI surface installations for detecting the Extensive Air Showers.
Next area where methods described in the monograph are effectively used is genetic data analysis.
An important problem addressed using cDNA microarray data is the detection of genes differentially expressed in two tissues of interest. Currently used approaches consider each gene separately and evaluate their differential expression independently, ignoring the multidimensional structure of the data. However it is well known that correlation among covariates can enhance the ability to detect less pronounced differences.
We propose a novel approach utilizing the gene correlation information for finding the differentially expressed genes. The Mahalonobis distance between vectors of gene expressions is the criterion for simultaneously comparing a set of genes and an evolutionary algorithm is developed for maximizing it. However the extreme imbalance of the number of genes and the number of experiments causes instability of the sample covariance matrices, so a direct application of the Mahalonobis distance is not feasible. To overcome this problem we develop a new method of combining data from small-scale random search experiments that we term Multiple Random Search with Early Stop (MRSES).
The book presents a novel integrated approach to data analysis consisting of:
– optimization of information contained in simulation trials and experimental events;
– best feature subsets (for discrimination and estimation purposes) selection and initial dimensionality reduction;
– optimized methods of multivariate probability density estimation;
– scanning of multivariate distributions to investigate embedded nontrivial structures;
– neural network and Bayesian classification and background rejection;
– nonparametric estimation and best theoretical model selection;
– gene Selection based on MRSES algorithm.
Numerous examples of implementation of the new methods for astroparticle physics and gene comparative analysis are presented.
All procedures are implemented in Fortran and C/C++ programming languages. User-friendly interface is available for most of UNIX/LINUX platforms. Source code for most of the procedures will be provided on the CD included in the book, and will be made available on the Internet. Provision will be made for implementation of special hardware for accelerating network training and multiple random searches.