Top 10 Capabilities for Exploring Complex Relationships in Data for Scientific Discovery
Friday, September 12, 2014
With all of the discussion about Big Data these days, there is frequent reference to the 3 V’s that represent the top big data challenges: Volume, Velocity, and Variety. These 3 V’s generally refer to the size of the dataset (Volume), the rate at which data is flowing into (or out of) your systems (Velocity), and the complexity (dimensionality) of the data (Variety).
Most researchers agree that complexity of the data and its relationships is truly the biggest challenge at all scales and in most applications. Consequently, any dataset (whether large or small) that has hundreds or thousands (or more) dimensions per data item is difficult to explore, mine, and interpret.
So, when you find a data tool likeAutoDiscovery that helps in the analysis of high-dimensional data, you stop and take a look.
Exploratory Data Analysis (EDA) for Small Data
First, note that AutoDiscovery is not explicitly for big data, though it is certainly useful for small subsets of big data: that is, small data!.
This is the style of data science that nearly every scientist needs to carry out on a routine basis, since data from daily experiments are rarely in the rarified realm of big data, but modern scientific instruments often do generate large numbers of measured parameters per data object.
AutoDiscovery enables the discovery, exploration, and visualization of correlations in high-dimensional data from such experiments – i.e., Exploratory Data Analysis (EDA).
Four Benefits of Early Findings from EDA
AutoDiscovery objectively discovers interesting findings in the early stages of research. This provides four additional benefits to the scientist in the EDA stage of research:
Informs improvements in the experimental design
Validates and substantiates a priori hypotheses
Generates multiple new testable hypotheses
Reveals promising “hot spots” in the data that require deeper statistical analysis.
The latter capability is quite exciting – “Interestingness” Discovery – i.e., finding the unexpected, unusual, “interesting” regions and features within your data’s multi-dimensional parameter space! Especially with complex data, the combined sum of these capabilities empowers the data scientist to tell the “data story” in the full dimensionality of the dataset, not just in a few limited 2-D or 3-D projections.
Consequently, AutoDiscovery is an objective quantifiable feature-discovery tool that presents the most interesting correlations to end-users for efficient and effective EDA: efficient in the sense that automatic discovery of the most interesting data correlations for deeper analysis avoids lots of useless searches and manual manipulations of the data collection; and effectivein the sense that novel discoveries (beyond known correlations and expected relationships) are made possible.
The Top 10 Features of AutoDiscovery
One of the most sensible characteristics of AutoDiscovery is that it does not try to be the “one tool” for all possible statistical analyses. There are other statistical software packages that already do that, and there is no need to compete with R, SAS, or SPSS.
Consequently, AutoDiscovery aims to satisfy a very particular scientific discovery requirement: correlation discovery in the high-dimensional parameter spaces of complex (high-variety) data.
Correlation discovery alone may seem relatively simple and thus a specialized tool for it seems unnecessary. However, several proprietary features within AutoDiscovery can more than justify its use.
The top 10 features of AutoDiscovery for exploring complex relationships in data for scientific discovery are:
Rapid discovery of interesting findings that can confirm (or deny) initial hypotheses, inform further experimentation and experimental design, and generate multiple additional testable hypotheses.
Automatic search for significant correlations across the full set of pairwise parameter combinations in your dataset.
Automatic search for significant correlations between virtual parameters (i.e., the ratios of the original input parameters)
Quantitative assessment and evaluation of the value of each finding.
Automatic sorting of results, including deprecation of weak and insignificant relationships, placing them lower in the output listings, though still searchable if wanted.
Optional correlation analyses within multiple sub-segments of each parameter’s range of possible values (thereby enabling discovery of changes in the parameter correlations across these limited ranges of the data values, which is a reality often observed in complex scientific experiments).
Correlation analysis outputs (tables, visualizations, and the ability to export the correlation tables) that enable efficient and effective browsing, exploration, and navigation of causal connections (and the causal direction) in correlated data items.
Exploratory and Confirmatory Analyses
For scientists, the use of EDA for initial exploratory studies is crucial in the early stages of an experiment – both exploratory and confirmatory analyses enable discovery, hypothesis testing, and refinement of scientific hypotheses.
More detailed analysis would follow from initial discoveries of interesting and significant parameter correlations within complex high-dimensional data. An article was recently published in Nature on “Statistical Errors – p Values, the Gold Standard of Statistical Validity, Are Not as Reliable as Many Scientists Assume” (by Regina Nuzzo, Nature, 506, 150-152, 2014). In this article, Columbia University statistician Andrew Gelman states that instead of doing multiple separate small studies, “researchers would first do small exploratory studies and gather potentially interesting findings without worrying too much about false alarms. Then, on the basis of these results, the authors would decide exactly how they planned to confirm the findings.”
In other words, a disciplined scientific methodology that includes both exploratory and confirmatory analyses can be documented within an open science framework (e.g., https://osf.io) to demonstrate repeatability and reproducibility in scientific experiments. This would break down the walls of “black box” software that hide the complex analyses that are being applied to complex data. The ability of the scientist and her/his peers to reproduce an experiment’s rationale as well as its results will yield greater transparency in scientific research.