Sampling strategies for extracting data from big data
I am working on a template pipeline for the flexible
handling of big data sampling.
It is very convenient, and sometimes there is no other
choice to analyze small data sets extracted from the original big data
repository. The approximate results can be statistically verified. We know that
generally speaking, the strategy is to keep the statistical significance concerning
the population, keep the probability distribution, and a reasonable
significance level.
Multivariate data sets might need to be analyzed in terms of
correlation between variables. Resampling algorithms like bootstrap might be
necessary.
This paper is a great comparison of sampling algorithms,
sample sizes, and execution complexity. Obviously, results may differ depending
on the data set at hand.
https://www.researchgate.net/publication/322893311_Sampling_strategies_for_extracting_information_from_large_data_sets
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.