May 23, 2021

Sampling for Big Data

 

Sampling strategies for extracting data from big data

I am working on a template pipeline for the flexible handling of big data sampling.

It is very convenient, and sometimes there is no other choice to analyze small data sets extracted from the original big data repository. The approximate results can be statistically verified. We know that generally speaking, the strategy is to keep the statistical significance concerning the population, keep the probability distribution, and a reasonable significance level.

Multivariate data sets might need to be analyzed in terms of correlation between variables. Resampling algorithms like bootstrap might be necessary.

This paper is a great comparison of sampling algorithms, sample sizes, and execution complexity. Obviously, results may differ depending on the data set at hand.

 

https://www.researchgate.net/publication/322893311_Sampling_strategies_for_extracting_information_from_large_data_sets




No comments:

Post a Comment

Note: Only a member of this blog may post a comment.