*Result*: 大数据子抽样方法综述.
*Further Information*
*In the big data era, the volume of dataset grows exponentially. This situation poses a significant challenge to conduct statistics and analysis on the vast datasets. To process and analyze these datasets, high-performance computing resources have to be used and prohibitively high associated costs follows. To address this challenge, methods effectively selecting small subsets of key data from the entire dataset, while enabling researchers to achieve similar analytical results by using the entire dataset, are urgently demanded. Subsample method is an important one. Nowadays, subsample method has garnered significant attention from scholars across various fields, including statistics, machine learning and data science. Subsample method offers a practical solution to the problem of managing large datasets by allowing researchers to focus on the most relevant and informative data points, thereby reducing the computational burden and costs. In this paper, from the perspective of whether a subsample algorithm relies on statistical models or not, we provide a brief review on both the model-based and the model-free subsampling approaches. Within the framework of model-based approaches, we discuss various subsampling algorithms that are applicable to linear models, generalized linear models and nonlinear models. Each of these algorithms has its own unique strengths and limitations, and the choice of algorithm substantially depends on the specific characteristics of the dataset and research questions. Besides the model-based approaches, we also present several new algorithms that falls under the category of model-free subsampling approaches. These algorithms do not rely on specific model assumptions and can be applied to a broader range of application scenarios by offering greater flexibility and adaptability in handling different types of data and research questions. Furthermore, to evaluate the performance of these algorithms, we conducted simulation analysis for four different algorithms: uniform random sampling, parallel data-driven subsampling, information-based optimal subsampling, and twinning algorithm. The obtained results offer profound insights into the specific strengths and weaknesses of each algorithm. These detailed findings serve as a crucial reference point by providing researchers with valuable guidance on how to select the most suitable subsampling algorithm for their practical applications. By gaining a thorough understanding of the performance of these algorithms, researchers are expected to be equipped with a powerful tool to make more informed and strategic decisions regarding how to manage and analyze their data. Ultimately, this informed decision-making process leads to the derivation of more accurate and meaningful insights by significantly enhancing the quality and impact of their research endeavors. [ABSTRACT FROM AUTHOR]*
*在大数据时代, 数据量呈现指数增长. 随着数据集规模变得越来越庞大. 对完整数 据集进行统计和分析越来越困难, 不仅要求高性能计算, 还导致成本剧增. 对此问题, 可行应 对方法之一是通过挑选小部分的关键数据来实现与完整数据集相近的效果. 子抽样就是这样 一种重要方法. 子抽样方法为统计和分析大数据集提供了一个解决方案, 使研究人员能够专 注于最相关、最有信息量的数据点, 有效减轻完整数据集分析所需的计算负担和成本. 目前, 该方法已成为统计学、机器学习和数据科学等不同领域研究者的关注焦点. 本文从子抽样方 法是否依赖于统计模型的角度出发综述了依赖和不依赖于模型的子抽样方法的发展现状. 对 依赖模型的子抽样方法, 本文介绍了适用于线性模型、广义线性模型及非线性模型的部分子 抽样算法, 每种算法均有其独特优势和局限性. 本文还介绍了几种不依赖于模型的子抽样算 法. 这些算法不依赖于特定的模型假设, 处理不同类型数据和问题时具有更大灵活性和适应 性, 适用更广泛场景. 为评估这些算法的性能, 本文对均匀随机子抽样、并行数据驱动子抽样、 基于信息的最优子抽样及孪生子抽样等四种算法进行了仿真. 通过展示算法性能, 本文为在 实际应用中研究者该如何选择合适的子抽样算法提供了依据. [ABSTRACT FROM AUTHOR]
Copyright of Journal Of Sichuan University (Natural Sciences Division) / Sichuan Daxue Xuebao-Ziran Kexueban is the property of Editorial Department of Journal of Sichuan University Natural Science Edition and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)*