Abstract:
The identification and classification of seismic events hold significant importance in seismic monitoring and earthquake disaster mitigation. This research primarily focuses on 1935 seismic event data with low magnitude (ML≤3.0) in the North China region, encompassing three distinct types of events: natural earthquakes, artificial explosions, and mining collapses. Preliminary analysis involved the geographical distribution examination, annual trends, and magnitude distribution of these events. Preprocessing of raw seismic data included amplitude normalization, detrending, mean removal, and band-pass filtering (0.5—20 Hz). Additionally, short-time Fourier transform analysis was utilized to visualize waveform and spectrogram characteristics, facilitating the observation and analysis of both time and frequency domain features. Based on the analysis results, 62 features across 10 categories, including time, P/S amplitude ratio, frequency, zero-crossing rate, peak amplitude, peak ground acceleration, energy, signal characteristics, angle, and other ratios, were extracted as the foundation for classification.
This research employed K-Nearest Neighbors (KNN), Adaptive Boosting (AdaBoost), and Light Gradient Boosting Machine (LGBM) algorithms to train models using the extracted 62 features for binary and ternary classification tasks of natural earthquakes, artificial explosions, and mining collapses. The basic principles of KNN, AdaBoost, and LGBM algorithms were initially introduced, followed by a description of the training process for the classification models. To ensure balanced sample distribution for each event type, data were selected based on uniform distribution of time and geographical location. Ultimately, 545 events for each event type, totaling 1635 seismic events, were chosen as the sample data for building the classification models. The dataset was divided into training and testing sets using a holdout method, with 75% of the data used for model construction and validation, and 25% for evaluating model performance. The training data covered the main geographical range of the North China region (109.3°—123.5°E, 34.1°—43.7°N), ensuring the models could capture the region’s diversity and complexity. The testing data covered a slightly different geographical range (110.8°—124.1°E, 34.9°—42.7°N).
The 62 features were used to train classification models by KNN, AdaBoost, and LGBM algorithms. Models were trained with number 0 representing natural earthquakes, number 1 representing artificial explosions, and number 2 representing mining collapses. Various classification models were evaluated using KNN, AdaBoost, and LGBM, with each model trained and tested 100 times for 0−1, 0−2, 1−2, and 0−1−2 classification tasks. AdaBoost and LGBM demonstrated superior performance compared to KNN across all classification tasks, especially in 0−1 and 0−1−2 classification task. LGBM consistently exhibited the best overall performance, maintaining an accuracy of over 95% and showing high stability. In different classification tasks, 0−2 classification yielded the most outstanding results, followed by 1−2 classification.
Following the training of classification models, the focus shifted to comprehensive evaluation of these models using testing data. Each model was used to identify the event types in the testing data, yielding performance results for each model across different classification tasks. Confusion matrices were generated based on identification results, demonstrating excellent performance for each classification task, particularly in the 0−2 classification using three different classification algorithms.
Based on confusion matrices, performance evaluation metrics, including accuracy, precision, recall, and F1 score, were calculated. In the 0−1 classification task, AdaBoost performed the best, achieving an accuracy of 96.69%. In the 0−2 classification task, all three algorithms performed well, with metrics exceeding 99.26%. In the 1−2 and 0−1−2 classifications, LGBM exhibited the best performance. Overall, each classification model demonstrated excellent performance, with accuracy, precision, recall, and F1 score all exceeding 89.71%.
LGBM exhibited superior overall performance, maintaining an accuracy of over 95.90% and demonstrating high stability. KNN still has significant room for improvement, possibly due to its sensitivity to data, resulting in relatively weaker performance compared to AdaBoost and LGBM. AdaBoost’s overall performance lies between LGBM and KNN.
Finally, ROC curves were plotted to visualize the recognition of the testing dataset using three different classification algorithms (KNN, AdaBoost, LGBM). While KNN algorithm performance for 0−1 and 1−2 classifications requires optimization, all other models performed exceptionally well in the ternary classification scenario. Confusion matrices and evaluation metrics indicate that the constructed classification models perform well on testing data, with ROC curve analysis further confirming the excellent performance of the classification models in various tasks and revealing the applicability of different algorithms in their respective tasks, providing strong support for the practical application of the models.