The observed preprocessing strategies for doing automatic text summarizing

ABSTRACT


INTRODUCTION
When looking for information, people have to sift through hundreds or even thousands of results on the internet, which is a direct result of the exponential growth in the amount of data.Many informational resources found on the internet call for in-depth research in natural language processing (NLP).Because of this, a system known as automatic text summarization was developed, and it quickly gained popularity, in order to condense information so that it is easier for people to comprehend [1].Documents that provide summaries of text will be required more frequently to assist in the resolution of these issues.
Automated text summarization (ATS) is a method of extracting the essence of information from text documents and containing the overall meaning of those texts [2].ATS is now one of the most popular NLP research fields for producing high-quality short paragraphs that cover the major body of a text document.Readability, coherence, syntax, non-redundancy, sentence order, diversity of information, and information coverage are some factors to consider for good summary findings [3].Automatic summarizing techniques are classified into two types: extractive and abstractive [4], [5].An extractive summary extracts the most important sections of a document without modifying the wording.Abstractive summaries modify sentences to make new ones conceivable, and the results can be comparable to human summaries.Abstractive summaries are more difficult to write because they involve meaning representation, content arrangement, surface manifestation, and intuitive understanding [6], [7].Even though the themes, data types, and algorithms differ, there are several sorts of ATS research.In the blog summarization dataset, extractive research was carried out utilizing the SummCoder algorithm approach [8].ROUGE-1 (78.0), ROUGE-2 (71.7),ROUGE-SU4 (71.8), and ROUGE-L (72.7) are the results given to the blog summary data set, SummCoder, followed by Com01 and Alg09, with scores of 77.0 and 76.0 ROUGE-1, respectively.The study was conducted in abstract form, with a genetic semantic graph used to summarize Indonesian news [9].The results showed that a 100-word summary had an average ROUGE-2 (0.32) and a 200-word summary had an average ROUGE-2 (0.39).Study [10] ROUGE-1 (0.11975), ROUGE-2 (0.01199) in scenario 1 with 128 hidden units, and ROUGE-1 (0.06745), ROUGE-2 (0.0055) in scenario 2 with 64 hidden units, are the results of abstractive summarization in Indonesian using BiGRU.
Raw data is prone to noise, missing numbers, and inconsistencies, which degrade the accuracy of the result [11].Data preprocessing is an important first step in determining data quality.Preprocessing is the process of structuring text documents such that a machine can read them easily.Data preparation is used in every system created for text processing and NLP.According to research findings [12], [13], preprocessing improves system performance.Unfortunately, earlier studies did not discuss the impact of the various preprocessing approaches utilized, and it is also uncertain what kind of preprocessing combination delivers best sentiment analysis performance.As a result, this study will concentrate on using various preprocessing approaches to determine the effect of preprocessing on the version of automatic text summarizers.ROUGE will be used to analyze the results that are produced by the machine summarization.The recall-oriented understudy for gisting evaluation also known as ROUGE [14] is a measure or parameter for evaluation that examines the results of summarizing text texts in an automated fashion.
Combining convenient features and preprocessing stages can improve summary performance and reduce the amount of computation required [15].Based on these findings, the features and preprocessing tasks used to achieve the best summary performance may differ depending on the text's domain and the success metrics used.As a result, the purpose of this research is to determine the impact of preprocessing techniques on the result of summarization, so that this research can contribute to the influential preprocessing stages and determine which preprocessing techniques are required or not.The following is an outline for this paper.Section 2 describes the research methodology, datasets, and experimental scenarios.Section 3 then explains the experimental results, and section 4 concludes.

METHOD 2.1. Research flow
This research aims to examine the performance impact of the pre-trained model by implementing a combination of preprocessing stages and evaluate model performance systematically using the proposed preprocessing technique.It is hoped that this research can contribute to the development of more accurate and efficient natural language models.The system to be created is shown in Figure 1.

Figure 1. Research flow
In the research flow in Figure 1, the researcher first entered and prepared the dataset.After that, the distribution of training and testing data was carried out.Then carry out 16 experimental combinations of preprocessing stages such as data cleaning, stemming, stopwords, and case folding.After applying the preprocessing stage, the next step is to process the data to the input model and the training process.In the final stage, evaluation and testing will be carried out using test data.

Transformers -BERT
Transformers architecture has two essential components, namely encoder and decoder.The encoder functions to capture and convert the input sequence into binary form.The Decoder will display the results of Comput Sci Inf Technol ISSN: 2722-3221  The observed preprocessing strategies for doing automatic text summarizing (Muhammad Farhan Juna) 121 the machine process as output that humans can understand [16].The two are linked by the attention mechanism, shown in Figure 2.  [17].The basic transformer consists of an encoder to read the input text and a decoder to generate predictive assignments.BERT only requires an encoder to produce a language representation model.The BERT architecture has 2 phases of use, namely Pre-training and Fine-Tuning, which can be combined in various tasks.There are slight differences between the pre-trained and final architecture shown in Figure 3. Study [18] introduces IndoBERT, a modified version of BERT Base following the BERT-Base (uncased) configuration.IndoBERT has been trained to use 220 million words using three main sources, namely Indonesia Wikipedia (74M words), Kompas Tempo and Liputan6 articles (55M total) and Indonesia Web Corpus (90M words).summarize the text.Tokenization is the process of splitting text into tokens.Words, symbols, numbers, punctuation marks and other essential entities can be considered tokens.Stopword removal aims to extract important words from the token results.Stemming is the process of eliminating reducing the number of index sentences by removing affixes into basic forms.The stopwords and stemming used in this study use the library from Sastrawi.Several combinations will be applied in this study by adding data cleaning and case folding.Data cleaning removes digit numbers from strings, punctuations, URLs, and white spaces.Case folding aims to change all letters in a document to lowercase.The common practice in most automated text summarisation studies is applying all the pre-processing methods without thoroughly analyzing their contribution to summary performance.Therefore, this study involves several experimental scenarios to see whether there is an influence from the preprocessing stage on the summary system built with the four preprocessing techniques listed in Table 1

Model
In the modelling stage using pretrained IndoBERT, a fine-tuning process will be carried out to optimize the model so that it can be used in the summarization process.IndoBERT architecture is shown in Figure 4.It is hoped that by using IndoBERT as the basic model and fine-tuning, the model can produce quality summary text according to user needs.Based on Figure 4. IndoBERT changes the input of a sentence into a token sequence.During the tokenization process, special tokens will be added, namely [CLS], [SEP], and [PAD] tokens.Token [CLS] is a token to symbolize the start of a sentence, and token [SEP] is a token to separate between sentences.[PAD] token to add padding and maximize the initialized token.In implementing text summarization, tokens [CLS] and [SEP] are inserted at the beginning and end of each sentence.A mask carries out bidirectional training with a certain percentage of the input token trained by pretraining.The transformers encoder's and MLP layers' parameters are randomly initialised [24].The transformer's encoder is configured as follows: layer=2, hidden size=768, feed-forward=2.048,and heads=8.The hyperparameters are trained using the Adam optimizer with a learning rate=3e -5 , batch size=16, epoch=7, and weight decay=5e -3 .The hardware specifications used can be described as follows: − Device Name: Laptop Asus Vivobook A416JA − RAM: 8 GB − GPU: Intel UHD Graphics − CPU: Intel Core i3-1005G1 − Software: Google Colaboratory

Evaluation
ROUGE [14] is an evaluation metric or parameter that automatically evaluates the results of summarizing text documents.ROUGE evaluates the summary results by comparing the machines and human results (gold summary).The most popular evaluation metrics used for ATS are ROUGE-N, and ROUGE L. ROUGE-N is a recall calculation based on n-grams between gold summary and machine summarized text.The number of n-grams often used is n=1 (ROUGE 1) and n=2 (ROUGE 2).For example, x is the number of n-grams that is the same between the gold standard summary and the machine-summarized text, and y is the number of n-grams in the gold standard summary.Then ROUGE-N can be calculated by the following formula,

ROUGE-N =
ROUGE-L evaluates text summaries by comparing the longest common subsequence (LCS) or the longest series of words that are the same between the engine text summary results and the gold standard summary.For example, z is the number of words in the gold standard summary, then ROUGE-L can be calculated using the following formula,

RESULTS AND DISCUSSION
This section reports the results of 16 experiments conducted to assess the accuracy of the summary results before and after applying the preprocessing method.The difference in each scenario is in the preprocessing section.This text summary test uses a dataset from IndoSum [25] of 14,262 news articles divided into 80% train data and 20% validation data.News articles are taken from Indonesian language news portals with titles, categories, and two gold standard summaries made manually.The test data consisting of 3762 articles have been applied to the preprocessing stage according to the experimental scenario used to test the model.From the results of the summary, a text summary performance evaluation will be carried out using the Rouge Score to determine the accuracy of the system being built.Table 2 are the results of the 16 experiments that have been carried out.
Based on Table 2, of the 16 experiments that have been carried out, it turns out that the highest ROUGE score was found in experiment 9 with ROUGE-1 (0.78), ROUGE-2 (0.60), and ROUGE-L (0.68) scores.The best system performance is obtained when combining data cleaning and case folding.The high ROUGE score in experiment 9 is due to the data cleaning process, which cleans dirty data.The application of case folding also has an effect because the data becomes structured and consistent in the use of capital letters.However, case folding without the data cleaning process gets the lowest results as in experiment 7. The lowest ROUGE value in experiment 7 gets ROUGE-1 (0.16), ROUGE-2 (0.05), and ROUGE-L (0.15) scores.The low results in experiment 7 were caused by the absence of a data cleaning process, so the model could not capture the information contained in the original text, or the summary results could have been better.Preprocessing testing using data cleaning produces better performance when compared to testing without using data cleaning, shown in Figure 5.   Figure 5 the nine highest experimental trials used data cleaning, except for experiment 6.This experiment only used stemming with an average ROUGE score better than experiment 8, which used a combination of data cleaning and stemming.The stemming used in the experiment cleaned dirty data even though there were still a few dashes, thus influencing the ROUGE score.Experimental combination testing without data cleaning will hurt the accuracy value in the seven lowest experiments.Most tests involving stopwords, stemming, and case folding produces low accuracy.Meanwhile, using stopwords and stemming techniques accompanied by data cleaning has a negative effect even though the accuracy results on the stopwords and stemming tests produce good accuracy.This is because when using stopwords and stemming, there are words that, if omitted, can reduce the information from the sentence so that the features used cannot describe the data.Using a large number of preprocessing techniques does not guarantee better system performance accuracy.
The results of experiment 7, Table 3. is one of the samples in the preprocessing stage that has been applied using only case folding.Before preprocessing, the articles and references summary columns still contained unnecessary punctuation, URLs, digits, and white space.After the preprocessing stage, the data still looks the same as before; only all letters are lowercase.The summary generated by the model is also ugly because the system needs to capture complete information.
From the test results in experiment 9, it can be seen that Table 4 is one of the preprocessing stages samples that have been applied.Before preprocessing, the articles and references summary columns contained unnecessary punctuation, URLs, digits and white space.Data cleaning and case folding are applied to these columns.After the preprocessing stage is carried out, it looks cleaner and easier to read and understand.The summary generated by the model also looks good, with information similar to the reference summary.
Comput Sci Inf Technol ISSN: 2722-3221  The observed preprocessing strategies for doing automatic text summarizing (Muhammad Farhan Juna) 125 Table 3. Sample of summarization result experiment 7

Figure 3 .
Figure 3. BERT pre-training and Fine-tuning -model architecture

Figure 4 .
Figure 4. Architecture of the IndoBERT for summarization model

Figure 5 .
Figure 5. ROUGE score in ascending order

Table 1 .
. Experiments of the preprocessing method

Table 2 .
Result of experiments