Vol 8, No 3 (2017) > Electrical, Electronics and Computer Engineering >

Generating Artificial Error Data for Indonesian Preposition Error Corrections

Budi Irmawati, Hiroyuki Shindo, Yuji Matsumoto

 

Abstract: Large-scale
annotated data written by second language learners are not always available for
low-resource languages such as Indonesian. To cope with data scarcity, it is
important to generate ‘learner-like’ artificial error sentences when the
available real learner data is insufficient and language experts cannot construct
data. In this paper, we propose a new method for generating effective
error-injected artificial data to proliferate training examples for preposition
error correction tasks. Our method first generates a large scale of noisy
artificial error data via the use of a simple error injection method. It then
selectively removes the uninformative (noisy) instances from the artificial
data. We assume that ‘good’ artificial preposition error data would be
effective training data for error correction tasks. Therefore, to evaluate the
goodness of the generated artificial data, we used the generated artificial
data as training data to correct preposition errors in real learners’
sentences. The results of our study indicate that the use of our artificial
data for training improves preposition error correction performance. The
results also show that training on a smaller sized of good instances
outperforms training on much larger-sized noisy instances as well as that on
sentences written by native speakers. This method is language-independent and
easy to apply to other low-resource languages because it assumes only a small
size of learner error data and uses features that could be extracted
automatically from linguistically annotated sentences.
Keywords: Artificial data; Indonesian language; Low-resourced languages; Noise removal; Preposition error correction

Full PDF Download

References


Cahill, A., Madnani, N., Tetreault, J., Napolitano, D., 2013. Robust Systems for Preposition Error Correction using Wikipedia Revisions. In: Proceedings of the Conference of the NACCL: HLT, Atlanta, Georgia, 21st–23rd June 2013, ACL

Chodorow M., Tetreault, J.R., Han, N., 2007. Detection of Grammatical Errors Involving Prepositions. In: Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions, Stroudsburg, Pennsylvania, USA, 28th June 2007, ACL, pp. 25–30

Dahlmeier, D., Ng, H.T., 2011. Grammatical Error Correction with Alternating Structure Optimization. In: Proceedings of the 49th Annual Meeting of the ACL: HLT - Volume 1, Stroudsburg, Pennsylvania, USA, 19th-24th June 2011, ACL, pp. 915–923

Foster, J., Andersen, Ø.E., 2009. GenERRate: Generating Errors for Use in Grammatical Error Detection. In: Proceedings of the 4th Workshop on Innovative Use of NLP for Building Educational Applications, Pennsylvania, USA, 5th June 2009, Stroudsburg, ACL, pp. 82–90

Han, N., Tetreault, J., Lee, S., Ha, J., 2010. Using an Error-annotated Learner Corpus to Develop an ASL/AFL Error Correction System. In: Proceedings of the 7th International Conference on LRE, Valletta, Malta, 23rd May 2010, ELRA, pp. 763–770

Irmawati, B., Komachi, M., Matsumoto, Y., 2016a. Towards Construction of an Error-Corrected Corpus of Indonesian Second Language Learners. In: Almeida, F.A. et al. Ed. Input a Word, Analyse the World: Selected Approaches to Corpus Linguistics. Cambridge Scholars Publishing: Newcastle, USA, pp. 425–443

Irmawati, B., Shindo, H., Matsumoto, Y., 2016b. Exploiting Syntactic Similarities for Preposition Error Correction on Indonesian. In: Proceedings of The 5th Workshop on Spoken Language Technologies for Under-resource languages, Jogjakarta, Indonesia, 9th-12th May 2016, Procedia Computer Science Volume 81 - Elsevier. pp. 214–220

Izumi, E., Uchimoto, K., Saiga, T., Supnithi, T., Isahara, H., 2003. Automatic Error Detection in the Japanese Learners’ English Spoken Data. In: Proceedings of the 41st Annual Meeting on ACL - Volume 2, Sapporo, Japan, 7th-12th July 2003, ACL, pp. 145–148

Larasati, S.D., Kuboň, V., Zeman, D., 2011. Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus. In: Proceedings of the 2nd International Workshop Systems and Frameworks for Computational Morphology, Zurich, Switzerland, 26th August 2011, pp. 119–129

Leacock, C., Chodorow, M. Gamon, M., Tetreault, J., 2014. Automated Grammatical Error Detection for Language Learners. Morgan and Claypool Publishers: Seattle, Washington, San Rafael, California, USA

Martineau, J., Chen, L., Cheng, D., Sheth, A., 2014. Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy. In: Proceedings of the 52nd Annual Meeting of the ACL (Volume 1: Long Papers), Baltimore, Maryland, USA, 23rd-25th June 2014, ACL, pp. 1104–1112

McDonald, R., Lerman, K., Pereira, F., 2006. Multilingual Dependency Analysis with a Two-stage Discriminative Parser. In: Proceedings of the 10th CoNLL, Stroudsburg, Pennsylvania, USA, 8th–9th June 2006, ACL, pp. 216–220

Mizumoto, T., Komachi, M., Nagata, M., Matsumoto, Y., 2011. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In: Proceedings of the 5th IJCNLP, Chiang Mai, Thailand, 8th–11th November 2011, AFNLP, pp. 147–155

Quasthoff, U., Richter, M., Biemann, C., 2006. Corpus Portal for Search in Monolingual Corpora. In: Proceedings of the 5th LREC, Genoa, Italy, 24th–26th May 2006, pp. 1799–1802

Rozovskaya, A., Roth, D., 2010. Generating Confusion Sets for Context-sensitive Error Correction. In: Proceedings of the 2010 Conference on EMNLP, Stroudsburg, Pennsylvania, USA, 9th–11th October 2010, ACL, pp. 961–970

Wagner, J., Foster, J., Genabith. J., 2009. Judging Grammaticality: Experiments in Sentence Classification. CALICO Journal, Volume 26(3), pp. 474–490