BEA shared task - 2019 datasetreleased for the BEA Shared Task on Grammatical Error Correction provides a newer and bigger dataset for evaluating GEC models in 3 tracks, based on the datasets used for training: Restricted track Unrestricted track ...
natural-language-processingcorpusdatasetcorpus-datacorpus-toolsgecnlp-datasetsgrammatical-error-correctionukrainian-language UpdatedFeb 11, 2024 Macaulay2 awasthiabhijeet/PIE Star227 Fast + Non-Autoregressive Grammatical Error Correction using BERT. Code and Pre-trained models for paper "Parallel Iterative Edit...
Previous researchers have proposed a variety of data augmentation methods to generate more training data and enlarge the dataset, but these methods either rely on rules to generate grammatical errors and are not automated, or produce errors that do not match human writing errors. The pre-trained ...
NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts Authors: Yue Zhang, Bo Zhang, Haochen Jiang, Zhenghua Li, Chen Li, Fei Huang, Min Zhang Conference: ACL Findings Link: https://aclanthology.org/2023.findings-acl.630/ Abstract We introduce NaSGEC,...
Experiments show that the new model can effectively correct errors of both types by incorporating word and character-level information, and that the model significantly outperforms previous neural models for GEC as measured on the standard CoNLL14 benchmark dataset. Further analysis also sho...
To mitigate the limited number of available training samples, a new denoising autoencoder is used to generate a new synthetic dataset to be used for pretraining. Additionally, a new character-level transformation is proposed to enhance the sequence-to-edit function and improve the model's ...
MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction This paper presents MuCGEC, a multi-reference multi-source evaluation dataset for Chinese Grammatical Error Correction (CGEC), consisting of 7,063 sentences collected from three Chinese-as-a-Second-Languag...
Dataset English: Based onBART.large, applyBPE. German, Czech, and Russian: Based onmbart.cc25, detokenize GEC data and applySentencePiece. For evaluation, use aspaCy-based tokenizerfor German and Russian, and theMorphoDiTa tokenizerfor Czech. ...
We overview the current stateof GEC byevaluatingtheperformanceoffourlead-ing systems on this new dataset. We analyze theedits made in JFLEG and summarize which typesof changes the systems successfully make, andwhich they need to address. JFLEG will enable thef ield to move beyond minimal error...
the most prominent peaks in the global responses were identified. This is believed to be optimal for approaching data in an unbiased way52by focusing on the periods of largest neuronal activity overall and thus avoiding double-dipping in dataset comparisons. These GFP curves manifested 2 distinct ...