The OneStopEnglish [2] corpus was designed to help with automatic readability assessment and text simplification. More documentation on the OneStopEnglish corpus can be found in [1]. This package contains a modified version of the OneStopEnglish corpus. The modifications fix some mistakes which most likely slipped in during the conversion from .pdf
to plain text. The tm.corpus.OneStopEnglish package was compiled to make the data easier available from R
.
library("tm")
## Loading required package: NLP
library("tm.corpus.OneStopEnglish")
The tm.corpus.OneStopEnglish contains a parallel corpus of texts. The corpus contains articles from the The Guardian which have been rewritten for three different reading levels.
data("ose_corpus")
The ose
corpus contains the documents of the OneStopEnglish corpus for the three reading levels ‘elementary’, ‘intermediate’ and ‘advanced’.
ose_corpus
## $elementary
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 189
##
## $intermediate
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 189
##
## $advanced
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 189
The documents are aligned.
i <- 130
substr(content(ose_corpus[["elementary"]])[i], 1, 200)
## [1] "It is no longer legal to smoke a cigarette inside a bar in the worlds drinking capital, New Orleans, Louisiana. Since Hurricane Katrina in 2005, New Orleans city government has begun trying to reduce "
substr(content(ose_corpus[["intermediate"]])[i], 1, 200)
## [1] "It is no longer legal to smoke a cigarette inside a bar in the worlds drinking capital, New Orleans, Louisiana. Many other cities have banned indoor smoking but New Orleans is different it attracts to"
substr(content(ose_corpus[["advanced"]])[i], 1, 200)
## [1] "You can no longer legally smoke a cigarette inside a bar in the worlds drinking capital, New Orleans, Louisiana. City after city has banned indoor smoking but that's different because other cities don"
The annotations can be created with NLPclient or StanfordCoreNLP. NLPclient is available from https://cran.r-project.org
. StanfordCoreNLP is available from https://datacube.wu.ac.at/
. The code to reproduce the annotation can found on https://readability.r-forge.r-project.org/
. Since creating the annotations is time consuming we also provide the annotations.
data("ose_annotations")
c(head(ose_annotations, 2), tail(ose_annotations, 2))
## $ele_001
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 523
## Content: chars: 2515
##
## $ele_002
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 517
## Content: chars: 2543
##
## $adv_188
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 921
## Content: chars: 4854
##
## $adv_189
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 784
## Content: chars: 4275
To make the feature creation process reproducible we create the NLPreadability package. The code to reproduce the feature generation can found on https://readability.r-forge.r-project.org/
.
data("ose_features")
str(ose_features)
## 'data.frame': 567 obs. of 96 variables:
## $ readability : Ord.factor w/ 3 levels "ele"<"int"<"adv": 1 1 1 1 1 1 1 1 1 1 ...
## $ avg_sl : num 12.6 20 17.2 20.8 19.9 ...
## $ avg_char_sl : num 63.6 95.4 79.7 96.6 87 ...
## $ avg_wl : num 4.84 4.59 4.37 4.51 4.23 ...
## $ avg_syll : num 1.66 1.54 1.41 1.47 1.34 ...
## $ r_long_words : num 0.07294 0.04505 0.01525 0.02292 0.00236 ...
## $ r_polysy : num 0.1859 0.1554 0.0871 0.1062 0.0519 ...
## $ r_unique_words : num 0.502 0.482 0.542 0.552 0.446 ...
## $ r_unique_lemmas : num 0.445 0.418 0.48 0.467 0.374 ...
## $ r_adjectives : num 0.113 0.0932 0.1138 0.0711 0.0504 ...
## $ r_adverbs : num 0.0409 0.0523 0.0402 0.0418 0.0216 ...
## $ r_nouns : num 0.337 0.27 0.275 0.335 0.29 ...
## $ r_prepositions : num 0.108 0.102 0.136 0.107 0.12 ...
## $ r_verbs : num 0.156 0.209 0.161 0.172 0.221 ...
## $ r_pronouns : num 0.0337 0.0523 0.0826 0.0335 0.0743 ...
## $ r_determiners : num 0.113 0.0886 0.1004 0.1192 0.1271 ...
## $ r_cooconj : num 0.0409 0.0409 0.0513 0.0397 0.0264 ...
## $ r_unique_adjectives : num 0.0024 0.00227 0.00223 0.00209 0.0024 ...
## $ r_unique_adverbs : num 0.0312 0.0341 0.029 0.0335 0.0192 ...
## $ r_unique_nouns : num 0.228 0.17 0.203 0.234 0.175 ...
## $ r_unique_prepositions : num 0.0337 0.0341 0.058 0.0335 0.0288 ...
## $ r_unique_verbs : num 0.0817 0.1227 0.0982 0.1276 0.1151 ...
## $ r_unique_pronouns : num 0.0168 0.0227 0.0268 0.0188 0.024 ...
## $ r_unique_determiners : num 0.0216 0.0114 0.0223 0.0188 0.012 ...
## $ r_unique_cooconj : num 0.00962 0.00909 0.00893 0.00837 0.00719 ...
## $ r_unique_adjectives_pty : num 0.00518 0.00518 0.00437 0.00437 0.00617 ...
## $ r_unique_adverbs_pty : num 0.0674 0.0777 0.0568 0.0699 0.0494 ...
## $ r_unique_nouns_pty : num 0.492 0.389 0.397 0.489 0.451 ...
## $ r_unique_prepositions_pty : num 0.0725 0.0777 0.1135 0.0699 0.0741 ...
## $ r_unique_verbs_pty : num 0.176 0.28 0.192 0.266 0.296 ...
## $ r_unique_pronouns_pty : num 0.0363 0.0518 0.0524 0.0393 0.0617 ...
## $ r_unique_determiners_pty : num 0.0466 0.0259 0.0437 0.0393 0.0309 ...
## $ r_unique_cooconj_pty : num 0.0207 0.0207 0.0175 0.0175 0.0185 ...
## $ avg_adjectives_ps : num 1.42 1.86 1.96 1.48 1 ...
## $ avg_adverbs_ps : num 0.515 1.045 0.692 0.87 0.429 ...
## $ avg_nouns_ps : num 4.24 5.41 4.73 6.96 5.76 ...
## $ avg_prepositions_ps : num 1.36 2.05 2.35 2.22 2.38 ...
## $ avg_verbs_ps : num 1.97 4.18 2.77 3.57 4.38 ...
## $ avg_pronouns_ps : num 0.424 1.045 1.423 0.696 1.476 ...
## $ avg_determiners_ps : num 1.42 1.77 1.73 2.48 2.52 ...
## $ avg_cooconj_ps : num 0.515 0.818 0.885 0.826 0.524 ...
## $ avg_unique_adjectives_ps : num 0.0303 0.0455 0.0385 0.0435 0.0476 ...
## $ avg_unique_adverbs_ps : num 0.394 0.682 0.5 0.696 0.381 ...
## $ avg_unique_nouns_ps : num 2.88 3.41 3.5 4.87 3.48 ...
## $ avg_unique_prepositions_ps : num 0.424 0.682 1 0.696 0.571 ...
## $ avg_unique_verbs_ps : num 1.03 2.45 1.69 2.65 2.29 ...
## $ avg_unique_pronouns_ps : num 0.212 0.455 0.462 0.391 0.476 ...
## $ avg_unique_determiners_ps : num 0.273 0.227 0.385 0.391 0.238 ...
## $ avg_unique_cooconj_ps : num 0.121 0.182 0.154 0.174 0.143 ...
## $ avg_nc_adjectives : num 7.04 6.59 6.06 5.85 5.38 ...
## $ avg_nc_adverbs : num 5.18 4.04 4 4.65 3.78 ...
## $ avg_nc_nouns : num 6.49 6.69 6.15 5.88 5.82 ...
## $ avg_nc_prepositions : num 2.87 3.18 3.34 2.78 3.16 ...
## $ avg_nc_verbs : num 3.95 4.1 3.89 4.78 4.47 ...
## $ avg_nc_pronouns : num 0.585 0.75 1.5 0.683 0.902 ...
## $ avg_nc_determiners : num 2 1.08 1.38 1.88 1.39 ...
## $ avg_nc_cooconj : num 0.738 0.576 0.917 0.683 0.359 ...
## $ avg_ptree_height : num 8.7 12.9 10.8 11.7 12.9 ...
## $ avg_subord_conj : num 0.576 1.182 0.692 0.783 1.571 ...
## $ avg_NP : num 5.21 7.18 7.12 7.83 7.76 ...
## $ avg_VP : num 2.3 5.45 3.04 4.43 5.05 ...
## $ avg_PP : num 1.36 1.77 1.92 2.39 2.05 ...
## $ avg_ADVP : num 0.364 0.364 0.346 0.609 0.19 ...
## $ avg_ADJP : num 0.152 0.455 0.5 0.348 0.286 ...
## $ avg_ALLP : num 11.6 18.8 15.9 18.7 19.5 ...
## $ r_NP : num 0.186 0.164 0.191 0.178 0.178 ...
## $ r_VP : num 0.0823 0.1246 0.0814 0.1011 0.1155 ...
## $ r_PP : num 0.0487 0.0405 0.0515 0.0545 0.0468 ...
## $ r_ADVP : num 0.01299 0.00831 0.00927 0.01388 0.00436 ...
## $ r_ADJP : num 0.00541 0.01038 0.01339 0.00793 0.00654 ...
## $ r_ALLP : num 0.416 0.43 0.426 0.427 0.447 ...
## $ avg_no_VP_ps : num 0.182 0 0 0 0 ...
## $ r_entities : num 0.349 0.275 0.288 0.337 0.297 ...
## $ r_uentities : num 0.455 0.354 0.387 0.417 0.392 ...
## $ avg_entities_ps : num 4.39 5.5 4.96 7 5.9 ...
## $ avg_uentities_ps : num 2.88 3.41 3.62 4.78 3.48 ...
## $ r_named_entities : num 0.1058 0.0432 0.058 0.0565 0.0983 ...
## $ r_unamed_entities : num 0.1627 0.0566 0.0741 0.0606 0.1398 ...
## $ avg_named_entities_ps : num 1.333 0.864 1 1.174 1.952 ...
## $ avg_unamed_entities_ps : num 1.03 0.545 0.692 0.696 1.238 ...
## $ r_nent_to_ent : num 0.303 0.157 0.202 0.168 0.331 ...
## $ r_overlapping_nouns : num 0.0938 0.0386 0.0446 0.0544 0.0911 ...
## $ avg_named_entity_len : num 10.17 7.5 8.28 6.91 7.94 ...
## $ r_named_entity_len : num 0.1465 0.052 0.0761 0.0738 0.1395 ...
## $ avg_passives : num 0.0909 0.2273 0.0385 0.1304 0.2857 ...
## $ avg_num_coref_per_chain : num 2.65 3.82 4 3.18 3.7 ...
## $ avg_coref_chain_span : num 142 137 115 147 134 ...
## $ r_long_corefs : num 0.294 0.235 0.167 0.235 0.217 ...
## $ avg_coref_chains_ps : num 0.515 0.773 0.692 0.739 1.095 ...
## $ avg_coref_inference_distance : num 433 247 185 344 239 ...
## $ median_coref_inference_distance: num 123 138.5 66.5 160 115.5 ...
## $ r_coref_per_words : num 0.108 0.148 0.161 0.113 0.204 ...
## $ r_coref_per_entities : num 0.31 0.537 0.558 0.335 0.685 ...
## $ avg_word_overlap : num 0.0539 0.1297 0.1086 0.0892 0.1281 ...
## $ avg_noun_overlap : num 0.0312 0.0476 0.4127 0 0.1917 ...
## $ r_content_words : num 0.577 0.58 0.556 0.623 0.528 ...
https://www.aclweb.org/anthology/W18-0535
.