OneStopEnglish corpus

The OneStopEnglish [2] corpus was designed to help with automatic readability assessment and text simplification. More documentation on the OneStopEnglish corpus can be found in [1]. This package contains a modified version of the OneStopEnglish corpus. The modifications fix some mistakes which most likely slipped in during the conversion from .pdf to plain text. The tm.corpus.OneStopEnglish package was compiled to make the data easier available from R.

library("tm")

## Loading required package: NLP

library("tm.corpus.OneStopEnglish")

Corpus

The tm.corpus.OneStopEnglish contains a parallel corpus of texts. The corpus contains articles from the The Guardian which have been rewritten for three different reading levels.

data("ose_corpus")

The ose corpus contains the documents of the OneStopEnglish corpus for the three reading levels ‘elementary’, ‘intermediate’ and ‘advanced’.

ose_corpus

## $elementary
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 189
## 
## $intermediate
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 189
## 
## $advanced
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 189

The documents are aligned.

i <- 130
substr(content(ose_corpus[["elementary"]])[i], 1, 200)

## [1] "It is no longer legal to smoke a cigarette inside a bar in the worlds drinking capital, New Orleans, Louisiana. Since Hurricane Katrina in 2005, New Orleans city government has begun trying to reduce "

substr(content(ose_corpus[["intermediate"]])[i], 1, 200)

## [1] "It is no longer legal to smoke a cigarette inside a bar in the worlds drinking capital, New Orleans, Louisiana. Many other cities have banned indoor smoking but New Orleans is different it attracts to"

substr(content(ose_corpus[["advanced"]])[i], 1, 200)

## [1] "You can no longer legally smoke a cigarette inside a bar in the worlds drinking capital, New Orleans, Louisiana. City after city has banned indoor smoking but that's different because other cities don"

Annotations

The annotations can be created with NLPclient or StanfordCoreNLP. NLPclient is available from https://cran.r-project.org. StanfordCoreNLP is available from https://datacube.wu.ac.at/. The code to reproduce the annotation can found on https://readability.r-forge.r-project.org/. Since creating the annotations is time consuming we also provide the annotations.

data("ose_annotations")
c(head(ose_annotations, 2), tail(ose_annotations, 2))

## $ele_001
## <<AnnotatedPlainTextDocument>>
## Metadata:  0
## Annotations:  length: 523
## Content:  chars: 2515
## 
## $ele_002
## <<AnnotatedPlainTextDocument>>
## Metadata:  0
## Annotations:  length: 517
## Content:  chars: 2543
## 
## $adv_188
## <<AnnotatedPlainTextDocument>>
## Metadata:  0
## Annotations:  length: 921
## Content:  chars: 4854
## 
## $adv_189
## <<AnnotatedPlainTextDocument>>
## Metadata:  0
## Annotations:  length: 784
## Content:  chars: 4275

Features

To make the feature creation process reproducible we create the NLPreadability package. The code to reproduce the feature generation can found on https://readability.r-forge.r-project.org/.

data("ose_features")
str(ose_features)

## 'data.frame':    567 obs. of  96 variables:
##  $ readability                    : Ord.factor w/ 3 levels "ele"<"int"<"adv": 1 1 1 1 1 1 1 1 1 1 ...
##  $ avg_sl                         : num  12.6 20 17.2 20.8 19.9 ...
##  $ avg_char_sl                    : num  63.6 95.4 79.7 96.6 87 ...
##  $ avg_wl                         : num  4.84 4.59 4.37 4.51 4.23 ...
##  $ avg_syll                       : num  1.66 1.54 1.41 1.47 1.34 ...
##  $ r_long_words                   : num  0.07294 0.04505 0.01525 0.02292 0.00236 ...
##  $ r_polysy                       : num  0.1859 0.1554 0.0871 0.1062 0.0519 ...
##  $ r_unique_words                 : num  0.502 0.482 0.542 0.552 0.446 ...
##  $ r_unique_lemmas                : num  0.445 0.418 0.48 0.467 0.374 ...
##  $ r_adjectives                   : num  0.113 0.0932 0.1138 0.0711 0.0504 ...
##  $ r_adverbs                      : num  0.0409 0.0523 0.0402 0.0418 0.0216 ...
##  $ r_nouns                        : num  0.337 0.27 0.275 0.335 0.29 ...
##  $ r_prepositions                 : num  0.108 0.102 0.136 0.107 0.12 ...
##  $ r_verbs                        : num  0.156 0.209 0.161 0.172 0.221 ...
##  $ r_pronouns                     : num  0.0337 0.0523 0.0826 0.0335 0.0743 ...
##  $ r_determiners                  : num  0.113 0.0886 0.1004 0.1192 0.1271 ...
##  $ r_cooconj                      : num  0.0409 0.0409 0.0513 0.0397 0.0264 ...
##  $ r_unique_adjectives            : num  0.0024 0.00227 0.00223 0.00209 0.0024 ...
##  $ r_unique_adverbs               : num  0.0312 0.0341 0.029 0.0335 0.0192 ...
##  $ r_unique_nouns                 : num  0.228 0.17 0.203 0.234 0.175 ...
##  $ r_unique_prepositions          : num  0.0337 0.0341 0.058 0.0335 0.0288 ...
##  $ r_unique_verbs                 : num  0.0817 0.1227 0.0982 0.1276 0.1151 ...
##  $ r_unique_pronouns              : num  0.0168 0.0227 0.0268 0.0188 0.024 ...
##  $ r_unique_determiners           : num  0.0216 0.0114 0.0223 0.0188 0.012 ...
##  $ r_unique_cooconj               : num  0.00962 0.00909 0.00893 0.00837 0.00719 ...
##  $ r_unique_adjectives_pty        : num  0.00518 0.00518 0.00437 0.00437 0.00617 ...
##  $ r_unique_adverbs_pty           : num  0.0674 0.0777 0.0568 0.0699 0.0494 ...
##  $ r_unique_nouns_pty             : num  0.492 0.389 0.397 0.489 0.451 ...
##  $ r_unique_prepositions_pty      : num  0.0725 0.0777 0.1135 0.0699 0.0741 ...
##  $ r_unique_verbs_pty             : num  0.176 0.28 0.192 0.266 0.296 ...
##  $ r_unique_pronouns_pty          : num  0.0363 0.0518 0.0524 0.0393 0.0617 ...
##  $ r_unique_determiners_pty       : num  0.0466 0.0259 0.0437 0.0393 0.0309 ...
##  $ r_unique_cooconj_pty           : num  0.0207 0.0207 0.0175 0.0175 0.0185 ...
##  $ avg_adjectives_ps              : num  1.42 1.86 1.96 1.48 1 ...
##  $ avg_adverbs_ps                 : num  0.515 1.045 0.692 0.87 0.429 ...
##  $ avg_nouns_ps                   : num  4.24 5.41 4.73 6.96 5.76 ...
##  $ avg_prepositions_ps            : num  1.36 2.05 2.35 2.22 2.38 ...
##  $ avg_verbs_ps                   : num  1.97 4.18 2.77 3.57 4.38 ...
##  $ avg_pronouns_ps                : num  0.424 1.045 1.423 0.696 1.476 ...
##  $ avg_determiners_ps             : num  1.42 1.77 1.73 2.48 2.52 ...
##  $ avg_cooconj_ps                 : num  0.515 0.818 0.885 0.826 0.524 ...
##  $ avg_unique_adjectives_ps       : num  0.0303 0.0455 0.0385 0.0435 0.0476 ...
##  $ avg_unique_adverbs_ps          : num  0.394 0.682 0.5 0.696 0.381 ...
##  $ avg_unique_nouns_ps            : num  2.88 3.41 3.5 4.87 3.48 ...
##  $ avg_unique_prepositions_ps     : num  0.424 0.682 1 0.696 0.571 ...
##  $ avg_unique_verbs_ps            : num  1.03 2.45 1.69 2.65 2.29 ...
##  $ avg_unique_pronouns_ps         : num  0.212 0.455 0.462 0.391 0.476 ...
##  $ avg_unique_determiners_ps      : num  0.273 0.227 0.385 0.391 0.238 ...
##  $ avg_unique_cooconj_ps          : num  0.121 0.182 0.154 0.174 0.143 ...
##  $ avg_nc_adjectives              : num  7.04 6.59 6.06 5.85 5.38 ...
##  $ avg_nc_adverbs                 : num  5.18 4.04 4 4.65 3.78 ...
##  $ avg_nc_nouns                   : num  6.49 6.69 6.15 5.88 5.82 ...
##  $ avg_nc_prepositions            : num  2.87 3.18 3.34 2.78 3.16 ...
##  $ avg_nc_verbs                   : num  3.95 4.1 3.89 4.78 4.47 ...
##  $ avg_nc_pronouns                : num  0.585 0.75 1.5 0.683 0.902 ...
##  $ avg_nc_determiners             : num  2 1.08 1.38 1.88 1.39 ...
##  $ avg_nc_cooconj                 : num  0.738 0.576 0.917 0.683 0.359 ...
##  $ avg_ptree_height               : num  8.7 12.9 10.8 11.7 12.9 ...
##  $ avg_subord_conj                : num  0.576 1.182 0.692 0.783 1.571 ...
##  $ avg_NP                         : num  5.21 7.18 7.12 7.83 7.76 ...
##  $ avg_VP                         : num  2.3 5.45 3.04 4.43 5.05 ...
##  $ avg_PP                         : num  1.36 1.77 1.92 2.39 2.05 ...
##  $ avg_ADVP                       : num  0.364 0.364 0.346 0.609 0.19 ...
##  $ avg_ADJP                       : num  0.152 0.455 0.5 0.348 0.286 ...
##  $ avg_ALLP                       : num  11.6 18.8 15.9 18.7 19.5 ...
##  $ r_NP                           : num  0.186 0.164 0.191 0.178 0.178 ...
##  $ r_VP                           : num  0.0823 0.1246 0.0814 0.1011 0.1155 ...
##  $ r_PP                           : num  0.0487 0.0405 0.0515 0.0545 0.0468 ...
##  $ r_ADVP                         : num  0.01299 0.00831 0.00927 0.01388 0.00436 ...
##  $ r_ADJP                         : num  0.00541 0.01038 0.01339 0.00793 0.00654 ...
##  $ r_ALLP                         : num  0.416 0.43 0.426 0.427 0.447 ...
##  $ avg_no_VP_ps                   : num  0.182 0 0 0 0 ...
##  $ r_entities                     : num  0.349 0.275 0.288 0.337 0.297 ...
##  $ r_uentities                    : num  0.455 0.354 0.387 0.417 0.392 ...
##  $ avg_entities_ps                : num  4.39 5.5 4.96 7 5.9 ...
##  $ avg_uentities_ps               : num  2.88 3.41 3.62 4.78 3.48 ...
##  $ r_named_entities               : num  0.1058 0.0432 0.058 0.0565 0.0983 ...
##  $ r_unamed_entities              : num  0.1627 0.0566 0.0741 0.0606 0.1398 ...
##  $ avg_named_entities_ps          : num  1.333 0.864 1 1.174 1.952 ...
##  $ avg_unamed_entities_ps         : num  1.03 0.545 0.692 0.696 1.238 ...
##  $ r_nent_to_ent                  : num  0.303 0.157 0.202 0.168 0.331 ...
##  $ r_overlapping_nouns            : num  0.0938 0.0386 0.0446 0.0544 0.0911 ...
##  $ avg_named_entity_len           : num  10.17 7.5 8.28 6.91 7.94 ...
##  $ r_named_entity_len             : num  0.1465 0.052 0.0761 0.0738 0.1395 ...
##  $ avg_passives                   : num  0.0909 0.2273 0.0385 0.1304 0.2857 ...
##  $ avg_num_coref_per_chain        : num  2.65 3.82 4 3.18 3.7 ...
##  $ avg_coref_chain_span           : num  142 137 115 147 134 ...
##  $ r_long_corefs                  : num  0.294 0.235 0.167 0.235 0.217 ...
##  $ avg_coref_chains_ps            : num  0.515 0.773 0.692 0.739 1.095 ...
##  $ avg_coref_inference_distance   : num  433 247 185 344 239 ...
##  $ median_coref_inference_distance: num  123 138.5 66.5 160 115.5 ...
##  $ r_coref_per_words              : num  0.108 0.148 0.161 0.113 0.204 ...
##  $ r_coref_per_entities           : num  0.31 0.537 0.558 0.335 0.685 ...
##  $ avg_word_overlap               : num  0.0539 0.1297 0.1086 0.0892 0.1281 ...
##  $ avg_noun_overlap               : num  0.0312 0.0476 0.4127 0 0.1917 ...
##  $ r_content_words                : num  0.577 0.58 0.556 0.623 0.528 ...

References

[1] Vajjala Sowmya and Luvcic Ivana. 2018. ‘OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification’. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-0535.
[2] Sowmya. (2018, April 16). nishkalavallabhi/OneStopEnglishCorpus: OneStopEnglish Corpus Release (Version bea2018). Zenodo.