The English textbook corpus contains texts from the English version of textbooks used in public schools in Bangladesh. More documentation on the English textbook corpus can be found in [1]. The tm.corpus.enTextbook package was compiled to make the data easier available from R
.
library("tm")
## Loading required package: NLP
library("tm.corpus.enTextbook")
data("entb_corpus")
The corpus contains 519 documents for four reading levels: ‘veryEasy’, ‘easy’, ‘medium’ and ‘difficult’.
data("entb_corpus")
entb_corpus
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 7
## Content: documents: 519
The text can be accessed with the content
function.
substr(head(content(entb_corpus)), 1, 60)
## [1] "Soil usually denotes the soft surface layer of the earth cru"
## [2] "The principle source of protein for the peoples of Banglades"
## [3] "The physical and chemical characteristics of the water of a "
## [4] "Shrimp is an important fisheries resource.\nIt is joint foote"
## [5] "Disease is developed in the body of the fish by the joint ac"
## [6] "With a view to maintain the taste and quality of fish the pr"
The reading levels are stored in the meta data.
head(meta(entb_corpus))
## title timestamp number_of_sentences number_of_tokens
## 1 9-agri-1 2014-10-30 12:56:03.187 77 1120
## 2 9-agri-10 2014-10-30 12:56:03.21 181 2635
## 3 9-agri-11 2014-10-30 12:56:03.243 310 4651
## 4 9-agri-12 2014-10-30 12:56:03.266 112 1628
## 5 9-agri-13 2014-10-30 12:56:03.285 75 959
## 6 9-agri-14 2014-10-30 12:56:03.306 129 2004
## number_of_token_types readability_level language
## 1 430 difficult English
## 2 863 difficult English
## 3 1297 difficult English
## 4 597 difficult English
## 5 382 difficult English
## 6 667 difficult English
table(meta(entb_corpus)$readability_level)
##
## difficult easy medium veryEasy
## 117 120 179 103
The annotations can be created with NLPclient or StanfordCoreNLP. NLPclient is available from https://cran.r-project.org
. StanfordCoreNLP is available from https://datacube.wu.ac.at/
. The code to reproduce the annotation can found on https://readability.r-forge.r-project.org/
. Since creating the annotations is time consuming we also provide the annotations.
data("entb_annotations")
c(head(entb_annotations, 2), tail(entb_annotations, 2))
## [[1]]
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 1339
## Content: chars: 6678
##
## [[2]]
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 3123
## Content: chars: 15732
##
## [[3]]
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 1455
## Content: chars: 6900
##
## [[4]]
## <<AnnotatedPlainTextDocument>>
## Metadata: 0
## Annotations: length: 985
## Content: chars: 5012
To make the feature creation process reproducible we create the NLPreadability package. The code to reproduce the feature generation can found on https://readability.r-forge.r-project.org/
.
data("entb_features")
str(entb_features)
## 'data.frame': 519 obs. of 96 variables:
## $ readability : Ord.factor w/ 3 levels "ele"<"int"<"adv": NA NA NA NA NA NA NA NA NA NA ...
## $ avg_sl : num 14.8 14.4 14.6 14 12.2 ...
## $ avg_char_sl : num 74.1 73.2 70.3 67.5 56.4 ...
## $ avg_wl : num 4.85 4.9 4.67 4.58 4.48 ...
## $ avg_syll : num 1.62 1.64 1.54 1.5 1.51 ...
## $ r_long_words : num 0.0502 0.0593 0.0447 0.0337 0.0502 ...
## $ r_polysy : num 0.183 0.153 0.142 0.118 0.147 ...
## $ r_unique_words : num 0.341 0.278 0.235 0.304 0.346 ...
## $ r_unique_lemmas : num 0.292 0.237 0.186 0.253 0.299 ...
## $ r_adjectives : num 0.1132 0.1029 0.0863 0.0953 0.0791 ...
## $ r_adverbs : num 0.0377 0.0388 0.0275 0.0205 0.0267 ...
## $ r_nouns : num 0.385 0.373 0.337 0.34 0.33 ...
## $ r_prepositions : num 0.139 0.15 0.161 0.185 0.192 ...
## $ r_verbs : num 0.159 0.144 0.174 0.164 0.144 ...
## $ r_pronouns : num 0.00809 0.01049 0.00895 0.00895 0.00855 ...
## $ r_determiners : num 0.0979 0.1002 0.1103 0.1061 0.1282 ...
## $ r_cooconj : num 0.027 0.0416 0.0382 0.0288 0.0438 ...
## $ r_unique_adjectives : num 0.000898 0.000388 0.000218 0.000639 0.001068 ...
## $ r_unique_adverbs : num 0.0234 0.0175 0.0124 0.0141 0.016 ...
## $ r_unique_nouns : num 0.155 0.138 0.115 0.139 0.153 ...
## $ r_unique_prepositions : num 0.02156 0.01398 0.00917 0.02302 0.02457 ...
## $ r_unique_verbs : num 0.0701 0.0482 0.0553 0.0601 0.0748 ...
## $ r_unique_pronouns : num 0.00449 0.00272 0.0024 0.0032 0.00534 ...
## $ r_unique_determiners : num 0.00719 0.0066 0.00393 0.00831 0.01175 ...
## $ r_unique_cooconj : num 0.0027 0.00155 0.00153 0.00128 0.00214 ...
## $ r_unique_adjectives_pty : num 0.00302 0.00154 0.00114 0.00229 0.00341 ...
## $ r_unique_adverbs_pty : num 0.0785 0.0692 0.0652 0.0503 0.0512 ...
## $ r_unique_nouns_pty : num 0.52 0.546 0.601 0.499 0.488 ...
## $ r_unique_prepositions_pty : num 0.0725 0.0554 0.0481 0.0824 0.0785 ...
## $ r_unique_verbs_pty : num 0.236 0.191 0.289 0.215 0.239 ...
## $ r_unique_pronouns_pty : num 0.0151 0.0108 0.0126 0.0114 0.0171 ...
## $ r_unique_determiners_pty : num 0.0242 0.0262 0.0206 0.0297 0.0375 ...
## $ r_unique_cooconj_pty : num 0.00906 0.00615 0.00801 0.00458 0.00683 ...
## $ avg_adjectives_ps : num 1.68 1.48 1.258 1.33 0.961 ...
## $ avg_adverbs_ps : num 0.56 0.559 0.401 0.286 0.325 ...
## $ avg_nouns_ps : num 5.72 5.36 4.92 4.75 4.01 ...
## $ avg_prepositions_ps : num 2.07 2.16 2.35 2.59 2.34 ...
## $ avg_verbs_ps : num 2.36 2.07 2.54 2.29 1.75 ...
## $ avg_pronouns_ps : num 0.12 0.151 0.131 0.125 0.104 ...
## $ avg_determiners_ps : num 1.45 1.44 1.61 1.48 1.56 ...
## $ avg_cooconj_ps : num 0.4 0.598 0.557 0.402 0.532 ...
## $ avg_unique_adjectives_ps : num 0.01333 0.00559 0.00318 0.00893 0.01299 ...
## $ avg_unique_adverbs_ps : num 0.347 0.251 0.182 0.196 0.195 ...
## $ avg_unique_nouns_ps : num 2.29 1.98 1.67 1.95 1.86 ...
## $ avg_unique_prepositions_ps : num 0.32 0.201 0.134 0.321 0.299 ...
## $ avg_unique_verbs_ps : num 1.04 0.693 0.806 0.839 0.909 ...
## $ avg_unique_pronouns_ps : num 0.0667 0.0391 0.035 0.0446 0.0649 ...
## $ avg_unique_determiners_ps : num 0.1067 0.095 0.0573 0.1161 0.1429 ...
## $ avg_unique_cooconj_ps : num 0.04 0.0223 0.0223 0.0179 0.026 ...
## $ avg_nc_adjectives : num 6.64 6.48 6.53 6.42 6.7 ...
## $ avg_nc_adverbs : num 5.74 6.42 5.41 5.25 7.04 ...
## $ avg_nc_nouns : num 5.84 6.19 6.04 5.8 5.62 ...
## $ avg_nc_prepositions : num 2.5 2.79 2.5 2.51 2.42 ...
## $ avg_nc_verbs : num 5 5 5.02 4.88 5.1 ...
## $ avg_nc_pronouns : num 0.141 0.181 0.162 0.168 0.148 ...
## $ avg_nc_determiners : num 1.75 2.01 1.91 2.07 2.63 ...
## $ avg_nc_cooconj : num 0.497 0.833 0.622 0.504 0.807 ...
## $ avg_ptree_height : num 9.2 8.87 9.66 9.11 8.53 ...
## $ avg_subord_conj : num 0.3067 0.1117 0.2516 0.1875 0.0779 ...
## $ avg_NP : num 5.77 6.06 6.08 6.29 5.23 ...
## $ avg_VP : num 2.36 2.29 3.04 2.67 2.29 ...
## $ avg_PP : num 2.16 2.27 2.34 2.46 2.3 ...
## $ avg_ADVP : num 0.36 0.263 0.274 0.214 0.26 ...
## $ avg_ADJP : num 0.493 0.508 0.274 0.259 0.156 ...
## $ avg_ALLP : num 12.3 12.9 12.9 12.9 10.6 ...
## $ r_NP : num 0.186 0.199 0.192 0.204 0.198 ...
## $ r_VP : num 0.0759 0.0751 0.0958 0.0865 0.0865 ...
## $ r_PP : num 0.0695 0.0745 0.0737 0.0798 0.087 ...
## $ r_ADVP : num 0.01158 0.00861 0.00864 0.00694 0.00983 ...
## $ r_ADJP : num 0.01587 0.01667 0.00864 0.00839 0.0059 ...
## $ r_ALLP : num 0.395 0.422 0.406 0.418 0.4 ...
## $ avg_no_VP_ps : num 0 0.0447 0.0446 0.0625 0.1169 ...
## $ r_entities : num 0.385 0.374 0.338 0.341 0.331 ...
## $ r_uentities : num 0.438 0.47 0.44 0.437 0.41 ...
## $ avg_entities_ps : num 5.72 5.38 4.93 4.76 4.03 ...
## $ avg_uentities_ps : num 2.21 1.88 1.51 1.86 1.73 ...
## $ r_named_entities : num 0.01078 0.03845 0.00808 0.02174 0.00214 ...
## $ r_unamed_entities : num 0.02639 0.06154 0.02414 0.03571 0.00617 ...
## $ avg_named_entities_ps : num 0.16 0.553 0.118 0.304 0.026 ...
## $ avg_unamed_entities_ps : num 0.1333 0.2458 0.0828 0.1518 0.026 ...
## $ r_nent_to_ent : num 0.02797 0.1028 0.0239 0.06379 0.00645 ...
## $ r_overlapping_nouns : num 0.01078 0.03728 0.00721 0.0211 0.00107 ...
## $ avg_named_entity_len : num 7.6 5.84 6.47 6.38 6 ...
## $ r_named_entity_len : num 0.01408 0.04301 0.00907 0.02582 0.00143 ...
## $ avg_passives : num 0.453 0.341 0.529 0.562 0.429 ...
## $ avg_num_coref_per_chain : num 2.82 4.15 3.68 2.89 4 ...
## $ avg_coref_chain_span : num 162 735 1079 323 414 ...
## $ r_long_corefs : num 0.0526 0.1429 0.1697 0.1316 0.3333 ...
## $ avg_coref_chains_ps : num 0.507 0.469 0.525 0.679 0.351 ...
## $ avg_coref_inference_distance : num 460 1230 2094 860 676 ...
## $ median_coref_inference_distance: num 177 431 366 269 446 232 207 227 178 119 ...
## $ r_coref_per_words : num 0.0961 0.1355 0.1326 0.1407 0.1154 ...
## $ r_coref_per_entities : num 0.249 0.362 0.392 0.413 0.348 ...
## $ avg_word_overlap : num 0.188 0.145 0.134 0.136 0.11 ...
## $ avg_noun_overlap : num 0.00676 0.00562 0.00639 0 0 ...
## $ r_content_words : num 0.672 0.64 0.596 0.597 0.576 ...
http://publikationen.ub.uni-frankfurt.de/opus4/frontdoor/index/index/year/2015/docId/38157
.