English textbook corpus

The English textbook corpus contains texts from the English version of textbooks used in public schools in Bangladesh. More documentation on the English textbook corpus can be found in [1]. The tm.corpus.enTextbook package was compiled to make the data easier available from R.

library("tm")

## Loading required package: NLP

library("tm.corpus.enTextbook")
data("entb_corpus")

Corpus

The corpus contains 519 documents for four reading levels: ‘veryEasy’, ‘easy’, ‘medium’ and ‘difficult’.

data("entb_corpus")

entb_corpus

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 7
## Content:  documents: 519

The text can be accessed with the content function.

substr(head(content(entb_corpus)), 1, 60)

## [1] "Soil usually denotes the soft surface layer of the earth cru" 
## [2] "The principle source of protein for the peoples of Banglades" 
## [3] "The physical and chemical characteristics of the water of a " 
## [4] "Shrimp is an important fisheries resource.\nIt is joint foote"
## [5] "Disease is developed in the body of the fish by the joint ac" 
## [6] "With a view to maintain the taste and quality of fish the pr"

The reading levels are stored in the meta data.

head(meta(entb_corpus))

##       title               timestamp number_of_sentences number_of_tokens
## 1  9-agri-1 2014-10-30 12:56:03.187                  77             1120
## 2 9-agri-10  2014-10-30 12:56:03.21                 181             2635
## 3 9-agri-11 2014-10-30 12:56:03.243                 310             4651
## 4 9-agri-12 2014-10-30 12:56:03.266                 112             1628
## 5 9-agri-13 2014-10-30 12:56:03.285                  75              959
## 6 9-agri-14 2014-10-30 12:56:03.306                 129             2004
##   number_of_token_types readability_level language
## 1                   430         difficult  English
## 2                   863         difficult  English
## 3                  1297         difficult  English
## 4                   597         difficult  English
## 5                   382         difficult  English
## 6                   667         difficult  English

table(meta(entb_corpus)$readability_level)

## 
## difficult      easy    medium  veryEasy 
##       117       120       179       103

Annotations

The annotations can be created with NLPclient or StanfordCoreNLP. NLPclient is available from https://cran.r-project.org. StanfordCoreNLP is available from https://datacube.wu.ac.at/. The code to reproduce the annotation can found on https://readability.r-forge.r-project.org/. Since creating the annotations is time consuming we also provide the annotations.

data("entb_annotations")
c(head(entb_annotations, 2), tail(entb_annotations, 2))

## [[1]]
## <<AnnotatedPlainTextDocument>>
## Metadata:  0
## Annotations:  length: 1339
## Content:  chars: 6678
## 
## [[2]]
## <<AnnotatedPlainTextDocument>>
## Metadata:  0
## Annotations:  length: 3123
## Content:  chars: 15732
## 
## [[3]]
## <<AnnotatedPlainTextDocument>>
## Metadata:  0
## Annotations:  length: 1455
## Content:  chars: 6900
## 
## [[4]]
## <<AnnotatedPlainTextDocument>>
## Metadata:  0
## Annotations:  length: 985
## Content:  chars: 5012

Features

To make the feature creation process reproducible we create the NLPreadability package. The code to reproduce the feature generation can found on https://readability.r-forge.r-project.org/.

data("entb_features")
str(entb_features)

## 'data.frame':    519 obs. of  96 variables:
##  $ readability                    : Ord.factor w/ 3 levels "ele"<"int"<"adv": NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_sl                         : num  14.8 14.4 14.6 14 12.2 ...
##  $ avg_char_sl                    : num  74.1 73.2 70.3 67.5 56.4 ...
##  $ avg_wl                         : num  4.85 4.9 4.67 4.58 4.48 ...
##  $ avg_syll                       : num  1.62 1.64 1.54 1.5 1.51 ...
##  $ r_long_words                   : num  0.0502 0.0593 0.0447 0.0337 0.0502 ...
##  $ r_polysy                       : num  0.183 0.153 0.142 0.118 0.147 ...
##  $ r_unique_words                 : num  0.341 0.278 0.235 0.304 0.346 ...
##  $ r_unique_lemmas                : num  0.292 0.237 0.186 0.253 0.299 ...
##  $ r_adjectives                   : num  0.1132 0.1029 0.0863 0.0953 0.0791 ...
##  $ r_adverbs                      : num  0.0377 0.0388 0.0275 0.0205 0.0267 ...
##  $ r_nouns                        : num  0.385 0.373 0.337 0.34 0.33 ...
##  $ r_prepositions                 : num  0.139 0.15 0.161 0.185 0.192 ...
##  $ r_verbs                        : num  0.159 0.144 0.174 0.164 0.144 ...
##  $ r_pronouns                     : num  0.00809 0.01049 0.00895 0.00895 0.00855 ...
##  $ r_determiners                  : num  0.0979 0.1002 0.1103 0.1061 0.1282 ...
##  $ r_cooconj                      : num  0.027 0.0416 0.0382 0.0288 0.0438 ...
##  $ r_unique_adjectives            : num  0.000898 0.000388 0.000218 0.000639 0.001068 ...
##  $ r_unique_adverbs               : num  0.0234 0.0175 0.0124 0.0141 0.016 ...
##  $ r_unique_nouns                 : num  0.155 0.138 0.115 0.139 0.153 ...
##  $ r_unique_prepositions          : num  0.02156 0.01398 0.00917 0.02302 0.02457 ...
##  $ r_unique_verbs                 : num  0.0701 0.0482 0.0553 0.0601 0.0748 ...
##  $ r_unique_pronouns              : num  0.00449 0.00272 0.0024 0.0032 0.00534 ...
##  $ r_unique_determiners           : num  0.00719 0.0066 0.00393 0.00831 0.01175 ...
##  $ r_unique_cooconj               : num  0.0027 0.00155 0.00153 0.00128 0.00214 ...
##  $ r_unique_adjectives_pty        : num  0.00302 0.00154 0.00114 0.00229 0.00341 ...
##  $ r_unique_adverbs_pty           : num  0.0785 0.0692 0.0652 0.0503 0.0512 ...
##  $ r_unique_nouns_pty             : num  0.52 0.546 0.601 0.499 0.488 ...
##  $ r_unique_prepositions_pty      : num  0.0725 0.0554 0.0481 0.0824 0.0785 ...
##  $ r_unique_verbs_pty             : num  0.236 0.191 0.289 0.215 0.239 ...
##  $ r_unique_pronouns_pty          : num  0.0151 0.0108 0.0126 0.0114 0.0171 ...
##  $ r_unique_determiners_pty       : num  0.0242 0.0262 0.0206 0.0297 0.0375 ...
##  $ r_unique_cooconj_pty           : num  0.00906 0.00615 0.00801 0.00458 0.00683 ...
##  $ avg_adjectives_ps              : num  1.68 1.48 1.258 1.33 0.961 ...
##  $ avg_adverbs_ps                 : num  0.56 0.559 0.401 0.286 0.325 ...
##  $ avg_nouns_ps                   : num  5.72 5.36 4.92 4.75 4.01 ...
##  $ avg_prepositions_ps            : num  2.07 2.16 2.35 2.59 2.34 ...
##  $ avg_verbs_ps                   : num  2.36 2.07 2.54 2.29 1.75 ...
##  $ avg_pronouns_ps                : num  0.12 0.151 0.131 0.125 0.104 ...
##  $ avg_determiners_ps             : num  1.45 1.44 1.61 1.48 1.56 ...
##  $ avg_cooconj_ps                 : num  0.4 0.598 0.557 0.402 0.532 ...
##  $ avg_unique_adjectives_ps       : num  0.01333 0.00559 0.00318 0.00893 0.01299 ...
##  $ avg_unique_adverbs_ps          : num  0.347 0.251 0.182 0.196 0.195 ...
##  $ avg_unique_nouns_ps            : num  2.29 1.98 1.67 1.95 1.86 ...
##  $ avg_unique_prepositions_ps     : num  0.32 0.201 0.134 0.321 0.299 ...
##  $ avg_unique_verbs_ps            : num  1.04 0.693 0.806 0.839 0.909 ...
##  $ avg_unique_pronouns_ps         : num  0.0667 0.0391 0.035 0.0446 0.0649 ...
##  $ avg_unique_determiners_ps      : num  0.1067 0.095 0.0573 0.1161 0.1429 ...
##  $ avg_unique_cooconj_ps          : num  0.04 0.0223 0.0223 0.0179 0.026 ...
##  $ avg_nc_adjectives              : num  6.64 6.48 6.53 6.42 6.7 ...
##  $ avg_nc_adverbs                 : num  5.74 6.42 5.41 5.25 7.04 ...
##  $ avg_nc_nouns                   : num  5.84 6.19 6.04 5.8 5.62 ...
##  $ avg_nc_prepositions            : num  2.5 2.79 2.5 2.51 2.42 ...
##  $ avg_nc_verbs                   : num  5 5 5.02 4.88 5.1 ...
##  $ avg_nc_pronouns                : num  0.141 0.181 0.162 0.168 0.148 ...
##  $ avg_nc_determiners             : num  1.75 2.01 1.91 2.07 2.63 ...
##  $ avg_nc_cooconj                 : num  0.497 0.833 0.622 0.504 0.807 ...
##  $ avg_ptree_height               : num  9.2 8.87 9.66 9.11 8.53 ...
##  $ avg_subord_conj                : num  0.3067 0.1117 0.2516 0.1875 0.0779 ...
##  $ avg_NP                         : num  5.77 6.06 6.08 6.29 5.23 ...
##  $ avg_VP                         : num  2.36 2.29 3.04 2.67 2.29 ...
##  $ avg_PP                         : num  2.16 2.27 2.34 2.46 2.3 ...
##  $ avg_ADVP                       : num  0.36 0.263 0.274 0.214 0.26 ...
##  $ avg_ADJP                       : num  0.493 0.508 0.274 0.259 0.156 ...
##  $ avg_ALLP                       : num  12.3 12.9 12.9 12.9 10.6 ...
##  $ r_NP                           : num  0.186 0.199 0.192 0.204 0.198 ...
##  $ r_VP                           : num  0.0759 0.0751 0.0958 0.0865 0.0865 ...
##  $ r_PP                           : num  0.0695 0.0745 0.0737 0.0798 0.087 ...
##  $ r_ADVP                         : num  0.01158 0.00861 0.00864 0.00694 0.00983 ...
##  $ r_ADJP                         : num  0.01587 0.01667 0.00864 0.00839 0.0059 ...
##  $ r_ALLP                         : num  0.395 0.422 0.406 0.418 0.4 ...
##  $ avg_no_VP_ps                   : num  0 0.0447 0.0446 0.0625 0.1169 ...
##  $ r_entities                     : num  0.385 0.374 0.338 0.341 0.331 ...
##  $ r_uentities                    : num  0.438 0.47 0.44 0.437 0.41 ...
##  $ avg_entities_ps                : num  5.72 5.38 4.93 4.76 4.03 ...
##  $ avg_uentities_ps               : num  2.21 1.88 1.51 1.86 1.73 ...
##  $ r_named_entities               : num  0.01078 0.03845 0.00808 0.02174 0.00214 ...
##  $ r_unamed_entities              : num  0.02639 0.06154 0.02414 0.03571 0.00617 ...
##  $ avg_named_entities_ps          : num  0.16 0.553 0.118 0.304 0.026 ...
##  $ avg_unamed_entities_ps         : num  0.1333 0.2458 0.0828 0.1518 0.026 ...
##  $ r_nent_to_ent                  : num  0.02797 0.1028 0.0239 0.06379 0.00645 ...
##  $ r_overlapping_nouns            : num  0.01078 0.03728 0.00721 0.0211 0.00107 ...
##  $ avg_named_entity_len           : num  7.6 5.84 6.47 6.38 6 ...
##  $ r_named_entity_len             : num  0.01408 0.04301 0.00907 0.02582 0.00143 ...
##  $ avg_passives                   : num  0.453 0.341 0.529 0.562 0.429 ...
##  $ avg_num_coref_per_chain        : num  2.82 4.15 3.68 2.89 4 ...
##  $ avg_coref_chain_span           : num  162 735 1079 323 414 ...
##  $ r_long_corefs                  : num  0.0526 0.1429 0.1697 0.1316 0.3333 ...
##  $ avg_coref_chains_ps            : num  0.507 0.469 0.525 0.679 0.351 ...
##  $ avg_coref_inference_distance   : num  460 1230 2094 860 676 ...
##  $ median_coref_inference_distance: num  177 431 366 269 446 232 207 227 178 119 ...
##  $ r_coref_per_words              : num  0.0961 0.1355 0.1326 0.1407 0.1154 ...
##  $ r_coref_per_entities           : num  0.249 0.362 0.392 0.413 0.348 ...
##  $ avg_word_overlap               : num  0.188 0.145 0.134 0.136 0.11 ...
##  $ avg_noun_overlap               : num  0.00676 0.00562 0.00639 0 0 ...
##  $ r_content_words                : num  0.672 0.64 0.596 0.597 0.576 ...

References

[1] Islam, Md. Zahurul. 2014. ‘Multilingual Text Classification using Information-Theoretic Features’. PhD Thesis, Goethe University Frankfurt. http://publikationen.ub.uni-frankfurt.de/opus4/frontdoor/index/index/year/2015/docId/38157.