We are proud to announce the version 3.0 release of the quanteda package, just over a year following our last major release of v2.0. Version 3.0 is a significant update that makes quanteda and its growing family of extension packages more solid, more consistent, and more extensible.
Main changes
Modularisation
We have now separated the textplot_*()
functions from the main package into a separate package quanteda.textplots, and the textstat_*()
functions from the main package into a separate package quanteda.textstats. This completes the modularisation begun in v2 with the move of the textmodel_*()
functions to the separate package quanteda.textmodels](https://github.com/quanteda/quanteda.textmodels). The quanteda package now consists only of core functions for textual data processing and management.
Dependencies
v3 has a much lighter dependency footprint. Its package dependency structure is now greatly reduced, by eliminating some unnecessary package dependencies, by modularising the quanteda packages, and by addressing complex downstream dependencies in packages such as stopwords. v3 should serve as a more lightweight and more consistent platform upon which other text analysis package developers can build.
Non-standard evaluation
v3 brings a new, consistent implementation of direct evaluation within docvars for by
and groups
arguments:
- The
*_sample()
functions’ argumentby
, andgroups
in the*_group()
functions, now take unquoted document variable (docvar) names directly, similar to the way thesubset
argument works in the*_subset()
functions. - For
groups
, the default is nowdocid(x)
, which is now documented more completely. See?groups
and?docid
. - The
by = "document"
formerly sampled fromdocid(x)
, but this functionality is now removed. Instead, useby = docid(x)
to replicate this functionality. - Quoted docvar names no longer work, as these will be evaluated literally. This may break some existing code, but it makes the usage of the
by
andgroups
arguments consistent with how other functions in R work.
Deprecation of previous “shortcut” functions
To enforce a more consistent workflow and one that users have to control more explicitly, we now require objects that operate on tokens to work only with tokenised inputs. Previously, some functions such as dfm()
and kwic()
worked directly with untokenised inputs, such as character or corpus objects, and tokenised these on the fly, with additional arguments to tokens()
fed via ...
.
In v3, we now restrict such functions to tokens()
objects. This may be less convenient in that users previously could run dfm()
or kwic()
on a corpus object directly. The new usage, by contrast, requires users to take more
direct control of tokenization options, or to substitute the alternative
tokeniser of their choice (and then coercing it to tokens via as.tokens()
).
This also allows our function behaviour to be more consistent, with each
function performing a single task, rather than combining functions (such as
tokenisation and constructing a matrix).
The most commonly used shortcut involved constructing a dfm directly from a character
or corpus object. Formerly, this would construct a tokens object internally
before creating the dfm, and allowed passing arguments to tokens()
via ...
.
This is now deprecated, although still functional with a warning.
What are the advantages of skipping the shortcut approach? Requiring the text processing steps to be explicit means that users will be in greater control of the consequences of the sequencing of these steps. Some processing steps are sequence-dependent, for instance when a user removes stopwords and also applies a stemmer.
txt <- "Because during the concert it's very loud."
tokens(txt, remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
tokens_wordstem()
## Tokens consisting of 1 document.
## text1 :
## [1] "concert" "loud"
is different from:
txt <- "Because during the concert it's very loud."
tokens(txt, remove_punct = TRUE) %>%
tokens_wordstem() %>%
tokens_remove(stopwords("en"))
## Tokens consisting of 1 document.
## text1 :
## [1] "Becaus" "dure" "concert" "veri" "loud"
While the first is probably what nearly all users want, by removing these from options previously hard-wired via options to dfm()
now requires the user to choose the sequence. Now, users must either creating a tokens object first, or pipe the
tokens return to dfm()
using %>%
. This has always been possible prior to v3, of course, but now it is required.
We have also deprecated direct character or corpus inputs to kwic()
, since this also requires a tokenised input.
These are deprecations rather than removals, since in v3.x the deprecated arguments and methods still work, but with deprecation warnings. We strongly encourage users to switch to the new workflow, as the deprecated arguments will be removed in the next major release.
Other new features
The full list of new features is the following:
-
dfm()
has a new argument,remove_padding
, for removing the “pads” left behind after removing tokens withpadding = TRUE
. (For other extensive changes todfm()
, see “Deprecated” below.) -
tokens_group()
, formerly internal-only, is now exported. -
corpus_sample()
,dfm_sample()
, andtokens_sample()
now work consistently. -
The
kwic()
return object structure has been redefined, and built with an option to use a new functionindex()
that returns token spans following a pattern search. -
The punctuation regular expression and that for matching social media usernames has now been redefined so that the valid Twitter username
@_
is now counted as a “tag” rather than as “punctuation”. -
The data object
data_corpus_inaugural
has been updated to include the Biden 2021 inaugural address. -
A new system of validators for input types now provides better argument type and value checking, with more consistent error messages for invalid types or values.
-
Upon startup, we now message the console with the Unicode and ICU version information. Because we removed our redefinition of
View()
(see below), the former conflict warning is now gone. -
as.character.corpus()
now has ause.names = TRUE
argument, similar toas.character.tokens()
(but with a different default value).
Deprecations
In addition to the deprecation of convenience shortcuts noted above, the following usages are also deprecated in v3.
-
dfm()
: As of version 3, only tokens objects are supported as inputs todfm()
(as well as a dfm as input). Callingdfm()
for character or corpus objects is still functional, but issues a warning. Convenience passing of arguments totokens()
via...
fordfm()
is also deprecated, but undocumented, and functions only with a warning. Users should now create a tokens object (usingtokens()
from character or corpus inputs before callingdfm()
. -
kwic()
: As of version 3, only tokens objects are supported as inputs tokwic()
. Callingkwic()
for character or corpus objects is still functional, but issues a warning. Passing arguments totokens()
via...
inkwic()
is now disabled. Users should now create a tokens object (usingtokens()
from character or corpus inputs before callingkwic()
). -
(as noted above) Shortcut arguments to
dfm()
are now deprecated. These are still active, with a warning, although they are no longer documented. These are:stem
: usetokens_wordstem()
ordfm_wordstem()
instead.select
,remove
: usetokens_select()
/dfm_select()
ortokens_remove()
/dfm_remove()
instead.dictionary
,thesaurus
: usetokens_lookup()
ordfm_lookup()
instead.valuetype
,case_insensitive
: these are disabled; for the deprecated arguments that take these qualifiers, they are fixed to the defaults"glob"
andTRUE
.groups
: usetokens_group()
ordfm_group()
instead.
-
texts()
andtexts<-
are deprecated.- Use
as.character.corpus()
to turn a corpus into a simple named character vector. - Use
corpus_group()
instead oftexts(x, groups = ...)
to aggregate texts by a grouping variable. - Use
[<-
instead oftexts()<-
for replacing texts in a corpus object. To replace all of the texts in a corpus, while keeping it a corpus, use[]
, e.g., to replace all texts in a five-document corpus:
corp <- data_corpus_inaugural[1:5] # equivalent to old usage of texts(corp) <- LETTERS[1:5] corp[] <- LETTERS[1:5]
- Use
Removals
Finally, we have removed some previously deprecated arguments and functions, and moved others to different packages.
-
The
textplot_*()
andtextstat_*()
functions are now moved to quanteda.textplots and quanteda.textstats, respectively. -
The following functions have been removed:
- all methods for defunct
corpuszip
objects. View()
functions – so no more namespace conflict warnings on startup!as.wfm()
andas.DocumentTermMatrix()
(the same functionality is available viaconvert()
)metadoc()
andmetacorpus()
corpus_trimsentences()
(replaced bycorpus_trim()
)- all of the
tortl
functions.
- all methods for defunct
-
dfm
objects can no longer be used as apattern
indfm_select()
(formerly deprecated). -
dfm_sample()
:- no longer has a
margin
argument. Instead,dfm_sample()
now samples only on documents, the same ascorpus_sample()
andtokens_sample()
; and - no longer works with
by = "document"
– useby = docid(x)
instead.
- no longer has a
-
dictionary_edit()
,char_edit()
, andlist_edit()
are removed. -
dfm_weight()
: the formerly deprecated"scheme"
options are now removed. -
tokens()
: The formerly deprecated optionsremove_hyphens
andremove_twitter
are now removed. (Usesplit_hyphens
instead, and the default tokenizer always now preserves Twitter and other social media tags.) -
Special versions of
head()
andtail()
for corpus, dfm, and fcm objects are now removed, since the base methods work fine for these objects. The main consequence was the removal of thenf
option from the methods for dfm and fcm objects, which limited the number of features. This can be accomplished using the index operator[
instead, or for printing, by specifyingprint(x, max_nfeat = 6L)
(for instance).
Acknowledgements
We wish to thank the CRAN maintainers, especially Kurt Hornik, for their patience and assistance in preparing this tricky release, which involved some tricky tests and a refresh of all of the quanteda packages at the same time.
We also thank the authors of the quanteda’s numerous reverse-dependent packages for working with us to update their code to work with the changes introduced in version 3. We hope the new infrastructure and more consistent usage will provide an even more solid base on which your code can build.
comments powered by Disqus