Numeric vectors with tags
Matt Kerlogue
2023-05-31
Source:vignettes/shrthnd-num.Rmd
shrthnd-num.Rmd
The development of the shrthnd package is heavily influenced by experience of working with statistical datasets published by governments and international bodies, especially departments and agencies in the UK producing outputs as part of the UK statistical system.
While data is increasingly released in machine readable formats or through APIs, there are still a large number of data products that continue to be released in spreadsheets and historical data from these institutions is often only available in spreadsheets. Beyond layout issues, such as the use of header and footer rows to communicate related information, it is not uncommon to encounter columns in these spreadsheets that contain a mix of numeric and non-numeric content. This non-numeric content may sometimes be the only content of a cell (to explain why there is no numeric value) or alongside a numeric value (to qualify, caveat or otherwise explain something about the value).
The most common approach in data processing when encountering these sorts of issues is simply to scrub the vector of the non-numeric components and coerced it into a numeric vector. However, these tags often convey useful information which you may wish to retain.
Introducing shrthnd_num()
The shrthnd_num()
data type builds on
vctrs::new_rcrd()
to split numeric and non-numeric
components while keeping them attached to each other. In practice a
shrthnd_num()
can be thought of as a numeric()
and character()
vector that have been coupled together.
Specifically it has a num
component representing the
numeric value and a tag
component representing the
non-numeric shorthand, symbol or marker.
Let us create the vector x
with seven values:
x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]")
x
#> [1] "12" "34.567" "[c]" "NA" "56.78[e]" "78.9"
#> [7] "90.123[e]"
The first, second and sixth values in this vector are purely numeric
(12
, 34.567
and 78.9
). The third
value is a shorthand symbol ("[c]"
) denoting that the value
has been suppressed because it is confidential. The fourth value is a
missing value ("NA"
). The fifth and seventh values contain
both numeric information (56.78
and 90.123
respectively) but also shorthand ("[e]"
) to denote that
these values are estimated. Depending on what processing we wish to do
with this vector in the future it might be useful to know that a value
has been suppressed or estimated.
Using as.numeric()
on this vector will result in all of
the values containing any non-numeric element to be converted to a
missing value, causing us to lose all the information of the third,
fifth and seventh values in the vector.
as.numeric(x)
#> Warning: NAs introduced by coercion
#> [1] 12.000 34.567 NA NA NA 78.900 NA
We could scrub the non-numeric elements of the vector, but we still lose the information provided by the shorthand.
as.numeric(gsub("[^0-9.]", "", c(x)))
#> [1] 12.000 34.567 NA NA 56.780 78.900 90.123
The shrthnd_num()
function, however, allows us to retain
both sets of information, and we can easily coerce a
shrthnd_num()
vector into a regular base R
numeric()
vector. We can also easily access the shorthand
or symbol tags with the shrthnd_tags()
function.
sh_x <- shrthnd_num(x)
sh_x
#> <shrthnd_num[7]>
#> [1] 12.00 34.57 NA [c] NA 56.78 [e] 78.90 90.12 [e]
as.numeric(sh_x)
#> [1] 12.000 34.567 NA NA 56.780 78.900 90.123
shrthnd_tags(sh_x)
#> [1] NA NA "[c]" NA "[e]" NA "[e]"
The shrthnd_list()
function provides a summary of the
tags contained in a shrthnd_num()
vector, their frequency
and positions in the vector.
shrthnd_list(sh_x)
#> <shrthnd_list[2]>
#> [c] (1 location): 3
#> [e] (2 locations): 5, 7
We saw above how as.numeric()
converts a
shrthnd_num()
to a numeric vector,
as.character()
will similarly convert a
shrthnd_num()
to a character vector as if it were a numeric
vector. Instead to print a character vector that combines the numeric
and non-numeric components we can use as_shrthnd()
.
as.character(sh_x)
#> [1] "12" "34.567" NA NA "56.78" "78.9" "90.123"
as_shrthnd(sh_x)
#> [1] "12.00" "34.57" "NA [c]" "NA" "56.78 [e]" "78.90"
#> [7] "90.12 [e]"
Making a shrthnd_num
You can make a shrthnd_num()
vector in two ways: using
shrthnd_num()
to convert a character vector containing
numeric and non-numeric components, or make_shrthnd_num()
to merge a vector of numbers and a vector of character strings.
Conversion to a shrthnd_num
You convert a character vector containing shorthand using
shrthnd_num()
. In addition to the character vector you can
also supply additional arguments to control the behaviour of the
conversion and the resulting display of the vector.
shrthnd_num(
x,
shorthand = NULL,
na_values = c("", "NA"),
digits = 2L,
paren_nums = c("negative", "strip"),
dec = ".",
bigmark = ","
)
#> <shrthnd_num[7]>
#> [1] 12.00 34.57 NA [c] NA 56.78 [e] 78.90 90.12 [e]
The shorthand
argument allows you to pass a character
vector of shorthand, symbols and markers that you want to validate
against, i.e. you can cause the conversion to throw an error if it
detects shorthand that is not in this vector.
The na_values
argument is used to determine values that
should be ignored when identifying shorthand tags and converted to
missing values when extracting the numeric component.
The digits
, dec
and bigmark
arguments are passed on to formatC()
in the formatting of
the numeric component when formatting and printing the vector.
The paren_nums
argument determines how to handle numbers
in parenthesis, i.e. whether to consider a number in parenthesis as a
negative number (as is commonly used in accounting formats, and the
default setting) or whether to just strip the parenthesis from the
number before its conversion.
The coercion to a numeric()
vector is handled by
utils::type.convert()
.
Making a shrthnd_num from scratch
You can use make_shrthnd_num()
to create a
shrthnd_num()
from a numeric()
and
character()
vector of the same length.
make_shrthnd_num(c(1:3, NA, 4:5, NA), c("", "", "", "[c]", "", "[e]", NA))
#> <shrthnd_num[7]>
#> [1] 1 2 3 NA [c] 4 5 [e] NA
Coercion, maths and statistics
Generally a shrthnd_num()
should behave like a
numeric()
vector. For example, using is.na()
will return TRUE
where the numeric value is missing and
FALSE
where the numeric value is not missing. Or, if you
use c()
to combine a shrthnd_num()
with
another vector it will first coerce the vector to numeric so that R can
proceed from there.
is.na(sh_x)
#> [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE
c(sh_x, 1)
#> [1] 12.000 34.567 NA NA 56.780 78.900 90.123 1.000
c(sh_x, "c")
#> [1] "12" "34.567" NA NA "56.78" "78.9" "90.123" "c"
However, in keeping with base R practice around complex numeric
objects such as Date()
, difftime()
and
POSIXct()
, using is.numeric()
on a
shrthnd_num()
vector will return FALSE
. Use
is_shrthnd_num()
to test if a vector is a
shrthnd_num()
vector. shrthnd also includes
an is_numeric()
function that allows you to test for
vectors that are either standard numeric vectors or a coercible
shrthnd_num()
vector, for example if you want to apply a
function across a range of columns in a dplyr::mutate()
call.
is.numeric(sh_x)
#> [1] FALSE
is_shrthnd_num(sh_x)
#> [1] TRUE
is_numeric(x)
#> [1] FALSE
Through vctrs::vec_arith()
and
vctrs::vec_math()
there is generalised support for
arithmetic and mathematical operations on a shrthnd_num()
vector. Bespoke methods have also been added for some functions which
are not directly supported, such as median()
and
quantile()
, so that they can easily work with the numeric
components of the shrthnd_num()
vector.
x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]")
sh_x <- shrthnd_num(x, c("[c]", "[e]"))
sh_x * 2
#> [1] 24.000 69.134 NA NA 113.560 157.800 180.246
2 + sh_x
#> [1] 14.000 36.567 NA NA 58.780 80.900 92.123
sum(sh_x, na.rm = TRUE)
#> [1] 272.37
range(sh_x, na.rm = TRUE)
#> [1] 12.000 90.123
mean(sh_x, na.rm = TRUE)
#> [1] 54.474
Working with shrthnd tags
The shrthnd_tags()
function allows us to access the tag
components of a shrthnd_num()
. It has a related function
shrthnd_unique_tags()
which will return a unique list of
tags, and is simply a convenience function in place of
unique(shrthnd_tags(x))
.
shrthnd_tags(sh_x)
#> [1] NA NA "[c]" NA "[e]" NA "[e]"
shrthnd_unique_tags(sh_x)
#> [1] "[c]" "[e]"
The base R functions for value matching work with the numeric
component of a shrthnd_num()
vector. Separate tag locator
functions have been used to support matching the tag components of a
shrthnd_num()
vector.
tag_match()
returns an integer vector showing the first
location of the tag provided while tag_in()
will return
TRUE
or FALSE
depending on whether the tag is
in the vector’s shorthand.
To locate where a specific tag is used in a vector use
where_tag()
, which is equivalent to computing
tags == tag
. To identify if a value has a tag, irrespective
of its value use any_tag()
, which is equivalent to
!is.na(tags)
.
where_tag(sh_x, "[e]")
#> [1] NA NA FALSE NA TRUE NA TRUE
any_tag(sh_x)
#> [1] FALSE FALSE TRUE FALSE TRUE FALSE TRUE
Using is.na()
on a shrthnd_num()
will
assess if the numeric component is missing. To identify if tags are
missing use is_na_tag()
, which is equivalent to
is.na(tags)
. To identify if both the numeric and tag
component is missing use is_na_both()
.
is_na_tag(sh_x)
#> [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE
is_na_both(sh_x)
#> [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
Finally, you can locate the positions of a specific tag, tagged
values or untagged values using a set of locate_*()
functions, which are convenience functions wrapping the functions that
return logical vectors in which()
.
locate_tag(sh_x, "[e]")
#> [1] 5 7
locate_any_tag(sh_x)
#> [1] 3 5 7
locate_no_tag(sh_x)
#> [1] 1 2 4 6