The aim of tidyods is to extract information about cells in OpenDocument Spreadsheet (ODS) files into a “tidy” format. In addition to the value contained by a cell it extracts a range of addition metadata about the cells.
This vignette explains how to use
tidyods::read_ods_cells()
and the types of information
about cells it extracts. The Spreadsheet
Munging Strategies book by Duncan Garmonsway explores in more
detail how you can use datasets of “tidy cells” to work with non-tabular
data stored in (Excel) spreadsheets.
Reading data from cells
You read in data from an ODS file with the
read_ods_cells()
function. The resulting tibble provides 28
columns of information about cells, with each row representing an
individual cell in the input spreadsheet.
example_file <- system.file("extdata", "basic_example.ods", package = "tidyods")
types_cells <- read_ods_cells(example_file, 2)
dplyr::glimpse(types_cells)
#> Rows: 160
#> Columns: 28
#> $ sheet <chr> "types", "types", "types", "types", "types", "t…
#> $ address <chr> "A1", "B1", "C1", "D1", "E1", "F1", "G1", "H1",…
#> $ row <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,…
#> $ col <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6…
#> $ cell_type <chr> "cell", "cell", "cell", "cell", "cell", "cell",…
#> $ is_empty <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ value_type <chr> "string", "string", "string", "string", "string…
#> $ cell_content <chr> "String", "Boolean", "Currency", "Date", "Time"…
#> $ base_value <chr> "String", "Boolean", "Currency", "Date", "Time"…
#> $ numeric_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ currency_symbol <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ boolean_value <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, TRU…
#> $ date_value <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ time_value <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_formula <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ formula <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_error <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ error_type <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_annotation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ annotation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ is_merged <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ merge_colspan <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ merge_rowspan <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ merge_shape <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ cell_style <chr> NA, "Default", "Default", "Default", "Default",…
#> $ row_style <chr> "ro1", "ro1", "ro1", "ro1", "ro1", "ro1", "ro1"…
#> $ col_style <chr> "co6", "co7", "co8", "co9", "co10", "co11", "co…
#> $ col_default_cell_style <chr> "Default", "ce56", "ce4", "ce11", "ce21", "ce31…
There are four arguments to read_ods_cells()
:
-
path
, the path to the ODS file -
sheet
, the sheet you wish to extract -
quick
, whether to extract a subset of cell information -
quiet
, whether to show console messages or not
The sheet
argument can either be the index number of a
sheet (ranging from 1 to the total number of sheets in the spreadsheet
document) or the name of a sheet (e.g. "types"
instead of
2
in the example above).
The quick
argument extracts only a small number of
columns and is useful if you are largely concerned with cell values; see
the quick extraction section for more
details.
The quiet
argument turns off console messages, ODS files
are written in XML which can take significant time to process and so
progress messages are provided to the console for interactive users. It
is advised to set this to false if you are doing batch operations.
Cell information
There are 28 types of information returned by
read_ods_cells()
, these can be grouped into three
categories: location information, value information, and other
metadata.
Location information
The first four columns of information relate to a cell’s location in the spreadsheet.
-
sheet
, the name of the sheet the cell belongs to -
address
, the address of the cell within a sheet using “A1” notation -
row
, the row number of the cell -
col
, the column number of the cell
The address
column is derived from the row
and col
using the cellranger package. It
provides the address in the “A1” notation commonly used in spreadsheet
applications where the letter corresponds to a column and the number
corresponds to a row. The “A1” notation is used by ODS files for
calculation formulas.
Value information
There are 10 columns that provide information about a cell’s value.
-
cell_type
, the type of cell -
is_empty
, whether the cell has content or not (i.e.cell_type == "empty"
) -
value_type
, the value type as set in the ODS file -
cell_content
, the cell’s value as formatted for users of spreadsheet applications -
base_value
, the “raw” value of the cell -
numeric_value
, for numeric value types the raw value as anumeric()
vector -
currency_symbol
, for currency value types the currency symbol associated with the numeric value -
boolean_value
, for boolean value types the raw value as alogical()
vector -
date_value
, for date value types (dates and date-times) the “raw” value as acharacter()
vector -
time_value
, for time value types, the “raw” value as acharacter()
vector
Cells have one of four types, it is empty
if it has no
content and/or no value type, it is a merge-lead
if it is
the top-left cell of a merge group while other cells in the merge group
are classified as merge-hidden
, all other cells are simply
a cell
.
The value_type
is derived from the ODS
specification which defines seven value types:
-
boolean
, equivalent to R’slogical()
data type -
currency
, a numeric value, optionally with a currency symbol -
date
, a date or date-time stored in ISO 8601 format, the ODS specification does not differentiate between a date (e.g. 15/06/2022) and a date-time (e.g. 15/06/2023 13:47) -
float
, a numeric value, the ODS format does not distinguish between different numeric types -
percentage
, a numeric value, but formatted for display as a percentage -
string
, equivalent to R’scharacter()
-
time
, a duration, stored in ISO 8601 format
The cell_content
column provides the value of a cell as
displayed and formatted to users of a spreadsheet application.
Non-string values are stored separately in the underlying ODS XML file,
and these are extracted as separate columns (numeric_value
,
boolean_value
, date_value
and
time_value
) by read_ods_cells()
, for numeric
and boolean values these are also coerced into a numeric()
and logical()
vector. The base_value
column
provides a combination of the cell_content
for cells that
are strings and for cells that are not string value types the value from
the relevant type specific column.
For currency value types optionally a currency symbol (e.g. GBP, EUR,
USD, JPY) can be set, this is provided in the
currency_symbol
column.
types_cells |>
dplyr::filter(row == 3) |>
dplyr::select(value_type:currency_symbol)
#> # A tibble: 10 × 5
#> value_type cell_content base_value numeric_value currency_symbol
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 string Cell with comment Cell with comment NA NA
#> 2 boolean FALSE false NA NA
#> 3 currency £1.20 1.2 1.2 GBP
#> 4 date 06/15/22 2022-06-15 NA NA
#> 5 time 13:24 PT13H24M56S NA NA
#> 6 date 15/06/22 13:24 2022-06-15T13:24:… NA NA
#> 7 float 12034.57 12034.56789 12035. NA
#> 8 percentage 55% 0.5467 0.547 NA
#> 9 float 6579.3 6579.298265463 6579. NA
#> 10 string #DIV/0! #DIV/0! NA NA
types_cells |>
dplyr::filter(row == 3) |>
dplyr::select(value_type, base_value, boolean_value:time_value)
#> # A tibble: 10 × 5
#> value_type base_value boolean_value date_value time_value
#> <chr> <chr> <lgl> <chr> <chr>
#> 1 string Cell with comment NA NA NA
#> 2 boolean false FALSE NA NA
#> 3 currency 1.2 NA NA NA
#> 4 date 2022-06-15 NA 2022-06-15 NA
#> 5 time PT13H24M56S NA NA PT13H24M56S
#> 6 date 2022-06-15T13:24:56 NA 2022-06-15T13:24:56 NA
#> 7 float 12034.56789 NA NA NA
#> 8 percentage 0.5467 NA NA NA
#> 9 float 6579.298265463 NA NA NA
#> 10 string #DIV/0! NA NA NA
The value type examples section below provides more details on the different value types and how they are stored and processed.
Other metadata
Finally, read_ods_cells()
provides 12 further columns
with additional metadata about each cell.
-
has_formula
, whether the cell value is calculated by a formula -
formula
, the formula used to calculate the cell value -
has_error
, whether the cell value or formula results in an error -
error_type
, the type of error the cell is experiencing -
has_annotation
, whether the cell has a comment/annotation -
annotation
, the content of the cell’s comment/annotation -
is_merged
, whether the cell is part of a merge group -
merge_colspan
, for the lead cell in a merge group the number of columns spanned by the merge group -
merge_rowspan
, for the lead cell in a merge group the number of rows spanned by the merge group -
merge_shape
, for the lead cell in a merge group the shape of the merge group -
cell_style
, the style reference for the cell -
row_style
, the style reference for the row -
col_style
, the style reference for the column -
col_deftaul_cell_style
, the default style reference for cells in the column
Formulas
The has_formula
column indicates if a cell value is
calculated by a formula, which is itself provided in the
formula
column.
types_cells |>
dplyr::filter(col == 9 & has_formula) |>
dplyr::select(value_type, cell_content, numeric_value, has_formula, formula) |>
head()
#> # A tibble: 6 × 5
#> value_type cell_content numeric_value has_formula formula
#> <chr> <chr> <dbl> <lgl> <chr>
#> 1 float 6579.3 6579. TRUE of:=[.G2]*[.H2]
#> 2 float 6579.3 6579. TRUE of:=[.G3]*[.H3]
#> 3 float 6579.3 6579. TRUE of:=[.G4]*[.H4]
#> 4 float 6579.3 6579. TRUE of:=[.G5]*[.H5]
#> 5 float 6579.3 6579. TRUE of:=[.G6]*[.H6]
#> 6 float 6579.3 6579. TRUE of:=[.G7]*[.H7]
For more on the use of formulas in ODS files you can consult the OpenFormula specification.
Errors
Formulas in spreadsheets can result in calculation errors. The
has_error
column indicates if there the cell results in an
error. The error_type
column is similar to the
=ERROR.TYPE()
function in Microsoft Excel, Google Sheets
and LibreOffice, these provide a number to denote the type of error.
error_type
will have a value from 1 to 7 or 501 to
540:
-
1
for #NULL! errors in Excel/Google Sheets or Err:511 in LibreOffice, typically this is some sort of error in the formula specification such as a missing operator (Excel/Google Sheets) or a missing variable (LibreOffice) -
2
for #DIV/0! (division by zero) errors -
3
for #VALUE! (value type) errors, e.g. when a formula uses a string rather than a number -
4
for #REF! (invalid cell reference) errors -
5
for #NAME? errors, e.g. formula specifies a function that doesn’t exist -
6
for #NUM! (calculation overflow) errors, e.g. the result of9999^9999
-
7
for #N/A (logic) errors, e.g. unable to match values -
501
to540
a LibreOffice error
LibreOffice assigns all errors an error code from 501 to 540, where
not displayed as an error like #DIV/0!
these are shown in
cell_content
as "Err:5##"
. See LibreOffice’s
documentation on error
codes for more details on these error codes.
Annotations
Spreadsheet applications allow you to add notes and comments to cells
to provide addition contextual information. The
has_annotation
column indicates whether the cell has an
annotation associated with it. The annotation
column
provides the text content of the annotation.
types_cells |>
dplyr::filter(has_annotation) |>
dplyr::select(has_annotation, annotation)
#> # A tibble: 2 × 2
#> has_annotation annotation
#> <lgl> <chr>
#> 1 TRUE "Test comment"
#> 2 TRUE "Comment 2\nMore comments\nComments with space\nBold comment"
Note: different spreadsheet applications handle
annotations in different ways, read_ods_cells()
does not
take account of these differences and provides only a basic
representation of the text contained in the annotation. LibreOffice only
supports a basic “comment” annotation which is called a “note” in
Microsoft Excel and Google Sheets.
In addition to a “note” Microsoft Excel and Google Sheets proprietary formats allow for threaded comments that allow users to leave comments and reply to these comments, when saving into ODS these features are saved as annotations like notes. However the coercion of these threads into a note and inclusion of comment/reply author details varies.
Merged cells
Spreadsheets allow you to merge cells, largely this is for
presentational purposes. Merged cells have a cell_type
of
either merge-lead
for the top-left cell of a merge group or
merge-hidden
for a cell hidden by a merge group. The
is_merged
indicates whether the cell is either of these
types.
When merging cells in Microsoft Office and Google Sheets then only content in the top-left cell is retained, however in LibreOffice merging cells retains the content of the cells that are hidden by a merge operation.
types_cells |>
dplyr::filter(is_merged) |>
dplyr::select(address, cell_type, cell_content, is_merged)
#> # A tibble: 4 × 4
#> address cell_type cell_content is_merged
#> <chr> <chr> <chr> <lgl>
#> 1 A8 merge-lead Merged cell TRUE
#> 2 A9 merge-hidden NA TRUE
#> 3 A10 merge-lead Merged cell with hidden text TRUE
#> 4 A11 merge-hidden Hidden text TRUE
For merge-lead
cells you will also get the number of
columns and rows spanned by the merge group in
merge_colspan
and merge_rowspan
respectively.
This span information is used to define the merge_shape
column:
- the shape of a merge group is defined as
vertical
whenmerge_colspan
is equal to 1 andmerge_rowspan
is greater than 1 - the shape of a merge group is defined as
horizontal
whenmerge_colspan
is greater than 1 andmerge_rowspan
is equal to 1 - otherwise the shape of the merge group is
rectangular
.
A | B | C | D | |
---|---|---|---|---|
1 | A cell across two columns (a horizontal merge) | C1 | D1 | |
2 | A cell across two rows (a vertical merge) | B2 | C2 | D2 |
3 | B3 | C3 | D3 | |
4 | A4 | B4 | A cell across two columns and two rows (a rectangular merge) | |
5 | A5 | B5 |
merge_cells <- read_ods_cells(example_file, 3)
merge_cells |>
dplyr::select(row, col, cell_type) |>
dplyr::mutate(col = paste0("x", col)) |>
tidyr::pivot_wider(names_from = col, values_from = cell_type)
#> # A tibble: 5 × 5
#> row x1 x2 x3 x4
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 merge-lead merge-hidden NA NA
#> 2 2 merge-lead NA NA NA
#> 3 3 merge-hidden NA NA NA
#> 4 4 empty empty merge-lead merge-hidden
#> 5 5 empty empty merge-hidden merge-hidden
merge_cells |>
dplyr::filter(cell_type == "merge-lead") |>
dplyr::select(address, cell_content, merge_colspan, merge_rowspan, merge_shape)
#> # A tibble: 3 × 5
#> address cell_content merge_colspan merge_rowspan merge_shape
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 A1 "A cell across two columns (a… 2 1 horizontal
#> 2 A2 "A cell across two rows (a ve… 1 2 vertical
#> 3 C4 "A cell across two columns an… 2 2 rectangular
Style information
Style and formatting information is stored separately from cells themselves, instead cells, rows and columns are given reference identifiers to look up in the style information. There are four style identifiers for each cell:
-
cell_style
, style and formatting specifically applied to that cell -
row_style
, style and formatting applied to the row -
col_style
, style and formatting applied to column -
col_default_cell_style
, the default style and formatting for cells in the column, i.e. when acell_style
is not set
Note: While these style identifiers are extracted, at present the tidyods package does not currently support the extraction of the style and formatting rules from the ODS XML.
Quick extraction
At its core an ODS file is an XML file, extracting this information
can be processor heavy. If your primary interest is just in cell
location and cell values then setting the argument
quick = TRUE
can result in a faster extraction process as
it provides only six columns: sheet
, address
,
row
, col
, value_type
and
base_value
.
The quick extraction process varies in how the
base_value
column is constructed. When
quick = TRUE
only float and percentage values are taken in
their raw numeric form, for all other types the
cell_content
is used. That means that for currency, date
and time value types the value as formatted for spreadsheet application
users is returned rather than the “raw” value stored in the underlying
XML file.
read_ods_cells(example_file, 2, quick = TRUE) |>
dplyr::filter(row == 3)
#> # A tibble: 10 × 6
#> sheet address row col value_type base_value
#> <chr> <chr> <int> <int> <chr> <chr>
#> 1 types A3 3 1 string Cell with comment
#> 2 types B3 3 2 boolean FALSE
#> 3 types C3 3 3 currency £1.20
#> 4 types D3 3 4 date 06/15/22
#> 5 types E3 3 5 time 13:24
#> 6 types F3 3 6 date 15/06/22 13:24
#> 7 types G3 3 7 float 12034.56789
#> 8 types H3 3 8 percentage 0.5467
#> 9 types I3 3 9 float 6579.298265463
#> 10 types J3 3 10 string #DIV/0!
Value types examples
Boolean
Boolean values (TRUE
/FALSE
) are equivalent
to R’s logical()
data type.
types_cells |>
dplyr::filter(value_type == "boolean") |>
dplyr::select(cell_content, base_value, boolean_value)
#> # A tibble: 14 × 3
#> cell_content base_value boolean_value
#> <chr> <chr> <lgl>
#> 1 TRUE true TRUE
#> 2 FALSE false FALSE
#> 3 TRUE true TRUE
#> 4 TRUE true TRUE
#> 5 FALSE false FALSE
#> 6 FALSE false FALSE
#> 7 TRUE true TRUE
#> 8 FALSE false FALSE
#> 9 FALSE false FALSE
#> 10 TRUE true TRUE
#> 11 FALSE false FALSE
#> 12 TRUE true TRUE
#> 13 TRUE true TRUE
#> 14 TRUE true TRUE
Note: ODS files produced by Google Sheets output
boolean values differently from other applications. Google Sheets stores
booleans as float
values of 0
for
FALSE
and 1
for TRUE
, and gives
the cells a formula of =FALSE()
and =TRUE()
instead. read_ods_cells()
will output the value type,
numeric value and formula written by Google Sheets, however it uses the
formula references to input the relevant
TRUE
/FALSE
value to the
boolean_value
column.
Currency
Currency values are numeric values with a
currency_symbol
, the specific formatting of the currency
symbol is handled by style formatting rules
types_cells |>
dplyr::filter(value_type == "currency") |>
dplyr::select(cell_content, base_value, numeric_value, currency_symbol)
#> # A tibble: 14 × 4
#> cell_content base_value numeric_value currency_symbol
#> <chr> <chr> <dbl> <chr>
#> 1 £1.20 1.2 1.2 GBP
#> 2 £1.20 1.2 1.2 GBP
#> 3 £1.20 1.2 1.2 GBP
#> 4 £1.20 1.2 1.2 GBP
#> 5 1.20 € 1.2 1.2 EUR
#> 6 1.20 € 1.2 1.2 EUR
#> 7 €1.20 1.2 1.2 EUR
#> 8 $1.20 1.2 1.2 CAD
#> 9 $1.20 1.2 1.2 USD
#> 10 1.20 USD 1.2 1.2 USD
#> 11 £1.20 1.2 1.2 GBP
#> 12 £1.20 1.2 1.2 GBP
#> 13 £1.20 1.2 1.2 GBP
#> 14 £1.20 1.2 1.2 GBP
Date
The ODS specification does not differentiate between a date
(e.g. 15/06/2022) and a date-time (e.g. 15/06/2023 13:47). The
cell_content
column will show the date as formatted for
spreadsheet users, e.g. Wednesday, 15 June 2022
or
15/06/22
. The date_value
(and thus
base_value
) is stored in ISO 8601 format
(0000-00-00T00:00:00
).
types_cells |>
dplyr::filter(value_type == "date") |>
dplyr::select(cell_content, base_value, date_value)
#> # A tibble: 28 × 3
#> cell_content base_value date_value
#> <chr> <chr> <chr>
#> 1 15/06/22 2022-06-15 2022-06-15
#> 2 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15T13:24:56
#> 3 06/15/22 2022-06-15 2022-06-15
#> 4 15/06/22 13:24 2022-06-15T13:24:56 2022-06-15T13:24:56
#> 5 Wed 15 Jun 22 2022-06-15 2022-06-15
#> 6 15-Jun-22 13:24 2022-06-15T13:24:56 2022-06-15T13:24:56
#> 7 Wednesday 15 June 2022 2022-06-15 2022-06-15
#> 8 Wednesday 15 June 2022 01:24 PM 2022-06-15T13:24:56 2022-06-15T13:24:56
#> 9 15-Jun-22 2022-06-15 2022-06-15
#> 10 15/06 13.24 2022-06-15T13:24:56 2022-06-15T13:24:56
#> # ℹ 18 more rows
As the date value type can contain both date and date-time values the
date_value
column is not coerced by
read_ods_cells()
into a base R data type. While you can use
base R’s as.Date()
to coerce to a date, if processing
date-times you need to replace the T
marker stored in
date_value
with a space in order for
as.POSIXct()
to coerce date-times correctly. Alternatively,
you can use lubridate::as_datetime()
without modifying the
date_value
.
types_cells |>
dplyr::filter(value_type == "date") |>
dplyr::select(cell_content, date_value) |>
dplyr::filter(!grepl("T", date_value)) |>
dplyr::mutate(date_value2 = as.Date(date_value))
#> # A tibble: 14 × 3
#> cell_content date_value date_value2
#> <chr> <chr> <date>
#> 1 15/06/22 2022-06-15 2022-06-15
#> 2 06/15/22 2022-06-15 2022-06-15
#> 3 Wed 15 Jun 22 2022-06-15 2022-06-15
#> 4 Wednesday 15 June 2022 2022-06-15 2022-06-15
#> 5 15-Jun-22 2022-06-15 2022-06-15
#> 6 15/06/2022 2022-06-15 2022-06-15
#> 7 15. Jun. 2022 2022-06-15 2022-06-15
#> 8 15 Jun 22 2022-06-15 2022-06-15
#> 9 mer. 15 juin 22 2022-06-15 2022-06-15
#> 10 2022-06-15 2022-06-15 2022-06-15
#> 11 15/06/22 2022-06-15 2022-06-15
#> 12 15/06/22 2022-06-15 2022-06-15
#> 13 15/06/22 2022-06-15 2022-06-15
#> 14 15/06/22 2022-06-15 2022-06-15
types_cells |>
dplyr::filter(value_type == "date") |>
dplyr::select(cell_content, date_value) |>
dplyr::filter(grepl("T", date_value)) |>
# dplyr::mutate(date_value2 = as.POSIXct(gsub("T", " ", y)))
dplyr::mutate(date_value2 = lubridate::as_datetime(date_value))
#> # A tibble: 14 × 3
#> cell_content date_value date_value2
#> <chr> <chr> <dttm>
#> 1 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 2 15/06/22 13:24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 3 15-Jun-22 13:24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 4 Wednesday 15 June 2022 01:24 PM 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 5 15/06 13.24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 6 2022-06-15 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 7 20220615-132456 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 8 15-Jun 1:24 PM 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 9 15 juin 2022 13:24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 10 06/15/22 13-24 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 11 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 12 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 13 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
#> 14 15/06/2022 13:24:56 2022-06-15T13:24:56 2022-06-15 13:24:56
The W3C XML schema has more details on the specifics of ISO formatting of dates and date-times.
Note: ODS files produced by Microsoft Excel output
all dates as date-times with the time component of the date set to
00:00:00
, this is a further reason for
read_ods_cells()
not automatically coercing
date_value
to an R data type.
Float
Numeric values, with the exception of currency and percentage types,
are defined as a float
value type.
types_cells |>
dplyr::filter(value_type == "float") |>
dplyr::select(cell_content, base_value, numeric_value)
#> # A tibble: 37 × 3
#> cell_content base_value numeric_value
#> <chr> <chr> <dbl>
#> 1 12035 12034.56789 12034.56789
#> 2 0.5467 0.5467 0.5467
#> 3 6579.3 6579.298265463 6579.298265
#> 4 12034.57 12034.56789 12034.56789
#> 5 6579.3 6579.298265463 6579.298265
#> 6 12034.5679 12034.56789 12034.56789
#> 7 6579.3 6579.298265463 6579.298265
#> 8 12034.568 12034.56789 12034.56789
#> 9 6579.3 6579.298265463 6579.298265
#> 10 12034.56789 12034.56789 12034.56789
#> # ℹ 27 more rows
Percentage
Percentages are numeric values with formatting to include a percentage sign. As with the currency value type the specifics of formatting are handled by style formatting rules.
types_cells |>
dplyr::filter(value_type == "percentage") |>
dplyr::select(cell_content, base_value, numeric_value)
#> # A tibble: 5 × 3
#> cell_content base_value numeric_value
#> <chr> <chr> <dbl>
#> 1 55% 0.5467 0.5467
#> 2 54.7% 0.5467 0.5467
#> 3 54.67% 0.5467 0.5467
#> 4 54.670% 0.5467 0.5467
#> 5 54.6700% 0.5467 0.5467
String
String values are derived from the cell_content
, the
text representation of the cell as seen by a spreadsheet application
user.
types_cells |>
dplyr::filter(col == 1 & row > 1 & value_type == "string") |>
dplyr::select(cell_content, base_value)
#> # A tibble: 14 × 2
#> cell_content base_value
#> <chr> <chr>
#> 1 "Cell" "Cell"
#> 2 "Cell with comment" "Cell with comment"
#> 3 "Cell with\nMultiple lines" "Cell with\nMultiple lines"
#> 4 "Cell with repeated spaces" "Cell with repeated spaces"
#> 5 "Cells with spaces\nAnd multiple lines" "Cells with spaces\nAnd mu…
#> 6 "Cell with formatted comment" "Cell with formatted comment"
#> 7 "Merged cell" "Merged cell"
#> 8 "Merged cell with hidden text" "Merged cell with hidden text"
#> 9 "Hidden text" "Hidden text"
#> 10 "Cell with bold text" "Cell with bold text"
#> 11 "Repeated rows" "Repeated rows"
#> 12 "Repeated rows" "Repeated rows"
#> 13 "Repeated rows" "Repeated rows"
#> 14 "une cellule dans une feuille de calcul" "une cellule dans une feuille …
While the ODS specification allows for a separate
string-value
attribute in the XML table-cell tags, in
practice this is not implemented by the major spreadsheet
applications.
Time
Time values are defined as a duration as specified by ISO 8601, in
the format PT00H00M00S
.
types_cells |>
dplyr::filter(value_type == "time") |>
dplyr::select(cell_content, base_value, time_value)
#> # A tibble: 14 × 3
#> cell_content base_value time_value
#> <chr> <chr> <chr>
#> 1 13:24:56 PT13H24M56S PT13H24M56S
#> 2 13:24 PT13H24M56S PT13H24M56S
#> 3 01:24 PM PT13H24M56S PT13H24M56S
#> 4 1:24 PM PT13H24M56S PT13H24M56S
#> 5 1.24 PM PT13H24M56S PT13H24M56S
#> 6 13h24 PT13H24M56S PT13H24M56S
#> 7 1324 PT13H24M56S PT13H24M56S
#> 8 13 24 PT13H24M56S PT13H24M56S
#> 9 13-24-56 PT13H24M56S PT13H24M56S
#> 10 13:24:56.00 PT13H24M56S PT13H24M56S
#> 11 13:24:56 PT13H24M56S PT13H24M56S
#> 12 13:24:56 PT13H24M56S PT13H24M56S
#> 13 13:24:56 PT13H24M56S PT13H24M56S
#> 14 13:24:56 PT13H24M56S PT13H24M56S
As with date values time values are not automatically coerced to an R
data type. To use base R’s as.difftime()
you will need to
remove or convert the letter characters, whereas
lubridate::as.duration()
can coerce without
modification.
types_cells |>
dplyr::filter(value_type == "time") |>
dplyr::select(cell_content, time_value) |>
# dplyr::mutate(
# time_value2 = as.difftime(gsub("[PTS]", "", gsub("[HM]", ":", time_value)))
# )
dplyr::mutate(time_value2 = lubridate::as.duration(time_value))
#> # A tibble: 14 × 3
#> cell_content time_value time_value2
#> <chr> <chr> <Duration>
#> 1 13:24:56 PT13H24M56S 48296s (~13.42 hours)
#> 2 13:24 PT13H24M56S 48296s (~13.42 hours)
#> 3 01:24 PM PT13H24M56S 48296s (~13.42 hours)
#> 4 1:24 PM PT13H24M56S 48296s (~13.42 hours)
#> 5 1.24 PM PT13H24M56S 48296s (~13.42 hours)
#> 6 13h24 PT13H24M56S 48296s (~13.42 hours)
#> 7 1324 PT13H24M56S 48296s (~13.42 hours)
#> 8 13 24 PT13H24M56S 48296s (~13.42 hours)
#> 9 13-24-56 PT13H24M56S 48296s (~13.42 hours)
#> 10 13:24:56.00 PT13H24M56S 48296s (~13.42 hours)
#> 11 13:24:56 PT13H24M56S 48296s (~13.42 hours)
#> 12 13:24:56 PT13H24M56S 48296s (~13.42 hours)
#> 13 13:24:56 PT13H24M56S 48296s (~13.42 hours)
#> 14 13:24:56 PT13H24M56S 48296s (~13.42 hours)
The W3C XML schema has more details on the specifics of ISO formatting of dates and date-times.