Extracting cell information • tidyods

library(tidyods)

The aim of tidyods is to extract information about cells in OpenDocument Spreadsheet (ODS) files into a “tidy” format. In addition to the value contained by a cell it extracts a range of addition metadata about the cells.

This vignette explains how to use tidyods::read_ods_cells() and the types of information about cells it extracts. The Spreadsheet Munging Strategies book by Duncan Garmonsway explores in more detail how you can use datasets of “tidy cells” to work with non-tabular data stored in (Excel) spreadsheets.

Reading data from cells

You read in data from an ODS file with the read_ods_cells() function. The resulting tibble provides 28 columns of information about cells, with each row representing an individual cell in the input spreadsheet.

example_file <- system.file("extdata", "basic_example.ods", package = "tidyods")
types_cells <- read_ods_cells(example_file, 2)
dplyr::glimpse(types_cells)
#> Rows: 160
#> Columns: 28
#> $ sheet                  <chr> "types", "types", "types", "types", "types", "t…
#> $ address                <chr> "A1", "B1", "C1", "D1", "E1", "F1", "G1", "H1",…
#> $ row                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,…
#> $ col                    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6…
#> $ cell_type              <chr> "cell", "cell", "cell", "cell", "cell", "cell",…
#> $ is_empty               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ value_type             <chr> "string", "string", "string", "string", "string…
#> $ cell_content           <chr> "String", "Boolean", "Currency", "Date", "Time"…
#> $ base_value             <chr> "String", "Boolean", "Currency", "Date", "Time"…
#> $ numeric_value          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ currency_symbol        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ boolean_value          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, TRU…
#> $ date_value             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ time_value             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_formula            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ formula                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_error              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ error_type             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ has_annotation         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ annotation             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ is_merged              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ merge_colspan          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ merge_rowspan          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ merge_shape            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ cell_style             <chr> NA, "Default", "Default", "Default", "Default",…
#> $ row_style              <chr> "ro1", "ro1", "ro1", "ro1", "ro1", "ro1", "ro1"…
#> $ col_style              <chr> "co6", "co7", "co8", "co9", "co10", "co11", "co…
#> $ col_default_cell_style <chr> "Default", "ce56", "ce4", "ce11", "ce21", "ce31…

There are four arguments to read_ods_cells():

path, the path to the ODS file
sheet, the sheet you wish to extract
quick, whether to extract a subset of cell information
quiet, whether to show console messages or not

The sheet argument can either be the index number of a sheet (ranging from 1 to the total number of sheets in the spreadsheet document) or the name of a sheet (e.g. "types" instead of 2 in the example above).

The quick argument extracts only a small number of columns and is useful if you are largely concerned with cell values; see the quick extraction section for more details.

The quiet argument turns off console messages, ODS files are written in XML which can take significant time to process and so progress messages are provided to the console for interactive users. It is advised to set this to false if you are doing batch operations.

Cell information

There are 28 types of information returned by read_ods_cells(), these can be grouped into three categories: location information, value information, and other metadata.

Location information

The first four columns of information relate to a cell’s location in the spreadsheet.

sheet, the name of the sheet the cell belongs to
address, the address of the cell within a sheet using “A1” notation
row, the row number of the cell
col, the column number of the cell

The address column is derived from the row and col using the cellranger package. It provides the address in the “A1” notation commonly used in spreadsheet applications where the letter corresponds to a column and the number corresponds to a row. The “A1” notation is used by ODS files for calculation formulas.

Value information

There are 10 columns that provide information about a cell’s value.

cell_type, the type of cell
is_empty, whether the cell has content or not (i.e. cell_type == "empty")
value_type, the value type as set in the ODS file
cell_content, the cell’s value as formatted for users of spreadsheet applications
base_value, the “raw” value of the cell
numeric_value, for numeric value types the raw value as a numeric() vector
currency_symbol, for currency value types the currency symbol associated with the numeric value
boolean_value, for boolean value types the raw value as a logical() vector
date_value, for date value types (dates and date-times) the “raw” value as a character() vector
time_value, for time value types, the “raw” value as a character() vector

Cells have one of four types, it is empty if it has no content and/or no value type, it is a merge-lead if it is the top-left cell of a merge group while other cells in the merge group are classified as merge-hidden, all other cells are simply a cell.

The value_type is derived from the ODS specification which defines seven value types:

boolean, equivalent to R’s logical() data type
currency, a numeric value, optionally with a currency symbol
date, a date or date-time stored in ISO 8601 format, the ODS specification does not differentiate between a date (e.g. 15/06/2022) and a date-time (e.g. 15/06/2023 13:47)
float, a numeric value, the ODS format does not distinguish between different numeric types
percentage, a numeric value, but formatted for display as a percentage
string, equivalent to R’s character()
time, a duration, stored in ISO 8601 format

The cell_content column provides the value of a cell as displayed and formatted to users of a spreadsheet application. Non-string values are stored separately in the underlying ODS XML file, and these are extracted as separate columns (numeric_value, boolean_value, date_value and time_value) by read_ods_cells(), for numeric and boolean values these are also coerced into a numeric() and logical() vector. The base_value column provides a combination of the cell_content for cells that are strings and for cells that are not string value types the value from the relevant type specific column.

For currency value types optionally a currency symbol (e.g. GBP, EUR, USD, JPY) can be set, this is provided in the currency_symbol column.

types_cells |> 
  dplyr::filter(row == 3) |> 
  dplyr::select(value_type:currency_symbol)
#> # A tibble: 10 × 5
#>    value_type cell_content      base_value         numeric_value currency_symbol
#>    <chr>      <chr>             <chr>                      <dbl> <chr>          
#>  1 string     Cell with comment Cell with comment         NA     NA             
#>  2 boolean    FALSE             false                     NA     NA             
#>  3 currency   £1.20             1.2                        1.2   GBP            
#>  4 date       06/15/22          2022-06-15                NA     NA             
#>  5 time       13:24             PT13H24M56S               NA     NA             
#>  6 date       15/06/22 13:24    2022-06-15T13:24:…        NA     NA             
#>  7 float      12034.57          12034.56789            12035.    NA             
#>  8 percentage 55%               0.5467                     0.547 NA             
#>  9 float      6579.3            6579.298265463          6579.    NA             
#> 10 string     #DIV/0!           #DIV/0!                   NA     NA

types_cells |> 
  dplyr::filter(row == 3) |> 
  dplyr::select(value_type, base_value, boolean_value:time_value)
#> # A tibble: 10 × 5
#>    value_type base_value          boolean_value date_value          time_value 
#>    <chr>      <chr>               <lgl>         <chr>               <chr>      
#>  1 string     Cell with comment   NA            NA                  NA         
#>  2 boolean    false               FALSE         NA                  NA         
#>  3 currency   1.2                 NA            NA                  NA         
#>  4 date       2022-06-15          NA            2022-06-15          NA         
#>  5 time       PT13H24M56S         NA            NA                  PT13H24M56S
#>  6 date       2022-06-15T13:24:56 NA            2022-06-15T13:24:56 NA         
#>  7 float      12034.56789         NA            NA                  NA         
#>  8 percentage 0.5467              NA            NA                  NA         
#>  9 float      6579.298265463      NA            NA                  NA         
#> 10 string     #DIV/0!             NA            NA                  NA

The value type examples section below provides more details on the different value types and how they are stored and processed.

Other metadata

Finally, read_ods_cells() provides 12 further columns with additional metadata about each cell.

has_formula, whether the cell value is calculated by a formula
formula, the formula used to calculate the cell value
has_error, whether the cell value or formula results in an error
error_type, the type of error the cell is experiencing
has_annotation, whether the cell has a comment/annotation
annotation, the content of the cell’s comment/annotation
is_merged, whether the cell is part of a merge group
merge_colspan, for the lead cell in a merge group the number of columns spanned by the merge group
merge_rowspan, for the lead cell in a merge group the number of rows spanned by the merge group
merge_shape, for the lead cell in a merge group the shape of the merge group
cell_style, the style reference for the cell
row_style, the style reference for the row
col_style, the style reference for the column
col_deftaul_cell_style, the default style reference for cells in the column

Formulas

The has_formula column indicates if a cell value is calculated by a formula, which is itself provided in the formula column.

types_cells |> 
  dplyr::filter(col == 9 & has_formula) |> 
  dplyr::select(value_type, cell_content, numeric_value, has_formula, formula) |>
  head()
#> # A tibble: 6 × 5
#>   value_type cell_content numeric_value has_formula formula        
#>   <chr>      <chr>                <dbl> <lgl>       <chr>          
#> 1 float      6579.3               6579. TRUE        of:=[.G2]*[.H2]
#> 2 float      6579.3               6579. TRUE        of:=[.G3]*[.H3]
#> 3 float      6579.3               6579. TRUE        of:=[.G4]*[.H4]
#> 4 float      6579.3               6579. TRUE        of:=[.G5]*[.H5]
#> 5 float      6579.3               6579. TRUE        of:=[.G6]*[.H6]
#> 6 float      6579.3               6579. TRUE        of:=[.G7]*[.H7]

For more on the use of formulas in ODS files you can consult the OpenFormula specification.

Errors

Formulas in spreadsheets can result in calculation errors. The has_error column indicates if there the cell results in an error. The error_type column is similar to the =ERROR.TYPE() function in Microsoft Excel, Google Sheets and LibreOffice, these provide a number to denote the type of error.

error_type will have a value from 1 to 7 or 501 to 540:

1 for #NULL! errors in Excel/Google Sheets or Err:511 in LibreOffice, typically this is some sort of error in the formula specification such as a missing operator (Excel/Google Sheets) or a missing variable (LibreOffice)
2 for #DIV/0! (division by zero) errors
3 for #VALUE! (value type) errors, e.g. when a formula uses a string rather than a number
4 for #REF! (invalid cell reference) errors
5 for #NAME? errors, e.g. formula specifies a function that doesn’t exist
6 for #NUM! (calculation overflow) errors, e.g. the result of 9999^9999
7 for #N/A (logic) errors, e.g. unable to match values
501 to 540 a LibreOffice error

LibreOffice assigns all errors an error code from 501 to 540, where not displayed as an error like #DIV/0! these are shown in cell_content as "Err:5##". See LibreOffice’s documentation on error codes for more details on these error codes.

Annotations

Spreadsheet applications allow you to add notes and comments to cells to provide addition contextual information. The has_annotation column indicates whether the cell has an annotation associated with it. The annotation column provides the text content of the annotation.

types_cells |> 
    dplyr::filter(has_annotation) |> 
  dplyr::select(has_annotation, annotation)
#> # A tibble: 2 × 2
#>   has_annotation annotation                                                   
#>   <lgl>          <chr>                                                        
#> 1 TRUE           "Test comment"                                               
#> 2 TRUE           "Comment 2\nMore comments\nComments with space\nBold comment"

Note: different spreadsheet applications handle annotations in different ways, read_ods_cells() does not take account of these differences and provides only a basic representation of the text contained in the annotation. LibreOffice only supports a basic “comment” annotation which is called a “note” in Microsoft Excel and Google Sheets.

In addition to a “note” Microsoft Excel and Google Sheets proprietary formats allow for threaded comments that allow users to leave comments and reply to these comments, when saving into ODS these features are saved as annotations like notes. However the coercion of these threads into a note and inclusion of comment/reply author details varies.

Merged cells

Spreadsheets allow you to merge cells, largely this is for presentational purposes. Merged cells have a cell_type of either merge-lead for the top-left cell of a merge group or merge-hidden for a cell hidden by a merge group. The is_merged indicates whether the cell is either of these types.

When merging cells in Microsoft Office and Google Sheets then only content in the top-left cell is retained, however in LibreOffice merging cells retains the content of the cells that are hidden by a merge operation.

types_cells |> 
    dplyr::filter(is_merged) |> 
  dplyr::select(address, cell_type, cell_content, is_merged)
#> # A tibble: 4 × 4
#>   address cell_type    cell_content                 is_merged
#>   <chr>   <chr>        <chr>                        <lgl>    
#> 1 A8      merge-lead   Merged cell                  TRUE     
#> 2 A9      merge-hidden NA                           TRUE     
#> 3 A10     merge-lead   Merged cell with hidden text TRUE     
#> 4 A11     merge-hidden Hidden text                  TRUE

For merge-lead cells you will also get the number of columns and rows spanned by the merge group in merge_colspan and merge_rowspan respectively. This span information is used to define the merge_shape column:

the shape of a merge group is defined as vertical when merge_colspan is equal to 1 and merge_rowspan is greater than 1
the shape of a merge group is defined as horizontal when merge_colspan is greater than 1 and merge_rowspan is equal to 1
otherwise the shape of the merge group is rectangular.

	A	B	C	D
1	A cell across two columns (a horizontal merge)		C1	D1
2	A cell across two rows (a vertical merge)	B2	C2	D2
3	A cell across two rows (a vertical merge)	B3	C3	D3
4	A4	B4	A cell across two columns and two rows (a rectangular merge)
5	A5	B5

merge_cells <- read_ods_cells(example_file, 3)

merge_cells |>
  dplyr::select(row, col, cell_type) |>
  dplyr::mutate(col = paste0("x", col)) |>
  tidyr::pivot_wider(names_from = col, values_from = cell_type)
#> # A tibble: 5 × 5
#>     row x1           x2           x3           x4          
#>   <dbl> <chr>        <chr>        <chr>        <chr>       
#> 1     1 merge-lead   merge-hidden NA           NA          
#> 2     2 merge-lead   NA           NA           NA          
#> 3     3 merge-hidden NA           NA           NA          
#> 4     4 empty        empty        merge-lead   merge-hidden
#> 5     5 empty        empty        merge-hidden merge-hidden

merge_cells |> 
  dplyr::filter(cell_type == "merge-lead") |> 
  dplyr::select(address, cell_content, merge_colspan, merge_rowspan, merge_shape)
#> # A tibble: 3 × 5
#>   address cell_content                   merge_colspan merge_rowspan merge_shape
#>   <chr>   <chr>                                  <dbl>         <dbl> <chr>      
#> 1 A1      "A cell across two columns (a…             2             1 horizontal 
#> 2 A2      "A cell across two rows (a ve…             1             2 vertical   
#> 3 C4      "A cell across two columns an…             2             2 rectangular

Style information

Style and formatting information is stored separately from cells themselves, instead cells, rows and columns are given reference identifiers to look up in the style information. There are four style identifiers for each cell:

cell_style, style and formatting specifically applied to that cell
row_style, style and formatting applied to the row
col_style, style and formatting applied to column
col_default_cell_style, the default style and formatting for cells in the column, i.e. when a cell_style is not set

Note: While these style identifiers are extracted, at present the tidyods package does not currently support the extraction of the style and formatting rules from the ODS XML.

Quick extraction

At its core an ODS file is an XML file, extracting this information can be processor heavy. If your primary interest is just in cell location and cell values then setting the argument quick = TRUE can result in a faster extraction process as it provides only six columns: sheet, address, row, col, value_type and base_value.

The quick extraction process varies in how the base_value column is constructed. When quick = TRUE only float and percentage values are taken in their raw numeric form, for all other types the cell_content is used. That means that for currency, date and time value types the value as formatted for spreadsheet application users is returned rather than the “raw” value stored in the underlying XML file.

read_ods_cells(example_file, 2, quick = TRUE) |> 
  dplyr::filter(row == 3)
#> # A tibble: 10 × 6
#>    sheet address   row   col value_type base_value       
#>    <chr> <chr>   <int> <int> <chr>      <chr>            
#>  1 types A3          3     1 string     Cell with comment
#>  2 types B3          3     2 boolean    FALSE            
#>  3 types C3          3     3 currency   £1.20            
#>  4 types D3          3     4 date       06/15/22         
#>  5 types E3          3     5 time       13:24            
#>  6 types F3          3     6 date       15/06/22 13:24   
#>  7 types G3          3     7 float      12034.56789      
#>  8 types H3          3     8 percentage 0.5467           
#>  9 types I3          3     9 float      6579.298265463   
#> 10 types J3          3    10 string     #DIV/0!

Value types examples

Boolean

Boolean values (TRUE/FALSE) are equivalent to R’s logical() data type.

types_cells |> 
  dplyr::filter(value_type == "boolean") |> 
  dplyr::select(cell_content, base_value, boolean_value)
#> # A tibble: 14 × 3
#>    cell_content base_value boolean_value
#>    <chr>        <chr>      <lgl>        
#>  1 TRUE         true       TRUE         
#>  2 FALSE        false      FALSE        
#>  3 TRUE         true       TRUE         
#>  4 TRUE         true       TRUE         
#>  5 FALSE        false      FALSE        
#>  6 FALSE        false      FALSE        
#>  7 TRUE         true       TRUE         
#>  8 FALSE        false      FALSE        
#>  9 FALSE        false      FALSE        
#> 10 TRUE         true       TRUE         
#> 11 FALSE        false      FALSE        
#> 12 TRUE         true       TRUE         
#> 13 TRUE         true       TRUE         
#> 14 TRUE         true       TRUE

Note: ODS files produced by Google Sheets output boolean values differently from other applications. Google Sheets stores booleans as float values of 0 for FALSE and 1 for TRUE, and gives the cells a formula of =FALSE() and =TRUE() instead. read_ods_cells() will output the value type, numeric value and formula written by Google Sheets, however it uses the formula references to input the relevant TRUE/FALSE value to the boolean_value column.

Currency

Currency values are numeric values with a currency_symbol, the specific formatting of the currency symbol is handled by style formatting rules

types_cells |> 
  dplyr::filter(value_type == "currency") |> 
  dplyr::select(cell_content, base_value, numeric_value, currency_symbol)
#> # A tibble: 14 × 4
#>    cell_content base_value numeric_value currency_symbol
#>    <chr>        <chr>              <dbl> <chr>          
#>  1 £1.20        1.2                  1.2 GBP            
#>  2 £1.20        1.2                  1.2 GBP            
#>  3 £1.20        1.2                  1.2 GBP            
#>  4 £1.20        1.2                  1.2 GBP            
#>  5 1.20 €       1.2                  1.2 EUR            
#>  6 1.20 €       1.2                  1.2 EUR            
#>  7 €1.20        1.2                  1.2 EUR            
#>  8 $1.20        1.2                  1.2 CAD            
#>  9 $1.20        1.2                  1.2 USD            
#> 10 1.20 USD     1.2                  1.2 USD            
#> 11 £1.20        1.2                  1.2 GBP            
#> 12 £1.20        1.2                  1.2 GBP            
#> 13 £1.20        1.2                  1.2 GBP            
#> 14 £1.20        1.2                  1.2 GBP

Date

The ODS specification does not differentiate between a date (e.g. 15/06/2022) and a date-time (e.g. 15/06/2023 13:47). The cell_content column will show the date as formatted for spreadsheet users, e.g. Wednesday, 15 June 2022 or 15/06/22. The date_value (and thus base_value) is stored in ISO 8601 format (0000-00-00T00:00:00).

types_cells |> 
  dplyr::filter(value_type == "date") |> 
  dplyr::select(cell_content, base_value, date_value)
#> # A tibble: 28 × 3
#>    cell_content                    base_value          date_value         
#>    <chr>                           <chr>               <chr>              
#>  1 15/06/22                        2022-06-15          2022-06-15         
#>  2 15/06/2022 13:24:56             2022-06-15T13:24:56 2022-06-15T13:24:56
#>  3 06/15/22                        2022-06-15          2022-06-15         
#>  4 15/06/22 13:24                  2022-06-15T13:24:56 2022-06-15T13:24:56
#>  5 Wed 15 Jun 22                   2022-06-15          2022-06-15         
#>  6 15-Jun-22 13:24                 2022-06-15T13:24:56 2022-06-15T13:24:56
#>  7 Wednesday 15 June 2022          2022-06-15          2022-06-15         
#>  8 Wednesday 15 June 2022 01:24 PM 2022-06-15T13:24:56 2022-06-15T13:24:56
#>  9 15-Jun-22                       2022-06-15          2022-06-15         
#> 10 15/06 13.24                     2022-06-15T13:24:56 2022-06-15T13:24:56
#> # ℹ 18 more rows

As the date value type can contain both date and date-time values the date_value column is not coerced by read_ods_cells() into a base R data type. While you can use base R’s as.Date() to coerce to a date, if processing date-times you need to replace the T marker stored in date_value with a space in order for as.POSIXct() to coerce date-times correctly. Alternatively, you can use lubridate::as_datetime() without modifying the date_value.

types_cells |> 
  dplyr::filter(value_type == "date") |> 
  dplyr::select(cell_content, date_value) |>
  dplyr::filter(!grepl("T", date_value)) |> 
  dplyr::mutate(date_value2 = as.Date(date_value))
#> # A tibble: 14 × 3
#>    cell_content           date_value date_value2
#>    <chr>                  <chr>      <date>     
#>  1 15/06/22               2022-06-15 2022-06-15 
#>  2 06/15/22               2022-06-15 2022-06-15 
#>  3 Wed 15 Jun 22          2022-06-15 2022-06-15 
#>  4 Wednesday 15 June 2022 2022-06-15 2022-06-15 
#>  5 15-Jun-22              2022-06-15 2022-06-15 
#>  6 15/06/2022             2022-06-15 2022-06-15 
#>  7 15. Jun. 2022          2022-06-15 2022-06-15 
#>  8 15 Jun 22              2022-06-15 2022-06-15 
#>  9 mer. 15 juin 22        2022-06-15 2022-06-15 
#> 10 2022-06-15             2022-06-15 2022-06-15 
#> 11 15/06/22               2022-06-15 2022-06-15 
#> 12 15/06/22               2022-06-15 2022-06-15 
#> 13 15/06/22               2022-06-15 2022-06-15 
#> 14 15/06/22               2022-06-15 2022-06-15

types_cells |> 
  dplyr::filter(value_type == "date") |> 
  dplyr::select(cell_content, date_value) |>
  dplyr::filter(grepl("T", date_value)) |> 
  # dplyr::mutate(date_value2 = as.POSIXct(gsub("T", " ", y)))
  dplyr::mutate(date_value2 = lubridate::as_datetime(date_value))
#> # A tibble: 14 × 3
#>    cell_content                    date_value          date_value2        
#>    <chr>                           <chr>               <dttm>             
#>  1 15/06/2022 13:24:56             2022-06-15T13:24:56 2022-06-15 13:24:56
#>  2 15/06/22 13:24                  2022-06-15T13:24:56 2022-06-15 13:24:56
#>  3 15-Jun-22 13:24                 2022-06-15T13:24:56 2022-06-15 13:24:56
#>  4 Wednesday 15 June 2022 01:24 PM 2022-06-15T13:24:56 2022-06-15 13:24:56
#>  5 15/06 13.24                     2022-06-15T13:24:56 2022-06-15 13:24:56
#>  6 2022-06-15 13:24:56             2022-06-15T13:24:56 2022-06-15 13:24:56
#>  7 20220615-132456                 2022-06-15T13:24:56 2022-06-15 13:24:56
#>  8 15-Jun 1:24 PM                  2022-06-15T13:24:56 2022-06-15 13:24:56
#>  9 15 juin 2022 13:24              2022-06-15T13:24:56 2022-06-15 13:24:56
#> 10 06/15/22 13-24                  2022-06-15T13:24:56 2022-06-15 13:24:56
#> 11 15/06/2022 13:24:56             2022-06-15T13:24:56 2022-06-15 13:24:56
#> 12 15/06/2022 13:24:56             2022-06-15T13:24:56 2022-06-15 13:24:56
#> 13 15/06/2022 13:24:56             2022-06-15T13:24:56 2022-06-15 13:24:56
#> 14 15/06/2022 13:24:56             2022-06-15T13:24:56 2022-06-15 13:24:56

The W3C XML schema has more details on the specifics of ISO formatting of dates and date-times.

Note: ODS files produced by Microsoft Excel output all dates as date-times with the time component of the date set to 00:00:00, this is a further reason for read_ods_cells() not automatically coercing date_value to an R data type.

Float

Numeric values, with the exception of currency and percentage types, are defined as a float value type.

types_cells |> 
  dplyr::filter(value_type == "float") |> 
  dplyr::select(cell_content, base_value, numeric_value)
#> # A tibble: 37 × 3
#>    cell_content base_value     numeric_value
#>    <chr>        <chr>                  <dbl>
#>  1 12035        12034.56789     12034.56789 
#>  2 0.5467       0.5467              0.5467  
#>  3 6579.3       6579.298265463   6579.298265
#>  4 12034.57     12034.56789     12034.56789 
#>  5 6579.3       6579.298265463   6579.298265
#>  6 12034.5679   12034.56789     12034.56789 
#>  7 6579.3       6579.298265463   6579.298265
#>  8 12034.568    12034.56789     12034.56789 
#>  9 6579.3       6579.298265463   6579.298265
#> 10 12034.56789  12034.56789     12034.56789 
#> # ℹ 27 more rows

Percentage

Percentages are numeric values with formatting to include a percentage sign. As with the currency value type the specifics of formatting are handled by style formatting rules.

types_cells |> 
  dplyr::filter(value_type == "percentage") |> 
  dplyr::select(cell_content, base_value, numeric_value)
#> # A tibble: 5 × 3
#>   cell_content base_value numeric_value
#>   <chr>        <chr>              <dbl>
#> 1 55%          0.5467            0.5467
#> 2 54.7%        0.5467            0.5467
#> 3 54.67%       0.5467            0.5467
#> 4 54.670%      0.5467            0.5467
#> 5 54.6700%     0.5467            0.5467

String

String values are derived from the cell_content, the text representation of the cell as seen by a spreadsheet application user.

types_cells |> 
  dplyr::filter(col == 1 & row > 1 & value_type == "string") |> 
  dplyr::select(cell_content, base_value)
#> # A tibble: 14 × 2
#>    cell_content                                  base_value                     
#>    <chr>                                         <chr>                          
#>  1 "Cell"                                        "Cell"                         
#>  2 "Cell with comment"                           "Cell with comment"            
#>  3 "Cell with\nMultiple lines"                   "Cell with\nMultiple lines"    
#>  4 "Cell  with    repeated spaces"               "Cell  with    repeated spaces"
#>  5 "Cells  with    spaces\nAnd multiple   lines" "Cells  with    spaces\nAnd mu…
#>  6 "Cell with formatted comment"                 "Cell with formatted comment"  
#>  7 "Merged cell"                                 "Merged cell"                  
#>  8 "Merged cell with hidden text"                "Merged cell with hidden text" 
#>  9 "Hidden text"                                 "Hidden text"                  
#> 10 "Cell with bold text"                         "Cell with bold text"          
#> 11 "Repeated rows"                               "Repeated rows"                
#> 12 "Repeated rows"                               "Repeated rows"                
#> 13 "Repeated rows"                               "Repeated rows"                
#> 14 "une cellule dans une feuille de calcul"      "une cellule dans une feuille …

While the ODS specification allows for a separate string-value attribute in the XML table-cell tags, in practice this is not implemented by the major spreadsheet applications.

Time

Time values are defined as a duration as specified by ISO 8601, in the format PT00H00M00S.

types_cells |> 
  dplyr::filter(value_type == "time") |> 
  dplyr::select(cell_content, base_value, time_value)
#> # A tibble: 14 × 3
#>    cell_content base_value  time_value 
#>    <chr>        <chr>       <chr>      
#>  1 13:24:56     PT13H24M56S PT13H24M56S
#>  2 13:24        PT13H24M56S PT13H24M56S
#>  3 01:24 PM     PT13H24M56S PT13H24M56S
#>  4 1:24 PM      PT13H24M56S PT13H24M56S
#>  5 1.24 PM      PT13H24M56S PT13H24M56S
#>  6 13h24        PT13H24M56S PT13H24M56S
#>  7 1324         PT13H24M56S PT13H24M56S
#>  8 13 24        PT13H24M56S PT13H24M56S
#>  9 13-24-56     PT13H24M56S PT13H24M56S
#> 10 13:24:56.00  PT13H24M56S PT13H24M56S
#> 11 13:24:56     PT13H24M56S PT13H24M56S
#> 12 13:24:56     PT13H24M56S PT13H24M56S
#> 13 13:24:56     PT13H24M56S PT13H24M56S
#> 14 13:24:56     PT13H24M56S PT13H24M56S

As with date values time values are not automatically coerced to an R data type. To use base R’s as.difftime() you will need to remove or convert the letter characters, whereas lubridate::as.duration() can coerce without modification.

types_cells |> 
  dplyr::filter(value_type == "time") |> 
  dplyr::select(cell_content, time_value) |>
  # dplyr::mutate(
  #   time_value2 = as.difftime(gsub("[PTS]", "", gsub("[HM]", ":", time_value)))
  # )
  dplyr::mutate(time_value2 = lubridate::as.duration(time_value))
#> # A tibble: 14 × 3
#>    cell_content time_value  time_value2          
#>    <chr>        <chr>       <Duration>           
#>  1 13:24:56     PT13H24M56S 48296s (~13.42 hours)
#>  2 13:24        PT13H24M56S 48296s (~13.42 hours)
#>  3 01:24 PM     PT13H24M56S 48296s (~13.42 hours)
#>  4 1:24 PM      PT13H24M56S 48296s (~13.42 hours)
#>  5 1.24 PM      PT13H24M56S 48296s (~13.42 hours)
#>  6 13h24        PT13H24M56S 48296s (~13.42 hours)
#>  7 1324         PT13H24M56S 48296s (~13.42 hours)
#>  8 13 24        PT13H24M56S 48296s (~13.42 hours)
#>  9 13-24-56     PT13H24M56S 48296s (~13.42 hours)
#> 10 13:24:56.00  PT13H24M56S 48296s (~13.42 hours)
#> 11 13:24:56     PT13H24M56S 48296s (~13.42 hours)
#> 12 13:24:56     PT13H24M56S 48296s (~13.42 hours)
#> 13 13:24:56     PT13H24M56S 48296s (~13.42 hours)
#> 14 13:24:56     PT13H24M56S 48296s (~13.42 hours)

The W3C XML schema has more details on the specifics of ISO formatting of dates and date-times.