1 Technical approach
The processing of the People Survey data held in the csps-data repository has been undertaken in R using the Positron IDE. The project is version controlled with git, using Github for storage.
File formats
The original People Survey source data is stored in a mix of CSV files, Microsoft Excel spreadsheet files (.xls/.xslx) and Open Document Format spreadsheet files (.ods).
Interim and output data are stored in either CSV format or Apache Parquet.
Dependencies
The following R packages (and their dependencies) have been used in the development of the processing code:
{arrow}- for storing and accessing data in Parquet files{cli}- for console messages{dplyr}- for tabular data manipulation{fs}- for file system operations{janitor}- for column name cleaning{purrr}- for data manipulation{readODS}- to read ODS files{readr}- to read and write CSV files{readxl}- to read Excel files{rlang}- for helper functions{stringi}- for string processing{stringr}- for string processing{tibble}- for tabular data manipulation{tidyr}- for tabular data manipulation{tidyxl}- to read Excel files{unpivotr}- to reshape tidyxl data{yaml12}- to read YAML files
The following packages (and their dependencies) have been used in the production of this companion:
Repository structure
This companion documents the contents of the csps-data repository. The high-level folder structure of the repo is as follows:
csps-data
├─ companion
├─ data
│ ├─ 00-reference
│ ├─ 01-benchmarks
│ ├─ 02-organisations
│ └─ 03-demographics
├─ proc
│ ├─ 01-questions_ref
│ ├─ 02-organisations_ref
│ ├─ 03-demographics_ref
│ ├─ 04-extract_data
│ └─ 05-processing
├─ R
│ ├─ 01-questions_ref
│ ├─ 02-organisations_ref
│ ├─ 03-demographics_ref
│ ├─ 04-extract_data
│ └─ 05-processing
└─ raw-data- The
companionfolder contains the underlying source code for this companion book - The
datafolder contains the harmonised and re-processed People Survey data, see the data dictionaries part of the companion for details. - The
procfolder contains intermediate outputs from the data processing. - The
Rfolder contains the source code for the data processing, see the workflow section below for details. - The
raw-datafolder contains the original published source datasets and associated materials, see the chapter on source_data for details.
Workflow
The csps-data repo and this companion are largely structured to follow the workflow of the data processing, in particular the structure and naming of the script in R folder outlines the workflow taken to process the data:
csps-data
├─ ...
├─ R
│ ├─ 01-questions_ref
│ │ ├ 01-questions_ref/01_01-extract_questions.R
│ │ ├ 01-questions_ref/01_02-regex_development.R
│ │ ├ 01-questions_ref/01_03-regex_refinement.R
│ │ └ 01-questions_ref/01_04-question_reference.R
│ ├─ 02-organisations_ref
│ │ ├─ 02-organisations_ref/02_01-extract_organisations.R
│ │ ├─ 02-organisations_ref/02_02-org_regex_refinement.R
│ │ └─ 02-organisations_ref/02_03-org_reference.R
│ ├─ 03-demographics_ref
│ │ ├─ 03-demographics_ref/03_01-extract-demogs.R
│ │ ├─ 03-demographics_ref/03_02-demog_regex_refinement.R
│ │ ├─ 03-demographics_ref/03_03-demcat_development.R
│ │ └─ 03-demographics_ref/03_04-demcat_ref.R
│ ├─ 04-extract_data
│ │ ├─ 04-extract_data/04_01-benchmark_extract.R
│ │ ├─ 04-extract_data/04_02-organisation_extract.R
│ │ └─ 04-extract_data/04_03-demograhpic_extract.R
│ ├─ 05-processing
│ │ ├─ 05-processing/05_01-process_benchmarks.R
│ │ ├─ 05-processing/05_02-process_organisations.R
│ │ └─ 05-processing/05_03-process_demographics.R
│ ├─ utils
│ │ ├─ data_extract_helpers.R
│ │ ├─ data_files.R
│ │ ├─ data_files_ref.R
│ │ ├─ extract_response_category.R
│ │ ├─ regex_matches.R
│ │ ├─ text_to_uid.R
│ │ └─ variable_extract_helpers.R
│ ├─ LICENSE-MIT
│ └─ README.md
└─ ...- The first stages of the workflow are dedicated to developing regexes and harmonised indicators.
- The
R/01-questions_reffolder contains the scripts for regexes and identifiers related to questions and measures. - The
R/02-organisations_reffolder contains the scripts for regexes and identifiers related to organisations. - The
R/03-demographics_reffolder contains the scripts for regexes and identifiers related to demographic questions and categories
- The
- The
R/04-extract_datafolder contains the scripts for extracting data from the raw source files for benchmarks, organisations, and demographics. - The
R/05-processingfolder contains the scripts for processing the raw data and producing the harmonised datasets for benchmarks, organisations, and demographics. - The
R/utilsfolder contains the scripts for bespoke helpers and utility functions.