Under development: This book is still under active development, its content and structure are subject to change.

1  Technical approach

The processing of the People Survey data held in the csps-data repository has been undertaken in R using the Positron IDE. The project is version controlled with git, using Github for storage.

File formats

The original People Survey source data is stored in a mix of CSV files, Microsoft Excel spreadsheet files (.xls/.xslx) and Open Document Format spreadsheet files (.ods).

Interim and output data are stored in either CSV format or Apache Parquet.

Dependencies

The following R packages (and their dependencies) have been used in the development of the processing code:

The following packages (and their dependencies) have been used in the production of this companion:

Repository structure

This companion documents the contents of the csps-data repository. The high-level folder structure of the repo is as follows:

csps-data
├─ companion
├─ data
│  ├─ 00-reference
│  ├─ 01-benchmarks
│  ├─ 02-organisations
│  └─ 03-demographics
├─ proc
│  ├─ 01-questions_ref
│  ├─ 02-organisations_ref
│  ├─ 03-demographics_ref
│  ├─ 04-extract_data
│  └─ 05-processing
├─ R
│  ├─ 01-questions_ref
│  ├─ 02-organisations_ref
│  ├─ 03-demographics_ref
│  ├─ 04-extract_data
│  └─ 05-processing
└─ raw-data
  • The companion folder contains the underlying source code for this companion book
  • The data folder contains the harmonised and re-processed People Survey data, see the data dictionaries part of the companion for details.
  • The proc folder contains intermediate outputs from the data processing.
  • The R folder contains the source code for the data processing, see the workflow section below for details.
  • The raw-data folder contains the original published source datasets and associated materials, see the chapter on source_data for details.

Workflow

The csps-data repo and this companion are largely structured to follow the workflow of the data processing, in particular the structure and naming of the script in R folder outlines the workflow taken to process the data:

csps-data
├─ ...
├─ R
│ ├─ 01-questions_ref
│ │  ├ 01-questions_ref/01_01-extract_questions.R
│ │  ├ 01-questions_ref/01_02-regex_development.R
│ │  ├ 01-questions_ref/01_03-regex_refinement.R
│ │  └ 01-questions_ref/01_04-question_reference.R
│ ├─ 02-organisations_ref
│ │  ├─ 02-organisations_ref/02_01-extract_organisations.R
│ │  ├─ 02-organisations_ref/02_02-org_regex_refinement.R
│ │  └─ 02-organisations_ref/02_03-org_reference.R
│ ├─ 03-demographics_ref
│ │  ├─ 03-demographics_ref/03_01-extract-demogs.R
│ │  ├─ 03-demographics_ref/03_02-demog_regex_refinement.R
│ │  ├─ 03-demographics_ref/03_03-demcat_development.R
│ │  └─ 03-demographics_ref/03_04-demcat_ref.R
│ ├─ 04-extract_data
│ │  ├─ 04-extract_data/04_01-benchmark_extract.R
│ │  ├─ 04-extract_data/04_02-organisation_extract.R
│ │  └─ 04-extract_data/04_03-demograhpic_extract.R
│ ├─ 05-processing
│ │  ├─ 05-processing/05_01-process_benchmarks.R
│ │  ├─ 05-processing/05_02-process_organisations.R
│ │  └─ 05-processing/05_03-process_demographics.R
│ ├─ utils
│ │  ├─ data_extract_helpers.R
│ │  ├─ data_files.R
│ │  ├─ data_files_ref.R
│ │  ├─ extract_response_category.R
│ │  ├─ regex_matches.R
│ │  ├─ text_to_uid.R
│ │  └─ variable_extract_helpers.R
│ ├─ LICENSE-MIT
│ └─ README.md
└─ ...