Data

Access is provided to .h5ad files selected for the manuscript as well as the full pretraining corpus (approximately 1.7 TB).

The data is hosted in an S3-compatible bucket at the Wellcome Sanger Institute. Downloading the data requires the AWS CLI to be installed: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

You can browse the available folders before downloading or download a single files using: https://perturbgen.cog.sanger.ac.uk/data.html

Manuscript data

To download the dataset used in the manuscript:

aws --endpoint-url https://cog.sanger.ac.uk --no-sign-request \
  s3 cp s3://perturbgen/Manuscript Manuscript --recursive

That will create a folder named Manuscript in your current working directory containing all the relevant .h5ad files.

Pretrain Corpus data

To download the full pretraining corpus (~2 TB):

aws --endpoint-url https://cog.sanger.ac.uk --no-sign-request \
  s3 cp s3://perturbgen/PretrainCorpus PretrainCorpus --recursive

That will create a folder named PretrainCorpus in your current working directory containing all the relevant folders with .h5ad files.

Warning

Downloading the full pretraining corpus may take a significant amount of time and disk space.