Data ==== Access is provided to ``.h5ad`` files selected for the manuscript as well as the full pretraining corpus *(approximately 1.7 TB)*. The data is hosted in an S3-compatible bucket at the Wellcome Sanger Institute. Downloading the data requires the AWS CLI to be installed: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html You can browse the available folders before downloading or download a single files using: https://perturbgen.cog.sanger.ac.uk/data.html Manuscript data --------------- To download the dataset used in the manuscript: .. code-block:: bash aws --endpoint-url https://cog.sanger.ac.uk --no-sign-request \ s3 cp s3://perturbgen/Manuscript Manuscript --recursive That will create a folder named `Manuscript` in your current working directory containing all the relevant `.h5ad` files. Pretrain Corpus data -------------------- To download the full pretraining corpus (~2 TB): .. code-block:: bash aws --endpoint-url https://cog.sanger.ac.uk --no-sign-request \ s3 cp s3://perturbgen/PretrainCorpus PretrainCorpus --recursive That will create a folder named `PretrainCorpus` in your current working directory containing all the relevant folders with `.h5ad` files. .. warning:: Downloading the full pretraining corpus may take a significant amount of time and disk space.