2. Tokenize transcriptomes and data pairing

This notebook starts with the CPU-only tokenization step. Replace placeholder paths like path/to/... with real locations on your system.

2.1. Configure paths and parameters

These parameters mirror the CLI options in perturbgen.pp.GF_tokenisation.

In order to download data, including the LPS data for this tutorial see https://perturbgen.cog.sanger.ac.uk/docs/data.html

# download LPS data from AWS S3 bucket
!aws --endpoint-url https://cog.sanger.ac.uk --no-sign-request \
  s3 cp s3://perturbgen/Manuscript/lps_otar.h5ad ./lps_otar.h5ad

download: s3://perturbgen/Manuscript/lps_otar.h5ad to ./lps_otar.h5adining remainingremainingremainingremainingremaining12.7 MiB/s) with 1 file(s) remaining   98.3 MiB/s) with 1 file(s) remaining   

H5AD_PATH = "path/to/adata.h5ad" #"path/to/data.h5ad"
DATASET_NAME = "LPS_all_tps_2k" # choose a name for the dataset
GENE_FILTERING_MODE = "hvg"  # one of: hvg, degs, all
HVG_MODE = "before_tokenisation"  # before_tokenisation or after_tokenisation

VAR_LIST = [
    "cell_type_harmonized",
    "time_after_LPS",
] # list of obs to retain in adata.vars after preprocessing

PAIRING_MODE = "stratified"  # stratified, random, mapping. We select the pairing strategy here, for more info please read the paper.
TIME_OBS = "time_after_LPS" # the obs that contains the time point information for pairing
PAIRING_FILE = "path/to/pairing.csv"  # only if PAIRING_MODE == 'mapping'
MAIN_PAIRING_OBS = "cell_type_harmonized" # the main obs to use for pairing amongst TIME_OBS
OPT_PAIRING_OBS = []  # optional additional obs

NPROC = 8 # number of parallel processes to use
N_HVG = 2000 # number of highly variable genes to select if HVG filtering is used
TIME_POINT_ORDER = ["normal", "90m_LPS", "6h_LPS", "10h_LPS"]
REFERENCE_TIME = "normal" # the reference time point for pairing, usually the control or untreated condition

GENE_MEDIAN_PATH = "Perturbgen/perturbgen/pp/gene_median_dict_gftokens_gc95M.pkl"  # path to gene median dictionary, provided with Perturbgen for pretrained model
TOKEN_DICT_PATH  = "Perturbgen/perturbgen/pp/token_dict_gftokens_gc95M.pkl"  # path to token dictionary, provided with Perturbgen for pretrained model
GENE_MAPPING_PATH = "Perturbgen/perturbgen/pp/ensembl_mapping_dict_gc95M.pkl"  # path to gene mapping dictionary, Geneformer 95M mapping dictionary provided with Perturbgen

2.2. Build the tokenization command

This prints the exact command that will be executed. Review it before running.

cmd = [
    "python",
    "-m",
    "perturbgen",
    "tokenise",
    "--h5ad_path", H5AD_PATH,
    "--dataset", DATASET_NAME,
    "--gene_filtering_mode", GENE_FILTERING_MODE,
    "--hvg_mode", HVG_MODE,
    "--var_list", *VAR_LIST,
    "--pairing_mode", PAIRING_MODE,
    "--time_obs", TIME_OBS,
    "--main_pairing_obs", MAIN_PAIRING_OBS,
    "--nproc", str(NPROC),
    "--n_hvg", str(N_HVG),
    "--reference_time", REFERENCE_TIME,
    "--time_point_order", *TIME_POINT_ORDER,
    "--gene_median_path", GENE_MEDIAN_PATH,
    "--token_dict_path", TOKEN_DICT_PATH,
    "--gene_mapping_path", GENE_MAPPING_PATH,
]

if PAIRING_MODE == "mapping":
    cmd += ["--pairing_file", PAIRING_FILE]
if OPT_PAIRING_OBS:
    cmd += ["--opt_pairing_obs", *OPT_PAIRING_OBS]

print(" ".join(cmd))

python -m perturbgen tokenise --h5ad_path /nfs/team361/am74/Cytomeister/Evaluation_datasets/LPS/lps_otar.h5ad --dataset LPS_all_tps_2k --gene_filtering_mode hvg --hvg_mode before_tokenisation --var_list cell_type_harmonized time_after_LPS --pairing_mode stratified --time_obs time_after_LPS --main_pairing_obs cell_type_harmonized --nproc 8 --n_hvg 2000 --reference_time normal --time_point_order normal 90m_LPS 6h_LPS 10h_LPS --gene_median_path Perturbgen/perturbgen/pp/gene_median_dict_gftokens_gc95M.pkl --token_dict_path Perturbgen/perturbgen/pp/token_dict_gftokens_gc95M.pkl --gene_mapping_path Perturbgen/perturbgen/pp/ensembl_mapping_dict_gc95M.pkl

2.3. Run tokenization (CPU-only)

This step can take time depending on dataset size.

import subprocess

subprocess.run(cmd, check=True)

loading, please wait...
Current working directory: /lustre/scratch126/cellgen/lotfollahi/dv8/for_otar
Start preprocessing adata...
Number of genes dropped: 0
Finished preprocessing adata.
Start tokenisation of adata...
Tokenizing /lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/T_perturb/tokenized_data/LPS_all_tps_2k/h5ad_pairing_2000_hvg/LPS_all_tps_2k.h5ad

/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/perturbgen/pp/tokenizer.py:495: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  for i in adata.var["ensembl_id_collapsed"][coding_miRNA_loc]
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/perturbgen/pp/tokenizer.py:498: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  coding_miRNA_ids = adata.var["ensembl_id_collapsed"][coding_miRNA_loc]

/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/T_perturb/tokenized_data/LPS_all_tps_2k/h5ad_pairing_2000_hvg/LPS_all_tps_2k.h5ad has no column attribute 'filter_pass'; tokenizing all cells.
Creating dataset.

Map (num_proc=4): 100%|██████████| 223478/223478 [00:29<00:00, 7559.59 examples/s] 
Saving the dataset (1/1 shards): 100%|██████████| 223478/223478 [00:00<00:00, 458199.76 examples/s]

Finished tokenisation.

/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/perturbgen/src/utils.py:1643: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  adata_obs_.groupby(grouping_obs)[time_obs].transform('nunique') == total_tps
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/perturbgen/src/utils.py:1646: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = adata_grouped.groupby(grouping_obs)
Map (num_proc=4): 100%|██████████| 223478/223478 [01:33<00:00, 2399.51 examples/s]
  0%|          | 0/4 [00:00<?, ?it/s]/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/.venv/lib/python3.11/site-packages/anndata/_core/aligned_df.py:68: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)

Saving the dataset (0/1 shards):   0%|          | 0/148107 [00:00<?, ? examples/s]
Saving the dataset (0/1 shards):   1%|          | 1000/148107 [00:00<00:36, 4022.79 examples/s]
Saving the dataset (0/1 shards):   1%|▏         | 2000/148107 [00:00<00:24, 5861.26 examples/s]
Saving the dataset (0/1 shards):   3%|▎         | 4000/148107 [00:00<00:16, 8494.98 examples/s]
Saving the dataset (0/1 shards):   4%|▍         | 6000/148107 [00:00<00:16, 8565.91 examples/s]
Saving the dataset (0/1 shards):   5%|▍         | 7000/148107 [00:00<00:16, 8508.87 examples/s]
Saving the dataset (0/1 shards):   6%|▌         | 9000/148107 [00:01<00:13, 10273.72 examples/s]
Saving the dataset (0/1 shards):   7%|▋         | 11000/148107 [00:01<00:12, 11398.90 examples/s]
Saving the dataset (0/1 shards):   9%|▉         | 13000/148107 [00:01<00:11, 11534.82 examples/s]
Saving the dataset (0/1 shards):  10%|█         | 15000/148107 [00:01<00:11, 11741.47 examples/s]
Saving the dataset (0/1 shards):  11%|█▏        | 17000/148107 [00:01<00:10, 12271.56 examples/s]
Saving the dataset (0/1 shards):  13%|█▎        | 19000/148107 [00:01<00:09, 13040.10 examples/s]
Saving the dataset (0/1 shards):  14%|█▍        | 21000/148107 [00:01<00:09, 13350.15 examples/s]
Saving the dataset (0/1 shards):  16%|█▌        | 24000/148107 [00:02<00:07, 15798.87 examples/s]
Saving the dataset (0/1 shards):  18%|█▊        | 27000/148107 [00:02<00:07, 17165.03 examples/s]
Saving the dataset (0/1 shards):  21%|██        | 31000/148107 [00:02<00:05, 20105.64 examples/s]
Saving the dataset (0/1 shards):  24%|██▎       | 35000/148107 [00:02<00:04, 22851.98 examples/s]
Saving the dataset (0/1 shards):  26%|██▋       | 39000/148107 [00:02<00:04, 26076.16 examples/s]
Saving the dataset (0/1 shards):  29%|██▉       | 43000/148107 [00:02<00:04, 25776.42 examples/s]
Saving the dataset (0/1 shards):  33%|███▎      | 49000/148107 [00:02<00:03, 31866.48 examples/s]
Saving the dataset (0/1 shards):  37%|███▋      | 55000/148107 [00:03<00:02, 37425.21 examples/s]
Saving the dataset (0/1 shards):  41%|████      | 60000/148107 [00:03<00:02, 39768.59 examples/s]
Saving the dataset (0/1 shards):  45%|████▌     | 67000/148107 [00:03<00:01, 44655.50 examples/s]
Saving the dataset (0/1 shards):  50%|████▉     | 74000/148107 [00:03<00:01, 46819.13 examples/s]
Saving the dataset (0/1 shards):  55%|█████▌    | 82000/148107 [00:03<00:01, 54587.43 examples/s]
Saving the dataset (0/1 shards):  60%|██████    | 89000/148107 [00:03<00:02, 27999.20 examples/s]
Saving the dataset (0/1 shards):  63%|██████▎   | 94000/148107 [00:04<00:01, 29521.77 examples/s]
Saving the dataset (0/1 shards):  68%|██████▊   | 100000/148107 [00:04<00:01, 33540.70 examples/s]
Saving the dataset (0/1 shards):  73%|███████▎  | 108000/148107 [00:04<00:00, 41849.53 examples/s]
Saving the dataset (0/1 shards):  80%|███████▉  | 118000/148107 [00:04<00:00, 52490.38 examples/s]
Saving the dataset (0/1 shards):  86%|████████▌ | 127000/148107 [00:04<00:00, 59553.49 examples/s]
Saving the dataset (0/1 shards):  93%|█████████▎| 137000/148107 [00:04<00:00, 67642.13 examples/s]
Saving the dataset (0/1 shards):  99%|█████████▉| 147000/148107 [00:04<00:00, 73855.08 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 148107/148107 [00:04<00:00, 30679.80 examples/s]
 25%|██▌       | 1/4 [00:42<02:06, 42.17s/it]/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/.venv/lib/python3.11/site-packages/anndata/_core/aligned_df.py:68: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/.venv/lib/python3.11/site-packages/anndata/_core/anndata.py:1758: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")

Saving the dataset (0/1 shards):   0%|          | 0/148107 [00:00<?, ? examples/s]
Saving the dataset (0/1 shards):   1%|          | 1000/148107 [00:00<00:27, 5442.45 examples/s]
Saving the dataset (0/1 shards):   5%|▌         | 8000/148107 [00:00<00:04, 30728.95 examples/s]
Saving the dataset (0/1 shards):   9%|▉         | 13000/148107 [00:00<00:03, 36805.37 examples/s]
Saving the dataset (0/1 shards):  16%|█▌        | 23000/148107 [00:00<00:02, 56483.40 examples/s]
Saving the dataset (0/1 shards):  23%|██▎       | 34000/148107 [00:00<00:01, 71237.55 examples/s]
Saving the dataset (0/1 shards):  30%|███       | 45000/148107 [00:00<00:01, 80922.66 examples/s]
Saving the dataset (0/1 shards):  37%|███▋      | 55000/148107 [00:00<00:01, 85035.23 examples/s]
Saving the dataset (0/1 shards):  44%|████▍     | 65000/148107 [00:00<00:00, 86037.80 examples/s]
Saving the dataset (0/1 shards):  50%|████▉     | 74000/148107 [00:01<00:00, 85103.32 examples/s]
Saving the dataset (0/1 shards):  57%|█████▋    | 84000/148107 [00:01<00:00, 85710.06 examples/s]
Saving the dataset (0/1 shards):  63%|██████▎   | 94000/148107 [00:01<00:00, 86442.46 examples/s]
Saving the dataset (0/1 shards):  71%|███████   | 105000/148107 [00:01<00:00, 90388.36 examples/s]
Saving the dataset (0/1 shards):  78%|███████▊  | 115000/148107 [00:01<00:00, 91664.24 examples/s]
Saving the dataset (0/1 shards):  84%|████████▍ | 125000/148107 [00:01<00:00, 93846.51 examples/s]
Saving the dataset (0/1 shards):  95%|█████████▍| 140000/148107 [00:01<00:00, 92515.21 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 148107/148107 [00:01<00:00, 79380.54 examples/s]
 50%|█████     | 2/4 [01:21<01:20, 40.48s/it]

2.4. Outputs

Tokenized files are written under the tokenized data directory defined in perturbgen/configs/paths.py. You should see a new folder for your dataset name containing .dataset files and pairing outputs in root_dir/T_perturb/tokenized_data.

Next: training steps (masking model and count decoder) in the following sections. See Notebook 3 for training the Perturbgen model