2. Tokenize transcriptomes and data pairing

This notebook starts with the CPU-only tokenization step. Replace placeholder paths like path/to/... with real locations on your system.

2.1. Configure paths and parameters

These parameters mirror the CLI options in perturbgen.pp.GF_tokenisation.

In order to download data, including the LPS data for this tutorial see https://perturbgen.cog.sanger.ac.uk/docs/data.html

# download LPS data from AWS S3 bucket
!aws --endpoint-url https://cog.sanger.ac.uk --no-sign-request \
  s3 cp s3://perturbgen/Manuscript/lps_otar.h5ad ./lps_otar.h5ad
download: s3://perturbgen/Manuscript/lps_otar.h5ad to ./lps_otar.h5adining remainingremainingremainingremainingremaining12.7 MiB/s) with 1 file(s) remaining   98.3 MiB/s) with 1 file(s) remaining   
H5AD_PATH = "path/to/adata.h5ad" #"path/to/data.h5ad"
DATASET_NAME = "LPS_all_tps_2k" # choose a name for the dataset
GENE_FILTERING_MODE = "hvg"  # one of: hvg, degs, all
HVG_MODE = "before_tokenisation"  # before_tokenisation or after_tokenisation

VAR_LIST = [
    "cell_type_harmonized",
    "time_after_LPS",
] # list of obs to retain in adata.vars after preprocessing

PAIRING_MODE = "stratified"  # stratified, random, mapping. We select the pairing strategy here, for more info please read the paper.
TIME_OBS = "time_after_LPS" # the obs that contains the time point information for pairing
PAIRING_FILE = "path/to/pairing.csv"  # only if PAIRING_MODE == 'mapping'
MAIN_PAIRING_OBS = "cell_type_harmonized" # the main obs to use for pairing amongst TIME_OBS
OPT_PAIRING_OBS = []  # optional additional obs

NPROC = 8 # number of parallel processes to use
N_HVG = 2000 # number of highly variable genes to select if HVG filtering is used
TIME_POINT_ORDER = ["normal", "90m_LPS", "6h_LPS", "10h_LPS"]
REFERENCE_TIME = "normal" # the reference time point for pairing, usually the control or untreated condition

GENE_MEDIAN_PATH = "Perturbgen/perturbgen/pp/gene_median_dict_gftokens_gc95M.pkl"  # path to gene median dictionary, provided with Perturbgen for pretrained model
TOKEN_DICT_PATH  = "Perturbgen/perturbgen/pp/token_dict_gftokens_gc95M.pkl"  # path to token dictionary, provided with Perturbgen for pretrained model
GENE_MAPPING_PATH = "Perturbgen/perturbgen/pp/ensembl_mapping_dict_gc95M.pkl"  # path to gene mapping dictionary, Geneformer 95M mapping dictionary provided with Perturbgen

2.2. Build the tokenization command

This prints the exact command that will be executed. Review it before running.

cmd = [
    "python",
    "-m",
    "perturbgen",
    "tokenise",
    "--h5ad_path", H5AD_PATH,
    "--dataset", DATASET_NAME,
    "--gene_filtering_mode", GENE_FILTERING_MODE,
    "--hvg_mode", HVG_MODE,
    "--var_list", *VAR_LIST,
    "--pairing_mode", PAIRING_MODE,
    "--time_obs", TIME_OBS,
    "--main_pairing_obs", MAIN_PAIRING_OBS,
    "--nproc", str(NPROC),
    "--n_hvg", str(N_HVG),
    "--reference_time", REFERENCE_TIME,
    "--time_point_order", *TIME_POINT_ORDER,
    "--gene_median_path", GENE_MEDIAN_PATH,
    "--token_dict_path", TOKEN_DICT_PATH,
    "--gene_mapping_path", GENE_MAPPING_PATH,
]

if PAIRING_MODE == "mapping":
    cmd += ["--pairing_file", PAIRING_FILE]
if OPT_PAIRING_OBS:
    cmd += ["--opt_pairing_obs", *OPT_PAIRING_OBS]

print(" ".join(cmd))
python -m perturbgen tokenise --h5ad_path /nfs/team361/am74/Cytomeister/Evaluation_datasets/LPS/lps_otar.h5ad --dataset LPS_all_tps_2k --gene_filtering_mode hvg --hvg_mode before_tokenisation --var_list cell_type_harmonized time_after_LPS --pairing_mode stratified --time_obs time_after_LPS --main_pairing_obs cell_type_harmonized --nproc 8 --n_hvg 2000 --reference_time normal --time_point_order normal 90m_LPS 6h_LPS 10h_LPS --gene_median_path Perturbgen/perturbgen/pp/gene_median_dict_gftokens_gc95M.pkl --token_dict_path Perturbgen/perturbgen/pp/token_dict_gftokens_gc95M.pkl --gene_mapping_path Perturbgen/perturbgen/pp/ensembl_mapping_dict_gc95M.pkl

2.3. Run tokenization (CPU-only)

This step can take time depending on dataset size.

import subprocess

subprocess.run(cmd, check=True)
loading, please wait...
Current working directory: /lustre/scratch126/cellgen/lotfollahi/dv8/for_otar
Start preprocessing adata...
Number of genes dropped: 0
Finished preprocessing adata.
Start tokenisation of adata...
Tokenizing /lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/T_perturb/tokenized_data/LPS_all_tps_2k/h5ad_pairing_2000_hvg/LPS_all_tps_2k.h5ad
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/perturbgen/pp/tokenizer.py:495: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  for i in adata.var["ensembl_id_collapsed"][coding_miRNA_loc]
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/perturbgen/pp/tokenizer.py:498: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  coding_miRNA_ids = adata.var["ensembl_id_collapsed"][coding_miRNA_loc]
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/T_perturb/tokenized_data/LPS_all_tps_2k/h5ad_pairing_2000_hvg/LPS_all_tps_2k.h5ad has no column attribute 'filter_pass'; tokenizing all cells.
Creating dataset.
Map (num_proc=4): 100%|██████████| 223478/223478 [00:29<00:00, 7559.59 examples/s] 
Saving the dataset (1/1 shards): 100%|██████████| 223478/223478 [00:00<00:00, 458199.76 examples/s]
Finished tokenisation.
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/perturbgen/src/utils.py:1643: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  adata_obs_.groupby(grouping_obs)[time_obs].transform('nunique') == total_tps
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/perturbgen/src/utils.py:1646: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped = adata_grouped.groupby(grouping_obs)
Map (num_proc=4): 100%|██████████| 223478/223478 [01:33<00:00, 2399.51 examples/s]
  0%|          | 0/4 [00:00<?, ?it/s]/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/.venv/lib/python3.11/site-packages/anndata/_core/aligned_df.py:68: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)

Saving the dataset (0/1 shards):   0%|          | 0/148107 [00:00<?, ? examples/s]
Saving the dataset (0/1 shards):   1%|          | 1000/148107 [00:00<00:36, 4022.79 examples/s]
Saving the dataset (0/1 shards):   1%|▏         | 2000/148107 [00:00<00:24, 5861.26 examples/s]
Saving the dataset (0/1 shards):   3%|▎         | 4000/148107 [00:00<00:16, 8494.98 examples/s]
Saving the dataset (0/1 shards):   4%|▍         | 6000/148107 [00:00<00:16, 8565.91 examples/s]
Saving the dataset (0/1 shards):   5%|▍         | 7000/148107 [00:00<00:16, 8508.87 examples/s]
Saving the dataset (0/1 shards):   6%|▌         | 9000/148107 [00:01<00:13, 10273.72 examples/s]
Saving the dataset (0/1 shards):   7%|▋         | 11000/148107 [00:01<00:12, 11398.90 examples/s]
Saving the dataset (0/1 shards):   9%|▉         | 13000/148107 [00:01<00:11, 11534.82 examples/s]
Saving the dataset (0/1 shards):  10%|█         | 15000/148107 [00:01<00:11, 11741.47 examples/s]
Saving the dataset (0/1 shards):  11%|█▏        | 17000/148107 [00:01<00:10, 12271.56 examples/s]
Saving the dataset (0/1 shards):  13%|█▎        | 19000/148107 [00:01<00:09, 13040.10 examples/s]
Saving the dataset (0/1 shards):  14%|█▍        | 21000/148107 [00:01<00:09, 13350.15 examples/s]
Saving the dataset (0/1 shards):  16%|█▌        | 24000/148107 [00:02<00:07, 15798.87 examples/s]
Saving the dataset (0/1 shards):  18%|█▊        | 27000/148107 [00:02<00:07, 17165.03 examples/s]
Saving the dataset (0/1 shards):  21%|██        | 31000/148107 [00:02<00:05, 20105.64 examples/s]
Saving the dataset (0/1 shards):  24%|██▎       | 35000/148107 [00:02<00:04, 22851.98 examples/s]
Saving the dataset (0/1 shards):  26%|██▋       | 39000/148107 [00:02<00:04, 26076.16 examples/s]
Saving the dataset (0/1 shards):  29%|██▉       | 43000/148107 [00:02<00:04, 25776.42 examples/s]
Saving the dataset (0/1 shards):  33%|███▎      | 49000/148107 [00:02<00:03, 31866.48 examples/s]
Saving the dataset (0/1 shards):  37%|███▋      | 55000/148107 [00:03<00:02, 37425.21 examples/s]
Saving the dataset (0/1 shards):  41%|████      | 60000/148107 [00:03<00:02, 39768.59 examples/s]
Saving the dataset (0/1 shards):  45%|████▌     | 67000/148107 [00:03<00:01, 44655.50 examples/s]
Saving the dataset (0/1 shards):  50%|████▉     | 74000/148107 [00:03<00:01, 46819.13 examples/s]
Saving the dataset (0/1 shards):  55%|█████▌    | 82000/148107 [00:03<00:01, 54587.43 examples/s]
Saving the dataset (0/1 shards):  60%|██████    | 89000/148107 [00:03<00:02, 27999.20 examples/s]
Saving the dataset (0/1 shards):  63%|██████▎   | 94000/148107 [00:04<00:01, 29521.77 examples/s]
Saving the dataset (0/1 shards):  68%|██████▊   | 100000/148107 [00:04<00:01, 33540.70 examples/s]
Saving the dataset (0/1 shards):  73%|███████▎  | 108000/148107 [00:04<00:00, 41849.53 examples/s]
Saving the dataset (0/1 shards):  80%|███████▉  | 118000/148107 [00:04<00:00, 52490.38 examples/s]
Saving the dataset (0/1 shards):  86%|████████▌ | 127000/148107 [00:04<00:00, 59553.49 examples/s]
Saving the dataset (0/1 shards):  93%|█████████▎| 137000/148107 [00:04<00:00, 67642.13 examples/s]
Saving the dataset (0/1 shards):  99%|█████████▉| 147000/148107 [00:04<00:00, 73855.08 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 148107/148107 [00:04<00:00, 30679.80 examples/s]
 25%|██▌       | 1/4 [00:42<02:06, 42.17s/it]/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/.venv/lib/python3.11/site-packages/anndata/_core/aligned_df.py:68: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
/lustre/scratch126/cellgen/lotfollahi/dv8/for_otar/Perturbgen/.venv/lib/python3.11/site-packages/anndata/_core/anndata.py:1758: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")

Saving the dataset (0/1 shards):   0%|          | 0/148107 [00:00<?, ? examples/s]
Saving the dataset (0/1 shards):   1%|          | 1000/148107 [00:00<00:27, 5442.45 examples/s]
Saving the dataset (0/1 shards):   5%|▌         | 8000/148107 [00:00<00:04, 30728.95 examples/s]
Saving the dataset (0/1 shards):   9%|▉         | 13000/148107 [00:00<00:03, 36805.37 examples/s]
Saving the dataset (0/1 shards):  16%|█▌        | 23000/148107 [00:00<00:02, 56483.40 examples/s]
Saving the dataset (0/1 shards):  23%|██▎       | 34000/148107 [00:00<00:01, 71237.55 examples/s]
Saving the dataset (0/1 shards):  30%|███       | 45000/148107 [00:00<00:01, 80922.66 examples/s]
Saving the dataset (0/1 shards):  37%|███▋      | 55000/148107 [00:00<00:01, 85035.23 examples/s]
Saving the dataset (0/1 shards):  44%|████▍     | 65000/148107 [00:00<00:00, 86037.80 examples/s]
Saving the dataset (0/1 shards):  50%|████▉     | 74000/148107 [00:01<00:00, 85103.32 examples/s]
Saving the dataset (0/1 shards):  57%|█████▋    | 84000/148107 [00:01<00:00, 85710.06 examples/s]
Saving the dataset (0/1 shards):  63%|██████▎   | 94000/148107 [00:01<00:00, 86442.46 examples/s]
Saving the dataset (0/1 shards):  71%|███████   | 105000/148107 [00:01<00:00, 90388.36 examples/s]
Saving the dataset (0/1 shards):  78%|███████▊  | 115000/148107 [00:01<00:00, 91664.24 examples/s]
Saving the dataset (0/1 shards):  84%|████████▍ | 125000/148107 [00:01<00:00, 93846.51 examples/s]
Saving the dataset (0/1 shards):  95%|█████████▍| 140000/148107 [00:01<00:00, 92515.21 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 148107/148107 [00:01<00:00, 79380.54 examples/s]
 50%|█████     | 2/4 [01:21<01:20, 40.48s/it]

2.4. Outputs

Tokenized files are written under the tokenized data directory defined in perturbgen/configs/paths.py. You should see a new folder for your dataset name containing .dataset files and pairing outputs in root_dir/T_perturb/tokenized_data.


Next: training steps (masking model and count decoder) in the following sections. See Notebook 3 for training the Perturbgen model