{ "cells": [ { "cell_type": "markdown", "id": "tamil-advancement", "metadata": {}, "source": [ "# 1. Data preprocessing and curation" ] }, { "cell_type": "markdown", "id": "c0d1d069", "metadata": {}, "source": [ "This notebook focuses on the preparation of LPS datasets, including both public and private sources, and demonstrates how to merge them for training the PerturbGen model." ] }, { "cell_type": "markdown", "id": "d8b14104", "metadata": {}, "source": [ "## 1.1. Public LPS data" ] }, { "cell_type": "code", "execution_count": null, "id": "1bba2a08", "metadata": {}, "outputs": [], "source": [ "import gdown # for downloading files from Google Drive\n", "import requests" ] }, { "cell_type": "markdown", "id": "da1d0485", "metadata": {}, "source": [ "Downloading the public LPS dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "broke-mexico", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading...\n", "From: https://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/026/E-MTAB-10026/Files/covid_portal_210320_with_raw.h5ad\n", "To: /home/jovyan/farm_mount/Cytomeister/Evaluation_datasets/LPS/Public_Emily/Raw_h5ad.zip\n", "100%|██████████| 7.19G/7.19G [01:31<00:00, 78.9MB/s]\n" ] }, { "data": { "text/plain": [ "'./Raw_h5ad.zip'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'https://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/026/E-MTAB-10026/Files/covid_portal_210320_with_raw.h5ad'\n", "output= './Raw_h5ad'\n", "gdown.download(url,output) # download the file" ] }, { "cell_type": "markdown", "id": "17127c7f", "metadata": {}, "source": [ "Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "id": "98744b7e", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import scanpy as sc\n", "import numpy as np\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": null, "id": "functional-vertex", "metadata": {}, "outputs": [], "source": [ "# Read the downloaded h5ad file\n", "adata = sc.read('./Raw.h5ad')" ] }, { "cell_type": "markdown", "id": "3e8a88f7", "metadata": {}, "source": [ "Here, we investigate the .h5ad file to identify LPS samples and healthy control cells in the downloaded LPS data." ] }, { "cell_type": "code", "execution_count": 22, "id": "understood-tutorial", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 647366 × 24929\n", " obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'full_clustering', 'initial_clustering', 'Resample', 'Collection_Day', 'Sex', 'Age_interval', 'Swab_result', 'Status', 'Smoker', 'Status_on_day_collection', 'Status_on_day_collection_summary', 'Days_from_onset', 'Site', 'time_after_LPS', 'Worst_Clinical_Status', 'Outcome', 'patient_id'\n", " var: 'feature_types'\n", " uns: 'hvg', 'leiden', 'neighbors', 'pca', 'umap'\n", " obsm: 'X_pca', 'X_pca_harmony', 'X_umap'\n", " layers: 'raw'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata" ] }, { "cell_type": "code", "execution_count": null, "id": "objective-offset", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "time_after_LPS\n", "nan 639482\n", "90m 3999\n", "10h 3885\n", "Name: count, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obs['time_after_LPS'].value_counts() # check time points available" ] }, { "cell_type": "code", "execution_count": null, "id": "verified-italian", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Status\n", "Covid 527286\n", "Healthy 97039\n", "Non_covid 15157\n", "LPS 7884\n", "Name: count, dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obs['Status'].value_counts() # check conditions available" ] }, { "cell_type": "code", "execution_count": null, "id": "generic-livestock", "metadata": {}, "outputs": [], "source": [ "lps_sample_ids = adata.obs.loc[adata.obs['Status'] == 'LPS', 'patient_id'].unique() # get patient ids for LPS samples\n", "\n", "healthy_sample_ids = adata.obs.loc[adata.obs['Status'] == 'Healthy', 'patient_id'].unique() # get patient ids for Healthy samples" ] }, { "cell_type": "code", "execution_count": null, "id": "advanced-canal", "metadata": { "tags": [] }, "outputs": [], "source": [ "adata = adata[adata.obs['Status'].isin(['LPS','Healthy'])].copy() # keep only LPS and Healthy samples" ] }, { "cell_type": "markdown", "id": "9e5a3cde", "metadata": {}, "source": [ "Next, we apply a set of modifications and add details to make downstream analyses more convenient e.g. converting time obs into categorical obs" ] }, { "cell_type": "code", "execution_count": null, "id": "sensitive-northern", "metadata": {}, "outputs": [], "source": [ "adata.obs['time_after_LPS'] = adata.obs['time_after_LPS'].astype(str) # convert to string\n", "adata.obs['time_after_LPS'] = adata.obs['time_after_LPS'].replace('nan', 'normal') # replace nan with normal" ] }, { "cell_type": "code", "execution_count": 30, "id": "23d4552a-7c48-4603-af08-c0ee6e05903a", "metadata": {}, "outputs": [], "source": [ "adata.obs['time_after_LPS'] = adata.obs['time_after_LPS'].apply(lambda x: x if x == 'normal' else f'{x}_LPS')\n", "\n", "adata.obs['time_after_LPS'] = pd.Categorical(\n", " adata.obs['time_after_LPS'],\n", " categories=['normal', '90m_LPS', '10h_LPS'],\n", " ordered=True\n", ")" ] }, { "cell_type": "code", "execution_count": 31, "id": "soviet-simulation", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "time_after_LPS\n", "normal 97039\n", "90m_LPS 3999\n", "10h_LPS 3885\n", "Name: count, dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obs['time_after_LPS'].value_counts()" ] }, { "cell_type": "code", "execution_count": 32, "id": "plain-nurse", "metadata": {}, "outputs": [], "source": [ "adata.obs['donor_id'] = adata.obs['patient_id']" ] }, { "cell_type": "code", "execution_count": 37, "id": "07dc3ac2-3623-432c-b61a-4413a0ae3638", "metadata": {}, "outputs": [], "source": [ "adata.obs['study'] = 'Public_Emily2021'" ] }, { "cell_type": "code", "execution_count": 36, "id": "labeled-disaster", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age_interval\n", "(30, 39] 30538\n", "(50, 59] 26344\n", "(20, 29] 20426\n", "(40, 49] 15926\n", "(60, 69] 9970\n", "(70, 79] 1719\n", "Name: count, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obs['Age_interval'].value_counts()" ] }, { "cell_type": "code", "execution_count": 38, "id": "simplified-sender", "metadata": {}, "outputs": [], "source": [ "mapping = {\n", " '(20, 29]': 'Adult', \n", " '(30, 39]': 'Adult',\n", " '(40, 49]': 'Adult',\n", " '(50, 59]': 'Adult',\n", " '(60, 69]': 'Adult',\n", " '(70, 79]': 'Old'\n", "}\n" ] }, { "cell_type": "code", "execution_count": 39, "id": "earned-gallery", "metadata": {}, "outputs": [], "source": [ "adata.obs['life_stage'] = adata.obs['Age_interval'].map(mapping)\n", "\n", "\n", "adata.obs['life_stage'] = pd.Categorical(\n", " adata.obs['life_stage'],\n", " categories=['Embryo', 'Fetal', 'Childhood', 'Young Adult', 'Adult', 'Old'],\n", " ordered=True\n", ")\n" ] }, { "cell_type": "code", "execution_count": 40, "id": "56e64b5a-88e0-450c-85c7-dd868b2ff74f", "metadata": {}, "outputs": [], "source": [ "adata.obs['development_stage'] = adata.obs['life_stage']" ] }, { "cell_type": "code", "execution_count": 41, "id": "e6d56d3d-afcb-4867-abc3-b2a417be64f5", "metadata": {}, "outputs": [], "source": [ "adata.obs['tissue'] = 'blood'" ] }, { "cell_type": "code", "execution_count": 45, "id": "67f058ae-8547-42bb-b97c-546d62b5731d", "metadata": {}, "outputs": [], "source": [ "adata.X = adata.layers['raw'].copy()" ] }, { "cell_type": "code", "execution_count": 46, "id": "e31c5979-704f-49cb-bf2d-2d949986c1db", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float32(275279.0)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.X.max()" ] }, { "cell_type": "code", "execution_count": 51, "id": "65175042-d236-4102-a2f6-da2015beb8bc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "covid_index\n", "AAACCTGAGACCACGA-newcastle65 Initial\n", "AAACCTGAGATGTCGG-newcastle65 Initial\n", "AAACCTGAGGCGATAC-newcastle65 Initial\n", "AAACCTGAGTACACCT-newcastle65 Initial\n", "AAACCTGAGTGAATTG-newcastle65 Initial\n", " ... \n", "BGCV15_TTTCCTCTCTGATACG-1 Initial\n", "BGCV15_TTTCCTCTCTTTAGGG-1 Initial\n", "BGCV15_TTTGCGCTCACCGTAA-1 Initial\n", "BGCV15_TTTGGTTTCAAGATCC-1 Initial\n", "BGCV15_TTTGTCACAAGCCATT-1 Initial\n", "Name: Resample, Length: 104923, dtype: category\n", "Categories (1, object): ['Initial']" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obs['Resample']" ] }, { "cell_type": "code", "execution_count": 52, "id": "090676e0-9ddb-49e5-9fbe-09a3ffdf3d0a", "metadata": {}, "outputs": [], "source": [ "# Subset to Gene Expression features\n", "adata = adata[:, adata.var['feature_types'] == 'Gene Expression']\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e27651a7-dcac-4e2f-b048-2c82a50f662e", "metadata": {}, "outputs": [], "source": [ "ref = pd.read_csv(\"../../../../Ensembl_symbol_Human_(GRCh38.p14)/mart_export.txt\") \n" ] }, { "cell_type": "code", "execution_count": 56, "id": "ae9dd219-a3e6-4883-a436-7d46a6b05840", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | feature_types | \n", "
|---|---|
| MIR1302-2HG | \n", "Gene Expression | \n", "
| AL627309.1 | \n", "Gene Expression | \n", "
| AL627309.3 | \n", "Gene Expression | \n", "
| AL627309.2 | \n", "Gene Expression | \n", "
| AL669831.2 | \n", "Gene Expression | \n", "
| ... | \n", "... | \n", "
| AC007325.2 | \n", "Gene Expression | \n", "
| AL354822.1 | \n", "Gene Expression | \n", "
| AC233755.2 | \n", "Gene Expression | \n", "
| AC233755.1 | \n", "Gene Expression | \n", "
| AC240274.1 | \n", "Gene Expression | \n", "
24737 rows × 1 columns
\n", "| \n", " | feature_types | \n", "gene_symbol | \n", "ensembl_id | \n", "
|---|---|---|---|
| gene_id | \n", "\n", " | \n", " | \n", " |
| ENSG00000243485 | \n", "Gene Expression | \n", "MIR1302-2HG | \n", "ENSG00000243485 | \n", "
| ENSG00000177757 | \n", "Gene Expression | \n", "FAM87B | \n", "ENSG00000177757 | \n", "
| ENSG00000225880 | \n", "Gene Expression | \n", "LINC00115 | \n", "ENSG00000225880 | \n", "
| ENSG00000230368 | \n", "Gene Expression | \n", "FAM41C | \n", "ENSG00000230368 | \n", "
| ENSG00000187634 | \n", "Gene Expression | \n", "SAMD11 | \n", "ENSG00000187634 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| ENSG00000198886 | \n", "Gene Expression | \n", "MT-ND4 | \n", "ENSG00000198886 | \n", "
| ENSG00000198786 | \n", "Gene Expression | \n", "MT-ND5 | \n", "ENSG00000198786 | \n", "
| ENSG00000198695 | \n", "Gene Expression | \n", "MT-ND6 | \n", "ENSG00000198695 | \n", "
| ENSG00000198727 | \n", "Gene Expression | \n", "MT-CYB | \n", "ENSG00000198727 | \n", "
| ENSG00000274847 | \n", "Gene Expression | \n", "MAFIP | \n", "ENSG00000274847 | \n", "
19193 rows × 3 columns
\n", "| \n", " | HGNC symbol | \n", "Gene stable ID | \n", "
|---|---|---|
| 0 | \n", "MT-TF | \n", "ENSG00000210049 | \n", "
| 1 | \n", "MT-RNR1 | \n", "ENSG00000211459 | \n", "
| 2 | \n", "MT-TV | \n", "ENSG00000210077 | \n", "
| 3 | \n", "MT-RNR2 | \n", "ENSG00000210082 | \n", "
| 4 | \n", "MT-TL1 | \n", "ENSG00000209082 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 86363 | \n", "SNHG12 | \n", "ENSG00000197989 | \n", "
| 86364 | \n", "TAF12-DT | \n", "ENSG00000229388 | \n", "
| 86365 | \n", "NaN | \n", "ENSG00000289291 | \n", "
| 86366 | \n", "RNU11 | \n", "ENSG00000274978 | \n", "
| 86367 | \n", "NaN | \n", "ENSG00000296488 | \n", "
86368 rows × 2 columns
\n", "