The FACSPy dataset: Structure#

This vignette is supposed to explain the structure of a FACSPy dataset.

FACSPy is built around the AnnData object. For further information on AnnData objects, please refer to their official documentation.

This notebook covers a brief explanation of the AnnData slots and where essential information is stored.

Later on, FACSPy-specific features will be introduced that enable the user to analyze specific populations from within the same AnnData object.

A quick overview:#

The .obs slot contains all cell-specific metadata.
The .var slot contains all panel-related information.
The .layers slot contains different versions of the data.
The .obsm slot contains the gating matrix as well as dimensionality reduction coordinates.
The .uns slot contains unstructured metadata, user-supplied information and sample-wise analyses.

[1]:

import FACSPy as fp

First, we read in a dataset via the fp.read_dataset() function. This dataset has been pre-assembled and saved to the hard drive.

As we can see, the dataset consists of 3103969 cells and 36 channels.

[2]:

dataset = fp.read_dataset(input_dir = "../Tutorials/spectral_dataset/",
                          file_name = "raw_dataset")
dataset

[2]:

AnnData object with n_obs × n_vars = 3103969 × 36
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table'
    obsm: 'gating'
    layers: 'compensated', 'transformed'

The `.obs` slot - cell metadata#

.obs stores the observation’s characteristics, in this case the metadata. The metadata that were provided during dataset assembly are copied for each cell of the respective sample.

The .obs slot contains a pandas.DataFrame that can be modified using conventional pandas methods.

[3]:

dataset.obs.head()

[3]:

	sample_ID	file_name	group_fd	internal_id	organ	staining	diag_main	diag_fine	donor_id	material	batch
0-0	1	3742.fcs	healthy	3742	PB	stained	healthy	healthy	3742	PBMC	1
1-0	1	3742.fcs	healthy	3742	PB	stained	healthy	healthy	3742	PBMC	1
2-0	1	3742.fcs	healthy	3742	PB	stained	healthy	healthy	3742	PBMC	1
3-0	1	3742.fcs	healthy	3742	PB	stained	healthy	healthy	3742	PBMC	1
4-0	1	3742.fcs	healthy	3742	PB	stained	healthy	healthy	3742	PBMC	1

Indexing `.obs` - Selecting groups of cells#

In order to select specific subsets of the data that can be described in terms of their metadata we can use indexing similar to indexing numpy arrays and pandas dataframes.

Here, we subset the dataset so that only cells from the sample with sample_ID 1 are kept.

[4]:

sample_1_subset = dataset[dataset.obs["sample_ID"] == "1",:]
sample_1_subset

[4]:

View of AnnData object with n_obs × n_vars = 183756 × 36
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table'
    obsm: 'gating'
    layers: 'compensated', 'transformed'

The `.var` slot - the cytometry panel#

.var stores the panel information. This information is concatenated from the .fcs file and the user supplied panel. Note that the index of the table contains the antigens and not the fluorochromes. That means that for all functions that perform visualization tasks, the antigen name is used.

The type of a channel can be either ‘time’, ‘scatter’, ‘fluo’ and ‘technical’ and is automatically determined upon dataset creation. Channels of type ‘fluo’ correspond to channels that are neither recognized as ‘time’, ‘scatter’ or ‘technical’.

[5]:

dataset.var.head(10)

[5]:

	pns	png	pne	type	pnn	cofactors
Time	Time	1.0	(0.0, 0.0)	time	Time	1.0
SSC-H	SSC-H	1.0	(0.0, 0.0)	scatter	SSC-H	1.0
SSC-A	SSC-A	1.0	(0.0, 0.0)	scatter	SSC-A	1.0
FSC-H	FSC-H	1.0	(0.0, 0.0)	scatter	FSC-H	1.0
FSC-A	FSC-A	1.0	(0.0, 0.0)	scatter	FSC-A	1.0
SSC-B-H	SSC-B-H	1.0	(0.0, 0.0)	scatter	SSC-B-H	1.0
SSC-B-A	SSC-B-A	1.0	(0.0, 0.0)	scatter	SSC-B-A	1.0
CD38	CD38	1.0	(0.0, 0.0)	fluo	BUV395-A	1298.1292
NKG2C_(CD159c)	NKG2C_(CD159c)	1.0	(0.0, 0.0)	fluo	BUV496-A	1132.2448
CD3	CD3	1.0	(0.0, 0.0)	fluo	BUV563-A	5287.9663

The individual channels can be accessed by the .var_names attributes

[6]:

dataset.var_names

[6]:

Index(['Time', 'SSC-H', 'SSC-A', 'FSC-H', 'FSC-A', 'SSC-B-H', 'SSC-B-A',
       'CD38', 'NKG2C_(CD159c)', 'CD3', 'CD16', 'CD161', 'CD32', 'CD56',
       '41BB_(CD137)', 'CD4', 'CD64', 'KLRG1', 'CD45', 'HLA_DR', 'CD19',
       'NKp44', 'CD69', 'TIGIT', 'CD57', 'CD8', 'CD14', 'CD27',
       'NKG2A_(CD159a)', 'CTLA-4_(CD152)', 'TRAIL_(CD253)', 'PD-1_(CD279) ',
       'CD18', 'Zombie_NIR', 'CD66b', 'AF-A'],
      dtype='object')

Indexing `.var` - Selecting specific channels#

Similar to the .obs indexing, we use numpy/pandas indexing to subselect data corresponding to specific channels.

Here, we subset the dataset so that only the channels corresponding to CD3 and CD19 are kept.

[7]:

channel_subset = dataset[:,dataset.var["pns"].isin(["CD3", "CD19"])]
channel_subset

[7]:

View of AnnData object with n_obs × n_vars = 3103969 × 2
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table'
    obsm: 'gating'
    layers: 'compensated', 'transformed'

[8]:

channel_subset.var

[8]:

	pns	png	pne	type	pnn	cofactors
CD3	CD3	1.0	(0.0, 0.0)	fluo	BUV563-A	5287.9663
CD19	CD19	1.0	(0.0, 0.0)	fluo	BV650-A	617.30945

FACSPy offers convenience functions in order to subset specific channels.

Here, we only keep the channels of type ‘fluo’.

Note that in order to perform analysis, you can decide whether to include scatter/time/technical channels in the analysis in the respective functions. It is therefore not strictly necessary to subset fluo channels in advance.

[9]:

fluo_only = fp.subset_fluo_channels(dataset,
                                    copy = True)
fluo_only.var.head()

[9]:

	pns	png	pne	type	pnn	cofactors
CD38	CD38	1.0	(0.0, 0.0)	fluo	BUV395-A	1298.1292
NKG2C_(CD159c)	NKG2C_(CD159c)	1.0	(0.0, 0.0)	fluo	BUV496-A	1132.2448
CD3	CD3	1.0	(0.0, 0.0)	fluo	BUV563-A	5287.9663
CD16	CD16	1.0	(0.0, 0.0)	fluo	BUV615-A	14275.482
CD161	CD161	1.0	(0.0, 0.0)	fluo	BUV661-A	2890.5862

In order to remove a specific channel, you can use the fp.remove_channel() function Here, we remove the channel corresponding to the antigen CD16.

[10]:

without_CD16 = fp.remove_channel(dataset, "CD16",
                                 copy = True)
assert "CD16" not in without_CD16.var_names

The `.obsm slot` - gating matrix#

By default, the .obsm slot stores the gating information retrieved from a workspace or added by the user as a sparse matrix.

[11]:

type(dataset.obsm["gating"])

[11]:

scipy.sparse._csr.csr_matrix

This matrix is a boolean matrix where ‘True’ means that a cell is within the gate and False means the opposite.

The gate names are stored in .uns['gating_cols']. We assemble the matrix, the index and the gate-names into a pandas dataframe:

[12]:

import pandas as pd
gating = pd.DataFrame(data = dataset.obsm["gating"].toarray(),
                      index = dataset.obs_names,
                      columns = dataset.uns["gating_cols"])
gating.head()

[12]:

	root/all_cells	root/all_cells/FSC_singlets	root/all_cells/FSC_singlets/live	root/all_cells/FSC_singlets/live/CD45+	root/all_cells/FSC_singlets/live/CD45+/PBMC	root/all_cells/FSC_singlets/live/CD45+/PBMC/CD19neg,_CD14neg	root/all_cells/FSC_singlets/live/CD45+/PBMC/CD19neg,_CD14neg/NK	root/all_cells/FSC_singlets/live/CD45+/PBMC/CD19neg,_CD14neg/T	root/all_cells/FSC_singlets/live/CD45+/PBMC/CD19neg,_CD14neg/T/CD8+T
0-0	True	True	True	True	True	True	False	True	False
1-0	True	True	True	True	True	True	False	False	False
2-0	True	True	True	True	True	True	False	True	False
3-0	True	True	True	False	False	False	False	False	False
4-0	True	True	True	True	True	True	False	True	True

The user is not supposed to access this matrix directly. Instead, we can use FACSPy functions to access specific gates.

Here, we subset only live cells.

Note that we are using the copy parameter to copy the live cells into the variable live_cells.

[13]:

live_cells = fp.subset_gate(dataset,
                            "live",
                            copy = True)
live_cells

[13]:

AnnData object with n_obs × n_vars = 2367839 × 36
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table'
    obsm: 'gating'
    layers: 'compensated', 'transformed'

If instead of copying we want to discard all other cells that are not within the live cells, we set the parameter copy to False or leave it as default by not explicitly passing it.

For the rest of the vignette, we only keep live cells.

[14]:

fp.subset_gate(dataset, "live")

The `.obsm` slot - dimensionality reduction coordinates#

The .obsm slot also stores coordinates for dimensionality reductions.

Here, we calculate a PCA on the transformed data for CD45+ cells.

[15]:

fp.tl.pca(dataset,
          layer = "transformed",
          gate = "CD45+")

dataset

[15]:

AnnData object with n_obs × n_vars = 2367839 × 36
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table', 'settings', 'pca_CD45+_transformed'
    obsm: 'gating', 'X_pca_CD45+_transformed'
    varm: 'pca_CD45+_transformed'
    layers: 'compensated', 'transformed'

Similar to the mfi dataframe, we store the dimensionality reduction, the gate and the data layer within the slot entry name. That way, we can store dimensionality reductions, neighbors information etc. for multiple gates at the same time and easily access them.

The plotting utility will use the corresponding information to access the correct slots.

[16]:

fp.pl.pca(dataset,
          layer = "transformed",
          gate = "CD45+",
          color = "organ")

C:\Users\tarik\anaconda3\envs\FACSPypeline\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:1208: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if not is_categorical_dtype(values):
C:\Users\tarik\anaconda3\envs\FACSPypeline\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:1217: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning
  color_vector = pd.Categorical(values.map(color_map))
C:\Users\tarik\anaconda3\envs\FACSPypeline\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

../_images/vignettes_dataset_structure_31_1.png

The `.layers` slot - data storage#

The .layers slot contains different versions of the data. By default, we have compensated data that are created from the .fcs file and a compensation matrix.

Here, we also have transformed data that were created by fp.dt.transform(). Since we use both data formats for different analysis types, we keep them both readily available in the same dataset.

[17]:

dataset

[17]:

AnnData object with n_obs × n_vars = 2367839 × 36
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table', 'settings', 'pca_CD45+_transformed', 'organ_colors'
    obsm: 'gating', 'X_pca_CD45+_transformed'
    varm: 'pca_CD45+_transformed'
    layers: 'compensated', 'transformed'

We can access these layers by using anndata’s .to_df() method. Here, we convert the layer ‘compensated’ to a pandas dataframe.

[18]:

dataset.to_df(layer = "compensated").head()

[18]:

	Time	SSC-H	SSC-A	FSC-H	FSC-A	SSC-B-H	SSC-B-A	CD38	NKG2C_(CD159c)	CD3	...	CD14	CD27	NKG2A_(CD159a)	CTLA-4_(CD152)	TRAIL_(CD253)	PD-1_(CD279)	CD18	Zombie_NIR	CD66b	AF-A
0-0	0.0000	280669.0	2.964933e+05	1294084.0	1501242.750	239248.0	2.784677e+05	3733.869629	2428.832031	83645.859375	...	2196.446045	11616.447266	-6762.053223	228.966385	33.528599	94.960457	7313.366211	350.470825	275.108398	8161.391113
1-0	0.0034	408956.0	4.385952e+05	1095373.0	1253709.375	331308.0	3.864948e+05	-81.559128	-1160.806519	104297.421875	...	1770.571655	-821.060364	-8796.852539	-1698.822144	369.262634	488.935730	28683.802734	-980.601562	758.078125	9775.565430
2-0	0.0043	456595.0	4.995947e+05	1037646.0	1191190.000	346673.0	4.079460e+05	3523.129639	1337.915161	97211.875000	...	151.382462	3082.001221	1003.328857	67.463631	748.273804	3780.447021	40503.281250	-1788.460205	-21.585938	8605.550781
3-0	0.0055	1145342.0	1.912832e+06	1896447.0	3095241.500	1068368.0	1.890969e+06	-2893.820557	-66354.523438	117631.554688	...	66223.476562	-27051.164062	-201888.609375	-44016.957031	1557.418945	-10924.555664	127950.484375	12481.312500	7094.832031	88197.375000
4-0	0.0058	566604.0	6.517887e+05	1006409.0	1248624.875	414126.0	4.997060e+05	-10.057960	3012.665771	23707.751953	...	-1838.767578	5670.687500	4823.494141	3098.549561	717.823181	10720.612305	20161.494141	-860.871826	392.693848	3137.052734

5 rows × 36 columns

The `.uns` slot - unstructured metadata#

The .uns slot contains all data that do not fit the original shape (n_cells x n_channels) of the dataset.

By default, a Metadata object, a Panel object, a FlowJoWorkspace object, the gate names and a dataset status hash are stored here.

[19]:

print(dataset.uns["metadata"], end = "\n\n")
print(dataset.uns["panel"], end = "\n\n")
print(dataset.uns["cofactors"], end = "\n\n")
dataset

Metadata(36 entries with factors ['group_fd', 'internal_id', 'organ', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'])

Panel(28 channels, loaded as provided file)

CofactorTable(28 channels, loaded as provided dataframe)

[19]:

AnnData object with n_obs × n_vars = 2367839 × 36
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table', 'settings', 'pca_CD45+_transformed', 'organ_colors'
    obsm: 'gating', 'X_pca_CD45+_transformed'
    varm: 'pca_CD45+_transformed'
    layers: 'compensated', 'transformed'

After calculating samplewise analyses (for example MFI/FOPs), the corresponding dataframes are also stored here.

Here, we calculate the MFI on the compensated data and in turn create a .uns entry with the name ‘mfi_sample_ID_compensated’. The entry is a pandas dataframe.

[20]:

fp.tl.mfi(dataset,
          layer = "compensated")

dataset

[20]:

AnnData object with n_obs × n_vars = 2367839 × 36
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table', 'settings', 'pca_CD45+_transformed', 'organ_colors', 'mfi_sample_ID_compensated'
    obsm: 'gating', 'X_pca_CD45+_transformed'
    varm: 'pca_CD45+_transformed'
    layers: 'compensated', 'transformed'

When we inspect the dataframe, we see that for every gate, the MFI values have been calculated per channel.

[21]:

dataset.uns["mfi_sample_ID_compensated"].head()

[21]:

		Time	SSC-H	SSC-A	FSC-H	FSC-A	SSC-B-H	SSC-B-A	CD38	NKG2C_(CD159c)	CD3	...	CD14	CD27	NKG2A_(CD159a)	CTLA-4_(CD152)	TRAIL_(CD253)	PD-1_(CD279)	CD18	Zombie_NIR	CD66b	AF-A
sample_ID	gate
1	root/all_cells	92.338799	426979.0	483348.156250	1280046.0	1.525810e+06	343562.0	416366.468750	3096.480225	421.562561	4260.855957	...	326.190948	2415.716309	-2186.655518	113.783081	567.933838	476.055206	21647.068359	64.106445	382.719727	8249.457031
2	root/all_cells	91.307152	461467.5	528586.281250	1314374.5	1.597389e+06	370216.0	454101.234375	3649.359985	752.088287	27555.433594	...	49.189646	4699.732422	-2126.341431	354.737915	765.017456	763.312592	19111.833008	380.439178	394.168945	7477.313477
3	root/all_cells	100.621799	458393.0	523310.906250	1203675.5	1.438810e+06	374794.0	456158.156250	2338.043701	541.534790	28436.040039	...	1491.264160	1293.785889	-2469.103516	-16.103633	844.625580	579.380524	17958.804688	694.674988	371.653076	6341.876709
4	root/all_cells	98.125847	412581.0	460528.484375	1208110.0	1.420063e+06	336580.0	401097.625000	3147.915283	1111.765991	43195.044922	...	1263.234619	4732.218994	-2937.132690	5.521069	671.986633	344.843430	11126.077637	737.335571	271.416504	6536.433105
5	root/all_cells	90.351402	442781.0	505627.750000	1265257.0	1.513078e+06	356986.0	436810.468750	2581.451660	1372.416504	4461.762695	...	5.471334	2372.531494	-1844.797852	191.291962	702.836304	584.618591	16370.416016	529.221924	345.144775	7698.866211

5 rows × 36 columns

Similarly, if we calculate the fop using the same settings we create an entry called ‘fop_sample_ID_compensated’

[22]:

fp.tl.fop(dataset,
          layer = "compensated")
dataset

[22]:

AnnData object with n_obs × n_vars = 2367839 × 36
    obs: 'sample_ID', 'file_name', 'group_fd', 'internal_id', 'organ', 'staining', 'diag_main', 'diag_fine', 'donor_id', 'material', 'batch'
    var: 'pns', 'png', 'pne', 'type', 'pnn', 'cofactors'
    uns: 'metadata', 'panel', 'workspace', 'gating_cols', 'dataset_status_hash', 'cofactors', 'raw_cofactors', 'cofactor_table', 'settings', 'pca_CD45+_transformed', 'organ_colors', 'mfi_sample_ID_compensated', 'fop_sample_ID_compensated'
    obsm: 'gating', 'X_pca_CD45+_transformed'
    varm: 'pca_CD45+_transformed'
    layers: 'compensated', 'transformed'

We also store analysis settings that were used for the respective function parameters in the .uns['settings'] slot.

[23]:

dataset.uns["settings"]

[23]:

{'_pca_CD45+_transformed': {'gate': 'CD45+',
  'layer': 'transformed',
  'use_only_fluo': True,
  'exclude': [],
  'scaling': None},
 '_mfi_sample_ID_compensated': {'groupby': 'sample_ID',
  'method': 'median',
  'use_only_fluo': False,
  'layer': 'compensated'},
 '_fop_sample_ID_compensated': {'groupby': 'sample_ID',
  'cutoff': array([1.0000000e+00, 1.0000000e+00, 1.0000000e+00, 1.0000000e+00,
         1.0000000e+00, 1.0000000e+00, 1.0000000e+00, 1.2981292e+03,
         1.1322448e+03, 5.2879663e+03, 1.4275482e+04, 2.8905862e+03,
         5.5698975e+03, 2.7737961e+03, 2.1693916e+03, 4.2989844e+03,
         2.0747034e+03, 2.4437014e+03, 5.0000000e+03, 7.1690107e+03,
         6.1730945e+02, 1.3852777e+03, 1.4649585e+03, 4.0099905e+03,
         1.3888629e+03, 4.0046506e+03, 4.8489380e+03, 4.0826614e+02,
         7.7138184e+02, 1.3791636e+03, 2.5787056e+03, 1.2436598e+03,
         6.2204258e+03, 4.2631496e+04, 8.8387695e+03, 1.0000000e+00],
        dtype=float32),
  'use_only_fluo': False,
  'layer': 'compensated'}}

The FACSPy dataset: Structure

Contents

The FACSPy dataset: Structure#

A quick overview:#

The .obs slot - cell metadata#

Indexing .obs - Selecting groups of cells#

The .var slot - the cytometry panel#

Indexing .var - Selecting specific channels#

The .obsm slot - gating matrix#

The .obsm slot - dimensionality reduction coordinates#

The .layers slot - data storage#

The .uns slot - unstructured metadata#