The most crucial step when setting up your own app is curating your data in preparation for loading into the browser. There are two reasons for these steps:
AnnData
files, which creates a database accessible by the browserAs the Curator you have the responsibility to understand the specific research questions involved in your data, and how to allow these questions to be explored using a visualization tool.
The process of curating your data generally includes the following steps:
AnnData
structureadditional_info.Rmd
)Each step of data curation is generally described below, referencing the PBMC3k dataset from 10X Genomics as an example. The complete script used for curation can be found in examples/browse_pbmc3k.R
. Additional examples of how the data curation process is applied to various data types and use cases can be found in the examples/
subdirectory.
There are three main locations you’ll need to specify while curating your data and otherwise preparing your browser app:
OMICSER_RUN_DIR
, and create a new directory named omicser_test/
which will contain the other content for this project. If you plan to make more extensive modifications to the browser, we recommend cloning the omicser
package and using the top level folder for your run directory.RAW_DIR
and set to raw_data/
in the PBMC3k example. This location is only used in the data curation steps.DB_ROOT_PATH
and set as test_db/
we will be working with a single dataset (and single dataset), though it is possible to visualize multiple datasets with the same browser configuration (i.e., this folder can potentially contain multiple databases).The example data used for this tutorial are from ~3000 peripheral blood mononuclear cells (PBMCs) from a healthy donor. Further information about the data is available at the 10X Genomics website. These data have also been used in tutorials from scanpy and Seurat.
The PBMC3k example script includes code sections to help you install software/environments, download/organize data, and otherwise prepare for the data curation steps below. Please note that the data files from 10X Genomics need to be saved as .gz
files for them to be read correctly when importing the data.
The exact steps required to ingest raw data vary widely depending on your data type and how these data are formatted. The general goal is to obtain the data necessary to create an AnnData
object containing all data and metadata required for visualization.
AnnData
Schema
The AnnData
format includes three types of data:
omic
annotation - e.g. gene/protein names, families, “highly variable genes”, “is marker gene”, etc.More information on this format is available on the AnnData for R website.
### Ingesting raw data
For our example, we’ll read the PBMC3k data files using the read_10x_mtx()
function from Python’s scanpy
package, then writing the data to file in .h5ad
format. We’ll access scanpy
using the reticulate
R package. If you have difficulty accessing scanpy
in this section,please see the troubleshooting section below.
Please see the other example curation scripts to understand how other data can be processed at this step. This may include writing custom helper scripts to standardize data manipulation from files resulting from your data generation processes.
AnnData
Our next step will use the raw database we just created and add the consistent structure we’ll need to orient the browser features later. The omicser::setup_database()
function incorporates the three separate data sources (DATA matrix, omic FEATURE METADATA annotations, and SAMPLE METADATA) into the AnnData
object.
For our PBMC3k example, we’ll apply the omicser::setup_database()
function:
# identify location of raw data
data_list <- list(object = path(DB_ROOT_PATH, DB_NAME, "core_data.h5ad"))
# create database formatted as AnnData
adata <- omicser::setup_database(database_name = DB_NAME,
db_path = DB_ROOT_PATH,
data_in = data_list,
re_pack = TRUE)
In this case, all data necessary was ingested in the previous step. This results in a database that is consistently formatted, and thus able to potentially be combined with other datasets in the future.
Now that the data are packed into the the AnnData
object, we can leverage additional functions from scanpy
via reticulate
to do perform additional data post-processing. The for the PMBC3k data includes steps to filter, annotate, and otherwise prepare the data for visualization. The data are periodically saved at intermediate steps, which is useful for optimizing visualizations. such as dimension reduction and clustering.
You have flexibility in the data curation process regarding which approaches you use, and when those steps are performed. For example, clustering and dimension reduction isn’t required to browse data, but is a useful step if you’d like to work with your data in other tools, like
cellxgene
.
The example below demonstrates a semi-automatic procedure to infering clusters in the data using the scanpy
python library for processing AnnData
data. For more information on using scanpy
in R, please see the troubleshooting section below.
The ability to identify which -omics features differ among categories is perhaps the most important part of data curation. However, it is also one of the trickiest parts, as some data generation services (especially for proteomics) compute differential expression (DE) using commercial software associated with instrumentation. These algorithms leverage bespoke statistics, with unknown assumptions about the statistical tests involved, so it will be best to reformat those tables. Additionally, the schema below provides a way to standardize comparisons across datasets, which is important for browsers drawing from multi-omics data.
The differential expression table has the following fields:
group
- the comparison, using the format {names}V{reference} from fields belownames
- what are we comparing?obs_name
- name of the metadata variabletest_type
- what statistic are we using?reference
- the denominator, or the condition against which we are comparing expressions valuescomp_type
- same format as group
, as grpVref or grpVrest, with rest representing all other conditionslogfoldchanges
- log2(name/reference)scores
- statistic scorepvals
- pvalues from the statisticals test, e.g. t-testpvals_adj
- adjusted pvalue (Q)versus
- label which we will choose in the browserThe omicser::compute_de_table()
leverages scanpy
functions and the AnnData
object to compute differential expression and provide a properly formatted DE table. We will need to provide the function with the following parameters:
adata
- the anndata objectcomp_types
- what kind of comparisons? there are two types:
test_types
- statistical tests. See our examples or scanpy
documentation for which test types are available.obs_names
- name of the adata$obs
column defining the comparision groupssc
- the scanpy data object we imported with reticulate
Here’s an example which computes a differential expression with a wilcoxon
test of significance for each of our inferred groups (leiden
) with respect to the rest of the groups. (These groups were inferred programmically via dimension reductions and a “leiden” clustering procedure via Scanpy
tools.cf line 207 of pbmc3k_curate_and_config.R
)
sc <- reticulate::import("scanpy")
test_types <- c('wilcoxon')
comp_types <- c("grpVrest")
obs_names <- c('leiden')
diff_exp <- omicser::compute_de_table(adata,comp_types, test_types, obs_names,sc)
The DE table output from this function should be saved to the database folder as db_de_table.rds
:
The last step of data curation is to write the final database to the database directory as db_data.h5ad
:
adata$write_h5ad(filename=file.path(DS_ROOT_PATH,DB_NAME,"db_data.h5ad"))
At this point, each folder in your database directory should include at least three files. Shiny will look for these files when creating the browser, so their names and location are important:
db_data.h5ad
- the AnnData
objectdb_de_table.rds
- differntial expression table 3, additional_info.Rmd
- expository explanation of the context and logic of databaseYou may also have intermediate database files created during post-processing: normalized_data.h5ad
,core_data.h5ad
,and norm_data_plus_dr.h5ad
. They are available for your reference, but are not required for further app development.
Now that the data are formatted and organized appropriately, we’ll continue to the next step and configure the browser.
It is useful, and in many cases necessary, to include information about your data and analysis methods when sharing a visualization tool with your colleagues. This section describes approaches for sharing such information within the browser application.
The box on the right of the Ingest
tab in the browser application is available for documenting both analysis methods (pre-processing approaches/assumptions, hypotheses being tested, statistical parameters) as well as information about the project (link to preprint/publication, acknowledgements, etc).
The content of this tab is rendered from additional_info.Rmd
. If you choose to customize this file, you must organize your browser application files within a copy (clone) of the omicser
. This ensures that your modified copy of additional_info.Rmd
is read when you launch your browser, instead of the original version from your installed library.
scanpy
in R
If you have difficulty accessing scanpy
, please see the Troubleshooting section of 1-Installation
scanpy
was originally written as a Python package, but we are using it within R courtesy of the reticulate
R package. The original scanpy
documentation can help you understand the assumptions and requirements of functions, which retain the same names between Python and R. For more information on syntax for scanpy
in R, please see this documentation.