API Reference¶
SummarizedPy
¶
A class to hold bulk proteomics (and metabolomics) data and process it for differential expression analysis.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
| Raises: |
|
|---|
Examples:
Constructing SummarizedPy object from numpy array and pandas DataFrame.
>>> import pandas as pd
>>> import numpy as np
>>> import depy as dp
>>> data = np.array([[1, 2, 3],
>>> [4, 5, 6],
>>> [7, 8, 9]])
>>> features = pd.DataFrame({"proteinID": ["feature1", "feature2", "feature3"]})
>>> samples = pd.DataFrame({"sample": ["sample1", "sample2", "sample3"]})
>>> sp = dp.SummarizedPy(data=data, features=features, samples=samples)
<SummarizedPy(data=ndarray(shape=(3, 3), dtype=int64), features=DataFrame(shape=(3, 1)), samples=DataFrame(shape=(3, 1)))>
filter_features(expr=None, mask=None)
¶
Filter SummarizedPy object based on feature metadata, using either Pandas-like query strings or a mask.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Filter out reverse hits in example dataset PXD000438.
>>> import depy as dp
>>> import re
>>> sp = dp.SummarizedPy().load_example_data()
>>> rev_hits = sp.features["protein_id"].apply(lambda x: bool(re.match("REV", x)))
>>> sp.features["rev"] = rev_hits
>>> sp = sp.filter_features(expr="~rev")
>>> sp = sp.filter_features(mask=~rev_hits) Or using rev_hits as a boolean mask
filter_missingness(frac=0.75, strategy='all_conditions', condition_column=None)
¶
Filter SummarizedPy object based on % feature missingness across one of: overall, all conditions, or any condition.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Filter out missing values in example dataset PXD000438.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.filter_missingness(strategy="overall", frac=0.75)
>>> sp = sp.filter_missingness(strategy="any_condition", condition_column="condition", frac=0.75)
>>> sp = sp.filter_missingness(strategy="all_conditions", condition_column="condition", frac=0.75)
filter_samples(expr=None, mask=None)
¶
Filter SummarizedPy object based on sample metadata, using either Pandas-like query strings or a mask.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Filter for ADC samples in example dataset PXD000438.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp.samples["condition"] = ["ADC"] * 6 + ["SCC"] * 6
>>> sp = sp.filter_samples(expr="condition=='ADC'")
import_from_delim_file(path, delim, data_selector=None, feature_selector=None, replace_val_with_nan=None, clean_column_names=False)
classmethod
¶
Alternative constructor from file. Import method that reads data directly from delimited file, including feature data, feature metadata, and sample metadata, and assigns them to data, features, and samples attributes automatically. This is intended for convenient import of standardized outputs like MaxQuant's proteingroups.txt or FragPipe/DIA-NN's diann-output.pg_matrix.tsv. Method uses ColumnSelector objects to assign columns to their relevant storage containers. If no ColumnSelector is provided, the function defaults to assigning all numerical (float64) columns to data and all string (object) columns to feature data. Thus, it is best to explicitly state which columns to import. The samples attribute is automatically populated with the column names from data. The original data row and columns indices are stored in features and samples 'orig_index' variables, resp., for bookkeeping. The read path and delimiter used will be appended to the history attribute. Values in data can be replaced with NaN to indicate missingness (e.g. intensity values of 0). Column names can be automatically cleaned.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
Read in data from protein groups file (e.g. MQ or FragPipe) and construct SummarizedPy object. Default to placing all numerical columns in 'data', assocaited column names in 'samples', and object or string type columns in 'features'.
>>> import depy as dp
>>> pg_path = "~/path/to/my/proteingroups.txt"
>>> sp = dp.SummarizedPy().import_from_delim_file(path=pg_path, delim=' ', replace_val_with_nan=0., clean_column_names=True)
Select columns to import using ColumnSelector object. Assume data are in columns containing sub-string 'LFQ_intensity_'.
>>> import depy as dp
>>> pg_path = "~/path/to/my/proteingroups.txt"
>>> data = dp.ColumnSelector(regex="LFQ_intensity_")
>>> features = dp.ColumnSelector(names=["proteinID", "geneSymbol", "proteinDescription"])
>>> sp = dp.SummarizedPy().import_from_delim_file(path=pg_path, delim=' ', data_selector=data, feature_selector=features)
impute_missing_values(method=None, extra_args=None)
¶
Impute missing values using the ImputeLCMD R package.
Several common methods are available under the assumptions of: - MAR (KNN, SVD, MLE) - MNAR (QRILC, MinDet, MinProb) - Both MAR and MNAR (Hybrid)
Refer to the ImputeLCMD package documentation for further information: https://cran.r-project.org/web/packages/imputeLCMD/imputeLCMD.pdf
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Impute missing values using ImputeLCMD's hybrid strategy. Use example dataset PXD000438 after filtering excessive missingness.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.filter_missingness(strategy="overall")
>>> sp = sp.impute_missing_values(method="Hybrid")
limma_trend_dea(design_formula=None, contrasts=None, feature_id_col=None, robust=False, block=None, array_weights=False, extra_args=None)
¶
Run differential expression analysis (DEA) with limma-trend. Limma powers its analyses by incorporating an empirical mean-variance trend estimated from the data as a prior. This alleviates the issue of estimating fold changes in the face of heteroscedasticity. In short, low-abundant features are prone to false positives due to inherently lower variance, whereas the opposite is true for high-abundant features, which are prone to false negatives. By modeling the overall mean-variance trend in the data and incorporating it as prior, information is shared across samples (which powers low-N designs) and features are effectively regularized. Compared to traditional parametric statistics, this Bayesian approach has consistently been found to be more powerful and achieve better FDR (false discovery rate) control. Additionally, a robust approximation can be used if the data contain hypo-/hypervariable features to avoid skewing the mean-variance trend. By fundamentally utilizing linear models, limma can accommodate complex designs, including fixed and random factors (i.e. mixed effects, such as nested factors or repeated measures) and their combination (i.e. to model between- and within-subjects designs). Limma can also incorporate sample quality weights, which are extremely powerful, especially in noisy datasets, which is often the case with human or animal samples. Using limma's arrayWeights function, samples are up- or down-weighted based on how variable they are compared to the average sample. Importantly, this function takes the experimental design into account when estimating the sample weights. Moreover, the user can provide an arbitrary design or none at all to estimate averaged weights for different groups of samples (i.e. in cases where sample quality is known to be especially poor according to some condition or technical covariate) or simply estimate sample-specific weights independent of design covariates. arrayWeights is run with the 'REML' method that allows for missing values. The weights are inversely proportional to sample variability (i.e. a weight of 0.5 is twice as variable as the average sample; weights >1 are less variable than the average and tend to reflect higher quality). The weights can be stabilized further and squeezed towards 1 by increasing the 'prior_n' parameter >10 (default); this tends to make weights more symmetric around 1 (average/equality), thus up- and down-weighting samples by similar magnitudes, rather disproportionately up-weighting good samples.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Full DEA pipeline on example dataset PXD000438.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp.samples["condition"] = ["ADC"] * 6 + ["SCC"] * 6 # Add condition variable
>>> sp = sp.filter_missingness(strategy="overall") # Pre-process
>>> sp = sp.transform_features(method="log", by=2)
>>> sp = sp.impute_missing_values(method="Hybrid")
>>> sp = sp.surrogate_variable_analysis(mod="~condition")
>>> des = "~condition+sv_1+sv_2+sv_3" # design formula (incl. 'condition' and surrogate variables)
>>> contr = {"SCCvsADC": "SCC-ADC"} # define contrast (levels must be present in covariates above)
>>> sp = sp.limma_trend_dea(design_formula=des, contrasts=contr, array_weights=True) # with array_weights option
>>> sp.results # Check newly created results attribute
load_example_data()
classmethod
¶
Load a real-world example proteomics dataset for demonstration purposes. The function loads dataset 'PXD000438' from the ImputeLCMD package. The data were generated from a super-SILAC experiment of human adenocacinoma (ADC) and squamous cell carcinoma (SCC) samples. The dataset contains six ADC and six SCC samples and 3,709 proteomic features with raw feature intensities and missing values. Samples 092.1-3 and 441.1-3 are ADC and 561.1-3 and 691.1-3 are SCC.
For more information about the dataset: https://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000438
| Returns: |
|
|---|
Examples:
Load example dataset.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
<SummarizedPy(data=ndarray(shape=(3709, 12), dtype=float64), features=DataFrame(shape=(3709, 1)), samples=DataFrame(shape=(12, 1)))>
load_sp(path=None)
classmethod
¶
Load a previously saved SummarizedPy object, stored as a pickle file on disk.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
Load a saved SP object from pickle file.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_sp("my_sp.pkl")
plot_pca(standardize=True, n_comp=None, fill_by=None, label=False)
¶
Generate PCA plot of data using the first two principal components. PCA is computed using scikit-learn's PCA estimator with defaults.
If data contain features with missing values, these will simply be omitted, as PCA requires complete data.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Generate PCA plot on standardized data.
>>> sp.plot_pca(standardize=True)
save_sp(path=None)
¶
Save SummarizedPy object to disk using pickle for easy loading in the future.
Method automatically appends '.pkl' to file.
| Parameters: |
|
|---|
Examples:
Save SP object to disk.
>>> sp.save_sp(path="my_sp")
select_variable_features(top_n=None, top_percentile=None, plot=False)
¶
Select highly variable features (HVF) based on deviation from data mean-variance trend.
Uses LOWESS to fit a smooth trend to the feature-wise mean and standard deviation values.
Note: if log2 transformation has not been applied using transform_features(method='log',by=2)
it will be applied prior to fitting the mean-variance trend. Data will be returned on the original scale.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Select top 500 most variable features in example dataset PXD000438 and plot the fitted mean-variance trend.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.select_variable_features(top_n=500, plot=True)
surrogate_variable_analysis(mod=None, mod0='~1', num_sv=None)
¶
Run surrogate variable analysis to estimate latent factors that capture expression heterogeneity or hidden batch effects. The surrogate variables (SVs) will be added to the samples attribute and can be included as covariates in DEA. SVs are estimated through PCA on the residualized feature matrix after regressing out known experimental and technical/batch covariates. This is done by supplying the method with a fully parameterized model (mod), including all known covariates (experimental and technical; i.e. as present in the samples attribute), and a null model, including only technical (adjustment) covariates (mod0). The number of significant surrogate variables to estimate can then be specified using the num_sv argument; alternatively, the method can be run without specifying a number and allowing SVA to estimate the number empirically (using SVA's 'num.sv' function and the default 'leek' method). Note that this can return 0 SVs and fail. However, it is still possible to find significant SVs by forcing the method to run with a pre-specified num_sv argument. The mod and mod0 arguments must be specified using R formula formatting, which all start with a tilde (~) symbol and add covariates (+) and their interactions (*). Covariates must be present in the samples attribute. If no technical covariates are known, the method will run with the recommended default of "~1" (i.e. only using an intercept term). For more information, see: https://bioconductor.org/packages/3.19/bioc/vignettes/sva/inst/doc/sva.pdf
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Use SVA to estimate surrogate variables for inclusion in DEA. Use example dataset PXD000438: filter missing values and log2 transform features first.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.filter_missingness(strategy="overall") # Filter excessive missinginess (this is important)
>>> sp = sp.transform_features(method="log", by=2) # Log transform data (important)
>>> sp = sp.impute_missing_values(method="Hybrid") # Optionally, impute missing remaining values (sva excludes any feature with nan values)
>>> sp = sp.surrogate_variable_analysis(mod="~condition") # Default null model: mod0 = '~1' (intercept-only)
>>> sp.samples # SVs now in samples attribute
transform_features(method, axis=None, by=None)
¶
transform_features(
method: Literal["center"],
axis: int,
by: Optional[Literal["mean", "median"]] = None,
) -> SummarizedPy
transform_features(
method: Literal["log"], by: Optional[int] = None
) -> SummarizedPy
transform_features(
method: Literal["z-score"], axis: int, by: None = None
) -> SummarizedPy
transform_features(method: Literal['vsn']) -> SummarizedPy
Mathematically transform features stored in data attribute using one of: log (base N), center (using mean or median subtraction), or z-score (standardize).
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Transform feature data in example dataset PXD000438.
>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.transform_features(method="log", by=2) # Log transformation (base 2)
>>> sp = sp.transform_features(method="center", by="median", axis=1) # Center data sample-wise by median
>>> sp = sp.transform_features(method="z-score", axis=0) # Feature-wise standardization
>>> sp = sp.transform_features(method="vsn") # vsn normalization
volcano_plot(contrasts=None, top_n=3, de_colors=None)
¶
Generate volcano plots for limma-trend results, highlighting the top N up- and downregulated features.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
Generate volcano plots after running limma_trend_dea method.
>>> sp.volcano_plot()
>>> sp.volcano_plot(contrasts=["SCCvsADC"])
ColumnSelector
¶
Object for pre-selecting columns when constructing SummarizedPy from file.
| Parameters: |
|
|---|
Examples:
Define columns to import from file. Assume data are in columns labeled "LFQ_intensity_*".
>>> import depy as dp
>>> data = dp.ColumnSelector(regex="LFQ_intensity_")
>>> features = dp.ColumnSelector(names=["proteinID", "geneSymbol", "proteinDescription"])
>>> sp = dp.SummarizedPy().import_from_delim_file(path="my/path/proteingroups.txt", delim=' ', data_selector=data, feature_selector=features)
select_cols(df)
¶
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|