Advanced Filtering¶
In addition to filtering the dataset by searching for annotations such as gene symbols or descriptions, you can filter the dataset by the nature of the data itself. For example, min, max, and mean etc of the dataset can be used to select which rows of the dataset to retrieve. You accomplish this by writing a very simple Script.
A Script consists of a set of Variable definitions and a set of Filters.
Variable definition is in the form of:
v = name1,name2,name3,...
Internally this is turned into a list of strings:
v = ['name1','name2','name3',...]
That is, this variable definition is a short hand for making a list of names. These variables are used as parameters for Filters.
When defining variables, you can also use wild cards to select multiple group names, sample names.
* | matches everything |
? | matches any single character |
[seq] | matches any character in seq ([1-9],[a-z],[abc],[12] etc.) |
[!seq] | matches any character not in seq |
- | a minus (-) sign in front removes matched names from the previously matched list if there is no previous match, it removes matched names from the whole list |
For example:
g1 = *CA1 # matches all Hippocampus CA1 groups
g2 = Trpv1* # matches all Trpv1 mouse line groups
g3 = *.unfed* # matches unfed condition
g4 = Trp*,-*NL* # matches all Trpv1 mouse samples but exclude samples which contains "NL"
Tip
Anything after # is treated as comment and ignored.
You can also use ‘-‘ at the beginning to indicate exclusion:
g1 = -*ChIP,*INP # exclude ChIP data
A Filter will select a subset of the dataset with supplied criteria. It is in a form of a function:
FilterName( param1, param2, ...)
If multiple filters are specified, they are applied in succeeding manner.
For example:
g1 = *.unfed* # define g1 as unfed group
g2 = *.fed* # define g2 as fed group
Max(g1+g2, th=20) # only consider genes with max expression value over 20 across unfed, fed samples.
FoldChange(g1,g2,th=3) # now filter according to the foldchange between groups g1 and g2
TTest(g1,g2,th=0.05) # then further filter according to ttest pvalues
Sort('max') # sort according to max gene value across fed, unfed samples
c = *Arc # define c as groups taken from Arcuate nucleus
Columns(c) # only show these Arcuate samples
This script will produce the following status message:
and output this heatmap:
More examples are listed in Example Scripts section.
Available Filters are listed below.
Available Filters¶
- Columns(include=None, exclude=None)¶
This filter restricts columns (samples) rather than rows.
Parameters: - include – list of groups to include in the final output
- exclude – list of groups to exclude in the final output
Example1:
c = celltype1,celltype2 Columns(c) # This will only show data for celltype1 and celltype2
Example2:
c = celltype1,celltype2 Columns(exclude=c) # This will exclude celltype2 and celltype2 from the final output.
- Find(val, col='symbol', exact=True)¶
Find rows that matches a regular expression. Case is ignored.
Parameters: - val – list of string or regular expression search string (^:indicates start, $:end, . (period):wild card, *:repeat, |:or etc.)
- col – which column to search in
- exact – when list is supplied, whether to match exact word (equivalent to put ^ and $ at the beginning and end)
Available columns
'id', 'etid', 'egid', 'symbol', 'sym', 'description', 'chrloc', 'strand', 'band', 'biotype', 'GC', 'refseq', 'entrez', 'mirbase', 'ids', 'wikigene', 'gob', 'gom', 'goc', 'interpro', 'pfam'
Example1:
Find('^Gad1$|^Gad2$') # Find rows whose symbol is exactly Gad1 or Gad2
Example2:
Find('Gad') # Find rows whose symbol contains Gad (returns Gad1, Gad2, Gadd45a, Itgad etc.)
Example3:
Find('peptide', 'description') # Find rows whose description column contains "peptide".
Example4:
s = Gad1,Gad2 Find(s) # same as Example 1
- FoldChange(group1, group2=None, th=None, prefix='', both=True, bigger=True)¶
Calculate fold change and selects according to supplied threshold.
Parameters: - group1 – group1
- group2 – group2 if not supplied (or None) then complement of group1
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
- both (bool) – if both==True then foldchange=max(group1/group2,group2/group1) otherwise foldchange = g1/g2
Calculated foldchange values will be in column named fc or prefix+’fc’.
- FoldDiff(group1, group2=None, th=None, prefix='', base=2, both=True, bigger=True)¶
Calculate fold change when signal values are in log space and selects according to supplied threshold.
Parameters: - group1 – group1
- group2 – group2 if not supplied (or None) then complement of group1
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
- both (bool) – if both==True then foldchange=max(group1-group2,group2-group1) otherwise foldchange = g1/g2
Calculated foldchange values will be in column named fc or prefix+’fc’.
- GO(val, subtree=True)¶
Filtering based on Gene Ontology annotation.
Parameters: - val – list of GO ids
- subtree (boolean) – whether to search for subtree or not
Example:
goid = 0005184 # neuro peptide hormone activity GO(goid) # this will return rows annotated to have neuropeptide hormone activity
- Sort(groups=None, ascending=None)¶
Sort dataset according to the mean of the supplied group or just supplied field.
Parameters: - groups – list of groups or just single field name
- bigger (bool) – whether to sort ascending
- TTest(group1, group2=None, th=None, prefix='', twosided=True, bigger=False, logscale=True)¶
Calculate Student’s T-test p-values and selects according to supplied threshold.
Parameters: - group1 – group1
- group2 – group2 if not supplied (or None) then complement of group1
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
- twosided (bool) – whether to take twosided ttest or not
T-test pvalues will be in a column named ttestp or prefix+’ttestp’.
- Mean(groups=None, complement=False, th=None, prefix='', bigger=True)¶
Calculates average for supplied groups and selects according to supplied threshold.
Parameters: - groups – specifies groups (list)
- complement (bool) – whether to take complemental groups
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
Calculated mean values are in column named mean or prefix+’mean’.
- Max(groups=None, complement=False, th=None, prefix='', bigger=True)¶
Calculates maximum values. Output column is max or prefix+’max’.
Parameters: - groups – specifies groups (list)
- complement (bool) – whether to take complemental groups
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
Calculated mean values are in column named mean or prefix+’mean’.
- Min(groups=None, complement=False, th=None, prefix='', bigger=True)¶
Calculates minimum values. Output column is min or prefix+’min’.
Parameters: - groups – specifies groups (list)
- complement (bool) – whether to take complemental groups
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
Calculated mean values are in column named mean or prefix+’mean’.
- Std(groups=None, complement=False, th=None, prefix='', bigger=True)¶
- Calculates standard variations. Output column is std
or prefix+’std’.
Parameters: - groups – specifies groups (list)
- complement (bool) – whether to take complemental groups
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
Calculated mean values are in column named mean or prefix+’mean’.
- Var(groups=None, complement=False, th=None, prefix='', bigger=True)¶
- Calculates variations. Output column is var
or prefix+’var’.
Parameters: - groups – specifies groups (list)
- complement (bool) – whether to take complemental groups
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
Calculated mean values are in column named mean or prefix+’mean’.
- ColumnSort(stats='mean', grouped=True, reverse=True)¶
Sort columns according to calculated stats (mean, max, min).
Parameters: - stats – mean, max or min
- grouped (boolean) – whether to calculate group average or treat each sample independently
- reverse (boolean) – sort order
- ColumnSelect(th, stats='mean', grouped=True, bigger=True)¶
Select columns according to calculated stats (mean, max, min).
Parameters: - th – threshold
- stats – either mean, max, or min
- grouped (boolean) – whether to calculate group average or treat each sample independently
- bigger (boolean) – whether to select bigger than threshold or not
- reverse (boolean) – sort order
- Threshold(th, field, bigger=False, absolute=False)¶
This filter selects rows according to set threshold for a column.
Parameters: - th (float) – threshold
- field – column name to apply the threshold
- bigger (bool) – whether to select rows bigger than threshold
- absolute (bool) – whether to take absolute value before thresholding
Example:
Threshold(2, 'sample1', bigger=True) # select rows where sample1 has bigger value than 2
- CV(groups=None, complement=False, th=None, prefix='', bigger=True)¶
- Calculates coefficient of variations. Output column is cv
or prefix+’cv’.
Parameters: - groups – specifies groups (list)
- complement (bool) – whether to take complemental groups
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
Calculated mean values are in column named mean or prefix+’mean’.
- ANOVA(groups=None, complement=False, th=None, prefix='', bigger=False)¶
Calculate ANOVA p-values and select according to the supplied threshold value.
Parameters: - groups – specifies groups (list)
- complement (bool) – whether to take complemental groups
- th (float) – threshold to apply for the statistics
- prefix (string) – prefix to prepend on the name for the statistics
- bigger (bool) – whether to select bigger than threshold or not
Calculated ANOVA p-values are in the column named anovap or prefix+’anovap’ if prefix is not empty string.
- Limit(limit=100, page=1)¶
Just limits the number of rows.
Parameters: - limit (integer) – how many to return
- page (integer) – which page to return
- Scale(groups=None, complement=False, lim=[0.0, 1.0])¶
Scale z values for defined limit. Good if you only care about between sample difference and not between gene difference.
Parameters: - groups – which groups to scale
- complement (bool) – whether to take complement of supplied groups
- lim – list of two floats, limits for z values
- Standardize(groups=None, complement=False)¶
Similar to Scale but instead of scaling it standardizes z values.
Parameters: - groups – which groups to standardize
- complement – whether to take complement of supplied groups