Advanced Filtering¶

In addition to filtering the dataset by searching for annotations such as gene symbols or descriptions, you can filter the dataset by the nature of the data itself. For example, min, max, and mean etc of the dataset can be used to select which rows of the dataset to retrieve. You accomplish this by writing a very simple Script.

A Script consists of a set of Variable definitions and a set of Filters.

Variable definition is in the form of:

v = name1,name2,name3,...
                

Internally this is turned into a list of strings:

v = ['name1','name2','name3',...]
                

That is, this variable definition is a short hand for making a list of names. These variables are used as parameters for Filters.

When defining variables, you can also use wild cards to select multiple group names, sample names.

*	matches everything
?	matches any single character
[seq]	matches any character in seq ([1-9],[a-z],[abc],[12] etc.)
[!seq]	matches any character not in seq
-	a minus (-) sign in front removes matched names from the previously matched list if there is no previous match, it removes matched names from the whole list

For example:

g1 = *CA1 # matches all Hippocampus CA1 groups
g2 = Trpv1* # matches all Trpv1 mouse line groups
g3 = *.unfed* # matches unfed condition
g4 = Trp*,-*NL* # matches all Trpv1 mouse samples but exclude samples which contains "NL"

Tip

Anything after # is treated as comment and ignored.

You can also use ‘-‘ at the beginning to indicate exclusion:

g1 = -*ChIP,*INP # exclude ChIP data

A Filter will select a subset of the dataset with supplied criteria. It is in a form of a function:

FilterName( param1, param2, ...)
                

If multiple filters are specified, they are applied in succeeding manner.

For example:

g1 = *.unfed*  # define g1 as unfed group
g2 = *.fed*  # define g2 as fed group
Max(g1+g2, th=20) # only consider genes with max expression value over 20 across unfed, fed samples.
FoldChange(g1,g2,th=3) # now filter according to the foldchange between groups g1 and g2
TTest(g1,g2,th=0.05) # then further filter according to ttest pvalues
Sort('max') # sort according to max gene value across fed, unfed samples
c = *Arc # define c as groups taken from Arcuate nucleus
Columns(c) # only show these Arcuate samples

This script will produce the following status message:

and output this heatmap:

More examples are listed in Example Scripts section.

Available Filters are listed below.

Available Filters¶

Columns(include=None, exclude=None)¶

This filter restricts columns (samples) rather than rows.

Parameters:	include – list of groups to include in the final output exclude – list of groups to exclude in the final output

Example1:

c = celltype1,celltype2
Columns(c) # This will only show data for celltype1 and celltype2
                    

Example2:

c = celltype1,celltype2
Columns(exclude=c) # This will exclude celltype2 and celltype2 from the final output.
                    

Find(val, col='symbol', exact=True)¶

Find rows that matches a regular expression. Case is ignored.

Parameters:	val – list of string or regular expression search string (^:indicates start, $:end, . (period):wild card, :repeat, \|:or etc.) col* – which column to search in exact – when list is supplied, whether to match exact word (equivalent to put ^ and $ at the beginning and end)

Available columns

'id', 'etid', 'egid', 'symbol', 'sym', 'description', 'chrloc', 'strand', 'band', 'biotype', 'GC', 'refseq', 'entrez', 'mirbase', 'ids', 'wikigene', 'gob', 'gom', 'goc', 'interpro', 'pfam'

Example1:

                        Find('^Gad1$|^Gad2$') # Find rows whose symbol is exactly Gad1 or Gad2
                        
                    

Example2:

Find('Gad') # Find rows whose symbol contains Gad (returns Gad1, Gad2, Gadd45a, Itgad etc.)
                    

Example3:

Find('peptide', 'description') # Find rows whose description column contains "peptide".
                    

Example4:

s = Gad1,Gad2
Find(s)  # same as Example 1
                    

FoldChange(group1, group2=None, th=None, prefix='', both=True, bigger=True)¶

Calculate fold change and selects according to supplied threshold.

Parameters:

Parameters:	group1 – group1 group2 – group2 if not supplied (or None) then complement of group1 th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not both (bool) – if both==True then foldchange=max(group1/group2,group2/group1) otherwise foldchange = g1/g2

group1 – group1
group2 – group2 if not supplied (or None) then complement of group1
th (float) – threshold to apply for the statistics
prefix (string) – prefix to prepend on the name for the statistics
bigger (bool) – whether to select bigger than threshold or not
both (bool) – if both==True then foldchange=max(group1/group2,group2/group1) otherwise foldchange = g1/g2

Calculated foldchange values will be in column named fc or prefix+’fc’.

FoldDiff(group1, group2=None, th=None, prefix='', base=2, both=True, bigger=True)¶

Calculate fold change when signal values are in log space and selects according to supplied threshold.

Parameters:

Parameters:	group1 – group1 group2 – group2 if not supplied (or None) then complement of group1 th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not both (bool) – if both==True then foldchange=max(group1-group2,group2-group1) otherwise foldchange = g1/g2

group1 – group1
group2 – group2 if not supplied (or None) then complement of group1
th (float) – threshold to apply for the statistics
prefix (string) – prefix to prepend on the name for the statistics
bigger (bool) – whether to select bigger than threshold or not
both (bool) – if both==True then foldchange=max(group1-group2,group2-group1) otherwise foldchange = g1/g2

Calculated foldchange values will be in column named fc or prefix+’fc’.

GO(val, subtree=True)¶

Filtering based on Gene Ontology annotation.

Parameters:	val – list of GO ids subtree (boolean) – whether to search for subtree or not

Example:

goid = 0005184 # neuro peptide hormone activity
GO(goid)  # this will return rows annotated to have neuropeptide hormone activity

Sort(groups=None, ascending=None)¶

Sort dataset according to the mean of the supplied group or just supplied field.

Parameters:	groups – list of groups or just single field name bigger (bool) – whether to sort ascending

TTest(group1, group2=None, th=None, prefix='', twosided=True, bigger=False, logscale=True)¶

Calculate Student’s T-test p-values and selects according to supplied threshold.

Parameters:	group1 – group1 group2 – group2 if not supplied (or None) then complement of group1 th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not twosided (bool) – whether to take twosided ttest or not

T-test pvalues will be in a column named ttestp or prefix+’ttestp’.

Mean(groups=None, complement=False, th=None, prefix='', bigger=True)¶

Calculates average for supplied groups and selects according to supplied threshold.

Parameters:	groups – specifies groups (list) complement (bool) – whether to take complemental groups th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not

Calculated mean values are in column named mean or prefix+’mean’.

Max(groups=None, complement=False, th=None, prefix='', bigger=True)¶

Calculates maximum values. Output column is max or prefix+’max’.

Parameters:	groups – specifies groups (list) complement (bool) – whether to take complemental groups th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not

Calculated mean values are in column named mean or prefix+’mean’.

Min(groups=None, complement=False, th=None, prefix='', bigger=True)¶

Calculates minimum values. Output column is min or prefix+’min’.

Parameters:	groups – specifies groups (list) complement (bool) – whether to take complemental groups th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not

Calculated mean values are in column named mean or prefix+’mean’.

Std(groups=None, complement=False, th=None, prefix='', bigger=True)¶

Calculates standard variations. Output column is std or prefix+’std’.

Parameters:	groups – specifies groups (list) complement (bool) – whether to take complemental groups th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not

Calculated mean values are in column named mean or prefix+’mean’.

Var(groups=None, complement=False, th=None, prefix='', bigger=True)¶

Calculates variations. Output column is var or prefix+’var’.

Parameters:	groups – specifies groups (list) complement (bool) – whether to take complemental groups th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not

Calculated mean values are in column named mean or prefix+’mean’.

ColumnSort(stats='mean', grouped=True, reverse=True)¶

Sort columns according to calculated stats (mean, max, min).

Parameters:	stats – mean, max or min grouped (boolean) – whether to calculate group average or treat each sample independently reverse (boolean) – sort order

ColumnSelect(th, stats='mean', grouped=True, bigger=True)¶

Select columns according to calculated stats (mean, max, min).

Parameters:	th – threshold stats – either mean, max, or min grouped (boolean) – whether to calculate group average or treat each sample independently bigger (boolean) – whether to select bigger than threshold or not reverse (boolean) – sort order

Threshold(th, field, bigger=False, absolute=False)¶

This filter selects rows according to set threshold for a column.

Parameters:	th (float) – threshold field – column name to apply the threshold bigger (bool) – whether to select rows bigger than threshold absolute (bool) – whether to take absolute value before thresholding

Example:

Threshold(2, 'sample1', bigger=True) # select rows where sample1 has bigger value than 2
                    

CV(groups=None, complement=False, th=None, prefix='', bigger=True)¶

Calculates coefficient of variations. Output column is cv or prefix+’cv’.

Parameters:	groups – specifies groups (list) complement (bool) – whether to take complemental groups th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not

Calculated mean values are in column named mean or prefix+’mean’.

ANOVA(groups=None, complement=False, th=None, prefix='', bigger=False)¶

Calculate ANOVA p-values and select according to the supplied threshold value.

Parameters:	groups – specifies groups (list) complement (bool) – whether to take complemental groups th (float) – threshold to apply for the statistics prefix (string) – prefix to prepend on the name for the statistics bigger (bool) – whether to select bigger than threshold or not

Calculated ANOVA p-values are in the column named anovap or prefix+’anovap’ if prefix is not empty string.

Limit(limit=100, page=1)¶

Just limits the number of rows.

Parameters:	limit (integer) – how many to return page (integer) – which page to return

Scale(groups=None, complement=False, lim=[0.0, 1.0])¶

Scale z values for defined limit. Good if you only care about between sample difference and not between gene difference.

Parameters:	groups – which groups to scale complement (bool) – whether to take complement of supplied groups lim – list of two floats, limits for z values

Standardize(groups=None, complement=False)¶

Similar to Scale but instead of scaling it standardizes z values.

Parameters:	groups – which groups to standardize complement – whether to take complement of supplied groups