-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathModuler.py
More file actions
2123 lines (1890 loc) · 86.2 KB
/
Moduler.py
File metadata and controls
2123 lines (1890 loc) · 86.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/python
"""
**Moduler Copyright (C) 2012 Jose Sergio Hleap, with contributions of Kyle Nguyen, Alex Safatli and Christian Blouin**
Graph based Modularity.
This script will evaluate the data for modules. Such modules are defined as correlating variables, so the clustering
is performed in the correlation space. It has an optional statistical significance test for the clustering and power
analysis of the result, as well as a bootstrap analysis. See options for more details.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
E-mail: jshleap@dal.ca
In this version the the R dependencies have been extracted, and with them the RV coefficient test.
Requirements:
-------------
1. numpy module
2. scipy module
3. statsmodels module
4. matplotlib
5. scikit-learn
6. PDBnet module: This is an open source module and can be found in :download: `LabBlouinTools<https://github.com/LabBlouin/LabBlouinTools>`
*To install python modules 1-5 in UBUNTU: sudo apt-get install python-<module> OR sudo easy_install -U <module>
OR sudo pip install -U <module>*
*For PDBnet, the whole directory should be downloded into a labblouin folder which should be in your pythonpath*
Citation
========
If you use this software, please cite:
Hleap, J.S., Susko, E., & Blouin, C. Defining structural and evolutionary modules in proteins: a community detection approach to explore sub-domain architecture. BMC Structural Biology, 13, 20.
If you use the LDA pre-merge approach, please cite:
Hleap, J.S. & Blouin, C. 2014. Inferring meaningful communities from topology constrained correlation networks. PLOS ONE 9(11):e113438. DOI:10.1371/journal.pone.00113438
"""
__author__ = ['Jose Sergio Hleap',]
__contributions__ = ['Kyle Nguyen', 'Alex Safatli', 'Christian Blouin']
__version__ = 1.1
# Standard Library Imports
from numpy import array, cov, corrcoef, zeros, mean, std, sqrt, savetxt, \
column_stack, isnan, rad2deg, arccos, where, concatenate, triu_indices,\
log, trace, arange, percentile, arctanh, triu, mat, sum, subtract,\
vstack, equal, nonzero, savetxt, ndarray
from numpy.random import choice
from numpy.random import rand as randm
from numpy.linalg import eig, pinv
from scipy.stats import pearsonr, kendalltau, spearmanr, norm
from igraph import Graph, VertexClustering
from os.path import isfile,dirname,join
from matplotlib.patches import Ellipse
from joblib import Parallel, delayed
from collections import Counter
from copy import deepcopy
from random import sample
from os import rename
from glob import glob
import optparse
import datetime
import sys
try:import dill as pickle
except: import pickle
# Third Party Imports
import statsmodels.stats.power as smp
from sklearn.lda import LDA
# Library-Specific Imports
from labblouin.PDBnet import PDBstructure as P
# __licence__ = open(join(dirname(__file__),'gpl.txt')).read()
# main classes
class GMdata:
'''
GM object that populates GM data
:param prefix: Prefix of your GM file and accompanying files
:type prefix: string
:param dimension: The number of cartesian dimensions to be analyzed. By default is set to 3.
:type dimension: integer
:param t: Type of input. Either gm or csv, both semicolon delimited files, but the former includes names of the observations in the first field. By default is gm
:type t: string
:param asdistance: whether or not to tranform the data into a distance matrix
:type asdistance: boolean
:param Matrix: user defined matrix. By default a file will be read, based on t.
:type Matrix: :class `numpy.array`
'''
def __init__(self, prefix, dimension = 3, t='gm',asdistance=False, contacts=False,
Matrix=False):
self.dim = dimension
self.prefix = prefix
self.sample_size = None
self.type = t
if isfile(self.prefix+'.pdb'): self.pdb = P(self.prefix+'.pdb')
if not isinstance(Matrix,bool):
self.data = []
for d in xrange(self.dim):
self.data.append(Matrix[:,d::self.dim])
self.data = array(self.data)
else:
# read file
self.Read_GM()
# transform to distance?
if asdistance:
asdistance(self)
if contacts:
self.Load_contacts()
else:
self.contacts = False
def Read_GM(self):
'''
Load data from a gm (coordinates file) file. The file is a semicolon
separated file with row names in the first field.
'''
# Load Data ##########################################################
data = []
sample_size = 0
labels = []
# line for each dimension and create slush temp lists
temps = []
for i in range(self.dim):
temps.append([])
data.append([])
# Set the filename with the proper suffix
if self.type == 'csv':
fname = self.prefix+'.csv'
else:
fname = self.prefix+'.gm'
# Open file and read each line
with open(fname) as F:
for line in F:
#get the names
labels.append(line.strip().split(';')[0])
# Split into list
if self.type == 'gm':
if line.strip()[-1] == ';':
line = line.strip().split(';')[1:-1]
else:
line = line.strip().split(';')[1:]
elif self.type == 'csv':
if line.strip()[-1] == ';':
line = line.strip().split(';')[:-1]
else:
line = line.strip().split(';')
# check if there is a titleline
if all([not isFloat(x) for x in line]):
continue
# Dispatch values in the right dimension list using
#i%dim to index
for i in range(len(line)):
temps[i%self.dim].append(float(line[i]))
# Grow each matrix by 1 row
for i in range(len(temps)):
data[i].append(temps[i])
sample_size += 1
# Flush temps
temps = []
for i in range(self.dim):
temps.append([])
self.data= array(data)
self.sample_size = sample_size
self.labels =labels
def bootstrap_replicate(self):
'''
Create a bootstrap replicate of the data
'''
temp = []
ch = choice(self.data.shape[1],self.data.shape[1])
for d in xrange(self.dim):
temp.append(self.data[d][ch,:])
self.boot = array(temp)
return self.boot
def Load_contacts(self):
'''
Load the contacts from file, from PDBstructure or None
'''
self.contacts=[]
fn = '%s.contacts'%(self.prefix)
if isfile(fn):
with open(fn) as F:
for line in F:
l = line.strip().strip('(').strip(')').split(',')
self.contacts.append(tuple(l))
elif isfile('%s.pdb'%(self.prefix)):
if not self.pdb: self.pdb = P('%s.pdb'%(self.prefix))
contacts = self.pdb.Contacts(fasta='%s.fasta'%(self.prefix))
if isinstance(contacts,list): self.contacts = contacts
elif isinstance(contacts,dict):
c=[]
for v in contacts.values():
if not c: c=v
else:
sc = set(c)
c = list(sc.union(v))
self.contacts = c
self.pdb = pdb
with open(self.prefix+'.contacts','w') as F:
F.write('\n'.join([str(x) for x in self.contacts]))
else:
raise Exception('No Contact file nor PDB file provided')
def Correlated(self,GMstatsInstance):
'''
Include a GMstats instance
:param GMstatsInstance: An instance of the class GMstats
:type GMstatsInstance: :class GMstats
'''
self.GMstats = GMstatsInstance
###############################################################################
class GMstats:
'''
Include all stats related things with a GM file
:param prefix: Prefix of your GM file and accompanying files
:type prefix: string
:param matrix: A numpy nd array with the coordinates or info to be analysed. It contains the dimensions as first element, rows and columns follow.
:type matrix: :class `numpy.array`
:param dimensions: The number of cartesian dimensions to be analyzed. By default is set to 3.
:type dimensions: integer
:param sample_size: Number of observations
:type sample_size: integer
'''
def __init__(self,prefix,matrix,dimensions,lms=None,sample_size=None,rand=True):
self.data = matrix
self.prefix = prefix
self.dim = dimensions
self.lms = lms
self.rand = rand
if sample_size == None: self.n = self.data.shape[0]
else: self.n= sample_size
if self.lms == None:
self.Compute_correlations()
def Compute_correlations(self, method='fisher', absolutecov=False,
confval=0.95, threshold=0.0, usecov=False,
writemat=False,additive=True,power=0.8):
'''
:param method: Which method to use for correlation. By default it uses the fisher transformation. Available options are pearson, spearman and fisher.
:type method: string
:param confval: Define the confidence level (1-alpha) for the correlation test. By default is 0.95. Can take any float value between 0 and 1.
:type confval: float
:param absolutecov: Whether or not to use absolute values of the correlation matrix
:type absolutecov: boolean
:param power: Perform a statistical power analysis with this value as the desired Power (1-type II error). By default 0.8 is used. Can be any float from 0 to 1
:type power: float
:param usecov: Use covariance instead of correlation.
:type usecov: boolean
:param writemat: Write correlation/covariance matrices to file. By default is false. It can take a False, for the aggreatated matrix or cor for the correlation matrix.
:param additive: Use the mean of the correlation in each dimension instead of the euclidean distance to aglomerate the dimensions. The default behaviour is additive.
:type additive: boolean
'''
self.method = method
self.confval = confval
self.absolutecov = absolutecov
self.power = power
self.threshold = threshold
self.usecov = usecov
self.writemat = writemat
self.additive = additive
thr = 0
# Create a correlation matrix testing for significance ################
if self.method:
self.Sigcorr()
# Else, use covariance matrix instead or correlation without
# statistical test ####################################################
elif self.usecov:
thr = self.UseCov()
else:
thr = self.UseCorr()
#self.matrices = self.matrices
if mat == 'cor':
for m in range(len(self.matrices)):
savetxt(self.prefix+str(m)+'.mat',self.matrices[m],
delimiter=';', fmt='%s')
# Agglomerate landmark dimensions #####################################
if self.additive:
self.agglomerare_additive()
else:
self.agglomerare_mean()
if mat == 'agg':
savetxt(self.prefix+'.lms',self.lms,delimiter=';',fmt='%s')
def agglomerare_additive(self):
'''
Agglomerate landmark dimensions using euclidean distance
'''
lms = []
for i in range(0,self.matrices[0].shape[0]):
temp = []
for j in range(0, self.matrices[0].shape[0]):
sm = 0.0
# Sum over all dimensions
for m in self.matrices:
d = m[i][j]
if self.absolutecov:
d = abs(d)
sm += (d**2)
sq = sqrt(sm)
temp.append(sq)
lms.append(temp)
self.lms = array(lms)
def agglomerare_mean(self):
'''Agglomerate landmark dimensions using average of correlation '''
lms = []
for i in xrange(0,self.matrices[0].shape[0]):
temp = []
for j in xrange(0, self.matrices[0].shape[0]):
sm = 0.0
# Sum over all dimensions
for m in self.matrices:
d = m[i][j]
if self.absolutecov:
d = abs(d)
sm += (d/self.dim)
temp.append(sm)
lms.append(temp)
self.lms = array(lms)
def Sigcorr(self):
'''
Test if the correlation is significantly different than 0 with the method specified
'''
if isbetterthanrandom(self.dim,self.data):
self.matrices = []
if self.dim == 1:
self.matrices.append(self.SigCorrOneMatrix(self.data))
else:
for i in xrange(len(self.data)):
# Perform the significant correlation test in each dimension
self.matrices.append(self.SigCorrOneMatrix(self.data[i]))
else:
sys.exit("Your data's correlation is not better than random. Exiting Moduler")
def SigCorrOneMatrix(self, sliced):
'''
Performs the significance of correlation test according to the method passed
:param sliced: an array with single dimensional data
:type sliced: :class `numpy.array`
'''
data = sliced
# Get a matrix of zeroes.
zero=zeros(shape=(data.shape[1], data.shape[1]))
for e in range(len(data.T)):
for f in range(e, len(data.T)):
if self.method == 'pearson' or self.method == 'fisher':
p=pearsonr(array(data.T[e]).ravel(),array(data.T[f]).ravel())
if self.method == 'kendall':
p=kendalltau(array(data.T[e]).ravel(),array(data.T[f]).ravel())
if self.method == 'spearman':
p=spearmanr(array(data.T[e]).ravel(),array(data.T[f]).ravel())
# Symmetrize
if self.method == 'fisher':
if p[0] == 1.0:
p = (0.999,p[1])
if self.absolutecov:
if abs(self.F_transf(p[0])) > self.Z_fisher()\
and self.Power_r(p[0]) <= self.n:
zero[e][f] = zero[f][e] = abs(p[0])
else:
zero[e][f] = zero[f][e] = 0.0
elif self.F_transf(p[0]) > self.Z_fisher()\
and self.Power_r(p[0]) <= self.n:
zero[e][f] = zero[f][e] = p[0]
else:
zero[e][f] = zero[f][e] = 0.0
else:
if (p[1]<= (1-self.confval)) and (self.Power_r(p[0]) <= self.n):
zero[e][f] = zero[f][e] = p[0]
else:
zero[e][f] = zero[f][e] = 0.0
return zero
def F_transf(self,r):
'''
Compute the Fisher transformation of correlation
'''
return 0.5 * (log((1+r)/(1-r)))
def Z_fisher(self):
'''
Compute the sample - corrected Z_alpha for hypotesis testing of Fisher transformation of correlation
'''
confval = 1-((1-self.confval)/2)
return norm.isf(1-confval) / sqrt(self.n-3)
def Power_r(self, corr):
'''
Compute the power of the correlation using the Z' trasnformation of correlation coefficient:
Z'=arctang(r)+r/(2*(n-1)) (see Cohen (1988) p.546).
It will return the required n fo the power and significance chosen.
:param corr: Correlation value
:type corr: float
'''
if corr == 0.0 or (corr < 0.01 and corr > -0.001) :
corr = 0.0012
elif corr < -0.001:
corr = corr
elif corr >= 0.99:
corr = 0.99
else:
corr = corr
'''
r('p=pwr.r.test(power=%f,r=%f, sig.level=%f)'%(float(self.power), float(corr), float(1-self.confval)))
n = r('p[1]')[0]
if this part is reverted:
Require the R package PWR written by Stephane Champely <champely@univ-lyon1.fr>.
'''
#This is in beta to avoid dependencies outside python
n = smp.NormalIndPower().solve_power(arctanh(corr), alpha=float(1-self.confval), ratio=0,
power= float(self.power), alternative='two-sided') + 3
return n
def UseCov(self):
'''
Create a variance-covariance matrix
'''
thr=[]
# Perform np magic
for i in range(len(data)):
l=88
# Convert to np matrix
self.matrices.append(matrix(data[i]))
l+=i
self.matrices[i] = cov(self.matrices[i].T)
if self.threshold == 'Auto':
#get the threshold from correlation matrix#
for e in range(len(self.matrices[i])):
thr.append(std(self.matrices[i][e]))
return thr
def UseCorr(self):
'''
Use pearson correlation without a significant test
'''
thr=[]
# Perform np magic
for i in range(len(data)):
l=88
# Convert to np matrix
self.matrices.append(matrix(data[i]))
l+=i
self.matrices[i] = corrcoef(self.matrices[i].T)
if self.threshold == 'Auto':
#get the threshold from correlation matrix#
for e in range(len(self.matrices[i])):
thr.append(std(self.matrices[i][e]))
return thr
def GetAglomeratedThreshold(self):
''' Get threshold for the n dimensions'''
# Call if you need to get the vector of threshold
thr = []
for i in range(len(self.lms)):
for j in range(len(self.lms)):
if i !=j and i < j and self.lms[i][j] != 0.00000:
thr.append(self.lms[i][j])
threshold = std(thr)
self.threshold = threshold
def LDA(self, membership, which='lms', ellipses=2.5):
'''
Perform a Linear discriminant analysis of the transposed data. Membership must be an array
of integers of the same lenght of the number of observations in the data.
:param membership: a list corresponding to memberships of the entries in data
:type membership: list
:param which: Either 'gm' or 'lms'. To perform the LDA in the full matrix or only in the correlation matrix.
:type which: string
:param ellipses: A value representing the estandard deviations for confidence ellipses. By default is set to 2.5 (95% confidence ellipses)
:type ellipses: float
'''
if which == 'gm':
data = self.data
if len(data.shape) == 3:
membership = membership*data.shape[0]
data = column_stack(tuple([x for x in data]))
elif any(((data.shape[0] == len(membership)), (data.shape[1] == len(membership)))):
m = []
membership = [m.extend([membership[i]]*self.dim) for i in xrange(len(membership))]
else: data = self.lms
membership = array(membership)
#dont include singletons
nonsingles = where(membership != '?')
membership = membership[nonsingles]
if len(set(membership)) >= 2:
data = data[nonsingles]
lda = LDA(n_components=2)
self.fit = lda.fit(data.T, membership).transform(data.T)
if ellipses:
ell = getEllipses(self.fit, membership,ellipses)
else:
print 'Not enough groups to perform LDA prefiltering. Exiting the program.'
sys.exit()
def Clus_Power(self, membership):
'''
Make a power analysis of the clusters by igraph and outputs a table with the proportion
of elements above the n required according to significance, power and correlation.
it can test different clusters that are not in the class
:param m: a membership vector in list form or the name of a file.
:type m: list or string
'''
if isinstance(membership,str): mem = open(prefix+'.graphcluster').read().strip().split()
else: mem = membership
# alternative dictionary
proportion={}
# Split the data into the clusters ignoring singletons. This assumes that singletons have
# been identifed
darr = clus2list(membership,self.matrices)
if darr:
# get all outputs
resline = '#################################################################\n'
resline+= '#Power analysis and description of the intracluster correlation:#\n'
resline+= '#################################################################\n\n'
resline+= '1 - Significance(alpha)= %f;\t1 - Power(beta):%f\n'%(1-self.confval,1-self.power)
resline+= 'Sample Size : %d\n\n'%(self.n)
#loop over clusters
for clusters in darr:
c=darr[clusters]
# if is a list of n dimensions zip each correlation and return the max
if isinstance(c,list): c=array([max(x) for x in zip(*c)])
# estimate quantiles
q = percentile(c, [0,25,50,75,100], axis=None, out=None, overwrite_input=False)
# estimate the proportion of variables with enough power
enough=0.0
for x in c:
nsam = self.Power_r(x)
if int(nsam) <= self.n:
enough += 1.0
prop = enough/len(c)
tr = ' 0% \t\t 25% \t\t 50% \t\t 75% \t\t 100% \t\t PVP \t\t nvar \n'
resline+= tr +' %f \t %f \t %f \t %f \t %f \t %f \t %d/%d \n\n'%(round(q[0],3),round(q[1],3),
round(q[2],3),round(q[3],3),round(q[4],3),
round(prop,3), enough, len(c))
proportion[clusters]=prop
resline+= '_'*len(tr)+'\n'
resline+= 'Percentages are the quantiles of the lower triangle of the correlation matrix\n'
resline+= 'PVP: Proportion of variables with enough statistical power \n'
resline+= 'nvar: Number of variables with enough power within the cluster \n'
# Open the output file and write resline
# write to file
with open('%s_Power_analysis.asc'%(self.prefix),'w') as ouf:
ouf.write(resline)
print resline
else:
print 'Only singletons in the dataset. Either correlation zero or not enough power to resolve'\
'a negligible correlation.'
self.PVP = proportion
return proportion
###############################################################################################
class GMgraph:
'''
Create a graph based on an input in matrix form and compute modularity
:param Matrix: an square matrix where the indices represent intended nodes and the values the relationship between them
:type Matrix: :class `numpy.array` or :class `GMstats`
:param unweighted: create an unweighted graph as opposed to a weighted (correlation; default) one.
:type unweighted: boolean
:param gfilter: List of tuples corresponding to desired conection of the graph. Each element in pair tuple must correspond to node indices in the graph. This is a topology constraint.
:type gfilter: list of tuples
:param threshold: A float corresponding to the threshold to create a conection between nodes. This is very user dependendent and is set to 0 as default.
:type threshold: float
:param rand: wether to test if your data is better than random. Default is true
:type rand: boolean
'''
def __init__(self, prefix, Matrix, unweighted=False, gfilter=[], threshold=0.0, rand=True):
if isinstance(Matrix,GMstats): self.lms = Matrix.lms; self.gmstats = Matrix; self.gmgraph=False
elif isinstance(Matrix, Graph): self.gmgraph = True ; self.gmstats = False
else: self.lms = Matrix ; self.gmstats = False; self.gmgraph=False
self.gfilter = gfilter
self.threshold = threshold
self.unweighted= unweighted
self.membership= None
self.prefix = prefix
self.rand = rand
#check if graph have been done before if so load it is not execute buildgraph
if isinstance(Matrix, ndarray) and not self.gmgraph:
if self.prefix:
if not isfile(str(self.prefix) + '.igraph.pickl'): self.Build_igraph()
else: self.g = pickle.load(open(prefix + '.igraph.pickl'))
elif isinstance(Matrix,Graph):
self.g = Matrix
def Build_igraph(self):
'''
Build a graph, using igraph library. It will return it, and store it as an attribute (self.g)
:returns: :class:: 'igraph.Graph'
'''
# create a multiplier if the weights are not high enough for resolution
if (self.lms.max()/100000.) <= 1e-4: multiplier = 100000
else: multiplier=1
g = Graph(len(self.lms))
for i in range(len(g.vs)):
g.vs[i]['label'] = i
for i in range(len(self.lms)):
for j in range(i+1,len(self.lms)):
#Assign edges if value is over the threshold
if self.lms[i][j] and self.lms[i][j] >= self.threshold:
if self.gfilter:
#Only add edges if not filtered out
if ((i,j) in self.gfilter) or ((str(i),str(j)) in self.gfilter):
g.add_edge(i,j)
if not self.unweighted:##if unweighted, don't include weight
g.es[len(g.es)-1]['wts'] = int(self.lms[i][j]*multiplier)
else:
g.add_edge(i,j)
if not self.unweighted:##if unweighted, don't include weight
g.es[len(g.es)-1]['wts'] = int(self.lms[i][j]*multiplier)
self.g = g
g.write_pickle(fname=self.prefix + '.igraph.pickl')
g.write(self.prefix+'.edges')
return g
def Get_StructProps(self,overall=False):
'''
Get the structural properties in the graph
:param overall: Calculate the centralities on the full graph, as opposed as by modules ( Default behaviour ).
:type overall: boolean
'''
if self.unweighted: weights = None
else: weights = self.g.es['wts']
if overall:
self.betweeness = self.g.betweenness(weights)
self.edge_betweenness = self.g.edge_betweenness(weights)
self.degree = self.g.degree(weights)
self.closeness = self.g.closeness(weights)
self.eigen = self.g.eigenvector_centrality(weights)
self.shell_index = self.g.coreness()
else:
mem = self.membership
evcen,btcen,clcen,degree,edge_bt,shell=[],[],[],[],[],[]
for e in mem:
sg = g.subgraph(e)
evcen.append(sg.eigenvector_centrality(weights))
btcen.append(sg.betweenness(weights=sg.es['wts']))
clcen.append(sg.closeness(weights))
degree.append(sg.degree(weights))
edge_bt.append(sg.edge_betweenness(weights))
shell.append(sg.coreness())
self.betweeness = btcen
self.edge_betweenness = edge_bt
self.degree = degree
self.closeness = clcen
self.eigen = evcen
self.shell_index = shell
def Graph_Cluster(self, method = 'fastgreedy',**kwargs):
'''
Clustering by components comstGreed, using igraph library.
:param method: method in igraph ofr community detection. It can be fastgreedy, infomap leading_eigenvector_naive,leading_eigenvector,label_propagation, multilevel, optimal_modularity, edge_betweenness, spinglass, walktrap. For details see igraph documentation.
:type method: string.
:param kwargs: other arguments passed to the igraph methods
'''
self.ClusterMethod = method
memgreed = [0]*self.lms.shape[0]
index = 0
#get independent componets of the graph
comp = self.g.components()
print "There are %d components in the network."%(len(comp))
# Check if only singletons
if len(comp) == len(self.lms):
Warning('Only singletons in the dataset')
mem = range(len(comp))
for e in range(len(memgreed)):
memgreed[e] = e
else:
# loop over each component and perform modularity optimization given the method provided
for cp in range(len(comp)):
cpn = comp.subgraph(cp)
if len(cpn.vs) > 1:
try: mem = getattr(cpn,'community_'+method)(cpn.es['wts'],**kwargs).as_clustering()
except: mem = getattr(cpn,'community_'+method)(cpn.es['wts'],**kwargs)
for i in mem:
for j in i:
memgreed[cpn.vs[j]['label']] = index
index += 1
else:
memgreed[comp[cp][0]] = index
index += 1
self.mem = memgreed
self.vertexclus = VertexClustering(self.g,self.mem)
self.membership = self.vertexclus.membership
try: self.modularityScore = self.g.modularity(self.membership, self.g.es['wts'])
except: self.modularityScore = self.g.modularity(self.membership)
print "%d cluster(s) found."%(len(set(self.membership)))
print "Modularity score: %f"%(self.modularityScore)
def Cluster2File(self):
''' Write cluster to a file and rename the cluster with PDB friendly characters if possible (this is specific use)'''
chains = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','Y','Z',
'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','y','z',
'0','1','2','3','4','5','6','7','8','9']
line=''
if isinstance(self.membership ,str): memb = self.membership.strip().split()
if not len(set(memb)) > len(chains):
for i in memb:
if isinstance(i,str) and i.isdigit():
line += chains[int(i)]+' '
elif isinstance(i,str) and not i.isdigit():
line += i + ' '
else:
for i in memb:
line += str(i)+' '
with open(self.prefix+'.graphcluster','w') as fout:
fout.write(line)
print 'Membership vector found:'
print line
print 'Membership vector written to %s'%(self.prefix+'.graphcluster')
self.membership = line.strip().split()
def Identify_Singletons(self,method='fastgreedy'):
''' Given a membership vector identify singletons'''
if not self.membership: self.Graph_Cluster(method=method)
nc=''
d = Counter(self.membership)
for e in self.membership:
if d[e] == 1:
nc+= '? '
else:
nc+= '%s '%(str(e))
self.membership = nc
def LDAmerge(self,which='lms',ellipses=2.5,dimensions=1):
'''
Perform an LDA analisis and merge all classes which 95% confidence ellipses collide.
:param which: Either 'gm' or 'lms'. To perform the LDA in the full matrix or only in the correlation matrix.
:type which: string
:param ellipses: A value representing the estandard deviations for confidence ellipses. By default is set to 2.5 (95% confidence ellipses)
:type ellipses: float
:param dimensions: dimensions of the matrix
:type dimensions: integer
'''
if self.gmstats:
self.gmstats.LDA(self.membership, which=which, ellipses=ellipses)
fit = self.gmstats.fit
else:
stats = GMstats(self.prefix, self.lms, dimensions=dimensions,
sample_size=None,rand=self.rand)
stats.LDA(self.membership, which=which, ellipses=ellipses)
fit = stats.fit
Ellipses = getEllipses(fit, self.membership, std=ellipses)
merges=[]
#Check for collisions between classes with 95% conficence ellipses
K=Ellipses.keys()
for i in range(len(Ellipses)):
for j in range(len(Ellipses)):
if i > j:
if EllipseOverlap(Ellipses[K[i]],Ellipses[K[j]]):
merges.append((K[i],K[j]))
#merge unsupported clusters
newcl = equiclus(merges)
self.newmem = rename_clusters(newcl, self.membership)
###############################################################################################
class SupportClustering:
'''
This class provides ways to provide statistical support for a given clustering scheme. It is based in
testing if the correlation between groups is significatly different than between groups.
:param prefix: a prefix for the output.
:type prefix: string
:param data: a 2D numpy array with the data from which the clustering was inferred.
:type data: :class `numpy.ndarray`
:param membership: a list equal to the second dimension of data (i.e. data.shape[1]), corresponding to the clustering scheme being tested.
:type membership: list
:param dimensions: Number of dimensions in your data matrix. If your matrix is correlation or related, dimesions should be one.
:type dimensions: integer
:param permutations: Number of permutations to perform the permutation t-test.
:type permutations: integer
:param confval: confidence value for the test (1 - alpha).
:type confval: float
:param threshold: Value to filter out values of the correlation. If set to none, no threshold filtering will be done.
:type threshold: float or None
:param rand: wether to test if your data is better than random. Default is true
:type rand: boolean
'''
def __init__(self,prefix, data, membership, dimensions, permutations, confval=0.95,
threshold=0.0, rand=True):
self.prefix = prefix
if isinstance(data,GMdata):
self.data = data.data
self.GMdata = data
else:
self.data = data
self.GMdata = None
self.confval = confval
if self.data.shape[0] == self.data.shape[1]: self.lms = self.data
else: self.DealDimensions()
self.dim = dimensions
self.perm = permutations
self.thres= threshold
self.mem = membership
self.rand = rand
def DealDimensions(self):
'''
If the data has more than one dimension (in the cartesian sense or the origin of the data),
split it, compute correlation of each dimension and then aggregate it using euclidean distance
of the coefficients of determination. It assumes that the dimensions are interleaved.
This correlation does not have a significance testing, use caution.
It is reccomended to use GMstats class before using this class.
'''
dims=[]
Rs=0
for i in xrange(self.dim):
dims.append(range(i,len(self.data.shape[1]),self.dim))
for d in dims:
C = corrcoef(self.data[:,d], rowvar=0)
Rs+= C**2
self.lms = sqrt(Rs)
def subsetTest(self, k1,k2):
'''
Given a series of indices, split the data into intra and inter correlation
and test for equality
:param subsetdata:
:param k1: Name of one of the clusters being compared
:type k1: string or integer
:param k2: Name of the other clusters being compared
:type k2: string or integer
'''
# if threshold not required (i.e. threshold false or none) set to a really negative number
if self.thres: threshold = self.thres
elif self.thres != 0.0: threshold = -1e300
else: threshold=0.0
# get the indices of each group from lms
indA = where(array(self.mem) == k1)
indB = where(array(self.mem) == k2)
# get the elements corresponding to intragroups (A and B) and intergroups (AB)
# only the upper triangle
A = self.lms[indA[0],:][:,indA[0]][triu_indices(len(indA[0]))]
B = self.lms[indB[0],:][:,indB[0]][triu_indices(len(indB[0]))]
AB = self.lms[indA[0],:].flatten()
# select cases above the threshold
A = A[where(A > threshold)]
B = B[where(B > threshold)]
AB = AB[where(AB > threshold)]
if not self.rand:
# test if random cases are significant
RA = randm(A.shape[0])
RAB = randm(AB.shape[0])
pvalR= twotPermutation(RA,RAB,self.perm)
if pvalR < (1-self.confval):
print 'Not enough data to discern %s vs %s from random. Setting to insignificant'%(k1,k2)
pvalA=1
pvalB=1
else:
if not isbetterthanrandom(self.dim, self.data):
print 'Not enough data to discern %s vs %s from random. Setting to insignificant'%(k1,k2)
pvalA=1
pvalB=1
else:
# perform the permutation t-test
pvalA= twotPermutation(A, AB,self.perm)
pvalB= twotPermutation(B, AB,self.perm)
# store pvalues in their corresponding containers
self.pvals.extend([pvalA,pvalB])
self.cl['%svs%s_%s'%(k1,k1,k2)]= pvalA
self.cl['%svs%s_%s'%(k2,k1,k2)]= pvalB
def get_cluster_indices(self):
'''
Returns a dictionary with cluster indices
'''
if isinstance(self.mem ,str): self.mem = self.mem.strip().split()
s = set(self.mem)
d={}
# get cluster indices
for l in s:
d[l]=[]
for e in range(len(self.mem)):
if self.mem[e] == l:
d[l].append(e)
return d
def filter_singletons(self,D):
'''
Giving a dictionary with cluster indices, return whichones are singletons
:param D: a dictionary with cluster indices
:type D: dictionary
:returns: a dictionary with the indices of unconnected nodes
'''
singles={}
if isinstance(self.mem,str): self.mem = self.mem.strip().split()
C = Counter(self.mem)
if '?' in C:
# singletons already in membership vector
Warning('Singletons might be already filtered. Check your results carefully')
for i in C:
# if the count of the class is 1, is a singleton
if (C[i] == 1) or (i == '?'):
singles[i]='?'
#Create a new membership vector replacing singletons by ?
newm=[]
for m in self.mem:
if m in singles.keys(): newm.append('?')
else: newm.append(m)
return singles, newm
def AreModneighbours(self,A,indA,indB):
'''
loop over the adjacency matrix to find if indA and indB are in the neiborhood
:param A: a list of tupples of related entries
:type A: list
:param indA: indices of grup A
:type indA: list
:param indB: indices of grup B
:type indB: list
:returns: boolean of whether or not indA and indB are neighbours
'''
ans = False
for i in indA:
for j in indB:
if A[i][j] == 1:
ans = True
return ans