The title is a little bit misleading. I basically need a "weighted" logistic regression with the weight applied to the log likelihood function. Instead of true logistic regression's loglik function: y*log(p) + (1-y)*log(1-p), I want w*(y*log(p) + (1-p)*log(1-p)). I am familiar with R, and one obvious way to solve it is to write down my customized log likelihood and call "optim" to optimize it. But I also want nice summary statistics available from glm object. I tried to use :
glm(y~x, weights = w, family="binomial")
the problem with this code is that "w" must be integer according to glm's doc: "For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes: they would rarely be used for a Poisson GLM".
My "w" vectors are not integers. However, the requirement of integer is ok for summary statistics: the results are same as optimizing loglikelihood function solution. But when I ask for statistics like "AIC", "loglik", it's no longer correct! Here is a code snippet from binomial's aic function in R:
-2 * sum(ifelse(m > 0, (wt/m), 0) * dbinom(round(m *
y), round(m), mu, log = TRUE))
That's the problem!
Since I am playing with Python's statsmodels moduaroundle recently, I looked into its source code for class "Logit" and found it is extremely clear and should be easily extensible. So, here is a extended LogitWeight class that can do the work:
import statsmodels.api as sm
import numpy as np
class LogitWeight(sm.Logit):
def __init__(self, endog, exog, **kargs):
super(sm.Logit, self).__init__(endog, exog, **kargs)
self.weights = kargs.get("weights", np.ones_like(endog))
def loglike(self, params):
q = 2*self.endog - 1
X = self.exog
return np.sum(self.weights*np.log(self.cdf(q*np.dot(X, params))))
def loglikeobs(self, params):
q = 2*self.endog - 1
X = self.exog
return self.weights*np.log(self.cdf(q*np.dot(X, params)))
def jac(self, params):
y = self.endog
X = self.exog
L = self.cdf(np.dot(X, params))
return ((y - L) * self.weights)[:, None] * X
def score(self, params):
y = self.endog
X = self.exog
L = self.cdf(np.dot(X, params))
return np.dot((y - L)*self.weights, X)
def hessian(self, params):
X = self.exog
L = self.cdf(np.dot(X, params))
return -np.dot((self.weights*L*(1-L)*X.T), X)
This is almost identical to LogLik's code except the added weights and it works! It produces the same parameter estimation as R's optimizing loglikelihood solution and produces correct AIC and loglikelihood. The only problem is that "weights" must be a np array and cannot be a Pandas object (although endog and exog can be pandas.DataFrame or pandas.Series which are converted to np array by default initialization function).
Afterthought: I should be able to write a customized binomial family to replace R's default binomial family to achieve same results.
This comment has been removed by the author.
ReplyDeleteHi Lu,
ReplyDeleteThanks for this post,
I tried your method. I used the native SM Logit and LogitWeight.
logit = sm.Logit(y,X)
result = logit.fit(maxiter = max_iteration,disp=False)
logit_w = LogitWeight(y,X) #,weights = weight)
result_w = logit_w.fit(maxiter = max_iteration,disp=False)
Using the same data set, the former ran without issue but LogitWeight gave me this error:
Traceback (most recent call last):
File "", line 3, in
result_w = logit_w.fit(maxiter = max_iteration,disp=False)
File "C:\Users\seyed.mahboobi\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py", line 1186, in fit
disp=disp, callback=callback, **kwargs)
File "C:\Users\seyed.mahboobi\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\discrete\discrete_model.py", line 164, in fit
disp=disp, callback=callback, **kwargs)
File "C:\Users\seyed.mahboobi\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\base\model.py", line 357, in fit
hess=hess)
File "C:\Users\seyed.mahboobi\AppData\Local\Continuum\Anaconda\lib\site-packages\statsmodels\base\model.py", line 405, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "C:\Users\seyed.mahboobi\AppData\Local\Continuum\Anaconda\lib\site-packages\numpy\linalg\linalg.py", line 520, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "C:\Users\seyed.mahboobi\AppData\Local\Continuum\Anaconda\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
Any idea about the issue?
Thanks