It was an unpleasant experience whenever a lendingclub loan defaulted in my portfolio. As a habit of curiosity, I would clicked into the loan and look for clues that I and my model might have missed. One possibly biased observation of mine is that I saw many "nurse" and "teacher" defaults. For sometime, my gut feeling is to lower my model's tendency to pick some specific job title; I resisted that temptation because I have no solid evidence to support that. lendingclub's loan applicants have tens of thousands of different job titles, some have typos, some abbreviations; it is difficult to treat user typed job title as independent factor to model input -- that would be creating too many factors and new job title cannot be classified to existing job titles.
Until recently, I am starting to playing with NLP. A simple idea floats:
- vectorize job title
- cluster vectors
I proceed to use google trained word2vec model to vectorize lendingclub historical data borrowers' job title. I then use clustering algorithm to cluster them (e.g. KMeans).
Here are some interesting findings:
My simple clustering exercise works reasonably well; it does cluster similar job titles together. The following job titles are in one cluster:
Speech Language Pathologist
Chemistry Lead
Senior Database Researcher
Associate Professor
Researcher
Radiologic technologist
Professor
Statistician
Scientist
Geologist
Liability Analyst
Columnist
Senior Strategist
Inventory Analyst
Project Analyst
Lead Analyst
Business Analyst
0.0 0.161513 1.0 0.158743 2.0 0.178444 3.0 0.224322 4.0 0.197112 5.0 0.217364 6.0 0.276460 7.0 0.271106 8.0 0.221385 9.0 0.233080 10.0 0.206700 11.0 0.251355 12.0 0.186574
Some jobs do have higher default rate than others. The question is, does job title brings more information than I already have? Clearly, different jobs indicates different income level. Will job title have prediction power after I consider income and other financial health information? That remains to be found out...
No comments:
Post a Comment