Mentorship Web Tree Visualization

The following blog post presents a project completed for the 2017, Preparing the Future Professoriate class at Virginia Tech.

[themify_hr]

Project Goal

Mentorship presents a profound part of the experience of academic research. A mentor whom provides guidance and wisdom may earn our lifelong admiration. A mentee who struggles to succeed may earn our compassion; one who fails, our empathy; and one whom succeeds, our greatest pride. While most researchers will work closely and develop strong relationships with their PIs, it’s important to recognize those interstitial mentors. Those who may have helped us to fix a thorny bit of code, to procure a prized analyte, or to assuage the whims of an ornery administrator.

In this project I sought to identify these networks of proximal mentors, providing a means of charting the flow of knowledge and wisdom through an academic organization. I also sought to make them beautiful, visualizing these organically grown relationships via a natural metaphor. Initially I had planned to craft these trees from individual user’s networks, serving as a sort of mirror into their professional life and the academic pedigree from which they had hailed. I was hoping to see a tree which fanned out laterally to a wide breadth of unrelated talents and applications, yet drilled down historically to a surprisingly singular expertise. Still, as more and more progress was made in crafting the trees, viewing the development of a cohesive organization both in its individual members and its missions became a more interesting challenge.

The presented visualization draws from a coauthorship network of the laboratory where I work, the Network Dynamics and Simulation Science Laboratory (NDSSL) of the Biocomplexity Institute of Virginia Tech. I had initially considered using the publication history of the members of the 2017 Preparing Future Professoriate class to construct this network but decided upon the NDSSL as its member and publication list could obtained without utilizing privileged information. The code for this project may be used under the terms of the GNU general public license and is generalizable to any coauthorship network.

MentNet Capabilities

  • Identify proximal mentors within a coauthorship network
  • Identify and collect mentor-mentee publications from pubmed
  • Automatically Categorize primary publication topics within a research organization

 

Development

The first challenge in creating MentNet was identifying a pleasing aesthetic. Just as elementary school classrooms are littered with cheery, inspirational posters to motivate young students, I wanted to create something that would appear beautiful, would show the complexity and structure of an organization, but which could be appreciated on a purely aesthetic level. I wanted to construct the visualization in such a way that viewer would feel no pressure to analyze it, that it could merely be taken as an intricate metaphor absent obligations to function or practical utility. As trees presented an obvious structural metaphor, randomly generated dummy trees were generated and rendered via python matplotlib axes3d objects to test this. Trees were defined by the mean relative length and angle dispersion of off-shooting branches respective to their origin branches.

A sample dummy tree may be found here:

A coauthorship network queried from the NDSSL content management system (CMS) by Dr. Stephen Eubank was used to create the base network [1]. This may present some challenges to the generalizability of the resulting code, though it would be trivial to create novel coauthorship networks given access to a collection of papers from a given laboratory or discipline. Per the advice of my advisor, Dr. Bryan Lewis, future efforts may explore just that, seeking to establish the flow of knowledge across the fields of public health and epidemiological modeling.

Here is that underlying coauthorship network:

Once the coauthorship network was obtained, the next challenge was identifying proximal mentors. The network was loaded within NetworkX and each node within the network was labelled with its betweenness centrality. Beetweenness  centrality is a measure that states how often within a given network the shortest path between any two nodes may cross a given node [2]. Of course, per the proximal mentorship idea, the individuals within the network with the highest centrality generally ended up being primary investigators and funding partners. A quick algorithm was written that identified each individual’s proximal mentor as their neighbor which had the lowest centrality which was greater than their own.

Incorporating this metric, we create the following: (image)From here a true(r) mentorship tree was created using a slight modification of the dummy tree creation code. While the basic network represented within the tree remains constant, the structures are randomly generated to suit the aesthetic. As such, several test trees were grown until a selection with pleasing aesthetics was found and saved.

The next challenge came in adding leaves with maximal symbolic content and information density. A pubmed query was run for each researcher in the network to collect the abstracts of the 20 most recent papers they published with their mentor [3]. From these a superset of abstracts was collected and categorized via latent dirichlet allocation (LDA) [4,5]. LDA is a statistical model for explaining unobserved similarities within sets of data. In this case, LDA was applied to a vector of the most frequently occurring, unusual word within each abstract. From there, the algorithm identified the ten most important categories as follows:

{'0': u'influenza vaccine health social pandemic measures distancing strain transmission research',
 '1': u'pylori gastric responses infection cell host response immune regulatory cells',
 '2': u'ppar cells mice expression inflammatory aba cell immune induced disease',
 '3': u'patients high activity study risk pei hospitals mortality frailty clinical',
 '4': u'cells cd4 th17 itreg differentiation ppar\u03b3 networks modeling study regulatory',
 '5': u'network model networks contact based population epidemic models results knockout',
 '6': u'text formula using parameters large model based modeling time used',
 '7': u'dose 10 disease effects brain data group rates studies results',
 '8': u'stock flow failure health energy human subjects weight care training',
 '9': u'disease health data public human africa ebola outbreaks outbreak provide'}

 

Each paper (leaf) written by a mentor mentee pair was color coded based upon its transformed categorization via the LDA algorithm. Likewise, the transparency of each leaf was set by its year of publication where more recent papers were rendered as more opaque. It’s important to note that, while 866 abstracts were pulled via this method, many more leaves may appear on the tree as singular papers may be matched to multiple collaborating mentees of a given mentor.

From there a script was written to render the scene via a change of season motif. Starting in 1980 before the first publication, time moves forward to 2017 with bright flowers gradually appearing on the tree during their year of publication. From there, seasons progress from spring to summer as the pink flowers are replaced with green leaves. Finally, as fall hits, the leaves slowly fall from the tree, fading away on the ground and revealing the identities of each researcher where once they had been. One of the biggest challenges of this entire process was that matplotlib axes3d does not account for true spatial overlap, only rendering objects in the order in which they areplotted. As a proxy to this, the zorder of each object is set to the inverse of the distance from its midpoint to the observer. There is some distortion in the rendering of overlap as the axes are not rendered to scale, though the effect is mostly intact.

Once all images were created, an mp4 animation was generated on the command line via imagemagick’s convert function.

 

The jupyter notebook source code for this project may be found here:

https://gist.github.com/jschlitt84/1ab25fd7b50b01acb82d87dd3a9a5941

V3 of the GNU General Public License may be found here:

https://www.gnu.org/licenses/gpl-3.0.en.html

The video may be found here:

 

References:

  1. Fwd: INformation to Ms Pierre [E-mail from S. Eubank]. (2017, March 27).
  2. Franceschet, M. (n.d.). Betweenness Centrality. Retrieved April 01, 2017, from https://www.sci.unich.it/~francesc/teaching/network/betweeness.html
  3. National Center for Biotechnology Information (NCBI)[Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] – [cited 2017 Apr 08]. Available from: https://www.ncbi.nlm.nih.gov/
  4. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
  5. Grisel, O., & You, C. (n.d.). Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation¶. Retrieved April 07, 2017, from http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py