Professor Rebecca Eynon
Professor of Education, the Internet and Society
Rebecca Eynon's research focuses on learning and the Internet, and the links between digital and social exclusion.
In this post, we describe the tools and techniques we’re using throughout the project and why. We also show how you can get started using the same.
Following our introductory post about AI and lifelong learning, we wanted to focus on the technology stack we’re using to conduct our research. We thought this topic made sense now for two reasons:
For example, we chose early on to incorporate machine learning techniques into our research: to address a research challenge we encountered, but also to gain firsthand experience using AI for augmenting human intelligence. If we’re going to assess claims about how AI can be used for learning, after all, it seems sensible for us to gain experience applying AI in this space ourselves!
Again, we’re interested in mapping not only the breadth of discourse about AI and lifelong learning, but especially the underexplored relationships between these subjects. This makes for a lot of material to potentially review, much of it from related but distinct communities.
As a result, the object of our analysis came into focus quickly: documents such as journal articles, press releases, social media, and unstructured web content. Tools and methods for analyzing documents took more time and experimentation to develop. As of today, we identify three phases of analytical foci and corresponding tools:
We’ll share the output of these analyses per se in future posts. For the rest of this post, we focus on what tools we chose in each phase, how, and with what lessons learned.
To help organize each phase, I’ve found it helpful to distinguish between two sets of data management tasks:
For each phase, I first present tools by sub-task, then discuss how all tools fit together within the phase. Every tool we used was free and/or open-source software, and (with a little patience) can all run on Windows, Mac, and Linux. I encourage you to check them out if you haven’t already!
We started with academic publications, which benefit from readily available data as well as interpretive standards established through scientometrics, the quantitative study of research.
We first targeted bulk data collection of bibliometric metadata, such as article and journal titles, authors, keywords, etc., using Harzing’s Publish or Perish, a bulk bibliographic metadata collection tool. The bibliographic data manager JabRef allowed us to merge data files from different sources into a single dataset using the BibTeX standard. We also used JabRef to de-duplicate entries as much as possible—a significant challenge given sometimes thousands of overlapping entries from multiple databases. Finally, VOSviewer allowed us to visualize patterns of usage of terms found in article abstracts.
Though quick and easy to get these “out of the box” tools working, the approach presented three main drawbacks:
In short, while Phase 1 proved out our basic approach for triangulating discourse, the actual process of downloading from multiple data sources, compiling into a monolithic file in JabRef, then exporting to VOSviewer, was too error prone and labor intensive to be sustainable, given our goals. For example, if there are distinct topic “networks” emerging, what disciplines are the source articles/journals in? Do journal disciplines correspond to topic networks? Etc.
For Phase 2, we attempted to streamline these steps from a process designed around a monolithic dataset and analysis step, into more of an iterative search process involving permutations of the dataset itself and of analyical techniques applied to it.
We turned to topic modeling and text classification techniques, paired with specialized visualizations and a bespoke “report generation” approach, to develop a more qualitative and expressive analysis of the topic space.
While we did reuse the BibTeX data standard from Phase 1, we shifted to using the Python 3 Anaconda Distribution to take advantage of multiple community packages dedicated to various sub-tasks in the phase. The tools below correspond to these packages.
Phase 2 data visualization and analysis tools
In our shift from article metrics to article abstracts, we wanted to ensure we could rapidly explore the parameter space of multiple dimensions in combination with each other:
To achieve this, we designed a pipeline of processing steps in Python that allowed us to tweak parameters at multiple points in the pipeline and quickly assess impact using interactive and static outputs. We did this by first bringing in bibliographic data using bibtexparser, then converting it to a Pandas dataframe for further processing. For example, Pandas allowed us to apply regular expression matching to filter and subset data.
After this filtering step, a “create topic report” function applies a series of transformations to the data before outputting an interactive topic model visualization using pyLDAvis, as well as a detailed Word document report (created with python-docx) for each topic containing the following:
To arrive at these report outputs, a number of NLP tasks are chained together. NTLK removes stopwords and lemmatizes article abstracts. Next, data is copied across two parallel processing flows using scikit-learn. For each flow, text data is converted into into a matrix representation suitable for quantitative and statistical analysis by topic modeling algorithms. These included term frequency-inverse document frequency (tf–idf) to support latent Dirichlet allocation (LDA), as well as term frequency to support Non-Negative Matrix Factorization (NMF).
For both LDA and NMF data flows, each document receives a score describing how well it fits within each of a given number of topic groups. Rather than settle on a single, “correct” number of topics, we wanted to explore the effect of varying topic number to see what patternsemerged. Therefore, for each time the pipeline is run with a given data input, it iterates across multiple values for topic number, creating a distinct report file for each group count, e.g. reports for 3, 5, 7, 10, 20, and 40 topics. By reviewing reports across the parameter space of topic numbers, we could then isolate topic groups that were especially unique; persistent across topic numbers; or else irrelevant to our cause. This last category enabled us to prune irrelevant articles from our dataset, then re-iterate to analyze again.
This qualitative, exploratory approach to data analysis would not have been possible with out of the box tools. At the same time, while helpful in enabling us to grasp the breadth of topical foci within the space, it was less clear how to understand the social situatedness of topics, articles, or journals. For this, we turned to a more ambitious software pipeline in our third phase.
In Phase 3, we wanted to take advantage of recent advances on a NLP task known as “entity recognition” to build out a network analysis of activities occurring in social media related to AI and lifelong learning. That is, by collecting news articles; blogs/microblogs; and possibly the academic articles we had collected in Phases 1-2, we wanted to develop a semi-automated way of identifying what was being discussed in the various articles, as well as what social actors we could therefore deduce or infer were collaborating in some way. We envisioned a graph database system to serve as a knowledge base for tracking these insights, with a data collection and analysis system on top of this database to help populate it.
This phase is a work in progress, so this list is subject to change!
For Phase 3, the vision is to use Python tools like feedparser and Newspaper3k to collect data from the open web. This will be stored within a Neo4j database in such a way as to preserve provenance (data source) represented as graph connections between data sources and documents. Using spaCy, we can then use named entity recognition (NER) functionality to identify nouns such as companies, products, and locations, from article texts. These can also be represented as distinct “entity” graph nodes connected to document nodes. By analyzing within- as well as across-document mentions of specific products, companies, etc., we can identify a “collaboration network” within the space. Challenges related to pruning, enhancing, or creating data entries in the database are met by providing Graphileon Interactor to the general research team, since this tool provides ad-hoc querying and data creation/delete/editing capabilities. Finally, Cytoscape provides more advanced capabilities around visualizing and analyzing the graph itself.
We are actively working on this phase and hope to dedicate a future blog post to its progress.
Over the course of these three phases of research, it has struck me that although our choice of tools has always been lead by research questions, so too have our questions been lead by technical capabilities. This give and take of course also characterizes the application of AI for lifelong learning: certain tasks are becoming increasingly efficient and effective for computers to perform—but what are the “right” applications of AI techniques, on balance? (“Where is the knowledge we have lost in information?”)
For us, an important part of navigating this question has been to ensure we maintain a “human in the middle” approach to our use of machine learning and other computational techniques. By this, we mean more than just the “art” or pragmatic dimension of applying unsupervised learning techniques like topic clustering. Rather, we mean that qualitative checks like our Phase 2 topic reports were important to ensure, regardless of the technical performance of processing steps per se, that we had opportunities to leverage (and develop) our own intuition, creativity, and expertise in the space to make further decisions and conclusions. Though I risk anthropomorphizing AI by saying so, I’m inclined to characterize this as a “partnering with” relation to AI technologies, rather than a “hand off work to” relation.
As the project has evolved, we’ve also answered the concerns raised in this post by pursuing two other branches of activity: a more conventional literature review and synthesis of policy related to AI and lifelong learning, and a plan for case studies applying a more ethnographic approach to studying the use of AI for lifelong learning in situ. We hope you’ll join us as we present these and more over coming weeks and months.