GitHub for Academic Research

There was an article on Slate this morning that made the argument:

We need a GitHub for academic research.

It is an interesting idea. GitHub is a repository site for software projects and their source code (see Wikipedia: GitHub). At this point, we are now going on a solid three decades of the internet. Academic listservs are more or less gone (granted, Linguistlist is still going strong, albeit in a less e-mail-oriented form. The golden age of academic blogging is mostly over. Many of us are still writing, of course. But our readers do not really socialize and interact here anymore. They’ve moved elsewhere. When I publish a post, discussion rarely appears in the comments section these days. That happens on Facebook or on Twitter now, which is fine. Some of the best academic discussion, today, now happens on Academia.edu’s sessions feature, where papers can be discussed in real time, usually in draft form and invitation only—you can

But none of these are what GitHub is. GitHub is a repository for code, not academic argument or discussion. GitHub is for data, not for prose.

The thrust of the argument is this:

The academic paper has some inherent limitations—chief among them that it can provide only a summary of a given research project. Even an outstanding paper cannot provide direct access to all of the research data collected or to the record of discussions among scientists that is reflected in lab notes. These windows into the messy and halting process of science, which can be extremely valuable learning objects, are not yet part of the official record of a research study.

But it doesn’t have to be this way. If we take advantage of the unique capabilities of the web to tell the full story of a research project—rather than merely using it as a faster printing press as we do today—we can build greater transparency into our approach to reporting science. Besides improving information-sharing among scientists, a push toward transparency could improve public trust in science and scientists. Now, when the very concepts of fact and truth under assault and many scientists feel compelled to march in response, is the perfect time to rethink our approach to scientific communication altogether.

A striking proposal indeed.

Now the author here, Marcus Banks, is talking about science, specifically. Most readers here likely view themselves as being more within humanities. But linguistics, even (or perhaps especially?) linguistics for an ancient language like Greek, is a data-driven discipline. Our theses and dissertations tend to be of one of two types. They are either a summary of research with an argument for a view that provides a snapshot of the data. Con Campbell’s (2007) Verbal Aspect in the Indicative Mood and Narrative is an example of this. Or they simply are data in its entirety with commentary. Douglas Huffman’s (2014) Verbal Aspect Theory and the Prohibitions in the Greek New Testament.

The data and its analysis is at least as important as the argument.

I choose these two particularly as examples for a reason. Both represents some form of the tenseless view of Greek that I find highly unconvincing. So which is more useful for me, as a researcher? Which one would I be more likely recommend to others, despite my disagreement? Is it the one that merely provides a snapshot of the data or the one that provides a comprehensive database of his analysis (albeit in print form)? Quite obviously, it is the latter.

Huffman’s monograph is of far greater value to me.[1] I can disagreement him on any number of points: his view of the status of tense in Greek, his interpretation of individual instances of prohibitions, or his categories for analyzing prohibitions. But despite that, I can always come back to the volume to see what his opinion is on whatever prohibition I’m looking at. You cannot do that with the other approach, the summary approach. Books that simply make an argument based on a summary/snapshot of the data tend to get read once. I read book of this type and I either agree with it or disagree with it. If I disagree with the argument or conclusion, then the book has little use to me afterward. We are doing language work. The data and its analysis is at least as important as the argument.

But Huffman’s data is still merely a print source. It is not searchable, it can’t be manipulated to be visualized in different ways. It exists merely as a list. This has historically be the challenge for biblical studies. The print concordance is the original database for our work.

We need to be digitizing our research, especially if it’s already published somewhere. Some of us already are. A few of my research projects are published with Logos Bible software, such as my semantic role/argument structure analysis of New Testament verbs. But I need to be better at this, too, especially for my non-commissioned/contracted  (i.e. personal) projects. Creating consistently annotated data is time consuming. Often it is easier in the moment to do the analysis token by token in my head without actually writing it down. When you are looking at 10,000 instances of something, the extra 20 second it takes to type the analysis adds a lot of time to project that probably already feel like they are moving too slowly.

Documentation is just as important as the final project. Long term, It is probably more important.

The annotated database of Greek perfects for my thesis is sadly probably only 2/3’s filled in, even though I checked everything. And now the thought of going back now feels worse.

I should probably be putting my personal projects upon GitHub, though (until there’s an academic alternative). Even in partially completed form, documentation is just as important as the final project. Long term, It is probably more important. If I want to take my grammar project seriously, it needs to be more than just prose. It needs to be data, too. And that data needs to be accessible. Otherwise, it’s useless.


[1] I should emphasize at this point that I still value Campbell’s work. Even with my disagreements, he has made some excellent contributions also. It is simply that on the practical level of usefulness, having the complete and fully annotated data creates value in a way that summary and prose to do not. In fact, in the reverse, data without a prose summary would be nearly as useless.

Works cite:

Campbell, Constantine. 2007. Verbal Aspect in the Indicative Mood and Narrative. New York: Peter Lang.

Huffman, Douglas. 2014. Verbal Aspect Theory and the Prohibitions in the Greek New Testament. New York: Peter Lang