A Text Mining Week at Texas A&M (2)

I am now back from my Texan adventures in humanities computing at Texas A&M, but I still wish to mention some of the later projects to which I was introduced during my stay.

One of the major difference between DH at Texas A&M and the UoA is that researchers at the former institution focus on an older corpus of texts that is both difficult to access and challenging to digitize on a large scale. While we work with tweets and other born-digital documents, they work with books from the 18th century. The difficulty resides in the fact that, even when digitized, they remain difficult to transform into machine-readable format due to various problems such as the absence of standards for typeset and various noise that ink can produce when read by a machine. The EBBO and ECCO corpora are fraught with these problems.

Considering these problems, the Initiative for Digital Humanities, Media & Culture worked on making these texts more reachable for the broader academic community with the 18th Connect portal. This search engine is linked to different other online collections and repositories and allows to look through libraries and collections for specific texts published in the 18th century.

Feeling like contributing? The 18th Connect portal also hosts TypeWright, an online tool that allows the public to improve the OCR results of certain digitized texts by typing lines of texts directly from the scanned document, thus improving the quality of the digitized text. Just create an account and start typing!


Last but not least, I wish to spread the knowledge about the online class Programming for Humanists at TAMU that is being offered since 2014. The program allows for different registration options (including an official certificate or not) and covers a lot of important topics for DH students. This is a neat online program for students interested in the fundamentals of digital humanities, but do not have access to a DH introduction class at their home institutions. Take a look if that is your case!

A Text Mining Week at Texas A&M (1)

Blogging live from the extremely sunny campus of Texas A&M, College Station, Texas. Quite a contrast from snowy Edmonton.

I have been fortunate enough to be invited to spend a week at Texas A&M (TAMU) to visit some of the scholars with whom I collaborate on the Novel TM project. Project co-investigator Doctor Laura Mandell and PhD student Nigel Lepianka were nice enough to show me around the campus (unable to drive, I find myself relying on Nigel most of the time).

So far, I presented some of my work on text mining JRPG video game reviews and was introduced to other text mining techniques using R (specifically, Nigel’s method to do some directed topic modelling). I was also introduced to some of the projects that the team here is working on as part of their Initiative for Digital Humanities, Medias and Culture.

The first one is syriaca.org, an extensive web portal that brings ressources for the study of the Syriac language to the wide web. While some of its contents remain to be published, the Gazetteer showcases how the platform can contribute as a geographical reference index.

I also was introduced to the BigDIVA viewer today. This is a promising interface that could revolutionize library search results display for universities. I am particularly interested in its potential to help rethinking queries with space in mind, a way to present queries in a less hierarchical manner which would allow the uncovering of marginal files and documents. This is radically different that the regular Google search algorithms which relies more on result popularity amongst millions of users (a form of crowdsourcing) who may be looking for the same specific website. An interesting tool, and one that triggers reflections about what it means to read (and play) space.