A Deep Dive into Big Data
Brian Flaherty
I have a colleague who now and again laughs about the time “when we were wizards.” When we reference librarians would wave our wands (or eyes) over digests or (worse yet) shepards – gibberish to mere muggles – and come up with citations to relevant legal authority crowned with the title “good law.” He points out that we are no longer wizards – that with google scholar & well chosen search terms, people unfamiliar with the law are able to do adequate legal research (of course, I’m quick to point out that “adequate legal research” just doesn’t cut it these days).
I say this because just now, I think some librarians look at “big data” as mystical, and at the folks who can harvest and manipulate it as magicians. But demystifying it is, I think, essential for a clear understanding of what’s behind it, how powerful it is, and how it can be used. The “Deep dive” on big data went a long way towards doing this. I am going to attempt a short summary, but what I write here will be inadequate. I urge everyone to get the powerpoints and reading lists from the AALL site when they’re available, and become familiar with this. The overused phrase fits here: it is the future of the law practice.
Briefly: Big Data are information sets that are too large or complex for traditional processing models. For example: a data set including every federal case would be “Big data.” Robert Kingan from Bloomberg Law began the program with a great definition of what constitutes big data, and a discussion of just how difficult it is to collect it in a form that is useful for any kind of analysis. He said that some 80-90% of the work for any kind of a project is just collecting and cleaning the data, putting it into a format where it can be used. Think, for example, of getting the aforementioned set of all federal cases into a spreadsheet, where one column was “Judges name” and you begin to get a feel for how huge an endeavor this is. Daniel Lewis from Ravel talked a bit about how they go about manipulating the data once they’ve got it – with a short discussion of the uses of SQL and NoSQL (“Not Only SQL”) and the benefits of both. Irina Matveeva from NexLP gave a short discussion of Language Processing – what would seem to be the next step of data analysis – where there are programs that can do document analysis, email forensics, and other linguistic manipulation extremely quickly.
Following this introduction, and a brief discussion of how BloomberLaw (Robert), Ravel (Daniel) and NextLP harness Big Data in the resources they provide, we were given the opportunity to explore what planning a “Big Data”project would be like. Folks got into groups at tables and devised a possible project, a list of some of the resources they would need, and a list of some of the necessary players (e.g. librarians, programmers, 3rd party vendors). Some of the ideas were fantastic: one table talked about creating a predictive tool that could be used to determine whether a law enforcement officer would be likely to be accused of a civil rights violation – and what kind of data sources would be necessary to create such a tool (personal history? Demographic information?). At our table, one person was engaged in creating a resource that would predict the likelihood that a piece of legislation would pass – and so we talked about the data necessary to do that: the sponsor’s history, party affiliation, words in bill titles that have passed, public sentiment (retrieved from news sources & social media).
In all, Big Data is fascinating stuff – incredibly useful for its predictive value of everything from the outcome of a court case, to the passage of legislation. Not only should we be paying attention, we should be “deep diving” into it, to understand what it can do for us and the legal community.