Harvard Converts Millions of Legal Documents into Open Data

The Caselaw Access Project, from the Library Innovation Lab at Harvard Law School, went live Oct. 29 and aggregates millions of state and federal cases on a free website.

by / November 2, 2018
Shutterstock

A new free website spearheaded by the Library Innovation Lab at the Harvard Law School makes available nearly 6.5 million state and federal cases dating from the 1600s to earlier this year, in an initiative that could alter and inform the future availability of similar areas of public-sector big data.

Led by the Lab, which was founded in 2010 as an arena for experimentation and exploration into expanding the role of libraries in the online era, the Caselaw Access Project went live Oct. 29 after five years of discussions, planning and digitization of roughly 100,000 pages per day over two years.

The effort was inspired by the Google Books Project; the Free Law Project, a California 501(c)(3) that provides free, public online access to primary legal sources, including so-called “slip opinions,” or early but nearly final versions of legal opinions; and the Legal Information Institute, a nonprofit service of Cornell University that provides free online access to key legal materials.

The conversion, done in-house at the Harvard Law School Library to preserve the chain of custody of millions of cases it had collected, used a hydraulic cutter to trim the binding from thousands of volumes; and a machine similar to those employed in the meatpacking industry to vacuum-seal them after scanning. Scanning costs were in the millions of dollars. Scanned, resealed volumes were shipped out-of-state for long-term storage underground at a former limestone mine in Louisville, Ky. Pages were subsequently uploaded to an optical character recognition (OCR) vendor for extraction into text files.

The project, which was funded by venture capital-backed startup Ravel Law and the Harvard Law School, doesn’t aggregate every court battle. Its legal trove primarily focuses on supreme court and appellate decisions, but is limited, the Lab’s director said, by the extent to which bygone officials “cared enough at the time” to compile decisions. Director Adam Ziegler said the project has a high concentration of federal trial opinions and lots of trial opinions from the state of New York, an early legal center, but fewer from some other states.

In standing up the project website, Ziegler said the Lab hopes to provide “anyone and everyone” with easy access to the law via court opinions, but noted that concept will have different meanings to different groups and “definitely means things we don’t even envision ourselves.”

“Every field is trying to learn things from big data these days and this data set has a lot to say about our history, our politics and our policy over time and our language over time. All that kind of stuff is going to be affected or supported by the availability of this data,” Ziegler said, pointing out it may one day power not just legal but language and historical analysis around nomenclature and change.

He characterized the Caselaw Project as something of a public interest exercise by Harvard, and does much of the work needed to move this area of the historical record online — and may spur courts to move quickly in publishing their prospectus or future law decisions online for free. Information services like LexisNexis and Ravel — which Lexis owns — could use that free data to create services to improve how residents access the law. Commercial, noncommercial services and academic research could stand near or atop that, Ziegler said.

Underlying these layers and eventually enabling these multiple services is the project’s application programming interface (API), with endpoints empowering users to get information on state and federal jurisdictions, courts and case volumes. The site also has bulk data downloads available and some search capabilities, though Ziegler said he expects lots of people to build tools to interface with the API.

“We’ve taken a lot of time to document and describe the API on our website. We’ve done it in a way that hopefully is accessible both to experts and beginners. We hope it’s fairly self-explanatory, though it’s definitely still kind of intimidating and mysterious to a lot of people, which we understand,” he said.

Beyond merely expanding access to the law, the Caselaw Access Project is on the leading edge of a fundamental change in how legal data is made available. Many courts currently charge to access trial cases, but Ziegler said the business of legal data is already changing from preserving exclusivity or scarcity of raw data to creating services, analytics and insight around it.

“That’s really what should matter,” said Ziegler. “Building businesses around artificial scarcity of public information should not be much of a viable business in this day and age with the Internet. But building really amazing search capabilities, building really amazing analytical insights, building really amazing applications using that data is where all the action is in the future and should be.”

Theo Douglas Staff Writer

Theo Douglas is a staff writer for Government Technology. His reporting experience includes covering municipal, county and state governments, business and breaking news. He has a Bachelor's degree in Newspaper Journalism and a Master's in History, both from California State University, Long Beach.