Welcome to the July 2015 “First Tuesday” feature—this month on a Wednesday due to the holidays! On the first Tuesday of every month we feature a one-on-one discussion with a person behind one of the previous month’s Best of the Business Web e-letter selections whom we felt would be particularly interesting to talk to.
This month we have chosen Petr Knoth, the founder of CORE (COnnecting REpositories), a meta repository and search engine containing over 24 million journal articles and other academic and scholarly documents held in approximately 700 data repositories around the world, housed at the UK Open University.
We chose CORE as one of our June 2015 Best of the Business Web selections, describing it as "a real gold mine of a research site.”
The following is an edited summary of our Skype discussion with Petr.
Q. What is your background?
A. I am originally from the Czech Republic, with a background in machine learning and text mining and I’m currently a Senior Data Scientist at Mendeley, and a research fellow at Open University.
When I joined Open University about 7 years ago, I was doing my PhD and I needed access to research papers but it was hard to get them as all the papers were behind a pay wall. But the open access movement was starting; though I also found that you could not access all these open access articles in one place. So I thought let’s see about how to get all this together. So the idea of CORE was not a search engine first, just a way to connect these together.
So I began developing the idea in November 2010 and our first funding was obtained in Feb 2011. I began developing CORE myself for the first few months, though now I have stepped back some from the day to day which is being handled by my colleague Lucas Anastasiou.
Q. What were some of the barriers and challenges you faced in developing the project?
A. There were both technical and legal barriers. The biggest issue has been and continues to be that there is no single harmonized protocol that we can use to get the data from the institutional repositories and publisher system platform, so we need to create specific systems for each one.
Q. Do you need to set up formal agreements with each platform, or do you just send out a spider?
A. It all depends on each entity’s own terms and conditions. In some cases we need a formal agreement and we will comply. We are not pirates here. And we don’t actually send out a spider, we harvest content which is a different process, meaning we access the meta data from these repositories and from that meta data try to identify links to a full text PDF.
Q. What would you say is unique about CORE vs. other open access repositories?
A. Well, one is our ability to provide fulltext when we can find it. Also, we are trying to help those who are mining text who need access to raw data to extract knowledge and then be able to do interesting things with that data. Previously, only publishers or databases with the data were the only ones who could exploit the data which I thought was unfair; so we serve the community so others can reuse the data set and do something cool with it.
We do also serve the general public when they are searching the Web for scholarly works, and we serve governmental, funding and other institutional bodies who need to see what is going on in scholarly research to help make decisions on where research is heading and which areas are good candidates for future funding.
Q. Is CORE particularly strong in covering certain disciplines or particular countries?
A. We are discipline agnostic, so I’d say the representation of works in various disciplines closely aligns to what’s out there and available in general. In terms of countries and languages, after English the next most represented language is Spanish. And as a UK aggregator, we do try to prioritize works from the UK. After that we focus on the US, and then the rest of the world.
Q. What is next for CORE?
A. We have a lot of projects in the works. One that we are putting a lot of attention into is to develop a cloud computing infrastructure that can be used by text miners so they don’t have to download the text or use API, but use the cloud to employ text mining algorithms to process the data. We want to encourage people to be able to get access to the data and do cool things with it.