Implementing a Research Knowledge Base

Michael Yuan

Issue #91, November 2001

Keeping up with large volumes of research requires a systemboth flexible and intuitive.

Research has always been an integral part of education, especially higher education. Research generates new knowledge and offers training in creative and independent thinking. The role of research in higher education is shown by the fact that almost all the famous universities in the world are also known as research universities.

Modern research is highly specialized and highly collaborative. Finding papers with more than 100 references is common these days. Moreover, the rise of the Internet has made it possible to publish results quickly, and the barrier to accessing scientific information has been lowered drastically. For instance, in the astrophysics field alone, its internet preprint service publishes more than 15 new research papers every day, all of which are freely accessible to the public. While this provides enormous power for researchers to conduct advanced research, it also creates a huge burden to keep up-to-date with the newest results.

These conditions give rise to two important questions in modern research: how do we organize knowledge to keep in step with recent developments and be able to retrieve the information when we need it? Second, how do we organize and keep record of research projects involving several researchers? In the research world, new knowledge is presented in the form of publications. I use the term “references” to refer to pieces of knowledge, including unpublished or privately communicated tips and results. In this article, I discuss research knowledge-base systems in the form of reference management systems.

Why We Need New Software

There have been commercial attempts to address the questions above. Applications like EndNote, Pro-Cite and PAPYRUS offer reference management capabilities and even some web capabilities. However, I found them hard to use and customize in my own research.

Research covers every field of human knowledge, and more importantly, research is intended to explore the unexpected. Every group might have different requirements for organizing and displaying their references. Most of the proprietary reference management systems have targeted specific research fields (usually medical research). Being closed-source software, it is impossible for the user to change and improve the software to adapt to specific needs.

The best solution to this problem is to design an open-source, web-based, multi-user knowledge-base system. It would run on an internet-connected server and be accessible from any standard web browser, from all platforms. Users could post/organize references and hold discussions in the comment section. All the knowledge and discussion interactions could be archived centrally from one secured server. This web-based approach would use web browsers as the user interface (UI) and would enable anyone to change the UI by simply changing the HTML-like source code. My answer to the need for such software is the OpenReference reference management system.

Goals and Required Features

One goal for writing this software was to categorize knowledge for easy future retrieval by multiple users. A simple keyword search is not enough, since searching cannot guarantee finding all the related references. It is much easier to browse through the category tree if you have a specific subject in mind. So, advanced categories and user management are two core features I decided to implement.

Big categories, each containing hundreds of references, are no better than no category at all. For the categorization to be useful, the leaf categories on the tree should contain less than one page or 20 references each. That demands a very fine-tuned category structure. Finer categories are needed in areas of more active research. It is impossible, however, to judge in advance how many levels of subcategories are needed in any field so as to make efficient categorization. The only way is to design a dynamic category structure that can be adjusted at runtime. If a lot of references show up in a particular category, the administrator can divide it into several subcategories, according to the nature of those references.

As I have stated, the leaf categories need to focus on narrow subjects to keep the number of references small. Today's research works have become more and more interdisciplinary, making it hard to categorize a reference into a narrow category. The solution to this problem is to allow a reference to associate with more than one category.

This system is designed not only as a personal reference organizer, but also as a group discussion server to exchange ideas in the comments section. A web-based collaboration system can keep track of information and make archives of idea exchanges possible.

Being a multi-user system, this software must establish some user access control. The administrator can set the policy to accept new users. Every user needs to log in with a legitimate username/password combination to post references and comments. Each user can edit/delete/recategorize his or her own postings. Only the administrator can touch the category structure.

Finally, in order to enable private conversation in the forum, I also allow a user to specify a list of other users who can see her or his postings. Those private conversations traditionally take place in e-mail communications, but this software encourages users to use the web system for better archiving of the research effort.

Database

The back end of the system is naturally a relational database management system (RDBMS). Most of the data we want to store/retrieve/search is textual, and RDBMS is perfect for this purpose. Any SQL-enabled database server will work, and there are many such servers to choose from. I chose the GPLed MySQL for its speed and reliability. If data integrity and transaction support is a must for you, you can choose the open-source PostgreSQL database server.

Middleware

The middleware is a collection of classes/methods that wrap the data query operations. They are designed to shield front-end developers from the technical details of database connections and SQL language.

I chose to use Java to develop the middleware. One big advantage of using Java is that it is a full-blown, object-oriented language, making it much easier to implement complex logic/structure designs required by large projects. Using Java also allows us to take advantage of a large number of utility classes already existing as Java libraries or beans. The ones I used for this project include JDBC driver, database connection pool, session management and text processing.

The standard way to build database middleware in J2EE is to use entity EJBs (Enterprise JavaBeans). However, this approach requires running an EJB container, which can be expensive. In fact, few low-cost JSP hosting services provide EJB containers. For OpenReference's relatively simple database structure, I decided to use a simpler approach: static methods in helper classes to provide database access. Each row in a table is represented by a HashTable.

The middleware uses JDBC to pass information between the Java application and the SQL database. There are JDBC drivers for all the major RDBMSes. For MySQL, I used the mm.mysql driver. I constructed one class for each database table. The class knows the fields in that table and knows which fields are searchable. Each class implements a set of basic data query functions (e.g., getAllRows, AddRow, updateRows, etc.) and a search function that searches all the searchable fields and returns all the matched rows. Each class also has its own query functions to do specific or cross-table queries. For example, in the ReferenceTable.java class, there is a function getReferencesByUserName. This will take the user name as input and find the corresponding user ID in the User table, and then return the rows with matching user ID from the Reference table. As an example, see Listing 1 [available at ftp.ssc.com/pub/lj/listings/issue91/4769.tgz] for the complete API for the Category class.

Front End

I chose JavaServer Pages (JSPs) for the front end. JSPs have the power of the full Java language plus the benefit of separating the web presentation from the application functions. JSPs support all the HTML tag syntax for formatting display, and one can add whole Java programs dealing with beans and other functions in HTML comments. One can even design custom tags to encapsulate the back-end operations (e.g., database queries). It is easy to train an HTML programmer/presentation expert to work on the web pages using their favorite HTML editor, without caring about how the database queries are executed. In the meantime, a back-end programmer can work on the data query part without caring about how the data will be displayed. More information about JSPs can be found in Reuven Lerner's At the Forge column in the May through July 2001 issues of Linux Journal.

Any J2EE-compatible Java server is capable of running JSPs. My favorites are the Tomcat engine from the Apache Foundation and the Resin engine from Caucho Technology. Either one of them can run as an extension module of the Apache web server to take advantage of many other useful features of Apache. Tomcat is released under GPL. Resin is closed-source software but is free for noncommercial use. Resin runs considerably faster than Tomcat and offers some useful features, such as a built-in JDBC driver and driver pool and multiple JVM for fallback.

Dynamic Category Tree

Originally I planned to use an XML file to represent the category structure because XML documents are naturally organized in a tree structure (DOM model), and they are easy for human reading and editing. There are many good XML-DOM tools in Java to manipulate trees.

However, we need a large and dynamic category structure for accurate classification and browsing, as it is important to be able to search the categories. One drawback of using XML is its difficulty to be searched together with other content stored in RDBMS. Also, to store a big parsed DOM object in memory constantly is inefficient and difficult to synchronize among several JVMs. To store it on disk and parse it when needed introduces too much processing overhead. So, I decided to use database tables for the categories.

The whole category is stored in a database table. Each record represents a category and has its unique category ID. It also has the parent category ID so that the records are linked together into a tree. References are linked to the categories by a separate category ID vs. reference ID table.

The Category class in middleware contains all methods to operate the tree. In order to maintain the links and structure integrity of the category table, it is important that we manipulate the category table only through the Category object public methods. Some important methods include:

Insert new subcategories in or between any level(s) in the tree. A special note about adding a new subcategory under a leaf category: since all references must be associated with leaves only, references that were associated with the old leaf become associated with the new subcategory now.
Change category description/keywords/properties.
Delete a category from any level in the tree to make the target category's children be children of its parent and then delete the target itself from the table.
Return a list of children (or parent) for a given category.
Search keywords from category descriptions.

An illustrative listing of Category-class source code is presented in Listing 1. I found those methods sufficient for my use. You might want to do more complicated operations, such as moving a subtree to another branch, etc. The strength of open-source software is that anyone can add functions to the code without rewriting the basic part.

User Authentication and Sessions

The user login and sessions are managed through the session API from J2EE. Usernames and passwords are authorized from a database table, then the servlet engine establishes a session for this user. The servlet engine tracks the sessions through session objects.

The session object needs to know about the user who owns it. That includes the user name, profile, preferences and current browsing status. Storing such information in the session object improves performance a lot. Otherwise, for example, the program has to look up the database for display preferences every time prior to displaying a page for a user.

Instead, I could store all the information inside the session object itself. But for better organization, I used several JavaBeans associated with session objects to store additional user information. In JSP specifications, a JavaBean can have a scope of a session or a page. Sessions with JavaBeans greatly simplify the development work.

Database Connections

Database connections are expensive to establish. To open a new connection every time we need to query can slow down the computer drastically and use up the system resources quickly.

There are several ways we can reduce the need for new connections. One way is to assign one connection for each session. That connection can be stored in a bean with session scope, and all the queries from that session go through it. However, if many sessions are active at the same time, available connections run out fast. This solution does not scale well.

Another choice is to assign one connection per page. However, I do not like this idea; the connection object has to be passed to the middleware by the JSP coder. This is not intuitive to a nontechnical JSP coder and defies the goal of separating presentation from application logic. It is possible to design a set of custom JSP tags to encapsulate the database connections so that JSP coders do not see them, but it requires extra designing work and that the JSP coder learn a nonstandard language. Using custom tags certainly increases productivity in the long run, but it also makes short and simple changes more difficult with a steeper learning curve for the system. I am still seeking the best solution. For now, I decided to hide the connection handling completely inside the middleware.

I made a new database connection from every database query function in the middleware. Each query function completes multiple-related queries to make efficient use of connections. Each JSP page calls only one or two such functions.

To reduce the overhead caused by opening new connections, I used “connection pool” utilities. These utilities are classes that maintain a pool of open connections in memory. When the user requests a new connection, it simply fetches one from its pool rather than making a new one. When the user closes the connection, it returns to the pool. There are several such utilities, e.g., PoolMan. Their usage is very straightforward, and you only need minimal changes to convert your code to take advantage of those utilities.

Text Processing

We have to allow users to mix some HTML tags in the text so that they can format the memo and comments properly. However, displaying user HTML has a big security risk: unauthorized user HTML tags can corrupt/change the display of page navigation and other people's comments. They can then be used to load unauthorized images/applets or even redirect the user to another site using JavaScript.

So, I decided to allow only four HTML tags: <p>, <b>, <i> and <a>. That ensures the user cannot change the format of any content other than her or his own. The unauthorized HTML tags are filtered out before submitting text into the database. The allowed HTML tags are configurable by the site administrator at compile time.

It is also important to note that some users might want to input XML sample code or mathematical formulas containing “<” in memos or comments. It would be inappropriate to treat all “<” as HTML tag tokens. So, I provide the plain text mode under which all the “<” symbols are converted to “<” before being submitted into the database.

A powerful way to process text is using regular expressions. There have been several good Java regular expression engines available. I choose gnu.regexp because it is a free software implementation of almost all Perl5 regular expression features through a simple and intuitive API. The code for filtering HTML tags and composing the SQL query string using a regular expression engine is listed in Listing 2 [available at ftp.ssc.com/pub/lj/listings/issue91/4769.tgz].

Resources