Skip down to main content

Data Mining and Information Extraction for CiteSeerX and Friends

Date & Time:
12:00:00 - 13:30:00,
Friday 22 June, 2012


Cyberinfrastructure or e-science has become crucial in many areas of science where data access often defines scientific progress. Open source (OS) systems have greatly facilitated the design and implementation and support of cyberinfrastructure, thereby permitting the design of specialized integrated search engines and digital libraries which offer many opportunities for domain relevant information and knowledge extraction, (such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing).

In this talk, Professor Giles will describe the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and will discuss issues in building domain specific enterprise cyberinfrastructure for the sciences and academia. Because of the large amount of information crawled, many problems arise in information extraction and data mining such as author and entity disambiguation, data extraction and ranking. Professor Giles will highlight application domains with examples from computer science, CiteSeerX, and chemistry, ChemXSeer and related problem areas.

As such enterprise systems require unique information extraction approaches, several different machine learning methods (such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining) are critical for performance. Professor Giles will draw lessons for other e-science and cyberinfrastructure systems in terms of design and implementation, and discuss future directions, systems and research.

Data Dump to delete


  • Name: Professor C. Lee Giles
  • Affiliation: David Reese Professor, College of Information Sciences and Technology,
    Pennsylvania University
  • Role:
  • URL:
  • Bio: