Extracting and Query a Comprehensive Web Database - M. Cafarella, UWS

Tremendous amount of information lost takes place in existing web database that tries to fit crawled information into a specific domain / predefined domain. This is because information is 'forced' into a pre-defined domain.

This paper improves extraction model for web databases.

Architecture

Web crawl ---> Multimodel Extraction --> Entity Database --> Multimodel Transaction ---> User Query


This paper proposed the use of dynamic domain generation approach.

Challenges faced by this approach are

a) Web Extraction - Generating the E-R model is going to a challenge as different domain maybe generated for the same topic. For example George W Bush or President George Bush. How do you know that these are the same domain? Data reconciliation is a huge issue.

(Dong, A Halevy and Madhavan - Reference Reconciliation in Complex Information Space)
(Singla and Domingos - Entity Resolution with Markov logic)

b) Entity - Relation - A component that contains entity extracted from the web. Two method of query the system --

1. Structure Query - Query specific table / domain.

2. Unstructure Query - Query that span across mutiple table trying to find a matched criteria.

c) Query Processing - Interface that accepts user request and it supports both structured and unstructured query. Results are stored as on-the-fly table. It also takes in interaction from the user to make result more accurate.


Other related work

KnowItAll

TextRunner

WebTables

WeakAssoc

Comments

Popular posts from this blog

OpenCover code coverage for .Net Core

Android Programmatically apply style to your view

Using Custom DLL with IronPython / Scripts