Extracting and Query a Comprehensive Web Database - M. Cafarella, UWS

Tremendous amount of information lost takes place in existing web database that tries to fit crawled information into a specific domain / predefined domain. This is because information is 'forced' into a pre-defined domain.

This paper improves extraction model for web databases.


Web crawl ---> Multimodel Extraction --> Entity Database --> Multimodel Transaction ---> User Query

This paper proposed the use of dynamic domain generation approach.

Challenges faced by this approach are

a) Web Extraction - Generating the E-R model is going to a challenge as different domain maybe generated for the same topic. For example George W Bush or President George Bush. How do you know that these are the same domain? Data reconciliation is a huge issue.

(Dong, A Halevy and Madhavan - Reference Reconciliation in Complex Information Space)
(Singla and Domingos - Entity Resolution with Markov logic)

b) Entity - Relation - A component that contains entity extracted from the web. Two method of query the system --

1. Structure Query - Query specific table / domain.

2. Unstructure Query - Query that span across mutiple table trying to find a matched criteria.

c) Query Processing - Interface that accepts user request and it supports both structured and unstructured query. Results are stored as on-the-fly table. It also takes in interaction from the user to make result more accurate.

