Extracting and Query a Comprehensive Web Database - M. Cafarella, UWS
Tremendous amount of information lost takes place in existing web database that tries to fit crawled information into a specific domain / predefined domain. This is because information is 'forced' into a pre-defined domain.
This paper improves extraction model for web databases.
Architecture
Web crawl ---> Multimodel Extraction --> Entity Database --> Multimodel Transaction ---> User Query
This paper proposed the use of dynamic domain generation approach.
Challenges faced by this approach are
a) Web Extraction - Generating the E-R model is going to a challenge as different domain maybe generated for the same topic. For example George W Bush or President George Bush. How do you know that these are the same domain? Data reconciliation is a huge issue.
(Dong, A Halevy and Madhavan - Reference Reconciliation in Complex Information Space)
(Singla and Domingos - Entity Resolution with Markov logic)
b) Entity - Relation - A component that contains entity extracted from the web. Two method of query the system --
1. Structure Query - Query specific table / domain.
2. Unstructure Query - Query that span across mutiple table trying to find a matched criteria.
c) Query Processing - Interface that accepts user request and it supports both structured and unstructured query. Results are stored as on-the-fly table. It also takes in interaction from the user to make result more accurate.
Other related work
KnowItAll
TextRunner
WebTables
WeakAssoc
This paper improves extraction model for web databases.
Architecture
Web crawl ---> Multimodel Extraction --> Entity Database --> Multimodel Transaction ---> User Query
This paper proposed the use of dynamic domain generation approach.
Challenges faced by this approach are
a) Web Extraction - Generating the E-R model is going to a challenge as different domain maybe generated for the same topic. For example George W Bush or President George Bush. How do you know that these are the same domain? Data reconciliation is a huge issue.
(Dong, A Halevy and Madhavan - Reference Reconciliation in Complex Information Space)
(Singla and Domingos - Entity Resolution with Markov logic)
b) Entity - Relation - A component that contains entity extracted from the web. Two method of query the system --
1. Structure Query - Query specific table / domain.
2. Unstructure Query - Query that span across mutiple table trying to find a matched criteria.
c) Query Processing - Interface that accepts user request and it supports both structured and unstructured query. Results are stored as on-the-fly table. It also takes in interaction from the user to make result more accurate.
Other related work
KnowItAll
TextRunner
WebTables
WeakAssoc
Comments