Paper 17

Paper Title: Semantic matching across heterogeneous data sources

Three Critical Questions

Monday

Group 1:

Member Name:Amarthaluri Abilash , Bhardwaz Somavarapu

 Cluster analysis is suited for schema level correspondence identification and classification suits for instance level correspondence identification. But what if the data set keeps on changing dynamically. These changes are obvious in present world problems. Do these changes need the entire process of cluster analysis to be redone again?
 There are many proposed learning methods for cluster analysis. But each of the methods has their own method of approach in order to cluster the data. Every method has some merits as well as disadvantages. But what if different applications cluster their own data using different learning methods and later those applications needs to be interoperated?
 There is no classifier which can classify the data so effectively without the intervention of human being. The data analysts need to review the entire data once again to find out any misclassified data and there might be problems of data being misinterpreted of the problems of not classifying the data as desired. Is there a need of fully automated classifier without doing things manually which saves a lot of time?

Group 2:

Member Name: Sai ram kota

Q1.the author does not clearly state on an average what is the success rate of identification. He only vaguely says that human intervention is needed in worst case scenario, but he doesn't speak such things in terms of numbers of cases successful/failed?

Q2: The author talks about speed of operation. how many outputted, how many actually correct, how many need Human intervention. which is more accurate, when is it more accurate..?

Q3: The author uses sound-ex to compare names of the attributes in databases, but how will sound-ex tool be able to handle the case of multi lingual data ?

Group 3:

Member Name: Sunil Kakaraparthi, Sunil Kumar Garrepally

1) Even though the clustering concept was introduced years ago, no clustering method can be chosen as a best suitable option. Different methods need to be integrated. How well these methods can be integrated and what are the further consequences in combing these methods and how well they perform for a better solution?

2) Statistical analysis techniques are used to evaluate attribute correspondences. In this technique, if it divulges some new records, the record matching must be repeated. But how this iterative process is going to be a solution for this technique, even though it was expensive?
3) Regression analyses are dependent on statistical analysis. How will the limitations in regression analyses like measurement error, specification error and multi collinearity can be eliminated?

Group 4:

Member Name:Ramya Devabhakthuni, Prashant Sunkari

 The paper stated that the “similar concepts” from different data sets can be modeled with “different structures”. When they modeled differently, it leads to “discrepancies”. How can the learning techniques and tools be applied efficiently without human analysts?
 When integration is to be done, the different databases may contain attributes in common and also the “overlapping records”. How the “semantic integration” is done when they contain “thousands” of “attributes”, “hundreds” of “tables” and “millions” of “records”?
 The author stated about the integration of “data sets” by restricting to two simple tables. But when it comes to huge amounts of heterogeneous data, are these specified techniques sufficient to identify the “correspondence” and “integration”?
 If there is no ‘common key” between the records, then the human analysts are used to manually classify the “record pairs”. How are the factors like “time-consumption” and “cost” be compromised in this case?

Group 5:

Member Name: Lokesh Reddy Gokul

• What are the possible association techniques possible for relating the data sources after the discussion of classification and clustering techniques for schema level and record level data heterogonous data sources? The author doesn’t discuss the mechanism of relating the data sources rather than classifying and clustering them.
• In the comprehensive procedure which he discusses in the end of the article, he mentions the analyzing attributes stage as part of the repeating life-cycle, which requires human intervention. So that indicates the continuous need of user to intervene in the semantic integration of data sources. This is not a good indicator as the main stand point of the paper is to automate the semantic integration procedure. So in conclusion, how does the article presume that it has achieved its purpose of obtaining a model for autonomous semantic integration?
• The author also doesn’t discuss about the privacy related concerns in the process of semantic integration of various data sources. Is such a scope nullable in such a case, if not what are the possible concerns or difficulties faced? What other attributes and schema level concerns does this bring into light?

Group 6:

1) Author mentioned that in order to achieve optimal solution, we need to adopt different methods . But this reduces the success rate because the system itself adopting other different methods. Can different methods be chosen by the system?

2) In the paper, author mentioned that as the learning based approach passes the bottle neck of knowledge acquisition, it is more preferable than rule based approach. How the bottle neck of knowledge acquisition is resolved in learning based approach?

3) In the paper, author said that “Tools help the humans to analyze but never totally replace them”. Are there any ways for developing the application interface in such a way that verification and classification is done automatically with the help of tools without doing it manually?

Group 7:

Member Name: Kishore Kumar Mannava

Mohana Siri Kamineni

1. The “author” mentions that large databases that are owned by various organizations often have various attributes with “overlapping records” and they contain “discrepancies” and “data errors”. Then how does the comparison of these “databases” leads us to efficient results in determining the “semantic correspondences”?
2. It is mentioned in the paper that the real world “data sources” are very large which include a large set of “tuples”, “attributes and records” and physically detecting the “correspondences” is very expensive. Then how is this problem eliminated? It is mentioned that analysts take the help of “automated” tools to determine them. How does this affect the overall cost and what are the various tools used?
3. The paper tells us that “cluster analysis” is best suited for sets of identical examples in a “data set”. However when the data set changes this process should be repeated. Then how efficient is this process when there are a large number of updated “data sets” considering the time and cost issues?
4. The “author” reveals that “clustering” is very expensive for “record matching”. He also mentions that “record matching” is implemented every time when the fundamental or supporting “data sources” are modified or updated. Then how can this issue be resolved to make more economical and efficient?
5. “Clustering” often uses “design documents” which contain a detailed description of the attributes and the tables for the comparison. It is mentioned that these “design documents” are very often wrong , “outdated” ,”incomplete” or may not be available. Then how accurately the method of “clustering” can compare the various documents and give appropriate results?

Group 8:

Member Name:Hema Snigdha Putheti

1. It is stated that the “classifiers” does not classify the “record pairs” with exactness. And leaves the difficulties for the analysts. The analysts have to evaluate these manually. Wasn’t this a troublesome and time consuming like reviewing the records manually? How can the analysts review when there is huge number of records and what is the use of classifiers when they don’t classify all the records.
2. It is stated that the record matching has to be done whenever the data is updated. How this can be done when there is huge amount of data as it is time consuming as a lot of overhead is generated.
3. The data is integrated using different techniques like the “semantic integration” and many issues arise when the databases have “records” in common. How the issues like overlapping are addressed when dealing with many databases. How the integration can be performed when there are too many intersecting values in different domains.

Wednesday

Group 1:

Member Name: Lattupalli,Pelluri,Voruganti

1. Is the process (semantic matching) a dynamic or static,if it is static how does the time factor get effected(while categorizing,analysis and accessing data) and if it is static ,how will the updated records change the matching,will it have to be constantly checked,is it worth to spend that much time?
2. If there is a data base with huge amounts of data and it requires the manual nalysis ,will it be efficient for the process to afford for such manual analysis?
3. For the security databases,what kind of architecture is needed such that all the tools and methods used are integrated to produce the most effecient data set for the two databases.If a less reliable architecture is deployed,that could affect the security factors between the two databases?

Group 2:Addagalla, Bobbili, Gopinath

Member Name:

• The author stated that classifier cannot classify record pairs. How did the author come to this conclusion?
• Also if some data documents tend to be wrong or incorrect, would the tool carefully semantic data. This point is not discussed?
• The author talks about various tools like clustering tools, data patterns and usage patterns. From the huge ocean of internet data how will these tools compare data?

Group 3:

Member Name: Swathi Shastry

1>How does the classifier learn techniques when the data sets keep changing and how does the clustering technique form a set of clusters when the data sets keep changing. How frequently are the clusters formed? Are they dependent on the frequency of data set changes, what would be the course of action when a data set is updated?

2>The security databases are matched based on the similarity of characteristics which result in the formation of roughly similar characteristics. What is the size limit for a characteristic set, after which a dimensionality reduction technique needs to be applied?

3>The metadata search engine needs to learn the relationships between element types by learning the relationships from the users’ logs and thus adaptively search the document effectively. Is there any verification scheme to validate the relationships learnt by the metadata search engine?

Group 4:

Member Name:Karunapiya Rameshwaram, Shaiv, Anusha Vunnam

Critical Questions:
1.)The real time sources of data are very bulky with numerous tables, attributes and records. Do all these support the optimal cost maintenance as these correspondences should be acknowledged manually and this inclined to be expensive exorbitantly?
2.)It is been clarified in the article that corresponding records could be incorporated into a particular data set so that the statistical scrutiny could be used for the advanced examination. But to how far extent is this analysis reliable?
3.)It is made clear in the article that the significance of hypothetical toning of semantics could be assured. But to how far extent is this supported practically?

Group 5:

Member Name: Rahul Reddy, Rahul Mootha

1. The character set that is developed in order to use for various matching techniques is based on too many parameters and the dependability of the set on these large number or parameters is highly questionable especially for real time systems.
2. The solution suggests various learning techniques and integrates them in order to find out the semantic equivalency among various databases, what’s the guarantee that it will provide a solution to problem without any human intervention and checking.
3.The learning techniques that are being used get their data from various sources like documents, dictionaries and thesaurus so that their equivalency can be checked what if their values are wrong or the documents don’t exist with correct values then the whole result that comes out will not be correct.

Group 6:

Member Name:

Group 7:

Member Name:

Group 8:

Member Name:Bhargav Sandeep Ramayanam

1) Author in the paper says that “Tools help the humans to analyze but never totally replace them.” Is there any way to develop the A.I such that even the verification can be done by the tools alone without human involvement?
2) Author says that there is no best method but we need to use several different methods together to achieve optimal solution. Can the system itself choose the different methods to adopt by means of using their success rates?
3) Author says that the learning based approach is preferred to rule based approach as it by passes the bottle neck of knowledge acquisition. How is this issue that is bottle acquisition is resolved in learning based approach? Is there any one particular way of implementation or we need to depend on multiple methods to resolve the issue?

Group 9:

Member Name: Satish Bhat, Holly Vo

1. What is the cost verses benefit ratio for these classification techniques? This is an important factor that any business needs to take into consideration
2. The paper does not provide any evaluation for the comprehensive procedure. Query size versus time should have been provided.
3. Combination of schema level matching techniques and instance level matching techniques can slow down the system to a great extent. Can this technique be applied to time critical applications used in hospitals etc?
4. Should manual verification required on instances matched by approximation function? How to identify a misspell vs. an on-purpose weird name, phonetic error vs. similar sounds in different language? Should matching confidence be attached to matched results to catch user on verification?

Group 10:

Member Name: Sunae Shin, Hyungbae Park

1. This paper only sees the methodological view of integrating heterogeneous data across many organizations. However, we need to discuss privacy and security issues since every organization has its own protocols and strategies which may not be incompatible with others.
2. The paper talked about the suitability of cluster analysis and classification techniques to the schema-level correspondences and instance-level correspondences. However, there are not enough reasons why cluster analysis techniques are more suited to identifying schema-level correspondences and classification techniques are more suited to detecting instance-level correspondences.
3. The paper showed several examples for discrepancies and data errors and also enumerated some existing techniques that solve the discrepancies and data errors. This helps me understand how they solve the problems of the discrepancies and data errors.
4. Sometimes human analysis power is more efficient than computer in the schema-level correspondences. For example, we can find that Table1 and Table2 in the paper are the same immediately. But, computer will take much more time to analyze the correspondences than the human.
5. Tables and attributes are name to reflect their meanings. However, this is not enough reason to finding similarities among them by using string-matching methods and linguistic tools. There are various possible names representing the same tables and attributes.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License