Paper 18

Paper Title: Service-Oriented Distributed Data Mining

Three Critical Questions

Monday

Group 1:

Member Name: Chiranjeevi Ashok Puvvula

1)How does reduction in information which is transmitted over Internet occur by the introduction of Service Oriented Architecture? A service oriented mechanism is a distributed mechanism in Internet thereby increasing data transmission and processing in the Internet.

2)How does "patterns" help in providing distributed nature for data mining.In such dynamism and distributed nature combining different "patterns" to bring out a common pattern or a generalised pattern is difficult.

3)The article assumes that the "data flow follows a Gaussian distribution".What are the conditions under which this assumption is true and false?In this dynamic nature of data what is the percentage of this assumption being true?

4)A privacy conscious DDM is implemented with "different data granularity" at different levels and also access privileges .This is good in one sense of security but there should be cryptographic aspect also to add to the security.Are these cryptographic aspects included?If so what are they and what security concerns are these aimed for?

Group 2:

Member Name: Srikanth Kodali

1. The method of learning by abstraction was designed to provide the feature of privacy to the data, but there is no actual statistical or theoretical factor mentioned by which the accuracy of results are being compromised because we are basing each new result of our round on the results obtained in previous round but not actual data present in the previous round which makes things a bit more generalized.

2. The authors say that the protocol DDM protocol continues to query each data source until the data likelihood stops improving or the computational budget runs out, if the budget eventually runs out before achieving the final value, then how to the authors then plan to get the answer, and how far is this approach beneficial considering the fact that we are after the answer but not the budgets involved. Is there a measure of the nearness of the data to the final required value anywhere..?

3. The author himself states that he is sacrificing the accuracy of algorithm for the purpose of saving time. “We can sacrifice optimality (regarding the modeling accuracy and representation efficiency) to a certain extent so that we can incorporate more efficient approximate algorithms instead”. How much of a compromise is being done in this case and how far is it beneficial for the effectiveness of the process..?

Group 3:

Member Name: Yaswanth Kantamaneni

1) In the implementation of distributed data mining systems, the journal faces a main issue in the integration of two demands like synthesizing useful and usable knowledge and performing large scale computations. How will these two problems eliminated?
2) The distributed data mining applications are extended by implementing the new services and re modeling the service flow. How will these new web services integrated with the old ones and what are the further consequences?
3) In the service oriented distributed data mining, a model based approach was developed for the information abstraction and analysis. It supports the adaptive data mining process. How will the data analysis computational complexity was decreased and the privacy of information was controlled?

Group 4:

Member Name:Nikhilesh Katakam

 The paper stated that it is difficult to access the “heterogeneous data” when there exists different levels of “privacy concerns”. How the mining is done accurately when data is restricted from accessing at different levels?
 The paper stated that algorithms are used to calculate the “data abstraction” repeatedly at “different levels” .How is the time and cost constraints be achieved in this case when abstractions are performed at each level?
 How is the information abstracted from the “’local models to the global models” and vice-versa when it contains huge amounts of “private date”? How the issues like ‘“resampling”’ are “resolved”?

Group 5:

Member Name: Ritesh Mengji

• In the paper, in order to succumb to the privacy concerns, only the statistical data patterns are allowed to be visible to the outside world and the clustering of this data is supposed to be used to be used to derive conclusions which do not sound meaningful. So what is the point in exposing the statistical data patterns? What probably might be the scope of this statistical pattern recognition in deriving meaningful conclusions to drive the data mining across various platforms?

• In the active exploration section, the process defines that the global service keeps on asking for more details as long as it thinks that the more data is needed. But the protocol specifies that this process should stop either when likelihood stops improving or the cost boundaries are expected to be crossed. So the actual degree of detail needed for data mining is not accomplished and is only a relative scale of accuracy, which may sometimes stop long behind the needed precision. So what is the actual driving force or an argument that strengthens the use of such a methodology?

• In specifying the local data abstractions and selecting from those available hierarchies, there is scope of redundant indexing due to the policy of storing the Gaussian components redundantly at each level. Is such a mechanism effective enough to provide better selection for the algorithm or is a more effective indexing possible? This becomes critical because this process itself is the core structure in this discussion.

Group 6:

Member Name:

Group 7:

Member Name: Priyanka Koneru

1. As the resampling process is computationally expensive and it also discards the contribution of the knowledge by the similar data points which is used for the estimation of each global model parameter, how is this global model best suited to the existing model learning techniques?
2. It was not answered in the paper that “how to identify the exact versions from the available SA tools as all most all of them are still in the developmental stage and also how to integrate those available tools?”.
3. “ Data mining services and data provisioning services are the components services of DDM “. But how can these services automatically determine which services suits best, what data is to be shared and whether or not to share the data?
4. How can one find the better accuracy, utilization of resources, efficiency and privacy using DDM?

Group 8:

Member Name:Muppalla,Putheti

1. It is stated that during the “hierarchical data abstraction” the “data abstraction” is to be done at the each level and more over the data revealed in the “lowest granularity” is very less. Isn’t this calculating the abstraction at different levels is time consuming and requires additional overhead?
2. How the data is abstracted form the “local models” to the “global models” when there is a lot of private data involved with in the “local models” and how the “resampling” issue is addressed in the mining of data from the “local models” to the “global models”.
3. How the “DDM” provides the services on demand and how it can provides the private data and which are not recognized in advance. As it has request the services from the “local providers”. As accessing such services needs a lot effort.

Wednesday

Group 1:

Member Name:pelluri,voruganti,lattupalli

1. Learning from abstraction still remains an unanswered question even if the SOA supports it ,the reason this question prevails is the abstraction of data is sometimes rapidly changing if the data is not consistent,in such cases how does the author explain the process of data mining?

2. The atomicity of the DDM needs to be categorized because the levels of atomicity needs to change accordingly,how does the author categorize the various factors effecting the atomicity of DDM?

3. As the datamining process is made into a distributed process with several services interacting with each other, will not the risk factor increase, is it not more error prone, validation and checking so many services will be a problem, malfunctioning of one service could sometimes effect the whole architecture ,Will it be secure enough to bring out such an architecture?

Group 2:

Member Name:Addagalla, Bobbili, Gopinath

• How heterogeneity of data is tackled by service oriented architecture? The granularity levels of various data sources follow a hierarchal manner. How will the application figure out the like of granular level?
• Mapping manipulation application for geographic data. How can that be solved?
• How will the application with interim solution be taken care?
• “’local models to the global models”. For this models how will the private data abstracted?

Group 4:

Member Name: Swati Thorve

1>To implement service oriented distributed data mining, author is using existing web service standards like WSDL, BPEL4WS UDDI, as it is However when we consider stand alone DDM application vs traditional applications, DDM is more data centric while traditional applications are more action oriented, they differ in way of storing and retrieving data. Are the traditional web services sufficient to incorporate these differences ? If not , how the DDM services should differ from traditional web services?

2>Local data sources are divided into different granularity levels. However more the number of granularity levels, more the processing. In Authors approach every time request is made from the lowest granular level, starting from there reaching at the highest point in the hierarchy is time consuming task. How can we avoid this by specifying a particular granular level at the time of requesting itself ? How the application will know which granular level to request ?

3>How to divide data into different levels ? What should be the criteria ? And does this distribution remains same for different DDM service ? Or will it vary according to services ?

4>Author is using Iterative biding process, where global broker is requesting lowest granularity data first then building a global model and sending that model to local data sources after that requesting higher granular data from local data source which provided more data and repeating this procedure iteratively. How can we replace this procedure to get more effective results in less time?

Group 4:

Member Name: Karunapriya Rameshwaram, Shaiv, Anusha Vunnam

CRITICAL QUESTIONS:
1.)Is it economical to use EM algorithm in local abstractions in order to learn global models as it is having more complicated steps to follow?
2.)Would it be compatible, if in any case we want to design stateful long run interactions in our DDM design?
3.)Will it be able to achieve the goals of data mining, in case of dealing with privacy issues that are not known in advance?

Group 5:

Member Name: Gayathri Devi Bojja

1. The character set that is developed in order to use for various matching techniques is based on too many parameters and the dependability of the set on these large number or parameters is highly questionable especially for real time systems.
2. The solution suggests various learning techniques and integrates them in order to find out the semantic equivalency among various databases, what’s the guarantee that it will provide a solution to problem without any human intervention and checking.
3.The learning techniques that are being used get their data from various sources like documents, dictionaries and thesaurus so that their equivalency can be checked what if their values are wrong or the documents don’t exist with correct values then the whole result that comes out will not be correct.

Group 6:

Member Name:

Group 7:

Member Name:

Group 8:

Member Name:Brugu Kumara Bhargava Mudumba

1) Author says that using the conventional EM algorithm for GMM maximizes observed data. But performing this iteratively will affect the posterior probability. How can this issue be resolved?
2) Author tells that there are 2 different implementations of DDM, one is by means of no exclusiveness and second is coexist in complete. What if there are overlapped or have some part included?
3) Author in his method says that global brokering services can send to all the local services up to a specific movement. How can achieve the full length transmission without errors?

Group 9:

Member Name: Satish Bhat, Holly Vo

1. What is the time complexity for distributed data mining services? With the convergence time of tens to hundreds of seconds, how can DDM service be used in real-time application? What are the good fields to apply this current technique with tolerance on its current performance?
2. How to weight accuracy against privacy? For different applications, one can be more important than the other. How can flexible policy on weight scales between accuracy and privacy be built, fed, and embedded in DDM service for self-balance process?
3. Privacy policies implemented by the local data owner control the granularity and privacy at a local abstraction. How does negotiation at autonomy DDM behave if a local abstraction has no privacy policies? Will a default policy be used? Will a policy be extracted from the data sample?

Group 10:

Member Name: Sunae Shin, Hyungbae Park

1. It is abstract in terms of the lowest granularity level. They mentioned that less information is exposed by the lowest granularity level rather than using the highest value. It would be better to understand if they provide the precise amount of the information that they can keep in private.
2. They recognized that repeating the computation of the local data abstraction for each level takes a lot of time. However, they didn’t consider about the aggregation time of local abstraction. It might also take some time if there are huge amount of local sources.
3. It would be better if they provide prove for how the latent variables’ values follows Gaussian distribution rather than assuming by the list of the examples that follows Gaussian distribution.
4. They may consider the privacy problem while the aggregation process of local abstraction since it is possible that privacy mechanism wrecked during the process. In addition, there is a suggestion that the privacy control during the aggregation process with less control for local source. It might have efficient cost or time for the system.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License