Data sets should have domains

General discussion including development, improvements and support
Forum rules
Search for if your question has already been asked before posting a new topic. Duplicates will be locked or deleted.

Data sets should have domains

Postby heatzync » Thu Jun 24, 2010 2:36 pm

Neither the DataSetBuilder class, DataSet class or DataTable interface know about domains. The Problem hierarchy does however.

I have the situation where I want to do mathematical calculations between the Vector of an Entity and the Vector of a Pattern of a data set. Since the data sets do not know about domains the Vectors of the Patterns are created to be unbounded (default bounds). The problem comes in when one wants to calculate the mean of a data set and use it as the Vector of an Entity, for example. The Entity will end up being unbounded. The bounds of a Real is also final.

Would it makes sense to implement a DataTable that is aware of a domain? The elements that are read from file can then automatically be validated against the specified type and domain bounds and then the appropriate Numeric object can be added to the row Vector.

We should also keep in mind that the domain of a data set can be different from the domain of the corresponding Problem, like in the case of Neural Networks.
heatzync
 
Posts: 7
Joined: Mon Aug 10, 2009 2:10 pm

Re: Data sets should have domains

Postby gpampara » Fri Jun 25, 2010 9:41 am

The current definition of problems is not correct in my opinion.

To correct this issue, I would think the best approach would be to remove the "dataset" functionality from a problem and make a new "DataBasedProblem" which would decorate the current Problem with data functionality. This way, you could expose add the domain knowledge to the DataSets, without much effort. This way, all problems still could potentially have datasets, although it's not enforced.

The fact that the data sets were added the way they were is not correct. I'd like your opinion on this, as the refactor will be large, yet I believe it's worth the effort as the current implementation is very naive.
gpampara
Site Admin
 
Posts: 114
Joined: Fri Aug 07, 2009 2:44 pm

Re: Data sets should have domains

Postby heatzync » Fri Jun 25, 2010 9:52 am

I agree with the idea of a DataBasedProblem which will remove the "forced" data set from all OptimisationProblems, since not all OptimisationProblems require data sets. I assume the DataBasedProblem will be an interface.

Just to confirm: The DataBasedProblem will still have its own domain and the DataSet / DataTable (contained in the DataBasedProblem) will also have its own domain?
heatzync
 
Posts: 7
Joined: Mon Aug 10, 2009 2:10 pm

Re: Data sets should have domains

Postby gpampara » Fri Jun 25, 2010 9:57 am

That would be possible now, without much effort. The DataBasedProblem is focused on the "data", whereas the OptimisationProblem that it wraps should be focused on the search space.

My only concern is, is that valid? I mean, are we not overloading the meaning of Problem now?
gpampara
Site Admin
 
Posts: 114
Joined: Fri Aug 07, 2009 2:44 pm

Re: Data sets should have domains

Postby heatzync » Fri Jun 25, 2010 10:22 am

No matter how we look at it, we will need a Problem that references a data set. In my case, I would probably have the following classes and interfaces:

Code: Select all
public interface ClusteringProblem extends OptimisationProblem, DataBasedProblem

after the data set stuff has been removed from OptimisationProblem, and

Code: Select all
public class PartitionalClusteringProblem implements ClusteringProblem


We could also define the data sets separately (also in the XML configuration) and everything that needs a data set (Algorithm, Problem, Measurement, whatever) gets it via Dependency Injection.
heatzync
 
Posts: 7
Joined: Mon Aug 10, 2009 2:10 pm

Re: Data sets should have domains

Postby gpampara » Mon Jun 28, 2010 9:22 am

Agreed on the data requirements.

The current XML configurations are actually limiting what we can do. I'm toying with an experimental XML representation that will allow such configuration, but being XML, the process is tedious and difficult to manage. The root issue is that data sets were added to the library, but the place they were introduced was a little silly, as they resulted in more dependencies and problems wiring up the simulations.

My current idea is that the data related services need to be extracted such that they can be mocked and used in a variety of different ways. This will not only create more opportunities, but will ensure that the flexibility that we require is present.

I'm considering something like the following (this is pseudo-xml):
Code: Select all
<?xml...>
<data id="data" ....>
  <data source />
</data>

<simulation>
  <algorithm ....>
  </algorithm>

  <problem class="databasedproblem" ....>
    <dataProvider idref="data"/>
    <problem class="...." ...>
    </problem>
  </problem>

  <measurements....>
  </measurements>

  <output format="TXT" file="..." />

</simulation>
gpampara
Site Admin
 
Posts: 114
Joined: Fri Aug 07, 2009 2:44 pm

Re: Data sets should have domains

Postby heatzync » Mon Jun 28, 2010 9:26 am

I second that!
heatzync
 
Posts: 7
Joined: Mon Aug 10, 2009 2:10 pm


Return to Usage questions and discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron