Datamining type

4 11 2008

In Talend Open Profiler, when you create a column analysis, you can see a combo box near each column in the editor which represents the data mining type of the column. What is it? And what is it useful for?

The available data mining types are

  1. nominal
  2. interval
  3. unstructured text
  4. other

Not all indicators (or metrics) can be computed on all kind of data. These data mining types helps Talend Open Profiler to choose the appropriate metrics for the column.

Nominal (also called “categorical” sometimes) means that the data can serve as label. For example, the type of a column called “WEATHER” with values: “sun”, “cloud”, “rain” would be nominal. In Talend Open Profiler, textual data are set to nominal data mining type.

But it happens that data such as “52200”, “75014” are nominal data too although they are represented by numbers. In fact, a column called “POSTAL_CODE” could have these values. It is clear for the user that these data are of nominal type because they identify a postal code in France. Computing mathematical quantities such as the average on these data is a non sense. In that case, the user should set the data mining type of this column to “nominal”, because there is currently no way to automatically guess the correct type in Talend Open Profiler in such a case.
The same is true for primary or foreign key data. Keys are most of the time numerical data, but their data mining type is “nominal”.

“Interval” data mining type is used for numerical data and time data. Difference between two values, averages can be computed on this kind of data. In databases, sometimes numerical quantities are stored in textual fields. With Talend Open Profiler, it’s possible to declare a textual column (e.g. a column of type VARCHAR) as an interval. In that case, the data should be treated as numerical data and summary statistics should be available. Currently, it’s not yet implemented because there is not yet an interface which allows the user to specify the format of the data. But this feature is planned for a future release.

The other two data mining types are not usual data mining types. In data mining we find sometimes the types “ordinal” and “ratio”.

The reason is that for the indicators currently available in Talend Open Profiler, these two types are not needed. Instead we added two other types to handle textual data. For example, a column “COMMENT” which contains text is not a nominal data, but still we could be interested in seeing the duplicate values of this column. Or we could implement metrics specific to text mining (but this is not for the current release…).

And finally, we also have the type “other” which design a data which Talend Open Profiler does not know how to handle yet.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: