Datamining type

4 11 2008

In Talend Open Profiler, when you create a column analysis, you can see a combo box near each column in the editor which represents the data mining type of the column. What is it? And what is it useful for?

The available data mining types are

  1. nominal
  2. interval
  3. unstructured text
  4. other

Not all indicators (or metrics) can be computed on all kind of data. These data mining types helps Talend Open Profiler to choose the appropriate metrics for the column.

Nominal (also called “categorical” sometimes) means that the data can serve as label. For example, the type of a column called “WEATHER” with values: “sun”, “cloud”, “rain” would be nominal. In Talend Open Profiler, textual data are set to nominal data mining type.

But it happens that data such as “52200”, “75014” are nominal data too although they are represented by numbers. In fact, a column called “POSTAL_CODE” could have these values. It is clear for the user that these data are of nominal type because they identify a postal code in France. Computing mathematical quantities such as the average on these data is a non sense. In that case, the user should set the data mining type of this column to “nominal”, because there is currently no way to automatically guess the correct type in Talend Open Profiler in such a case.
The same is true for primary or foreign key data. Keys are most of the time numerical data, but their data mining type is “nominal”.

“Interval” data mining type is used for numerical data and time data. Difference between two values, averages can be computed on this kind of data. In databases, sometimes numerical quantities are stored in textual fields. With Talend Open Profiler, it’s possible to declare a textual column (e.g. a column of type VARCHAR) as an interval. In that case, the data should be treated as numerical data and summary statistics should be available. Currently, it’s not yet implemented because there is not yet an interface which allows the user to specify the format of the data. But this feature is planned for a future release.

The other two data mining types are not usual data mining types. In data mining we find sometimes the types “ordinal” and “ratio”.

The reason is that for the indicators currently available in Talend Open Profiler, these two types are not needed. Instead we added two other types to handle textual data. For example, a column “COMMENT” which contains text is not a nominal data, but still we could be interested in seeing the duplicate values of this column. Or we could implement metrics specific to text mining (but this is not for the current release…).

And finally, we also have the type “other” which design a data which Talend Open Profiler does not know how to handle yet.

Advertisements




How to compute a median in SQL

28 10 2008

In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes
SELECT MEDIAN(salary) FROM employee

But for other databases, things begin to be more tricky. Let’s take MySQL first. One way to compute the median is the following:
SELECT AVG(salary) FROM (
SELECT salary FROM employee
WHERE salary IS NOT NULL
ORDER by salary ASC
LIMIT n,p) T

where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For Postgresql, the query is similar to the MySQL query and uses LIMIT too.
SELECT AVG(salary) FROM (
SELECT salary FROM employee
WHERE salary IS NOT NULL
ORDER by salary ASC
LIMIT p OFFSET n) T

This query can also be used on MySQL but not on old versions of MySQL (before 5.0).
For Oracle 9i, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
FROM employee

For DB2, the following query is used in Talend Open Profiler:
SELECT AVG(salary) FROM (
SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee
) x
WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) )
OR
( MOD(total,2) = 1 AND rn = next )

For Microsoft SQL Server, we used the TOP clause as follows
SELECT AVG(CAST(salary AS NUMERIC)) FROM (
SELECT TOP n salary FROM (
SELECT TOP m salary FROM employee
WHERE salary IS NOT NULL ORDER BY salary ASC
) AS FOO
ORDER BY salary DESC
) AS BAR

where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on Sybase ASE is the following:
SELECT AVG(CAST (salary AS NUMERIC)) FROM (
SELECT DISTINCT salary FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) STT
WHERE
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) AS SOU
WHERE SOU.salary <= STT.salary)
AND
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) AS SUR
WHERE SUR.salary >= STT.salary) ) T

This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.





krunner history

18 10 2008

krunner is the small linux application launcher that pops-up when you hit Alt+F2. Its behavior in kde4 changed from the behavior of minicli of kde-3.5. In order to get the previous behavior back, right-click on the text field of krunner and change the text completion mode to “short automatic”.

I never knew this menu before today. That’s why I give this tip here.





Want to create a startup?

16 10 2008

Before you start to create your startup, listen to (or read) Bill Inmon’s advice.

Some things that you must keep in mind are:

  • be pragmatic: “build what you can sell”, “95% of the startup resources will go to marketing and sales.”
  • protect your work “intellectual property is the backbone of the startup”
  • be ready for competition “competition is lurking behind every shadow and every bush”
  • business is business, don’t keep unproductive people “Get rid of unproductive people quickly”
  • Remember that a lot of money is needed to start because there is a “long way to go before there is any income”
  • But avoid venture capitalists at all costs
  • “be flexible” and adapt your business plan

I have not created a startup, but as I work in a startup, I can see how these advice are more or less followed by my employers. And these advice apply to open source editors too as soon as they need to make money.





Data quality blogs and resources

7 10 2008

The best data quality blogs are now listed in one place: Data Quality Pro Blog Finder

This post explains how the blogs are evaluated and shows that independent bloggers have more interactions with the community than vendor bloggers.

Data Quality Tools vendors are listed here.

Latest news are aggregated here and Data quality events are here.





What’s in your databases?

2 10 2008

Often you only know approximately what’s in your databases. Data profiling tools can help you to get a better idea of your database content. The goal of a data profiler is not to analyze your data in depth but to give you at a glance the main features of your data. Especially, data profilers can give you information about missing data, duplicates, badly formatted data, invalid data (out of range, incorrect business pattern…)

Talend Open Profiler (TOP) can help you to explore your data. The latest version is the 1.1.0. Its official documentation is available here. A lot of other informations can be found on the Data Quality Pro website which also made a 21 page tutorial for addressing your data quality issues with Talend Open Profiler and their free DQ Pattern analyser.

This tutorial shows you how to use TOP to explore your data and gives a lot of tips about how to interpret your profiling results. And this is really important, because you can profile easily your data and produce nice graphics with TOP, but if you don’t know what to do once you obtained the results, then profiling your data did not really help you to enhance your data quality. The tutorial also presents a very useful function called “DQ Pattern analyser” that lists the patterns existing in the data. It helps you to quickly see what’s wrong with your data and permits to identify rare occurences.
This function does not exists yet in TOP, but it will be implemented for the next version along with other new features.

By the way, if you are missing a feature, it’s time to tell Talend’s team which new feature would be great to be in TOP.





TOP announcements

11 09 2008

As I don’t want to announce here every version of Talend Open Profiler, you can find the announcements at the freshmeat page of TOP. The latest release is the 1.1.0RC1 release.