Tips&Tricks | My IT Weblog

How do I connect my ipod to linux?

6 05 2012

Here is what I use in my .alias file:

alias ipodmount=’ifuse ~/ipod’
alias ipodumount=’fusermount -u ~/ipod’
alias ipod=’ipodmount; gtkpod;ipodumount’

This launches gtkpod and connects automatically to my ipod.

Comments : Leave a Comment »

Categories : linux, Tips&Tricks

Using regular expressions when profiling SQL Server with Talend Open Profiler

28 08 2009

Email column analysis

By default, SQL Server does not provide integrated functions to do pattern matching with regular expressions. If you want to use Talend Open Profiler to analyze the matching of your data against some regular expressions (for example to analyze a column that contains emails, dates,…) you need to install a CLR User-Defined Function (UDF) on your SQL Server.

Amine who has done an internship at Talend has tried it. He provided us a step-by-step user guide about how to create the UDF, install it on your SQL Server and call it from Talend Open Profiler. This will allow you to use all predefined patterns to ensure the quality of your data.

You can also find new regular expressions for Talend Open Profiler in Talend Exchange.

Comments : 7 Comments »

Categories : data profiling, regex, Talend, Tips&Tricks

svn+ssh

11 06 2009

How to avoid that svn+ssh always asks for the password:

1. create a ssh key: ssh-keygen (without a passphrase or it will be asked each time you do an update)

2. connect to the remote server via ssh

3. edit or create a file “.ssh/authorized_keys” and put the content of your local “.ssh/id_rsa.pub” file (if the “.ssh” folder does not exist, execute “ssh-keygen” in order to generate it. Then remove the generated key files: “id_rsa” and “id_rsa.pub” which are not required)

Comments : 2 Comments »

Categories : linux, Tips&Tricks

How to detect random text in a free text field?

13 03 2009

As suggested by Pierrick 😉 , I start this post with a picture:

I give here a tip to detect random text by using regular expressions. But before I give this expression, let me explain when and what for it is useful.

Nowadays, we find webforms almost everywhere in the Internet. These forms are used to collect data entered by users. If the form is well designed, the required fields cannot be left empty by the user. Other controls can also be done, such as to check the address, the country, the phone number format and so on.
Sometimes these controls are not done, either because the field is a free text field or simply because the developper of the web site did not develop them. In this case, the user can type anything in the text area and then anything enter in the database.

If the user does not want to fill in a field and this field is required, what is going to happen?
Either the user goes away and does not fill the form, or he types something like “lmqkjfdmklgj”

if you are doing data quality, you may need to identify this kind of bad data. The question is: how to detect something that can be anything but has clearly no meaning?
There can exist several solutions to this question. The first one could be to check that the data is composed of real words. For this solution, you need a dictionary. And it should be rather complete in order not to miss some words. Then how do you handle proper names?

Another solution is to use regular expressions. But what would be the regular expression that matches random text?

When I looked at my keyboard (an azerty keyboard as in the picture above) I saw that all the vowels are on the second row. Moreover, the default starting position of the hands on the keyboard is to put the left index finger onto the “F” key and the right index finger onto the “J” key. These keys are on the third row of keys (called the home row). This means that when you want to type something randomly, there’s a great chance that you will type only letters from the home row. And on a French keyboard, this means that there will be no vowel in the entered text.

Given these considerations, a random text is a string of characters without any vowel. The regular expression to match it can be something like:
[zrtypqsdfghjklmwxcvbnZRTYPQSDFGHJKLMWXCVBN]{4,}

This expression matches any 4 consecutive consonants. Maybe, it’s not enough and some real words will be matched by this expression. For example, it matches the word “length“.
Either you can require at least 5 consecutive consonants or you can restrict the expression to the letter of the home row:
[qsdfghjklmQSDFGHJKLM]{3,}

Try it on your data with Talend Open Profiler. You can either create your own “pattern” or download it on Talend Exchange.

For the English keyboards, the “a” vowel appears in the home row. This adds some difficulties because there are probably several words that could be formed with “a” and the letters of the home row. I let you adapt the regular expression to your needs and keyboard…

Comments : 3 Comments »

Categories : data profiling, data quality, Tips&Tricks

Share your regular expressions on Talend Exchange

12 02 2009

A new service is offered on the Talend Forge website. It’s called Talend Exchange and is aimed at providing a place where you can share TOS components, TOS templates, TOS jobs and TOP patterns. It is an enhancement of the previously called “Ecosystem” service.

I have added a few new regular expressions for Talend Open Profiler which may be useful to French users. You may download them and import them in Talend Open Profiler-1.1.x via the “import patterns” menu available on the “Libraries/Patterns” folder.

You may add your own regular expressions too. For the Talend Open Profiler-1.1.x platform, you need to create a simple xls file with the following columns
* C1: Label: the label of the pattern
* C2: Purpose: the purpose of the pattern
* C3: Description: the description of the pattern
* C4: Regular expression: the regular expression applicable to all databases
* C5: MySQL Regexp: the regular expression applicable to MySQL databases
* C6: Oracle Regexp: the regular expression applicable to Oracle databases
* C7: Author: the author of the regular expression
Then compress it in the zip format and you’re ready to upload it on Talend Exchange.

For the current development version of Talend Open Profiler-1.2.0xx, the task is even easier because there is a menu to export your regular patterns and/or your SQL patterns in csv format. For this version, it is suggested to export your patterns individually. Then compress your csv file(s) and upload them on Talend Exchange to share them with other users. The “import” menu will be able to read these files (once unzipped).

We’ll work later on a more integrated mechanism to share your patterns in one click.

Comments : 3 Comments »

Categories : data profiling, regex, Talend, Tips&Tricks

Datamining type

4 11 2008

In Talend Open Profiler, when you create a column analysis, you can see a combo box near each column in the editor which represents the data mining type of the column. What is it? And what is it useful for?

The available data mining types are

nominal
interval
unstructured text
other

Not all indicators (or metrics) can be computed on all kind of data. These data mining types helps Talend Open Profiler to choose the appropriate metrics for the column.

Nominal (also called “categorical” sometimes) means that the data can serve as label. For example, the type of a column called “WEATHER” with values: “sun”, “cloud”, “rain” would be nominal. In Talend Open Profiler, textual data are set to nominal data mining type.

But it happens that data such as “52200”, “75014” are nominal data too although they are represented by numbers. In fact, a column called “POSTAL_CODE” could have these values. It is clear for the user that these data are of nominal type because they identify a postal code in France. Computing mathematical quantities such as the average on these data is a non sense. In that case, the user should set the data mining type of this column to “nominal”, because there is currently no way to automatically guess the correct type in Talend Open Profiler in such a case.
The same is true for primary or foreign key data. Keys are most of the time numerical data, but their data mining type is “nominal”.

“Interval” data mining type is used for numerical data and time data. Difference between two values, averages can be computed on this kind of data. In databases, sometimes numerical quantities are stored in textual fields. With Talend Open Profiler, it’s possible to declare a textual column (e.g. a column of type VARCHAR) as an interval. In that case, the data should be treated as numerical data and summary statistics should be available. Currently, it’s not yet implemented because there is not yet an interface which allows the user to specify the format of the data. But this feature is planned for a future release.

The other two data mining types are not usual data mining types. In data mining we find sometimes the types “ordinal” and “ratio”.

The reason is that for the indicators currently available in Talend Open Profiler, these two types are not needed. Instead we added two other types to handle textual data. For example, a column “COMMENT” which contains text is not a nominal data, but still we could be interested in seeing the duplicate values of this column. Or we could implement metrics specific to text mining (but this is not for the current release…).

And finally, we also have the type “other” which design a data which Talend Open Profiler does not know how to handle yet.

Comments : Leave a Comment »

Categories : data quality, Talend, Tips&Tricks

How to compute a median in SQL

28 10 2008

In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes
SELECT MEDIAN(salary) FROM employee

But for other databases, things begin to be more tricky. Let’s take MySQL first. One way to compute the median is the following:
SELECT AVG(salary) FROM ( SELECT salary FROM employee WHERE salary IS NOT NULL ORDER by salary ASC LIMIT n,p) T
where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For Postgresql, the query is similar to the MySQL query and uses LIMIT too.
SELECT AVG(salary) FROM ( SELECT salary FROM employee WHERE salary IS NOT NULL ORDER by salary ASC LIMIT p OFFSET n) T
This query can also be used on MySQL but not on old versions of MySQL (before 5.0).
For Oracle 9i, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) FROM employee
For DB2, the following query is used in Talend Open Profiler:
SELECT AVG(salary) FROM ( SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee ) x WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) ) OR ( MOD(total,2) = 1 AND rn = next )
For Microsoft SQL Server, we used the TOP clause as follows
SELECT AVG(CAST(salary AS NUMERIC)) FROM ( SELECT TOP n salary FROM ( SELECT TOP m salary FROM employee WHERE salary IS NOT NULL ORDER BY salary ASC ) AS FOO ORDER BY salary DESC ) AS BAR
where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on Sybase ASE is the following:
SELECT AVG(CAST (salary AS NUMERIC)) FROM ( SELECT DISTINCT salary FROM ( SELECT salary FROM employee UNION ALL SELECT salary FROM employee ) STT WHERE (SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM ( SELECT salary FROM employee UNION ALL SELECT salary FROM employee ) AS SOU WHERE SOU.salary <= STT.salary) AND (SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM ( SELECT salary FROM employee UNION ALL SELECT salary FROM employee ) AS SUR WHERE SUR.salary >= STT.salary) ) T
This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.

Comments : 12 Comments »

Categories : data profiling, SQL, Tips&Tricks

krunner history

18 10 2008

krunner is the small linux application launcher that pops-up when you hit Alt+F2. Its behavior in kde4 changed from the behavior of minicli of kde-3.5. In order to get the previous behavior back, right-click on the text field of krunner and change the text completion mode to “short automatic”.

I never knew this menu before today. That’s why I give this tip here.

Comments : Leave a Comment »

Categories : Tips&Tricks

How to launch TOP with a specified JVM?

1 08 2008

If you need to specify the JVM path to be used by Talend Open Profiler. Simply edit the TalendOpenProfiler-XXX.ini corresponding to your system and add the following 2 lines at the beginning of the file:
-vm
C:/usr/bin/java.exe
Be sure to write them on 2 lines, not on one line, otherwise it will not work.

The same configuration setting applies to Talend Open Studio.

Source: Eclipse FAQ.

Comments : Leave a Comment »

Categories : Talend, Tips&Tricks

Java program as a Windows service

11 02 2008

A few days ago, I used JSL for transforming a small Java program as a Windows service. The usage is really simple, define a class with static methods for starting and stopping the service. An example is the class TelnetEcho.java given with the sources of the library.

Then add the “jsl.jar” into the classpath, adapt the “jsl.ini” file to your needs and run “jsl.exe -debug” for testing the correct configuration of the service. This launches the service. Type “Ctrl-C” for stopping properly the service.

If everything is OK in this debug test, then you simply have to run “jsl.exe -install” in order to install the service. The name of the service is set in the jsl.ini file.

The service is then removed (if needed) with “jsl.exe -remove”.

Thanks Michael for this library.

Comments : 1 Comment »

Categories : Java, Tips&Tricks

My IT Weblog

How do I connect my ipod to linux?

Using regular expressions when profiling SQL Server with Talend Open Profiler

svn+ssh

How to detect random text in a free text field?

Share your regular expressions on Talend Exchange

Datamining type

How to compute a median in SQL

krunner history

How to launch TOP with a specified JVM?

Java program as a Windows service

Blogroll

Links

Talend Community

Categories

Meta

Business Intelligence

Development

Data Quality

Talend News

Get Opensuse

TOP