Wednesday, 1 June 2011

Best Practices


The following lists some performance considerations when building mappings.    

1.  Filter Expressions - try to evaluate them in a port expression. Try to create the filter (true/false) answer inside a port expression upstream. Complex filter expressions slow down the mapping.  Again, expressions/conditions operate fastest in an Expression Object with an output port for the result.  Turns out - the longer the expression, or the more complex - the more severe the speed degradation. Place the actual expression (complex or not) in an EXPRESSION OBJECT upstream from the filter. Compute a single numerical flag: 1 for true, 0 for false as an output port. Pump this in to the filter - you should see the maximum performance ability with this configuration.

2.  Remove all "DEFAULT" value expressions where possible. Having a default value - even the "ERROR(xxx)" command slows down the session. It causes an unnecessary evaluation of values for every data element in the map. The only time you want to use "DEFAULT value is when you have to provide a default value for a specific port. There is another method: placing a variable with an IIF(xxxx, DEFAULT VALUE, xxxx) condition within an expression. This will always be faster (if assigned to an output port) than a default value.

3.  Variable Ports are "slower" than Output Expressions. Whenever possible, use output expressions instead of variable ports. The variables are good for "static - and state driven" but do slow down the processing time - as they are allocated/reallocated each pass of a row through the expression transform.
4.  Datatype conversion - perform it in a port expression. Simply mapping a string to an integer, or an integer to a string will perform the conversion, however it will be slower than creating an output port with an expression like: to_integer(xxxx) and mapping an integer to an integer. It's because PMServer is left to decide if the conversion can be done mid-stream which seems to slow things down.

5.  Unused Ports. Surprisingly, unused output ports have no affect on performance. This is a good thing. However in general it is good practice to remove any unused ports in the mapping, including variables. Unfortunately - there is no "quick" method for identifying unused ports.

6.  String Functions. String functions definitely have an impact on performance. Particularly those that change the length of a string (substring, ltrim, rtrim, etc.). These functions slow the map down considerably, the operations behind each string function are expensive (de-allocate, and re-allocate memory within a READER block in the session). String functions are a necessary and important part of ETL, we do not recommend removing their use completely, only try to limit them to necessary operations. One of the ways to tune these is to use "varchar/varchar2" data types in your database sources, or to use delimited strings in source flat files (as much as possible). This will help reduce the need for "trimming" input. If your sources are in a database, perform the LTRIM/RTRIM functions on the data coming in from a database SQL statement, this will be much faster than operationally performing it mid-stream.

7.  IIF Conditionals are costly. When possible - arrange the logic to minimize the use of IIF conditionals. This is not particular to Informatica.  It is costly in ANY programming language. It introduces "decisions" within the tool; it also introduces multiple code paths across the logic (thus increasing complexity). Therefore - when possible, avoid utilizing an IIF conditional - again, the only possibility here might be (for example) an ORACLE DECODE function applied to a SQL source.

8.  Sequence Generators slow down mappings. Unfortunately there is no "fast" and easy way to create sequence generators. The cost is not that high for using a sequence generator inside of Informatica, particularly if you are caching values (cache at around 2000) - seems to be the suite spot. However - if at all avoidable, this is one "card" up a sleeve that can be played. If you don't absolutely need the sequence number in the map for calculation reasons, and you are utilizing Oracle, then let SQL*Loader create the sequence generator for all Insert Rows. If you're using Sybase, don't specify the Identity column as a target - let the Sybase Server generate the column. Also - try to avoid "reusable" sequence generators - they tend to slow the session down further, even with cached values.

9.  Test Expressions slow down sessions. Expressions such as: IS_SPACES tend slow down the mappings, this is a data validation expression which has to run through the entire string to determine if it is spaces, much the same as IS_NUMBER has to validate an entire string. These expressions (if at all avoidable) should be removed in cases where it is not necessary to "test" prior to conversion. Be aware however, that direct conversion without testing (conversion of an invalid value) will kill the transformation. If you absolutely need a test expression for a numerical value, try this: IIF(<field> * 1 >= 0,<field>,NULL) preferably you don't care if it's zero. An alpha in this expression should return a NULL to the computation. Yes - the IIF condition is slightly faster than the IS_NUMBER - because IS_NUMBER parses the entire string, where the multiplication operator is the actual speed gain.

10.       Reduce Number of OBJETS in a map. Frequently, the idea of these tools is to make the "data translation map" as easy as possible. All to often, that means creating "an" expression for each throughput/translation (taking it to an extreme of course). Each object adds computational overhead to the session and timings may suffer. Sometimes if performance is an issue / goal, you can integrate several expressions in to one expression object, thus reducing the "object" overhead. In doing so - you could speed up the map.

11.       Update Expressions - Session set to Update Else Insert. If you have this switch turned on - it will definitely slow the session down - Informatica performs 2 operations for each row: update (w/PK), then if it returns a ZERO rows updated, performs an insert. The way to speed this up is to "know" ahead of time if you need to issue a DD_UPDATE or DD_INSERT inside the mapping, then tell the update strategy what to do. After which you can change the session setting to: INSERT and UPDATE AS UPDATE or UPDATE AS INSERT.

12.       Multiple Targets are too slow. Frequently mappings are generated with multiple targets, and sometimes multiple sources. This (despite first appearances) can really burn up time. If the architecture permits change, and the users support re-work, then try to change the architecture -> 1 map per target is the general rule of thumb. Once reaching one map per target, the tuning gets easier. Sometimes it helps to reduce it to 1 source and 1 target per map. But - if the architecture allows more modularization 1 map per target usually does the trick. Going further, you could break it up: 1 map per target per operation (such as insert vs update). In doing this, it will provide a few more cards to the deck with which you can "tune" the session, as well as the target table itself. Going this route also introduces parallel operations. For further info on this topic, see my architecture presentations on Staging Tables, and 3rd normal form architecture (Corporate Data Warehouse Slides).

13.       Slow Sources - Flat Files. If you've got slow sources, and these sources are flat files, you can look at some of the following possibilities. If the sources reside on a different machine, and you've opened a named pipe to get them across the network - then you've opened (potentially) a can of worms. You've introduced the network speed as a variable on the speed of the flat file source. Try to compress the source file, FTP PUT it on the local machine (local to PMServer), decompress it, and then utilize it as a source.  If you're reaching across the network to a relational table - and the session is pulling many rows (over 10,000) then the source system itself may be slow.  You may be better off using a source system extract program to dump it to file first, then follow the above instructions. However, there is something your SA's and Network Ops folks could do (if necessary) - this is covered in detail in the advanced section.  They could backbone the two servers together with a dedicated network line (no hubs, routers, or other items in between the two machines).  At the very least, they could put the two machines on the same sub-net. Now, if your file is local to PMServer but is still slow, examine the location of the file (which device is it on). If it's not on an INTERNAL DISK then it will be slower than if it were on an internal disk (C drive for you folks on NT).  This doesn't mean a unix file LINK exists locally, and the file is remote - it means the actual file is local.

14.       Too Many Aggregators. If your map has more than 1 aggregator, chances are the session will run very slowly - unless the CACHE directory is extremely fast, and your drive seek/access times are very high.  Even still, placing aggregators end-to-end in mappings will slow the session down by factors of at least 2.  This is because of all the I/O activity being a bottleneck in Informatica.  What needs to be known here is that Informatica's products: PM / PC up through 4.7x are NOT built for parallel processing.  In other words, the internal core doesn't put the aggregators on threads, nor does it put the I/O on threads - therefore being a single strung process it becomes easy for a part of the session/map to become a "blocked" process by I/O factors.

15.       Maplets containing Aggregators. Maplets are a good source for replicating data logic. But just because an aggregator is in a maplet doesn't mean it won't affect the mapping. The reason maplets that don't affect speed of the mappings is they are treated as a part of the mapping once the session starts.  In other words, if you have an aggregator in a maplet, followed by another aggregator in a mapping you will still have the problem mentioned above in #14. Reduce the number of aggregators in the entire mapping (including maplets) to 1 if possible. If necessary split the mapping up in to several different mappings, use intermediate tables in the database if required to achieve processing goals.

16.       Eliminate "too many lookups". What happens and why? Well - with too many lookups, your cache is eaten in memory - particularly on the 1.6 / 4.6 products. The end result is that there is no memory left for the sessions to run in. The DTM reader/writer/transformer threads are not left with enough memory to be able to run efficiently. PC 1.7, PM 4.7 solve some of these problems by caching some of these lookups out to disk when the cache is full. But you still end up with contention - in this case, with too many lookups, you're trading in Memory Contention for Disk Contention.  The memory contention may be worse than the disk contention because the system OS end's up thrashing (swapping in and out of TEMP/SWAP disk space) with small block sizes to try and locate your lookup row.   As the row goes from lookup to lookup, the swapping / thrashing gets worse.

17.       Lookups & Aggregators Fight. The lookups and the aggregators fight for memory space as discussed above. Each requires Index Cache, and Data Cache and they "share" the same HEAP segments inside the core.   Particularly in the 4.6 / 1.6 products and prior - these memory areas become critical, and when dealing with many rows - the session is almost certain to cause the server to "thrash" memory in and out of the OS Swap space. If possible, separate the maps - perform the lookups in the first section of the maps, position the data in an intermediate target table - then a second map reads the target table and performs the aggregation (also provides the option for a group by to be done within the database)... Another speed improvement...


Tips & Tricks And Other Observations

Copying Sessions

When coping sessions from one repository to another, do not use the right-click copy/paste commands. This will copy the session in the current folder. Instead, use the CTRL-C, CTRL-V or the menu copy/paste commands.


IIF and Datatypes


If the precision and scale of value1 in an IIF statement is different than value 2, the precision for value 2 is rounded.

For example, if you have a value for sales of 7.75, and you have the following statement, a value of 7.7 will be returned:

IIF (ISNULL (Sales), 0.0, Sales)

If you had wanted two decimal places returned for sales, you would need to make your IIF statement look like the following:

IIF (IS Null (Sales), 0.00, Sales)


Lookup SQL Performance Considerations


A lookup SQL query always places an order by on the SQL for every lookup port.   For example, if the Lookup transformation includes three lookup ports used in the mapping, ITEM_ID, ITEM_NAME, and PRICE, the lookup query is:

SELECT  ITEM_NAME, PRICE, ITEM_ID FROM ITEMS_DIM ORDER BY ITEM_ID, ITEM_NAME, PRICE.

Therefore, if your lookups are slow in loading, you may need to have an index that your ORDER BY statement can utilize. 

Unconnected Lookup Port Default


If you have an unconnected lookup, Informatica ignores what is in the default port.  It always returns a null if the lookup is unsuccessful.

Unexpected Results When A Null Value Is In A Calculation


If you have an expression transformation that adds two values together (e.g.  A+B), if one of the values happens to contain a null value, the resulting sum will be set to null.  

One solution to this problem is to make sure you always set your default value for your port to 0 for numeric values.


Sessions and Using Debugger
 When using the debugger , always create a new session specifically for the debugger instead of using the original session created for the mapping.  The debugger can do things such as change your session to be a test load instead of a regular mode.  

Invalid Batch Ids
Let your administrator know when you are moving mappings/sessions from one repository to another – especially when you are moving to production.   If you are going to be running the sessions utilizing the pmcmd command, the server will need to be restarted.   If you don’t restart the server, you will receive ‘invalid batch id’ when you try to run something using pmcmd.   I have talked to Informatica about this and they state that their future architecture will eliminate this problem.

Resetting Designer or Server Manager Windows
 If you ever have your windows (e.g. navigator window, toolbars) undocked in designer or server manager, you can reset them back the standard settings by going to the \winnt directory and deleting either your pmdesign.ini file (to reset designer) or you pmsrvmgr.ini file (to reset your server manager).

Using Sequence Generators Across Multiple Environments
 If you ever struggle with managing your sequence generated ids between dev, fit, and prod, you can modify your mappings to make it seamless to move from one environment to another.   The following lists the steps you will need to include in your mappings:

1) Add a disconnected lookup that retrieves the max key from your target table.   Note:  This process will only work if you only have one session inserting into a target table at a time.
2) Create a mapping variable (e.g. $max_cust_key) and assign the max key from the disconnected lookup to the mapping variable.
3) Add a sequence generator to your mapping and specify that the value should be reset to 1 every time.
4) Pass the next_val from the sequence generator to an expression transformation.  Call this port something like ‘next_id’.    Add a new port called ‘next_key’ that uses the following formula:   $max_cust_key + ‘next_key’.   
5) Use ‘next_key’ as the key when you insert your records.