Data Empowerment Blog: 2011

Saturday, October 8, 2011

Why Do we need ETL Tool?

Think of any big organization, the company has over 100+ years of history & presence in almost all the industries. Over these years company’s management style has been changed from book keeping to SAP. This transition was not a single day transition. In transition, from book keeping to SAP, they used a wide array of technologies, ranging from mainframes to PCs, data storage ranging from flat files to relational databases, programming languages ranging from COBOL to Java. This transformation resulted into different businesses, or to be precise different sub businesses within a business, running different applications, different hardware and different architecture. Technologies are introduced as and when invented & as and when required.

This directly resulted into the scenario, like HR department of the company running on Oracle Applications, Finance running SAP, some part of process chain supported by mainframes, some data stored on Oracle, some data on mainframes, some data in VSM files & the list goes on. If one day company requires consolidated reports of assets, there are two ways.

• First completely manual, generate different reports from different systems and integrate them.

• Second fetch all the data from different systems/applications, make a Data Warehouse, and generate reports as per the requirement.

Obviously second approach is going to be the best.

Now to fetch the data from different systems, making it logical and loading into a Data Warehouse requires some kind of extraction, cleansing, integration, and load. ETL stands for Extraction, Transformation & Load.

ETL Tools provide facility to Extract data from different non-coherent systems, cleanse it, merge it and load into target systems.

Thursday, September 29, 2011

Informatica Best Practices (Session Recovery)

There are various approaches in implementing Session Recovery. They are:
Ø Using PowerCenter built-in Session Recovery Feature
Ø Using Session ID
Ø Setting high commit level in the sessions
Ø Using update strategy
Ø Using combination of Session Id and Temp Table

Informatica Best Practices (Migration process from development to production)

It is always best to have dedicated environment for development, test and production. If not possible there could be one shared environment for development and test and another dedicated environment for production. Again from security standpoint any user other than project administrator should have only read access to the production repository. This will stop developers from accidentally logging onto the wrong repository and making changes without realizing that there are not in the development repository.
Migrate from development to production can be done primarily two ways - copying entire folders and copying individual mappings into production. There are advantages and disadvantages with each approach described below.
Copy an Entire Folder
Copying an entire folder gives the ability to quickly promote all of the objects in the folder from test to production. All of the source tables, target tables, re-usable transformations, mappings, and sessions are promoted at once. What this means is that everything must be ready to be moved to the next stage. If certain mappings are not ready, then after the folder is copied the users must manually delete them from the new folder. It is possible to copy an entire folder with all of the versions associated with it or to copy a specific version of the folder. The steps for copying a folder are:
Ø If using shortcuts follow these steps
o In each of the dedicated repositories create a common folder with the EXACT same name and case sensitivity
o Copy the shortcut objects into common folder in production and making sure that the shortcut has the exact same name
Ø Open either the Repository Manager or Designer connecting to the repository
Ø Drag and drop the folder onto the repository icon within the Navigator tree structure
o Copying the entire folder: drag and drop the folder icon just under the repository level
o Copying a specific version: drag and drop the version # icon under the folder level
o Drag and drop the mapping from development into production
Ø Follow the Copy Folder Wizard steps
o If a folder with that name already exists, the folder must be renamed
Ø Modify the pre/post session commands as necessary
o It is recommended to set up the pre/post session commands to work in both environments. This means that the paths are the same in both environments (log/bad files, scripts, etc.).
o In the Server Manager within the Session Wizard in the General tab press the pre/post session button and make the required changes
Ø Implement appropriate production security
o In development, have the owner of the folders be a user in the development group
o In production, change the owner of the folders to a user in the production/operations group
o Revoke all rights to public other than read and all privileges other than browse repository
o Now if a developer accidentally logs into production, they can not make any changes
Copy Individual Mappings
Copying individual mappings into the next stage has the same advantages and disadvantages in the dedicated environment as it does in the shared environment. It gives the user the ability to promote into production the desired mapping, which is the lowest level of granularity.
If the folder where the mapping is copied into does not contain the source/target table or the re-usable transformation it will bring it along as well.
There are two disadvantages of copying individual mappings into production. First of all, only 1 mapping at a time can be copied. If there is a large number of mappings needing to be promoted into production, then it will be time consuming. Secondly, the drawback is that the sessions must be re-created from scratch. This is a tedious process, especially if pre/post session scripts are used.
Ø If using shortcuts follow these steps
o In each of the dedicated repositories create a common folder with the EXACT same name and case sensitivity
o Copy the shortcut objects into common folder in production and making sure that the shortcut has the exact same name
Ø Copy the mapping from development into production
o In the Designer, connect to both the development and production repositories and open the appropriate folders in each
o Drag and drop the mapping from development into production
Ø Create a session in the Server Manager to run the mapping
o Enter all the appropriate information in the Session Wizard
Ø Implement appropriate production security
o In development, have the owner of the folders be a user in the development group
o In production, change the owner of the folders to a user in the production/operations group
o Revoke all rights to public other than read for the production folders
o Now if a developer accidentally logs into production, they can not make any changes.

It is recommended to promote individual folders into a production environment that is already exists. By promoting individual mappings, a lower level of control is attained, but the sessions associated to the mappings must be manually recreated.

Informatica Best Practices (Workflow Development standards)

) Always run workflows in Suspended mode (check SUSPEND ON ERROR option on workflow level properties).

Workflow Objects
Naming Convention
Workflow - WF_Name
Worklets - WLT_Name
Assignment - ASGN_Name
Command - CMD_Name
Control - CTL_Name
Decision - DCN_Name
Email - EML_Name
Event Raise - EVTR_Name
Event Wait - EVTW_Name
Session - s_

Informatica Best Practices (using different transformations)

Aggregator transformation
The best practices of using Aggregator transformation are –
Ø Always sort the inputs before passing it to the Aggregator, one way of sorting data could be applying an ORDER BY clause in the Source Qualifier on the GROUP BY columns of Aggregator.
Ø In Aggregator transformation the GROUP BY ports should be placed in the same order as required for the aggregation.
Ø While summarize an Amount field, the input port should always be defaulted to zero (set the default value as zero). A NULL value added with other values will result NULL.
Ø If Amount filed has a precision of 2 (2 decimal places), it should be defaulted as 0.00, defaulted as 0, will result an integer value truncating the decimal places.

Expression transformation
Ø To write an expression in an output/ variable port use functions and ports by pointing and clicking instead of typing the names of ports, functions or transformations to avoid typo mistakes.
Ø For all Output ports delete the Default Value “ERROR ('transformation error')”

Lookup transformation
Ø While calling an unconnected Lookup transformation in an expression instead of writing the expression point and click the lookup transformation from the function pane.
Ø The data type should match for ports in each side of lookup condition. To avoid any error in this, copy and paste the lookup ports to create the input ports.
Ø If there is a Lookup SQL override statement, server automatically puts the ORDER BY clause after the Lookup SQL override statement based on the order of the lookup ports. Hence putting any ORDER BY clause in Lookup SQL override creates an error. The work around is – Use “—“(comment-out sign) at the end of the Lookup SQL override statement and then put the specific ORDER BY clause.
Ø Only Required fields from Lookup to another transformation must be marked as Output port.
Ø Remove all unnecessary fields from lookup.
Ø All input ports must be prefixed with the keyword IN.
Ø Location Information of the Lookup Object must be $Target at mapping level.
Ø If there are multiple calls to only one lookup in the mapping, use unconnected lookup.
E.g. IN_Loan_Number

Source Qualifier transformation
Ø Try to filter as much records as possible at the Source Qualifier Level.
Ø All Sources under one folder must point to same database using the same connect string.
Ø Links further only those ports, which are required in the mapping.
Ø All ports names should be Initcap.

Informatica Best Practices (Mapping Description)

It is a good practice to note down the name of the developer, Date of development and a brief description of the functionality of the mapping in the description field of the mapping. Whenever there is a change in the mapping the change, date of change and person responsible for change should be noted there. This helps in tracking history of changes in a mapping and acts as a log.
Similarly all major expressions should have a brief description and its functionality, for example a lookup transformation should have the lookup override condition or Source Qualifier should have the filter condition.

Informatica Best Practices (Mapping Development Standards)

Naming conventions for mappings will be in the form of - m__
Subject Area: Datawarehouse, ODS or Data mart names etc.
Dimension/Fact/Table: Name of the dimension or fact or related table. Use underscore wherever required to name dimension and facts.
Ex: m_VISION_ACTION_RESULT_CODES

Mapplets
Use same convention as mapping name with prefix MPLT_;
E.g. MPLT_

Transformations Objects
Transformation Naming Convention
Expression Transformation EXP__
Lookup Transformation LKP_
Source Qualifier Transformation SQ_
Aggregator Transformation AGG_
Filter Transformation FIL_
Update Strategy Transformation UPD__
Normalizer Transformation NRM_
Rank Transformation RNK_
Stored Procedure Transformation SP_StoredProcedureName
External Procedure Transformation EXT_ProcedureName
Joiner Transformation JNR_
Sorter Transformation SRT_<>
Router Transformation RTR_<>
Sequence Generator SEQ_ (Always Reusable)

For any reusable Transformation
R_ as prefix.

Mapping Variable
$$Function or Process that is being done

Mapping Parameter
$$Function or Process that is being done

Data Empowerment Blog

Blog Archive