Using Logical Data Lakes

Today, data-driven decision making is at the center of all things. The emergence of data science and machine learning has further reinforced the importance of data as the most critical commodity in today’s world. From FAAMG (the biggest five tech companies: Facebook, Amazon, Apple, Microsoft, and Google) to governments and non-profits, everyone is busy leveraging the power of data to achieve final goals. Unfortunately, this growing demand for data has exposed the inefficiency of the current systems to support the ever-growing data needs. This inefficiency is what led to the evolution of what we today know as Logical Data Lakes.

What Is a Logical Data Lake?

In simple words, a data lake is a data repository that is capable of storing any data in its original format. As opposed to traditional data sources that use the ETL (Extract, Transform, and Load) strategy, data lakes work on the ELT (Extract, Load, and Transform) strategy. This means data does not have to be first transformed and then loaded, which essentially translates into reduced time and efforts. Logical data lakes have captured the attention of millions as they do away with the need to integrate data from different data repositories. Thus, with this open access to data, companies can now begin to draw correlations between separate data entities and use this exercise to their advantage.

Primary Use Case Scenarios of Data Lakes

Logical data lakes are a relatively new concept, and thus, readers can benefit from some knowledge of how logical data lakes can be used in real-life scenarios.

To conduct Experimental Analysis of Data:

  • Logical data lakes can play an essential role in the experimental analysis of data to establish its value. Since data lakes work on the ELT strategy, they grant deftness and speed to processes during such experiments.

To store and analyze IoT Data:

  • Logical data lakes can efficiently store the Internet of Things type of data. Data lakes are capable of storing both relational as well as non-relational data. Under logical data lakes, it is not mandatory to define the structure or schema of the data stored. Moreover, logical data lakes can run analytics on IoT data and come up with ways to enhance quality and reduce operational cost.

To improve Customer Interaction:

  • Logical data lakes can methodically combine CRM data with social media analytics to give businesses an understanding of customer behavior as well as customer churn and its various causes.

To create a Data Warehouse:

  • Logical data lakes contain raw data. Data warehouses, on the other hand, store structured and filtered data. Creating a data lake is the first step in the process of data warehouse creation. A data lake may also be used to augment a data warehouse.

To support reporting and analytical function:

  • Data lakes can also be used to support the reporting and analytical function in organizations. By storing maximum data in a single repository, logical data lakes make it easier to analyze all data to come up with relevant and valuable findings.

A logical data lake is a comparatively new area of study. However, it can be said with certainty that logical data lakes will revolutionize the traditional data theories.

Related References

How to know if your Oracle Client install is 32 Bit or 64 Bit

Oracle Database, How to know if your Oracle Client install is 32 Bit or 64 Bit
Oracle Database

 

How to know if your Oracle Client install is 32 Bit or 64 Bit

Sometimes you just need to know if your Oracle Client install is 32 bit or 64 bit. But how do you figure that out? Here are two methods you can try.

The first method

Go to the %ORACLE_HOME%\inventory\ContentsXML folder and open the comps.xml file.
Look for <DEP_LIST> on the ~second screen.

If you see this: PLAT=”NT_AMD64” then your Oracle Home is 64 bit
If you see this: PLAT=”NT_X86” then your Oracle Home is 32 bit.

It is possible to have both the 32-bit and the 64-bit Oracle Homes installed.

The second method

This method is a bit faster. Windows has a different lib directory for 32-bit and 64-bit software. If you look under the ORACLE_HOME folder if you see a “lib” AND a “lib32” folder you have a 64 bit Oracle Client. If you see just the “lib” folder you’ve got a 32 bit Oracle Client.

Related References

 

OLTP vs Data Warehousing

OLTP Versus Data Warehousing

I’ve tried to explain the difference between OLTP systems and a Data Warehouse to my managers many times, as I’ve worked at a hospital as a Data Warehouse Manager/data analyst for many years. Why was the list that came from the operational applications different than the one that came from the Data Warehouse? Why couldn’t I just get a list of patients that were laying in the hospital right now from the Data Warehouse? So I explained, and explained again, and explained to another manager, and another. You get the picture.
In this article I will explain this very same thing to you. So you know  how to explain this to your manager. Or, if you are a manager, you might understand what your data analyst can and cannot give you.

OLTP

OLTP stands for OLine Transactional Processing. With other words: getting your data directly from the operational systems to make reports. An operational system is a system that is used for the day to day processes.
For example: When a patient checks in, his or her information gets entered into a Patient Information System. The doctor put scheduled tests, a diagnoses and a treatment plan in there as well. Doctors, nurses and other people working with patients use this system on a daily basis to enter and get detailed information on their patients.
The way the data is stored within operational systems is so the data can be used efficiently by the people working directly on the product, or with the patient in this case.

Data Warehousing

A Data Warehouse is a big database that fills itself with data from operational systems. It is used solely for reporting and analytical purposes. No one uses this data for day to day operations. The beauty of a Data Warehouse is, among others, that you can combine the data from the different operational systems. You can actually combine the number of patients in a department with the number of nurses for example. You can see how far a doctor is behind schedule and find the cause of that by looking at the patients. Does he run late with elderly patients? Is there a particular diagnoses that takes more time? Or does he just oversleep a lot? You can use this information to look at the past, see trends, so you can plan for the future.

The difference between OLTP and Data Warehousing

This is how a Data Warehouse works:

The data gets entered into the operational systems. Then the ETL processes Extract this data from these systems, Transforms the data so it will fit neatly into the Data Warehouse, and then Loads it into the Data Warehouse. After that reports are formed with a reporting tool, from the data that lies in the Data Warehouse.

This is how OLTP works:

Reports are directly made from the data inside the database of the operational systems. Some operational systems come with their own reporting tool, but you can always use a standalone reporting tool to make reports form the operational databases.

Pro’s and Con’s

Data Warehousing

Pro’s:

  • There is no strain on the operational systems during business hours
    • As you can schedule the ETL processes to run during the hours the least amount of people are using the operational system, you won’t disturb the operational processes. And when you need to run a large query, the operational systems won’t be affected, as you are working directly on the Data Warehouse database.
  • Data from different systems can be combined
    • It is possible to combine finance and productivity data for example. As the ETL process transforms the data so it can be combined.
  • Data is optimized for making queries and reports
    • You use different data in reports than you use on a day to day base. A Data Warehouse is built for this. For instance: most Data Warehouses have a separate date table where the weekday, day, month and year is saved. You can make a query to derive the weekday from a date, but that takes processing time. By using a separate table like this you’ll save time and decrease the strain on the database.
  • Data is saved longer than in the source systems
    • The source systems need to have their old records deleted when they are no longer used in the day to day operations. So they get deleted to gain performance.

Con’s:

  • You always look at the past
    • A Data Warehouse is updated once a night, or even just once a week. That means that you never have the latest data. Staying with the hospital example: you never knew how many patients are in the hospital are right now. Or what surgeon didn’t show up on time this morning.
  • You don’t have all the data
    • A Data Warehouse is built for discovering trends, showing the big picture. The little details, the ones not used in trends, get discarded during the ETL process.
  • Data isn’t the same as the data in the source systems
    • Because the data is older than those of the source systems it will always be a little different. But also because of the Transformation step in the ETL process, data will be a little different. It doesn’t mean one or the other is wrong. It’s just a different way of looking at the data. For example: the Data Warehouse at the hospital excluded all transactions that were marked as cancelled. If you try to get the same reports from both systems, and don’t exclude the cancelled transactions in the source system, you’ll get different results.

online transactional processing (OLTP)

Pro’s

  • You get real time data
    • If someone is entering a new record now, you’ll see it right away in your report. No delays.
  • You’ve got all the details
    • You have access to all the details that the employees have entered into the system. No grouping, no skipping records, just all the raw data that’s available.

Con’s

  • You are putting strain on an application during business hours.
    • When you are making a large query, you can take processing space that would otherwise be available to the people that need to work with this system for their day to day operations. And if you make an error, by for instance forgetting to put a date filter on your query, you could even bring the system down so no one can use it anymore.
  • You can’t compare the data with data from other sources.
    • Even when the systems are similar. Like an HR system and a payroll system that use each other to work. Data is always going to be different because it is granulated on a different level, or not all data is relevant for both systems.
  • You don’t have access to old data
    • To keep the applications at peak performance, old data, that’s irrelevant to day to day operations is deleted.
  • Data is optimized to suit day to day operations
    • And not for report making. This means you’ll have to get creative with your queries to get the data you need.

So what method should you use?

That all depends on what you need at that moment. If you need detailed information about things that are happening now, you should use OLTP.
If you are looking for trends, or insights on a higher level, you should use a Data Warehouse.

 Related References

Databases – Database Isolation Level Cross Reference

Database Type Isolation Levels Cross Reference
Database And Tables

 

Here is a table quick reference of some common database and/or connection types, which use connection level isolation and the equivalent isolation levels. This quick reference may prove useful as a job aid reference, when working with and making decisions about isolation level usage.

Database isolation levels

Data sources

Most restrictive isolation level

More restrictive isolation level

Less restrictive isolation level

Least restrictive isolation level

Amazon SimpleDB

Serializable Repeatable read Read committed Read Uncommitted

dashDB

Repeatable read Read stability Cursor stability Uncommitted read

DB2® family of products

Repeatable read Read stability* Cursor stability Uncommitted read

Informix®

Repeatable read Repeatable read Cursor stability Dirty read

JDBC

Serializable Repeatable read Read committed Read Uncommitted

MariaDB

Serializable Repeatable read Read committed Read Uncommitted

Microsoft SQL Server

Serializable Repeatable read Read committed Read Uncommitted

MySQL

Serializable Repeatable read Read committed Read Uncommitted

ODBC

Serializable Repeatable read Read committed Read Uncommitted

Oracle

Serializable Serializable Read committed Read committed

PostgreSQL

Serializable Repeatable read Read committed Read committed

Sybase

Level 3 Level 3 Level 1 Level 0

 

Related References

 

Infosphere Information Server (IIS) Component Alignment

Infosphere Information Server SDLC Alignment
Infosphere Information Server SDLC Alignment

In recent history, I have been asked several times to describe where different IIS components fit in the Software Development Lifecycle (SDLC) process.  The graphic above, list most of the more important IIS components in their relative SDLC relationships. However, it is important to note that that these are not absolutes. Many applications may cross boundaries depending on the practices of the individual company, the application spurt licensed by the company, and/or the applications implemented by the company.  For example, many components will participate in the sustainment phase of SDLC, although I did not list him in that role. This is especially true, if you’re using the governance tools (e.g. governance catalog ) and supporting your sustainment activities with modeling and development tools, such as, data architect.

Related References

InfoSphere Information Server (IIS) Component Descriptions

Infosphere Information Server LaunchPad Page
Infosphere Information Server LaunchPad Page

Each IIS component has a primary function in the InfoSphere architecture, which can be synopsized as follows:

ApplicationFunction
Blueprint DirectorIBM InfoSphere Blueprint Director is aimed at the Information Architect designing solution architectures for information-intensive projects.
Cognos (If purchased)Governance Dashboard (Framework Manager Model provided by IBM), Semantics, Analytics, and Reporting
Data Architect Data Architect is an enterprise data modeling and integration design tool. You can use it to discover, model, visualize, relate, and standardize diverse and distributed data assets, including dimensional models.
Data Click Data Click is an exciting new capability that helps novices and business users retrieve data and provision systems easily in only a few clicks.
DatastageDataStage is a data integration tool that enables users to move and transform data between operational, transactional, and analytical target systems.
DiscoveryDiscovery is used to identify the transformation rules that have been applied to source system data to populate a target. Once accurately defined, these business objects and transformation rules provide the essential input into information-centric projects.
FasttrackFastTrack streamlines collaboration between business analysts, data modelers, and developers by capturing and defining business requirements in a common format and, then, transforming that business logic (Source-to-Target-Mapping (STTM)) directly into DataStage ETL jobs.
Glossary AnywhereBusiness Glossary Anywhere, its companion module, augments Governance Catalog with more ease-of-use and extensibility features.
Governance CatalogThe Governance Catalog includes business glossary assets (categories, terms, information governance policies, and information governance rules) and information assets.
Information AnalyzerInformation Analyzer provides capabilities to profile and analyze data.
Information Services DirectorInformation Services Director provides a unified and consistent way to publish and manage shared information services in a service-oriented architecture (SOA).
Metadata Asset ManagerImport, export, and manage common metadata assets in Metadata Repository and across applications
Operations ConsoleAdmin workspaces to investigate data, deploy applications, Web services, and monitor schedules and logs.
QualitystageQualityStage provides data cleansing capabilities to help ensure quality and consistency by standardizing, validating, matching, and merging information to create comprehensive and authoritative information.
Server ManagerDeployment tool to move, deploy and control DataStage and QualityStage assets.

Related References