What is Big Data Intelligence

Advice for medium-sized companies - what is what about big data?

The term big data is not without controversy, but it has been established for a number of years and characterizes its own IT area. Big data can initially be understood literally in the sense of "large data (amounts)". As a rule, one speaks of big data when the volume of a certain definable amount of data is in the range of terabytes (1 terabyte = 1024 GB), petabytes (1 petabyte = 1024 terabytes) and exabytes (1 exabyte = 1024 petabytes). Such amounts of data are no longer theoretical values, but are more and more common in practice.

The development towards Big Data is not an isolated phenomenon, for example limited to certain industries or areas, but applies across the industry and globally. According to IDC market researchers, the digital information that is generated or copied each year broke the zettabyte barrier for the first time in 2011 and has swelled to 1.8 zettabytes. Zettabyte is the closest unit to exabyte and is equivalent to one trillion gigabytes of data.

According to IDC, the amount of information has grown by a factor of five in the past five years. There is no end in sight, on the contrary: According to the IDC Big Data survey in Germany, more than three quarters of German companies expect annual data growth of up to 25 percent over the next few years. 13 percent even expect their mountain of data to grow by 25 to 50 percent.

Why Big Data Now?

Why are the data volumes so high today? There are a number of reasons for this. Because "everything" is being digitized in the meantime, new types of mass data and real-time data are emerging in numerous industries. Machines and computers in particular produce enormous amounts of data: a modern airplane, for example, generates up to 10 terabytes of data in 30 minutes. With 25,000 flights per day, this creates petabytes of data.

New applications also drive data growth. Technologies such as cloud computing, RFID, transactional systems, data warehouses, business intelligence, document management and enterprise content management systems are IT applications that lead to big data.

The decisive factor in the data explosion, however, is likely to have been the Internet, combined with the increasing share of mobile devices and, above all, social media such as Facebook, Twitter and Co. On Twitter, for example, there are millions of users who use at least one account and often several times a day Dropping tweeds. With 140 characters per tweed and the speed at which the short messages are sent, Twitter alone provides at least 8 terabytes of data - per day.

Big Data = Volume + Variety + Velocity

The huge amount of data, however, is only one aspect of big data. The formula "V3" is a widely accepted criterion for characterizing Big Data: In addition to the sheer volume of data ("Volume"), there are additional features such as variety ("Variety") and speed ("Velocity").

By "diversity" is meant the number of different data sources from which the data is gushing today, as well as the diversity of the data itself. Companies today must manage and integrate data from a wide variety of traditional and newer information sources, including internal and external resources: data from Sensors, for example, from mobile communication, from intelligent devices or from social media channels and social collaboration technologies.

With the variety of sources, the data formats also increase. Until a few years ago, data was still well structured and could be saved efficiently and without great effort as tables in relational databases. With the increasing complexity of the data sources, the data formats have also become more complex.

Often the data that is generated in the new media today is completely unstructured. Unstructured data are texts, images, audio and video files - i.e. the lion's share of data types in the social media environment. They are difficult to squeeze into given schemes.

If you add a mixed area of ​​"semi-structured" data such as emails that have a certain structure with "recipient", "sender" and "subject" while the content itself is structureless, then we have one today Doing a mish-mash of structured, semi-structured, and unstructured data from a variety of different sources.

Typical types of data today

  • Structured data: data that is mapped in tables and structures of relational databases, such as addresses, product lists, personnel administration, etc.

  • Semi-structured data Data that are partly structured, partly unstructured, such as e-mails: Such data are often generated through the use of data exchange programs between companies and are often based on XML

  • Unstructured data: text files, PDFs, scanned mail, presentations, images, videos

In order to be able to store, manage and analyze such data in a meaningful way, new approaches have to be found - which brings the third aspect of big data to bear: the speed ("velocity"). Because this large data, available from different sources and in different formats, has to be saved and analyzed as quickly and efficiently as possible. This is a challenge insofar as conventional relational database systems reach their limits.

Relational databases

Relational databases can be used efficiently for frequent transactions at the data record level or for scenarios with small to medium data volumes. They are not geared towards the processing and analysis of data volumes in the peta or even exabyte range. Above all, however, unstructured data cannot be transferred into the table-oriented relational database systems, or only with contortions.

A conventional database becomes incredibly slow the more data has to be managed and the more relations are used for a query. The performance required for queries with acceptable access times is not achieved. There are now optimizations for large databases, but above a certain depth and complexity, the best optimization can no longer help.

Unstructured data is also a problem for traditional databases. As mentioned, unstructured data is difficult to press into the table schema. Table-oriented data models are not designed to work with tons of chaotic data. And to impose a relational structure on social media data from Facebook or Twitter is hardly feasible.

Another problem: In contrast to classic business intelligence, when it took hours to generate reports in a batch process, nowadays ad-hoc queries with analysis results are expected in real time, if possible. They form the basis for immediate, proactive decisions or even enable automated intervention. Today, not only the company boss, but also the department head and other decision-makers, right up to the clerk, want the results of such analyzes as soon as possible.