BigSheets is a spreadsheet-style tool for business analysts provided with IBM InfoSphere BigInsights, a platform based on the open source Apache Hadoop project. BigSheets enables non-programmers to iteratively explore, manipulate, and visualize data stored in your distributed file system. This article demonstrates how to analyze and visualize data we collected from Twitter in my previous article An example how to collect, analyze and visualize social media data with JAQL and Hadoop.
“BigSheets translates user commands, expressed through a graphical interface, into Pig scripts executed against a subset of the underlying data. In this manner, an analyst can iteratively explore various transformations efficiently. When satisfied, the user can save and run the workbook, which causes BigSheets to initiate MapReduce jobs over the full set of data, write the results to the distributed file system, and display the contents of the new workbook.” 
BigSheets can process huge amount of data due to the fact that user commands, expressed through a graphical interface, are translated into Pig scripts and can be run as MapReduce jobs in parallel on many nodes.
Step 1: Creating BigSheets Master Workbooks
To begin analyzing data with BigSheets, we need to create a workbook (spreadsheet-style structure). To do so, open your BigInsights web console (typically http://localhost:8080) and follow these steps:
- On the main menu go to Files and navigate to the comma-delimited file with tweets
tweets.del(we generated this file in the previous article and stored it to
/user/root/tweets.del). Click on the file and you should see its content.
- Click on the Radio button “Sheet”, edit the reader from Line Reader to Comma Separated Value (CSV) Data, then uncheck Headers Included and confirm the changes.
- You should see a table structure similar to the following one.
- Click on Save as Master Workbook, set the name to “Tweets-IBM+BigData ALL” and click Save.
- You just created a new Master Workbook. Master Workbook is used as a main data source and its structure and content cannot be changed. You will see how to work with data in the master workbook later.
As we also run our Java MapReduce application which counts the occurrences of words, we need to create a master workbook for the data the application generated. In my case, the path to the file is
/user/root/results/part-00000. So follow the previous steps 1-5 for this file with these modifications: change the reader to the Tab Separated Value (TSV) Data and the name of a new master workbook set to “WordCount-IBM+BigData”.
Step 2: Tailoring BigSheets Workbooks
Now we have two master workbooks:
- Tweets-IBM+BigData ALL with 17 columns corresponding to the following fields:
- header1 = created_at
- header2 = id_str
- header3 = geo
- header4 = coordinates
- header5 = location
- header6 = user.id_str
- header7 = user.name
- header8 = user.screen_name
- header9 = user.location
- header10 = user.description
- header11 = user.url
- header12 = user.followers_count
- header13 = user.friends_count
- header14 = retweet_count
- header15 = favorite_count
- header16 = lang
- header17 = text
- WordCount-IBM+BigData with 2 columns corresponding to:
- header1 = word
- header 2 = number of occurrences
We will work with these data sources and create 4 different visualizations.
Example: Coverage by language in a pie chart
The graph above shows total percentage coverage by languages. To achieve this result, we just basically group data from the Tweets-IBM+BigData ALL by header16 (lang) and provide the total count of tweets in every group. To do so, follow these steps:
- From the BigSheets page of the web console, open the Tweets-IBM+BigData ALL master workbook.
- As I mentioned before, master workbooks cannot be modified so we need to create a new workbook based on this master workbook. Within the master workbook, click on Build new workbook.
- New workbook will automatically open. In the left down corner click on Add sheets and as a type of sheet choose Pivot.
- Set the
- Go to the Calculate tab and create a new column (for example “total”) showing the results of function COUNT applied on groups based on column header16.
- Apply the settings. You should get results similar to the following screenshot.
- Save the workbook (for example as “Tweets-IBM+BigData ALL – language pie”) and Exit the editor.
Step 3: Creating charts
- Go to the Tweets-IBM+BigData ALL – language pie workbook (if you are not already there).
- In the left down corner click on Add chart and choose Chart, then Pie.
- In the settings set Value to header16 and Count to total. Apply the settings.
- Now you are supposed to run the computation of your chart. Note that a status bar to the right of the Run button enables you to monitor the progress of job. Behind the scenes, BigSheets executes Pig scripts that initiate MapReduce job. In the end you should see the pie chart representing the coverage by languages. Note that every time the data source change, you can just re-run your charts to get fresh results.
The other charts
All the other graphs follow almost the same steps. There are just very simple differrencies.
- Tag cloud
- You need to work with the second master workbook we have created (WordCount-IBM+BigData). Note that you don’t have to create a new sheet because there is no reason to modify the source data. All you need for visualization is already there so you don’t have to tailor anything.
- Bar chart
- To create the bar chart above, you again create a new workbook based on the master book with tweets. This time you group tweets by users (header8) and again use Count function to calculate the size of each group. Then you create a bar chart based on user names and results of count function.
- Timeline chart
- To create this chart, you need to add sheet using a Function (instead of Pivot). There you need to use the proper function to extract date and time format from your tweets (for example function FORMATDATE).
This article very briefly explores how BigInsights with BigSheets enable business analysts to work with big data without writing scripts. Introduced scenario was very simple and should just demonstrate the basic work with BigSheets. What’s very important to realize – BigSheets can process huge amount of data due to the fact that user commands, expressed through a graphical interface, are translated into Pig scripts and can be run as MapReduce jobs in parallel on many nodes.
BigSheets offers more than I would be able to cover here. Just for example, gather data from other sources like databases, sentiments analysis macros (part of accelerators), ability to write your own macros, … If you want to know more information, don’t hesitate to ask through comments or to contact me.