top of page
Search

Cloud Business Intelligence Google Big Query

  • Writer: June Tucay
    June Tucay
  • Jan 19, 2017
  • 24 min read

A list then Morecambe to this marginal on Google Bigbury interview workable brickwork fits into our picture here when we have RBI products is a service that provide all the analytical capabilities the training and graphing and we have a data warehousing is a service which provides a data storage and aggregation reactors in the latter there is working will be quintets and it is incredibly robust you can store terabytes possibly petabytes of data on the cloud in Google's servers I and access that students several different really easy methods that really take all the headache and hassle out of staining up I really petabyte scale data warehouse away and allow users to really focus on the imagine if you just upload your data and then you start using it it let's really is that simple and in this model that's rumoured do and walk through all the steps you need to actually start using Google be queried as a data warehouse as a service

Architecture

First lets take a look at the Google brickwork architecture to think about the architecture will be queried first thing to understand is how could be crew stores data and it uses a columnist or versus Aurora/will resent me well think like this if you would restore daily here looking at wrecker one wrecker tutors for example and each different colour combination represents a different call family different piece of information about that record all this is the traditional way so who wonder what gets data we want refine the most popular titles of books sold on members on we would need a scan all the records and work for those corns in green for example all the way corner works is you pivot this all a bit and you don't store dealer row by row but you distort set of columns together with pointers battle which rose they refer to so if I will take this daily here and make a copy of it pivot that and bring back an account structure would look something like this so now each different call lives in its own buckets and it has pointers back to which role it actually came from to for example all the red values are all sticking together and then it has pointers battle which role each different value actually belongs to same of the green values similar put wants to certain analytical functions in fact most of the things that we do win analytics looking for apparent of things this actually optimises it incredibly because it reduces the manner traffic I that example I gave you this is a story that Google put out I suggested that if you want refine the most popular titles of articles on Wikipedia it would actually reduce the Monday had escaped to get the letter from 9 GB to 3 GB that's pretty significant errors data volumes grow any Dr musicians like that that we can do really help us with speed. Look at an example using some real data here have a basic datacentre very simple one where have a product category a region and a number of orders so office central and six workers furniture westwards 50 workers et cetera and in the row storage world this is how about it would have a primary indicator are such as Rovai the end without the skin down this everyone find out something like for example how many sales it had in the South regions." This would be like if we switching over to a corner format we would put all of the product categories into a single row and again and I really think another like Rosehill tomorrow call but just to make it visibly compellable rigmarole that has each unique value from their call and the pointers back to which role and actually came from so office exists in rose 1 into technologies rose 367 furniture for five so already should be able to see how much shorter than concise this is can be interobserver storage format could can reduce the size of the data as well the Monday that we have to scare if I also do this with a region you can see it compresses a further instead of listing out the word West for different times I was stood out wants with individual references back to which rose their energy compress while better than strings do and this while Mr really reduce the overall size of the data some studies show that in a regular Rosehill we can get compression ratios from 1 to 3 and here we can get it wanted 10 sold about 10 acts compression ratio using common versus a react on roast or MRC would have or are numeric values that so with this if you wanted has questions about even how many distinct product categories do we have with simply have to count the number of nodes in one of more contactless everyone interview a sound across something you'd be easy because all we have to do is find the actual category and then go finely related managers to that looking back at our we had before this example where are we headed the data and storage to called families really allows us to do one of the more advanced analytical functions as well as compression on the data which reduces the overall size any money being used in a query which means it is culturally faxed and exported talk about in regards to Google bequeaths architecture or Howard actually performs queries swallow the challenges the Google have when they were working on now the product Rommel was the introversion of big query that they been using SYSmark 2006 was how to run his queries across tens of thousands of machines and collect the results in a matter of seconds or the figure this out by inventing something called the street architecture and the three architectural forms this mass avoid parallel distributed three for pushing down a query to the tree and then aggregating the results from Vodafone will we use and really really fast speeds so it uses a small tile of all serving 3 to execute queries at the top cover client which issues the query against the roots are the roots reads some meditator about the queried pushes down to the intermediate servers the immediate servers develop these query trees which have individual will execution client and action rewrites the query based on the storage it pushes is down to the servers are wanted actually have the data and do the actual aggregations with the storage wire and then it calls all that backup impeded backup through the roots back to the client the results of this or incredibly fast speeds at huge scale of the good some examples hearers are some basic examples from people produced a few years ago about Drummore where it result what Google actually uses this product for internally your vitamins you can think about how or urgent the data must be on some these and how complex some these words must be that gives you centred these platforms that the using these tools really do deliver incredible value and now it's even easier and cheaper to get up and running

Pricing

Celtic look at pricing from will be queried probably query offers two different models one is you pay for the storage and any pay for how much you actually use the data which means of bazaar variable that can be will be tricky and is ever evolving some actually can actually having to find pressing on a webpage in more job radiance accents stuff out sooner have the pricing page for rugby query and paste the URL and here she can take a lot but it's pretty simple the research was Google brickwork pricing you'll see it and will go down will talk about the progress we have this free operation soak portaloo things we do is below day that we export data are week are needed only really did all those or free leaflet operations and there is a certain quota applies to this arm gnomelike at this storage pricing its .026 cents per gigabyte sets less than three cents per gigabyte. It also allows the streaming inserts which is free up until January 1 and they seemed keep extending that so few have a live system anyone do real-time and works the way to do that is you have to stream those inserts you have actually throw them into your storage as they are coming in and then visualise them on the other side see what's happening and the streaming inserts are a bit different because can not consider all inserts which should be more like a bat job the run nightly just typically how one of virgin warehouses now works are set up so you can see here so either the pricing is paraded per megabyte per second so the store have to gain by four half-mile you pay I was than one set so you can see it's pretty pretty incredible down regard and we talk about querying to be on the storage cost and a reassuring one terabyte worth of data would be around 46 bucks a month will also have the cost associated with using the soup were included Alison Waters you remember because destroyed unicorn formatted it is actually compressed and what smaller than it might be when you look at it in a CSV file more rugby storage engine and the first terabyte of data processing every month is free which means that if you had once a 500 GB world and not even a terabyte and you were querying that maybe other 5% rate you might not even hit the point what you need to pay for its Thuringia) is running I can say from personal experience Icelanders in several different clients of mine eyes Walter Caruso were testing you right now I will looking at it and in the past month are were looking at a very detailed datasets and racked up about seven dollars total and cost which includes storage and query to incredibly cheap I feuded think about this I have done importations before we had out 50 to 20 TB worth of a data warehouse and other platforms that corner base that metrology 10 g per terabyte Louisiana hundred 50 to 200 K year that same 1520 TB and since here probably and run you less than $5000 and longer than you have discernibly horrible and is no administration the difference between this and red shift for example which Zahn's corner in the cloudy warehouse is that there is really any configuration that you need do you upload your dealer and it does the rest is not using it so really saves you a lot of time and effort there is what only the hiring admin for its eyes you have admin that would be able do it that great hand and do something else this is really something that you have to worry about they really take the concept of how having to administer understand the complexities of servers and architectures aware because of how this platform is designed with that query architecture and icon storage engine you're sitting your reserve capacity pricing so few have a consistent workload and you have its much larger you can pay here your $25,000 per month first 5 GB per second take about this if you had maybe are and apps that you were building any outside 100 million users on a new needed have a certain group are guaranteed are you could you could pay them for allow to make sure that the speed of this is really are up-to-date and all other sausages handled however Customs use contact one of yourself that could lead out there will are calculated here they can thought about the different data types and the size and all that stuff aren't as walls several query costs Oaks when you query Sunday that it's about the dealer that's being processed in the query it's not about just the size of the return datasets so I may query something that has a huge amount of data behind it and only returned a small results that will actually being charged for the overall bite process on the whole David that I types to give that small results and has some examples here with similar want to will altogether these is beginning to the more functional aspects of this

Demo Setup

Meltwater the steps of dealing celibate to public were ominous overtones or google.com and if you don't have an account are ready you'll need create one what you do that go hand in Simon and your search for the Google brickwork Council Gittus Lecky which is the developers Council's lexicon that I see a PZ describe risk everything I need to know and it's actually really simple to set this up so I need to do is jump over to the Google developers Council (create a project here at the project is created or simply need to enable the big create your parade another project's credit how we see your dashboard here which has all different Apis and with that I see tried the query with population they were heading could try the query sign in again and there are now in big query the fury have a project to will developers Council use of need to enable the big create PI so few another you can is good Apis and asked gun Apis and are enabled the Google brickwork want and with this you get a certain amount of storage and are querying available but by default afraid I could identical to that time information but then what's you actually want users former production purpose should any decide of the building still need going here and you need to entering going into either pricing is which when ovaries are very reasonable and paying and how much did you have in our can be very cost-effective eyes you're just looking at this is a brand-new idea TEU not sure about using the query for your data warehouse that's fine I do an experiment I encourage everyone do that soaks article dataset maybe one that United are not worried about former security standpoint you know the date as may be public ready and uses to see if it were to give you the performance you want for all the analysts needs the Stiglitz detesting the ubiquitous out-of-the-box and we are back in our query window and your I see my project were will have immediately said that we've uploaded will go through that minutes double we have this public area this Sepulveda sets it is different wants your some did have wines are Sunitha want to watch with work this the minute anticonsumer coconut*fielders like regular database Karrimor wide table which is the idea here I can set up the nor ability I can set up the description of the feel like you're that are valuable the dividend is tested aside because I don't own it but with my dataset I can have all the definition just listed race here are all the web from my brickwork Council now one other thing I could user can take with the details second describe table I consist of information about the same Mulholland 37 mine rose to about 21 GB in size as you in a previous in the daily here as look lunar when we build more of these policy here's of this will review were some kind of logical thing that are pulled different datasets together what we would see is actually the sequel will code the generator that is well so long with the details about the asset itself we would see the code them in the Brussels which they wonder we are by Shakespeare once has a word to word counts and then our try Glam is in the Wikipedia dataset which I find him adjusting you can analyse which are would oracles of the most popular reward is to there are often are updated

Demo Loading Data

The now let's actually loads did into our brickwork projects will work with the here and I cannot counsel and what we actually need to first do is jump over and are uploaded to our Google cloud storage sofa Google cloud storage water could come out winner Tennessee is actually need to enable building Jonnalagadda do that now and building enabled I should build the now go back into muscle browser and uploads data similar appalled that Euro quake interest arm talk about this vital bring up an Oracle the water that really goes into doubts about how you should prepare format your leader before uploading it they could switch here on the Google developers Council from big worry describes in detail how to loaded up on what you through the place Wanna surely this can not recover every example basic leaders to ways to do it underperformers to loaded onto Google cloud storage and an import that into brickwork a lot of will be doing will be doing by can of course you can automate out with languages like Java Python are the other ways to do the streaming inserts and distributors are really great as they enable real-time I cannot X terms can I talk about preparing your data from brickwork work on a give you natural dataset that argument prepares you worry about that but we knew get data it's important understand all the about how bequeathed processes that when you query it so you can structured properly so it accepts to different data types CSV injuries on and both of these a few work within our while is firstly unaware of the should be pretty commentariat and with the other different formats are that the daily can can follow when it has different died different size limitations compressed versus uncompressed for example aren't in and with the CSV evolution to carry out J Somers is different with your format that other datatypes are are not as robust as Somerville database platforms but that you cover everything that you'll need you can upload compressed in although it's not recommended down the licence just too much of our operation to uncompressed before you are uploaded and wanting you wanted use the pointy normalised data so that you create these what I call all sorts think of a storage scheme are are in a normal data warehousing environment but all that collapsed into single tables you have just one giant flat table with all the data that you what it actually is faster from brickwork Prosser sacked because of how structures it and how it handles generating is retraced to what would upload is actually a demoralised dataset are not to large purchase one that shows example of how you would structured either to make a really great for bequeathed consuming them to your views and here's an example here where you have in a normal database you have two different tables want people want facilities were of anonymity normalised want to just be a single table I call these pulsars are other people computing online datasets are call what you like that the idea and then you can of also put the the daily energies on four-hour wait actually nested like that so this is kind of calming here about how one of dealers represent on the web so purely have daily enough formation be pretty good ship of the knowledgeable ensure you the daily so actually work with cinnamon Excel and I have a dealer said where we have sales for our store we of all the different fields the normalised here's letter all the same I dataset and then and after joint different tables to get always staider unconcealed pride information in of containers categories locations are customer information events that information about the workers themselves the discount reliability that the world was placed into is in an extremely large dataset but will reduce services CSV and then upload to our Google cloud storage in older deeper attacks how would the daily using Python back of your iGoogle developers Council I'm going to add a storage bucket call this sales later and going to upload and Skanska upload their to load the data we have a couple options here and the Moro and are priced have way to do it is to write a job usually Python Java something like that and with that I loaded it also give you an example here in this is are on the same page ruling and downloading didn't bequeath wary despite uncle will take a CSV that with just uploaded to Google cloud storage and upload that into our brickwork so it's it's pretty basic level functioning ever try catching you basically given the configuration of your project that the the CSV what the fields are and what the datatypes are that associate with them and where the data Wayans and then in this example here that is how will the more about winning for to be be dining printout I the results and so I that is the more enterprise managers of its moment Python go hurdling do that if you are not smooth Python we problems and courses other than help you get still up to speed for this exam while keep it intra-world can to big picture type course and especially her manually take today we just uploaded to a storage bucket and put it into brickwork and here in big query unaccredited new dataset; superstore and was superstore I can describe this dataset but I need to actually add a dataset here exerted a radical colour + MSA creating imports given the table name of superstore all could maxed out for the day I wants like I can actually upload the file directly guided harried electors only to understand the Google cloud storage in Harrogate that set up self worldview lodges reality considers view just uploaded and putting and frequent Sunita given apart in this is the Google storage power to what that would be like is if I click back on my developers Council I'm in a storage bucket of sales data and here's my filename so what I want is simply/sales – data and then my farming but would be query sales data and foundling could next now I have to specify risky messages like the Python Java were given veggies on array of Ireland midfield in the datatypes Kilrathi do the same thing we have some datatypes of string and your football union timestamp are and if you don't have it strings response to be expected to the do was officially how I would create is in a very simple way for the dataset that we have here dug over and Excel what I have is the CSV that I said earlier almoner copy the row hatters here and could a new workbook and going to transposed its a brickwork doesn't like things like spaces – is anything like that on a do a final place manner find a space and is replaced with nothing a good pose with_thrown keep those animals can do this with – so all my call names are almost camel case here and what only do then is a need to add the datatypes some ago through in populate these as best as I can another have all the datatypes to find any decreed that format which is the common name of the field name: the datatype and put common between to the floor heart that I like to use a little while from urgency queries or use Excel to will generate cold forming you need to mount sometimes so take this guy and all add the datatype and confidence down to them I have all my datatypes a lot from I feel them to the datatypes dallied create this in a single, separated less soldiers to the guy and here unbelievable trick were what I do with a slight top one and then I add the one next to it but for example, and like so so it can Need to build this list of fields the regard summoner copy respond my which will have all a scroll rapeseed is every field with its datatype in the order that my data centres and serviceable transposed it jumper could be query and a bookworms canopies this could next I was a heroine Sunderland skipper and approximate and given among here and it's just global the daily into brickwork and will now ideally has loaded you can see I have my scheme and here would match as well as is pasted then I can are describe economies field of highlight are but in the details likely see a preview of the data and you can see that I have daily and all works formatted nice and neat just like my CSV had inside so that was the manual way of doing it like I said if you go back to the.com here downloading didn't bequeath a provides examples of how do with Python in a more automated fashion laws all nitty-gritty details about how format your data are when you upload

Demo Web SQL

Take never regarded as the loaded in bakery we can at least are quarrying it directly from the what this is one of the need features instead of having to have died desktop client that has become consideration are stuck the to sell-out you can is go directly to the website and running queries to buy can be quirkier working actually do is click on query table and it'll give me all the employer need to start riding query also wanted to do something like select order date and average were quantity from a superstore sales remote limit and also say whether the state you goals Washington and on a have to goodbye my Watergate MacGowan chronic renal and insulated so it gave me my results are it's just goodbye will they not and spectacularly are given me the average order quantity are you can see some interesting Maddalena Bowring here I can see that aquarist complete in 2.8 seconds and crosses 219 kB woman process are bigger datasets are this can be easily well into the gigabytes and it still is able running virtual runtime which is ruled out the power brickwork finished taking lures rolling trawler for our query usage so with requiring massive datasets all the time but we find rolling querying will aggregations of it might be more beneficial was from cost employed actually aggregators of craters a separate tables and you can do so by creating views what psychically review out of the status that just by cooking save you read here and giving union so one of the things that you'll be able to do here is create lots of aggregations in small chunks of the data or cut of data that you use more often and save wild querying course that you may incur relatively good some of the options here to consider Ronnies Kath resolves the queer priorities interactive Wilson difference Dr we can do so destination table we can actually Raper said out to another table which is colonised once-a-day was uploaded to be query are it can be calibre paying to download it massager uploaded began its leisure support: Ireland how the building all from the word inside be queried is to continue to create new tables to

Getting Data via API

Take a look of brief one at how to get dealer from Google bakery using the PRS like to the painful looking at Hubble data is also a greater coherent querying data and you can go through this in detail it to how degree actually execute queries on what the synchronous job look like and and about have with unpaid you get to the synchronous queries AppleWorks gives you code from my your job Python on how to actually run query and you can see that with this example you can run queries save the results to table overriding these are Cinderella and principles better results so all with this you have basically API access to their data that I just uploaded you can do asynchronous queries are some examples hearers well and you can also going to the client Weibring you can actually see all the different libraries make your eyes that the crew support so it's pretty robust are you can hear directly from JavaScript and Python etc are and Anders other ways to Hill's world this outstripped so are you can do so from Google products like the Google Docs for example have a use be queried it if he wants so what of different options for heading this are you will need to bow all rights you which uses all what to soul cannot my without outgo gives lawyer are this page your emotion risk and when you want to hit the API you first need authorised are just as you do if your are on the Web sequel here you just do it because your Lord then when you do that it authorises your article

Demo Getting Data With Tableau

Take look at using tableau desktop to get similar data cul-de-sac is one of the most popular be I tools out that and has kneels connection to a lot of cloud data sources like salesforce.com Google wallet is was "if you don't have tell worldly legal do first is downloaded from couple software that can't glucagon different trial and then you should be able to install it it is rubble are Windows and Mac failure Emmental largest openness and running on the map but the threat is very similar is not almost identical on Windows and what you do first is you could connected data and from here going to choose Google brickwork by connection of all your leaders was that can connect to local files as well as the cloud-based want you could will be query Armiger had and login was somewhat lastly to authorise is that all portion and from there it goes into our select project circuited with my project and the supershow dataset the largest uploaded and I can drag sky over to might read tables your window and I double I can say update now the likely show me all the field names and a preview of the data from be query so what did there actually went out to be query and it gave a Semple D Taylor and Paul into my preview window bowler from you all had go to worksheet and what tableau is don't leave does for its categorise all these different fields it read the datatypes but also with the the names and it maybe gave some of these fields are added up property such as citystate ZIP Code knowing that those are geo-located fields" happens to have a geolocation daughter it also categorise them as dimensions and measures so on top you have dimensions which are the categorical type fields fields that we would slice and dice by normally a double all the fields measures want that we actually aggregates a looking at sales over time for example to do this sky show Rookwood example here than one show you how it works with all much larger dataset so optic producer that we just uploaded unpoetic private category drag sky over the was to them permanently quoted sales by product category" on a scale seaward gives me just the numbers – they consider that skin tells of a visual products overworked, Ally can see aboard transfer by that familiar Wanna do something white collar by property was wanted as profitable either distinctive the freshers second and sales are it's not as profitable Zoff sponsors that alpha-1 to do something white break this down by our region regular that full shall fear and materially care about the South West region in okay total the downturn exposes recycling ploy in Iraq of my data will that so done so Forrestal the basic porch are using our sales data and in doing so every time I do this it's making lecture about to be query asking for daily using the API a getting about scissors the power of what you can do with that APR as well as how easy tableau is used to visualise data else take a look at the much lauded dataset the need how the dataset which is already embedded into Bigwood Nagorno new tab air mattresses can among many here say data connected data about to be query monologue then sold have to authorise it again and then aside the public data good of the Sables and and regulate our leaders said over good worksheet aware is the first one I was looking at see how many roses out of looking at on the supershow double-click number wrecker little Tom you am looking at about 8309 unrelated 400 roles and here on the tally one actually double-click number wrecker is actually are hundred and 37 million rocks so dealing with a much larger dataset I think of your calls were 20 GB worth and even that will piece rate that inning broke out took are under second and was able they go out to be query use the API run the query amateur the dataset create aboard Grindrod on-screen from it Elsie how response should get more in depth a two-stroke year onto corn to happens to get almost instant it through the number wrecker in the dataset which is how many births there had then and it drew that trend over time I didn't ask you to draw lighter to what it do that while Tamil want to be smart about how it helps you analyse your data we as humans are very visual creatures are in is so ingrained in us that it is makes more sense to look at did in a visual form than in any other way so what tableau tries to do is when you show me featured has this engine built into it that tries to gas at the correct way to look at the data that you've chosen Huguenots that year matches what it would consider as a parliament the value and fun looking at a measure risk is number wrecker could be any one of these over a year over time in those and all the best practices novel literature all the neuroscience research that the dancers you should look without an online forum because of brains are better at understanding that what civil the whole what that happen ignoble and action and striking the one field on else why do something smaller filter down to specific state to the California the agency would happening here and would Wanna see the weight and served as the number wrecker of errant drive add-on Ryder Cup miler one slam actually pulling in a deal Elna from their and going to changes from a sum to an average trigger and something more existing here some seeing the are in California your year basically the average weight in pounds from the result one of strangers to aboard try and was make a table calculation which is a percent difference changes to discreet would all go into details your regular should better view of what's what's happening and what's even hold Khmer and nonmarital control one are Windows and collar could buy that you can see the change over time and the beauty of this is that that dahlia which is pretty large 20 GB and hundred and 40 million rows were interacting with it and were visualising it on-the-fly were heading be query over HTTP and what it's doing here is allowing us to really enter questions to take through to explore idea the whole concept of having to manage the servers and how to optimise queries hundred index tables how the aggregate data properly all that stuff even how right good sequel in this example is completely irrelevant it's completely change the game of how we can use data to Paracelsus to make better decisions in business to warrant things are in public sector and to really gain a new understanding of how the world works of courses in the end of the road thrillers what I have something at maybe literacy visual solution to problem and Taiwan sure that and that's what will talk about her next module but how to create a tableau server environment hosted by them in the client and connected be query do announcers like this in make it available to all users. Tobacco into Google with line in this model

Recap

During this model we talked about DIS service and in warehousing*we really go rain to google big query Azharuddin were not as a service platform and with that we would that several different innovations here we have the EPI Brechin Hill directly from when which is like Python and Java outlook of the Web seek one office which allows vicious pungency will go directly are in website and give us result and have assumed previews and management tasks and all that the to be I tools were to tableau and how tableau can need we connected is to hold it out and without even writing any code and an extremely fast where looking at at 1 GB sized dataset that were able to ask questions of without even having on the any delay under one second every time we did that and sobbing tractor from big query down to my desktop uses and all running local elderly diving in really understand how we can do the same kind of thing with RBI infrastructure

 
 
 

Comments


bottom of page