Weather Patterns Using Hadoop and R – Part 2

The first part of this post looked at data mining of world weather patterns such as cloud location and cloud coverage levels.

As these features describe aggregate values and ratios they are quite simple to evaluate with Hadoop, R and MapReduce Streaming. However for time dependant analytics, such as how long rainfall happens, additional issues need to be addressed.

Time Dependent Data Mining

Time based weather analytics focuses on weather at any location as time passes. This is done by extracting data using R so that Hadoop sorts data in the shuffle phase by weather station then time. However this is quite inefficient as, for any station, data is scattered across every file in the dataset.

Filtering down to 6 stations spread across the globe significantly reduces the data load for Hadoop’s shuffle phase, while there remains sufficient data to look at world weather and regional patterns.

The 6 stations are chosen across the globe:

  • Arctic and Antarctic – Danmarkshavn,Greenland and Base Belgrano II, an Argentinian run Antarctic weather station.
  • Mid Latitudes – Dublin Airport, Ireland and Christchurch, New Zealand.
  • Tropics – Khartoum, Sudan and La Paz, Bolivia.

The following R script brings this all together to form the Mapper:

# Mapper script
#! /usr/bin/env Rscript
stations <- c(" 4320", #Danmarkshavn
              " 3969", #Dublin Airport
              "62721", #Khartoum
              "85201", #La Paz
              "93890", #Christchurch
              "89674") #Base Belgrano II

#days in standard 365 day year
months <- c(31,28,31,30,31,30,31,31,30,31,30,31)
#days in year
#from 1981 to 1991, inclusive
years <- c(365,365,365,366,365,365,365,366,365,365,365)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
     | substr(line,1,2)=="88")
     months[2] <- 29
    } else {
     months[2] <- 28

 #land station synoptic code
#time in number of hours since Jan 1, 1981 00:00
 #precipitation level

Using this script with an identity reducer data mining can be performed from the output in R, outside Hadoop. Note that a map-only job (achieved with the command line argument -D mapred.reduce.tasks=0) won’t shuffle the data, so an identity reducer is necessary for this task.


When Precipitation Happens

As noted in the previous post, nimbostratus always occurs with precipitation. For stratus, 45% of the time this cloud appears precipitation develops before either the stratus disappears or 24 hours have passed. With other cloud types this rate ranges from 5% to 26%.

As mentioned before, high-level cloud types are not precipitation inducing so any data mining results relating to precipitation are meaningless[1].

How Long Precipitation Lasts

Within the 6 chosen stations nimbostratus precipitation lasts for up to 6 days, corresponding with the expectation that precipitation “is generally steady and prolonged”[1]. With stratus, precipitation can also last for up to 6 days, with cirrostratus for up to 4.5 days while for other cloud types the maximum is 3 days.

Cloud Development Patterns

Nimbostratus – follows altocumulus slightly more often than altostratus. May also form following occurrence of cirrus cloud cover and is roughly equally as likely to develop into altocumulus as altostratus.

Cirrus – at the 6 stations often develops into cirrostratus. Cirrostratus is known to form from “spreading and joining of cirrus clouds”[1].

Cumulus – at the chosen stations usually develops into stratocumulus or cumulonimbus. Indeed cumulonimbus is known to form from “upwardly mobile cumulus congestus clouds (thermals)”[1].

Stratocumulus – 31% of occurrences are followed by altocumulus. Indeed
altocumulus can form “by subdivision of a layer of stratocumulus”[2]. 27% of stratocumulus occurrences are followed by cumulus. Indeed “cumulus may originate from … stratocumulus”[3]. Also, 33% of stratocumulus occurrences are followed by stratus. This is line with the expectation that “stratus may develop from stratocumulus”[4].

Stratus – is more likely to develop into stratocumulus than no low cloud cover.


Again, data mining results compare quite well with established cloud facts and world weather satellite data, albeit this part of the study restricts itself to 6 land based locations.

The set of stations can be changed so that the trade-off between accuracy and time running Hadoop is optimal while various data analytics, such as regional weather effects, can be extracted. The main drawback is that this data mining involves re-running the MapReduce job each time the station list is changed.


  1. ^ a b c Common Cloud Names, Shapes, and Altitudes
  2. ^ Altocumulus
  3. ^ Cumulus
  4. ^ Stratus

Weather Patterns Using Hadoop and R – Part 1

In the post Cloud Weather Reporting with R and Hadoop the meaning of synoptic cloud data was not explored. This post looks at cloud patterns and their impact on world weather.

The cloud weather analytics covered mainly comprises aggregate frequency data (e.g. cloud location, coverage and precipitation types) and time series data (e.g. when precipitation happens and for how long). This post focusses on frequency data analytics, while the next post will focus on time dependant analytics.

Frequency Based Analytics

Frequency based properties are evaluated through MapReduce by extracting relevant weather data using the following R script as Mapper:

# Mapper script
#! /usr/bin/env Rscript

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
 #output latitude, cloud types and precipitation level
 #latitude is coded as 0 for tropical, 1 for mid-latitudes, 2 for polar regions

The Reducer script from a previous post gives counts for each combination of cloud type and precipitation level, from which most frequency based weather analytics can be derived. For example, stratocumulus frequency is the aggregate count of where the low cloud type code is 4,5 or 8, divided by the total number of reports in the dataset.

Ideally from a data mining perspective, error prone work such as interpretation of WMO codes and further data analytics can be done using an R script and, as this is outside Hadoop, can be tested and amended quickly.


This data mining reveals a few properties of world weather, in addition to a few local weather patterns.

Location of Clouds

Nimbostratus is “common in middle latitudes”[1] – in the dataset 85% of nimbostratus occurrences are mid-latitude. Cumulonimbus is “common in tropics and temperate regions, rare at poles”[1], somewhat reflected in the dataset where 23% of occurrences of cumulonimbus are in tropical regions and 4% are in polar regions.

Likelihood of Precipitation

100% of occurrences of nimbostratus occur with precipitation, as WMO codes define precipitation bearing altostratus or altocumulus to be nimbostratus. Indeed nimbostratus is derived from ‘nimbus’, Latin for rain.

Conversely, high cloud types (cirrus, cirrostratus and cirrocumulus) generate no precipitation[1] so any perceived correlation with precipitation levels is meaningless. For remaining cloud types, precipitation occurrence levels range from 3% for cumulus to 38% for stratus. Indeed cumulus is associated with “generally no precipitation”[1].

Types of Precipitation

Cumulonimbus typically results in “Heavy downpours, hail”[1] – in the dataset 91% of precipitation associated with cumulonimbus is in the thunderstorm range. Indeed in the tropics, cumulonimbus often forms from hot weather through convection during monsoon season.

Conversely, nimbostratus is rarely seen with thunderstorms, more with “moderate to heavy rain or snow”[1] – 54% of nimbostratus occurrences happen with rain. Similarly, stratocumulus is usually accompanied by “occasional light rain, snow”[1] – 47% of cumulonimbus occurrences happen with rain.

Cloud Embedding

Cumulonimbus can be embedded with nimbostratus, and in such instances may result in heavy rain, but this is rare[2]. Indeed, further data mining reveals that only 5.9% of the occurrences of nimbostratus show this embedding.

Cloud Coverage Levels

The most significant cloud coverage comes from cirrus, stratocumulus and cumulus. For the dataset cirrus cloud cover is 22%, stratocumulus cloud cover is 24% and altocumulus cloud cover is 25%. This compares quite well to satellite data of 20 to 25% for cirrus[3], 25% for stratocumulus[2] and 20 to 25% for altocumulus[4].


The data mining results compare quite well with established cloud facts and world weather satellite data, particularly on location and coverage of clouds.

In the next post more complex data mining is presented i.e. time dependant analytics. For instance how to determine which clouds cause most prolonged rainfall? And how is this analytics done efficiently, given the amount of re-sorting of data involved?


  1. ^ a b c d e f Common Cloud Names, Shapes, and Altitudes
  2. ^ a Cloudwatching
  3. ^ Cirrus Clouds and Climate
  4. ^ Comparisons and analyses of aircraft and satellite observations for wintertime mixed-phase clouds

Evaluating Film User Behaviour with Hive

Part 1 – Introduction to Hive

Hive is a data warehouse system built on top of Hadoop, allowing queries to be run through Hive Query Language (HiveQL), a language similar to SQL. Hive is best used where data structures are rigid and flat and where operations such as JOINs are needed i.e. operations that are simple in HiveQL, but very complex in standard Java MapReduce or Streaming.

In Hive, since you are using HDFS, large datasets can be manipulated across many nodes, while viewing data as a table and creating queries as if it were a database.

Issues with Hive

Hive is slowed by time spent communicating with MapReduce. However, many solutions exist which do not use MapReduce, such as Shark and Impala[1].

Shark is fault tolerant like Hadoop – when one node fails, data is moved to another node so the whole query doesn’t fail. This is very useful for large datasets, where node failure is likely and queries take time.

Simple Hive Example

Take a simple file demoTable.txt which lists 4 famous Spanish League footballers of the 1950s/1960s:

PlayerID   PlayerName  Birth
1   Alfredo di Stefano  1926
2   Luis Suarez 1935
3   Ferenc Puskas   1927
4   Zoltan Czibor   1929

First create a table with a schema and load some data from HDFS into Hive:

CREATE TABLE demoTable (ID INT, PlayerName STRING, Birth INT)

Now for a simple query to find players born before 1930:

SELECT PlayerName
FROM demoTable
WHERE Birth<1930;

This results in the following output, which can be found in the new directory demoResults:

Alfredo di Stefano
Ferenc Puskas
Zoltan Czibor

Key Differences between SQL and HiveQL

Hortonworks’ cheat sheet covers key differences between SQL and HiveQL as well as code for basic tasks[2].

One notable example is that JOIN conditions have to be exact (unlike in SQL), however there are various ways around this e.g. WHERE clauses or CROSS JOINs can be used instead. Also, HiveQL is constantly being updated to remove as many of these restrictions as possible.

Part 2 – MovieLens Dataset

Lets look at the University of Minnesota’s MovieLens dataset[3] and the “10M” dataset, which has 10,000,054 ratings and 95,580 tags applied to 10,681 movies by 71,567 users of the online movie recommender service MovieLens. Not all users provided both ratings and tags – 69,878 rated films (at least 20 each), while only 4,016 applied tags to films.

Format of Data

Original data is contained in 3 files – ratings.dat, tags.dat, movies.dat. For the following observations, a file genres.dat was added to help look more closely at individual genres.

Filename Data Structure Example
ratings.dat UserID::MovieID::Rating::Timestamp “1::122::5.0::838985046” represents User #1’s rating of 5.0 for Movie #122, which is 5.0. Using the file movies.dat we see that Movie #122 is the film “Boomerang” made in 1992.
movies.dat MovieID::Title::Genres The line for film 4973 is “4973::Amélie (Fabuleux destin d’Amélie Poulain, Le) (2001)::Comedy|Romance”, which is self-explanatory.
tags.dat UserID::MovieID::Tag::Timestamp The corresponding tag for User #15 is the line “15::4973::excellent!::1215184630” representing User #15’s tag of “excellent!” for Amélie
genres.dat Genre Alphabetic list of 18 genres from Action to Western.

The above format clearly suits the use of Hive and HiveQL. Firstly we have one wrinkle – fields are separated by a “::”, not recognised by Hive as a delimiter. However, with a short script in Python or R we can replace all occurrences of “::” with a tab (“\t”) in all input files.

Now we look at what’s actually in the files to see if there’s any issues. The following code shows the first 10 lines for movies.dat:

CREATE TABLE movies (MovieID INT, Title STRING, Genres String)

SELECT * FROM movies LIMIT 10;

Which gives the following output:

1  Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy
2   Jumanji (1995)  Adventure|Children|Fantasy
3   Grumpier Old Men (1995) Comedy|Romance
4   Waiting to Exhale (1995)    Comedy|Drama|Romance
5   Father of the Bride Part II (1995)  Comedy
6   Heat (1995) Action|Crime|Thriller
7   Sabrina (1995)  Comedy|Romance
8   Tom and Huck (1995) Adventure|Children
9   Sudden Death (1995) Action
10  GoldenEye (1995)    Action|Adventure|Thriller

As can be seen above, the title includes the film year e.g. the title for Toy Story is “Toy Story (1995)”, so it’s worth parsing the year into a separate record. Also this film has multiple genres, so we should use the Hive instr function to look at individual genres. No other issues are found in the other tables, tags or ratings.

The table full_data presented below, brings together all key components of the 2 tables ratings and movies, including film year which is extracted using Hive’s substr function:

CREATE TABLE full_data (UserID INT, Title STRING, Year INT, 
           Genres STRING, Rating DOUBLE);
SELECT r.UserID,substr(m.Title,1,length(m.Title)-7),
       substr(m.Title,length(m.Title)-4,4), m.Genres,r.Rating
FROM ratings r
CROSS JOIN movies m
ON r.MovieID=m.MovieID;

This allows for us to run queries using Hive’s aggregate functions (e.g. avg, min, max, count) and make key observations that have an impact on marketing.

Note on Genres

As noted multiple genres appear in each film. To deal with this a JOIN of the tables full_data and genres using the condition instr(full_data.Genres, genres.genre) > 0 should work. However, pure JOINs have to be done with exact ON conditions. However, this is not so for CROSS JOINs:

SELECT g.genre,avg(f.rating)
FROM full_data f
CROSS JOIN genres g
WHERE instr(f.Genres,g.genre)>0
GROUP BY g.genre;

As CROSS JOINs produce more rows than a pure JOIN, this solution is quite inefficient. However Hive may be expanded in future to allow inexact conditions for pure JOINs.

Part 3 – Films

The following graphs show the average rating and number of films that were rated on an annual basis.

avgratingbyyear NumFilmsByYear

This shows a rating preference towards older movies – only 3 times before 1978 is it below 3.6 and these years (1915,1919 and 1926) are in the silent era, while it is only very occasionally above 3.6 from 1978 on.

This is the trend that is expected. It would appear, based on the number of films rated by year, that older movies are only watched if they have a reputation for being good, while there is a wider range of quality when it comes to modern movies. For instance, only 12% of the films were made before 1960, about halfway through the film range chronologically.


Below is a histogram of average film ratings, which shows a normal distribution spread from 1 to 5 with the centre somewhere between 3 and 4, and a Q-Q plot against normally distributed random variates.

Film_ratings_full Film_ratings_full-NORM_CHECK

The Q-Q plot shows this distribution similar to normal. This is the kind of distribution we expect, as the average rating is the average of many similarly distributed factors (e.g. overall quality, user interest, genre).

Part 4 – Users

Average Ratings by User

The following is a histogram of users’ average ratings, and a Q-Q plot which shows how close this is to a normal distribution.

UserID_avg_rating UserID_avg_rating-NORM_CHECK

Similar to the above case of average ratings by film, the distribution is shown to be similar to normal, where the average rating is the average of many similarly distributed factors (e.g. overall quality, user interest, genre).

Users’ Preferences by Year

The following 2 plots show the number of ratings by year of film and a frequency graph of the “average” year of user’s ratings, which give very similar distributions. The “average” year of a user’s ratings is defined as the mathematical average of the years of films they have rated, rounded to the nearest integer.

numfilmsbyyear UserID_avg_year

Both show a left skewed distribution, with 1995 being the year with most ratings and 1994 the most common user “average” year. This is the distribution we expect i.e. where most ratings are made on modern films but the number of ratings falls off as more immediately recent movies have not yet been viewed by everyone.

Users’ Preferences by Genre

Most users appear quite diverse in their tastes with 87% of users having rated films from 10 or more genres. However, many films come under multiple genres (e.g. The 1988 film “Who Framed Roger Rabbit?” comes under Adventure, Animation, Children, Comedy, Crime, Fantasy and Mystery) and only 37.5% of the films come under one genre.

Number of Ratings by User

The following is a frequency plot of the number of ratings.

The shape of this curve suggests that the number of users’ ratings follow a survival distribution. To check this we rephrase this as a survival problem and plot the function f, where f(n) is the number of users that have given at least n ratings:

Based on this curve, the number of films rated by a user appears to be indeed driven by a survival distribution. It would appear that the key driver behind the number of films rated by users is their interest in giving ratings which dies quite rapidly – while all 69,878 users rated at least 20 films, only 26,874 rated at least 100 movies and only 843 rated 1,000 movies.

Part 5 – Genres

The following scatter plots show the average rating and number of ratings by genre.

The most popular genre is Drama with 4.3 million films (43% of the dataset). While only 131,592 films fall under Film-Noir, this is the best rated film genre – perhaps linked to a propensity to rate films made before 1960 highly as 36.5% of film-noir films were made before 1960 (in contrast to 4.5% for Horror) the worst rated genre.

A closer look at these genres therefore appears prudent, particularly movie timings, as ratings for these genres would be expected to vary dramatically over time. Indeed the following graphs are the average ratings for each of these genres taken over 5 year intervals.

FilmNoir_avg_ratings_5years Horror_avg_ratings_5years

In short users appear to prefer older Horror movies to modern Horror, while the difference isn’t as dramatic with Film-Noir. As well as standard recommender techniques (e.g. collaborative filtering based on users’ favourite films) it may be worth marketing vintage films to users with preferences for Horror.


  1. ^ Shark: Real-time queries and analytics for big data, November 27, 2012
  2. ^ Simple Hive ‘Cheat Sheet’ for SQL Users, August 20, 2013
  3. ^ MovieLens Datasets