To start, first download the assignment: timeusage.zip. For this assignment, you also need to download the data (164 MB):
and place it in the folder
in your project directory.
The dataset is provided by Kaggle and is documented here:
The file uses the comma-separated values format: the first line is a header defining the field names of each column, and every following line contains an information record, which is itself made of several columns. It contains information about how people spend their time (e.g., sleeping, eating, working, etc.).
Here are the first lines of the dataset:
Our goal is to identify three groups of activities:
And then to observe how do people allocate their time between these three kinds of activities, and if we can see differences between men and women, employed and unemployed people, and young (less than 22 years old), active (between 22 and 55 years old) and elder people.
At the end of the assignment we will be able to answer the following questions based on the dataset:
To achieve this, we will first read the dataset with Spark, transform it into an intermediate dataset which will be easier to work with for our use case, and finally compute information that will answer the above questions.
The simplest way to create a
DataFrame consists in reading a file and letting Spark-sql infer
the underlying schema. However this approach does not work well with CSV files, because the
inferred column types are always
In our case, the first column contains a
String value identifying the respondent but all the
other columns contain numeric values. Since this schema will not be correctly inferred by
Spark-sql, we will define it programmatically. However, the number of columns is huge.
So, instead of manually enumerating all the columns we can rely on the fact that, in the
CSV file, the first line contains the name of all the columns of the dataset.
Our first task consists in turning this first line into a Spark-sql
This is the purpose of the
dfSchema method. This method returns a
the schema of the CSV file, where the first column has type
StringType and all the others
DoubleType. None of these columns are nullable.
def dfSchema(columnNames: List[String]): StructType
The second step is to be able to effectively read the CSV file is to turn each line into
containing columns that match the schema returned by
dfSchema. That’s the job of the
def row(line: List[String]): Row
As you probably noticed, the initial dataset contains lots of information that we don’t need to answer our questions, and even the columns that contain useful information are too detailed. For instance, we are not interested in the exact age of each respondent, but just whether she was “young”, “active” or “elder”.
Also, the time spent on each activity is very detailed (there are more than 50 reported activities). Again, we don’t need this level of detail; we are only interested in three activities: primary needs, work and other.
So, with this initial dataset it would a bit hard to express the queries that would give us the answers we are looking for.
The second part of this assignment consists in transforming the initial dataset into a format that will be easier to work with.
A first step in this direction is to identify which columns are related to the same activity. Based on the description of the activity corresponding to each column (given in this document), we deduce the following rules:
Then our work consists in implementing the
classifiedColumns method, which classifies the given
list of column names into three
groups (primary needs, work or other). This method should return a triplet containing the
“primary needs” columns list, the “work” columns list and the “other” columns list.
def classifiedColumns(columnNames: List[String]): (List[Column], List[Column], List[Column])
The second step is to implement the
timeUsageSummary method, which projects the detailed
dataset into a summarized dataset. This summary will contain only 6 columns: the working status
of the respondent, his sex, his age, the amount of daily hours spent on primary needs activities,
the amount of daily hours spent on working and the amount of daily hours spent on other activities.
def timeUsageSummary( primaryNeedsColumns: List[Column], workColumns: List[Column], otherColumns: List[Column], df: DataFrame ): DataFrame
Each “activity column” will contain the sum of the columns related to the same activity of the initial dataset. Note that time amounts are given in minutes in the initial dataset, whereas in our resulting dataset we want them to be in hours.
The columns describing the work status, the sex and the age, will contain simplified information compared to the initial dataset.
Last, people that are not employable will be filtered out of the resulting dataset.
The comment on top of the
timeUsageSummary method will give you more specific information
about what is expected in each column.
Finally, we want to compare the average time spent on each activity, for all the combinations of working status, sex and age.
We will implement the
timeUsageGrouped method which computes the
average number of hours spent on each activity, grouped by working status (employed
or unemployed), sex and age (young, active or elder), and also ordered by working status,
sex and age. The values will be rounded to the nearest tenth.
def timeUsageGrouped(summed: DataFrame): DataFrame
Now you can run the project and see what the final
What do you see when you compare elderly men versus elderly women’s time usage?
How much time elder people allocate to leisure compared to active people?
How much time do active employed people spend to work?
We can also implement the
timeUsageGrouped method by using a plain SQL query instead
DataFrame API. Note that sometimes using the programmatic API to build
queries is a lot easier than writing a plain SQL query. If you do not have experience
with SQL, you might find these examples
def timeUsageGroupedSqlQuery(viewName: String): String
Can you think of a previous query that would have been a nightmare to write in plain SQL?
Finally, in the last part of this assignment we will explore yet another
alternative way to express queries: using typed
Datasets instead of untyped
def timeUsageSummaryTyped(timeUsageSummaryDf: DataFrame): Dataset[TimeUsageRow]
timeUsageSummaryTyped method to convert a
timeUsageSummary into a
TimeUsageRow is a data type that models the content of a row
of a summarized dataset. To achieve the conversion you might want
to use the
getAs method of
Row. This method retrieves a
named column of the row and attempts to cast its value to
a given type.
def timeUsageGroupedTyped(summed: Dataset[TimeUsageRow]): Dataset[TimeUsageRow]
Then, implement the
timeUsageGroupedTyped method that performs the
same query as
timeUsageGrouped but uses typed APIs as much as possible.
Note that not all the operations have a typed equivalent.
round is an
example of operations that has no typed equivalent: it will return a
that you will have to turn into a
TypedColumn by calling
Another example is
orderBy, which has no typed equivalent. Make sure
Dataset has a schema because this operation requires one (column
names are generally lost when using typed transformations).
This exercise includes no real tests. We will run your submission against a large test suite to check for correctness. So if you want full points, be sure to add your own tests to double check that you’ve caught all corner cases.
You can also style check your code by running
styleCheck in the sbt
You will be graded out of 10 points. 2 points are awarded for style; don’t
submit code written in a Java-esque style. Again, you can also style check your
code by running
styleCheck in the sbt shell. 8 points are awarded for
correctness of your solution.
To submit your assignment, simply zip the entire
timeusage directory and upload
timeusage.zip to Blackboard.