2. Prepare your solutions as Rmarkdown documents. For problems in vega-lite, copy your code into the .Rmd, making sure to enclose them within js blocks,
// your code here
so that the syntax is properly highlighted, e.g.,
// your code here
3. For vega-lite problems, include a screenshot of your result.
4. We give example figures below to guide your work. However, view these only as suggestions – we will keep an eye out for improvements over our plots.
5. Include two files in your submission, (a) a pdf of the compiled .Rmd file and (b) the original .Rmd file.
2 problems below will be graded according to,
• Correctness: For plots, the displays meet the required specifications. For conceptual questions, all parts are accurately addressed.
• Attention to detail: Writing is clear and designs are elegant. For example, no superfluous marks are included, all axes are labeled, text is neither too small nor too large.
• Code quality (if applicable): Code is concise but readable, properly formatted and commented.
The remaining problems will be graded for completeness.
(1) Ikea Furniture
The dataset below shows prices of pieces of Ikea furniture. We will compare prices of different furniture categories and label the (relatively few) articles which cannot be bought online.
ikea <- read_csv("https://uwmadison.box.com/shared/static/iat31h1wjg7abhd2889cput7k264bdzd.csv" ) a. Make a plot that shows the relationship between the category of furniture and price (on a log-scale). Show each item_id as a point – do not aggregate to boxplots or ridgelines – but make sure to jitter and adjust the size the points to reduce the amount of overlap. Hint: use the geom_jitter layer. b. Modify the plot in (a) so that categories are sorted from those with highest to lowest average prices. c. Color points according to whether they can be purchased online. If they cannot be purchased online, add a text label giving the name of that item of furniture. An example result is given by Figure 1. (2) Penguins The data below measures properties of various antarctic penguins. penguins <- read_csv("https://uwmadison.box.com/shared/static/ijh7iipc9ect1jf0z8qa2n3j7dgem1gh.csv") Using either vega-lite or ggplot2, create a single plot that makes it easy to answer both of these questions, i) How is bill length related to bill depth within and across species? ii) On which islands are which species found? (Notice that the answer to part (i) is an example of Simpson’s paradox!) (3) 2012 London Olympics This exercise is similar to the Ikea furniture one, except that it will be interactive. The data at this link describes all participants in the London 2012 Olympics. From an observable notebook, the following code can be used to derive a new variable with a jittered Age variable, which will be useful in part (a). import { vl } from "@vega/vega-lite-api" import { aq, op } from "@uwdata/arquero" data_raw = aq.fromCSV(await FileAttachment("All London 2012 athletes - ALL ATHLETES.csv").text()) data = data_raw.derive({Age_: d => d.Age + 0.25 * Math.random() })
a. Create a layered display that shows (i) the ages of athletes across sports and (ii) the average age within each sport. Use different marks for participants and for averages. To avoid overplotting, use the jittered Age_ variable defined in the code block above.
b. Sort the sports from lowest to highest average age. Add a tooltip so that hovering over an athlete shows their name. Your results should look something like the display in Figure 1.

Figure 1: Example vega-lite result for Problem (3). Hovering over a tick mark shows the name of the athlete.
(4) Traffic
In lecture, we looked at the geom_density_ridges function. In this exercise, we will instead use geom_ridgeline, which is useful whenever the heights of the ridges have been computed in advance. We will use the traffic data read in below.
traffic <- read_csv("https://uwmadison.box.com/shared/static/x0mp3rhhic78vufsxtgrwencchmghbdf.csv") Each row is a timepoint of traffic within a city in Germany. Using geom_ridges, make a plot of traffic over time, within each of the cities. An example result is shown below. (5) Language Learning This problem will look at a simplified version of the data from the study A critical period for second language acquisition: Evidence from 2/3 million English speakers, which measured the effect of the the age of initial language learning on performance in grammar quizzes. We have downloaded the raw data from the supplementary material and reduced it down to the average and standard deviations of test scores within (initial learning age) × (current age-group) combinations. We have kept a column n showing how many participants were used to compute the associated statistics. The resulting data are available here. a) Using the .derive() command in arquero, create two new fields, low and high, giving confidence intervals for the means in each row. That is, derive new variables according to xˆ . b) Create a markArea-based ribbon plot showing confidence intervals for average test scores as a function of starting age. Include a line for the average score within that combination. An example result is shown in Figure 2. Interpret the results of the study. Figure 2: Example result for problem (5). (6) Deconstruction Take a static screenshot from any of the visualizations in this article, and deconstruct its associated visual encodings. a) What do you think was the underlying data behind the current view? What where the rows, and what were the columns? b) What were the data types of each of the columns? c) What encodings were used? How are properties of marks on the page derived from the underlying abstract data? d) Is multi-view composition being used? If so, how