Tuesday, May 20, 2014

Cooking: New Foods

I'm not exactly a novice chef: I've roasted a goose, made my own duck confit, and done a flambé. But over the past year, there have been several brand new things I've made or worked with that seem like Serious Gaps in my experience. When I say "Serious Gap," I don't think I am overstating the issue; the two that spring immediately to mind are plain white rice and avocados.

I'm not sure I have a huge point here, other than even seeming expertise can sometimes mask a weird fundamental flaw.

Friday, April 18, 2014

Statnet & "foreign" Vertex IDs

It's common that your network comes from an external source, and the vertices already have unique identifiers. Unfortunately, if I am importing the network into Statnet using edgelist format, Statnet requires the vertex identifiers to be in sequential order, beginning at zero. This requires me to do a bit of pre-processing of the data, which is not always desirable: it requires a look-up step if I want to figure out who a particular vertex refers to, and makes attaching additional attributes more complex.

However, two, "lightly documented" features can help out here. First, statnet already accepts textual vertex names in the standard network constructors. So, if the vertices already have (unique) text names, they can be used with no additional work:

text.edgelist <- rbind(c("tom_", "dick7"), c("dick7", "harry"),
   c("harry", "jill"), c("harry", "jane"))
n_alpha <- as.network(text.edgelist, matrix.type='edgelist', directed=F)
plot(n_alpha, displaylabels=T)
If the identifiers are numbers, the simple approach then is to convert the numeric vertex identifiers to a string (of numbers), and use that string as the input to as.network:

num.edgelist <- rbind(c(3, 4), c(4, 6),
   c(6, 8), c(6, 9))

# Map the numeric edge list to characters
char.edgelist <- apply(num.edgelist, c(1,2), as.character)
n_alpha <- as.network(char.edgelist, matrix.type='edgelist', directed=F)
plot(n_alpha, displaylabels=T)
This approach works, and depending on data and goals, may be the easiest approach. There's also a second feature in statnet that can solve this problem: persistent IDs. Persistent IDs are part of the dynamic network package, and provide a way to attach unique values to vertices and edges that don't change regardless of any network manipulation.

Persistent IDs (at least to me) "feel" like the better approach to this, but they are also more complex to use in practice. In particular, adding edges is not as easy as you'd like. In order to use persistent IDs, I first have to create the network, then initialize it for use with persistent IDs. In this example, I'm going to start with a (mostly) empty network, and then add all of the vertices and edges to it. I'm assuming that the edgelist is already loaded in num.edgelist and it has two columns, a tail and a head. The network has to be initialized with at least one node, otherwise the initialize.pids() call is not "sticky."

net.pid <- network.initialize(1, directed=F)


The next step is to add all of the vertices. This step is straight-forward. First, I create a list of the unique node identifiers (unique() is part of the base R distribution). Then, use that list to add the vertices and initialized them with the persistent ID.

node.list <- unique(c(unique(num.edgelist[,1]),
add.vertices(net.pid, length(node.list), vertex.pid = node.list)
Unfortunately, there is no add.edges() function which takes an edge defined in terms of the persistent IDs. This is not a huge problem: it's only a few lines of code to map a list of persistent IDs to the internal node representation, using the get.vertex.id function already written. Using apply, it's one (logical) line of R code, which gives the edgelist "mapped" to the internal vertex IDs. This list can be passed directly to add.edges():

mapped.edges <- apply(num.edgelist, c(1,2),
   function(v.pid, net) {
      get.vertex.id(net, v.pid)
   }, net.pid)
add.edges(net.pid, mapped.edges[,1], mapped.edges[,2])
There's one last bit of clean-up to do. At the beginning, I created the network with one node. This was done to make sure that the initialize.pids() call was "sticky." I need to remove that initial node. Since it was the first node added, I know the id is 1:

delete.vertices(n_pid, 1)
Finally, it's possible to plot the graph using the persistent IDs as labels. Just use the %v% operator to extract the vertex.pid attribute:
plot(net.pid, label=net.pid %v% 'vertex.pid')

Tuesday, December 17, 2013

#NoManagers / #NoMeetings

Rands makes the comment that even in organizations without "managers" in job title, there are still people who do the job function of a manager. I'd say the same thing is true about meetings as well: even if you claim you don't have meetings, there are still things that act as meetings.

The best example was an interview with a small start-up from a year or so ago (and, unfortunately, I can't remember the name or have the link to it). They said, as a point of pride, "we don't have meetings." But, in almost the same breath, they said something to the effect of "we all go to lunch together." So: in other words, they have an hour to an hour and a half of meetings every day. The meeting just happens to be catered.

McGrath is right here: meetings are not about getting work done. Meetings are about building a group identity and learning to work together. Work happens in the space around meetings, in the walks to and from the meetings, and in the opportunistic interactions that meetings help make possible. It would be interesting to formally think about meetings as social enablers: in other words, ignore all those rules about how to run a meeting, and make it purely an "unmeeting." We've got no agenda, we've got no agenda, and we've got no action plan, but we do have cheese and crackers.

Tuesday, November 26, 2013

ERGM: Creating Large Fully-Connected Network objects in Statnet


Use the following approach:

network <- network(matrix(1, n, n), directed=F)
(At least with Statnet 1.7 / R64 2.13)

In more depth

Statnet seems optimized for sparsely connected graphs. This is not too surprising, since many of the "real" graphs I deal with have a density around d = .0002 or thereabouts, and even some fairly large small world graphs, like IMDB only have a density of ~.18. However, there's one special case where I need to have a fully connected graph: the input to an edgecov() term in an ERGM model. This graph has to have all possible edges, not just the observed edges, and so, density = 1.0.

One of the challenges is how to create and initialize these very large networks. The step to create them would often take a very long time, and it wasn't clear I was using the best approach. There are at least two method that seemed plausible: use matrix to initialize an adjacency matrix, or use network.intialize and then add all of the edges in afterwards. It was not clear up front which one would be faster. So, I did a quick experiment: I ran each method 50 times on various sized graphs, and compared the results.

# Method 1: network(matrix())
startTime = proc.time()
for (i in 1:50) n <- network(matrix(1, 200, 200), directed=F)
proc.time() - startTime

# Method 2: network.intialize() and then assignment
startTime = proc.time()
for (i in 1:50) {
n <- network.initialize(200, directed=F)
n[,] <- 1
proc.time() - startTime

The results were pretty clear:

Method 200x200 500x500
#1 12.32 361.91
#2 42.42 8+ hours

I used proc.time() for the timing. There are suggestions that this is not super-accurate, but the difference is so stark, I think even 1s resolution is more than enough. Also: I've discovered that 32-bit R is a really bad environment for working with even "medium-sized" graphs (500 nodes or so), much less "large" graphs. The extra address space afforded by the 64bit version of R avoids a lot of out-of-memory conditions.

Saturday, August 24, 2013

Wuzzie Club: The Rules

1. Do not talk about Wuzzie Club, except for Rule #2
2. You may talk about Wuzzie Club to make non-Wuzzies jealous.
3. If you get a new title, bar or otherwise, you are no longer part of Wuzzie Club
4. As many Wuzzies as necessary.
5. As many chapters of Wuzzie Club as necessary.
6. No sashes. No title belts.
7. Wuzzie Club is for life, except for rule #3
8. If this is your first time at Wuzzie Club, you must say how happy you are to be a Wuzzie.

Thursday, August 15, 2013

Johnnie Walker Blue: Great Scotch? Or Great Marketing?

1. The Set-up

Johnnie Walker Blue Label retails (at least in Seattle) for more than $300 a bottle. It comes in an impressive presentation box, with a heavy blue-glass bottle, carefully designed to look far more impressive than the 750ml it really is. It is very much a luxury good, suitable for the most refined of tastes. But, if you read the label, there are some worrisome signs. For instance, it doesn't state an age. Under Scottish law, anything labeled "Scotch Whisky" must be aged at least three years. If an age is stated on the label, it must be the age of the youngest whisky used in the blend. Johnnie Walker Black Label (~$50) is labeled 12 years; Green Label (~$75) 15 years; and Gold Label (~$100) 18 years. All of these are significantly cheaper than Blue Label, so you'd expect Blue to at least be able to drink in the US (21 years). But, alas, there is no age at all. Blue may not even be quite out of diapers yet. (Red Label, their cheapest, similarly does not state an age.)
Also, because Blue is labeled "Blended Scotch Whisky" it could also contain grain whisky--meaning a cheaper spirit made from grains other than malted barley. Whiskies labeled "single malt" and "blended malt" (such as Johnnie Walker Green, for instance) are made from 100% malted barley, and are generally considered to be of higher quality.
Three hundred dollars is a lot of money. I can get some very impressive bottles for a third of that, for instance, Edradour "Caledonia" 12 year, a non-chill filtered single malt ($100), or even just a bottle of Johnny Walker Green Label, a blended malt ($75). The open question is whether, for $300, am I getting a Scotch that is truly three times as good? This calls for a (rigorous) test

2. The Methodology

The test was done in conjunction with a friend. He helped to procure three Scotchen1, a bottle each of:
  • Johnny Walker Green Label (15 year blended malt, ~$75/bottle)
  • Edradour "Caledonia" (12 Year, non-chill filtered single malt, ~$100/bottle)
  • Johnny Walker Blue Label (No age stated, blended whisky, ~$300/bottle)
Together, we carefully selected 18 discerning individuals to blind sample the three Scotches. Each person got a 1oz pour of each Scotch, labelled only with an "A", "B" or "C." They were told that this was a tasting to determine a best all-round Scotch; they were told nothing about the price or the background of the actual experiment. In addition to the three glasses of Scotch, the subjects were given water, both to cleanse the palette and to dilute the Scotch as desired; they were also provided with crackers to nibble on. The subjects were to rank the Scotchii2 in order, from favorite to least favorite
Once all the subjects had a chance to taste and rank the Scotchii, the results were tabulated. An in-depth statistical analysis was conducted to understand the results.
Note: Certain individuals were asked not to dilute the Scotchen. A non-chill filtered Scotch will often turn cloudy with water, and this would have given away details of the samples to a more knowledgeable subject. Also, due to modern human subject protection rules, we were unable to use electroshock for training purposes, much to the dismay of at least one subject.

3. The Results

First, the raw results, looking only at the top-ranked Scotch here.
Green Label Edradour Blue Label
4 9 5

As can be seen, the Edradour was the most popular, followed by the Blue Label, and finally the Green Label. But are these significant differences? I will take two approaches to answering this question. The first approach is to assume that all of the Scotches are the same. That is, I expect that people would pick each one more or less at random, or six votes for each of the Scotches. So, are there statistically significant differences in these rankings?
Using an exact multinomial test, we find no significant differences in the rankings (p = .419). (Similarly, there is no difference using a chi-square test, x2 = 2.33, df = 2, p = .31; again, not significant.) This means that our baseline assumption that the Scotches are equivalent is validated.
But, there is a huge variation in the prices of the Scitch3: from $75 to $300. This suggests that I should revise the assumption that there is no difference between the Scotches: at $300, I expect Blue Label to be significantly better. As a rough approximation, I'll use the cost to reflect the expected distribution of the first-place rankings.
Green Label Edradour Blue Label
Observed 4 9 5
Expected 16% 21% 63%
Then question is, with a price-based weighting, is there a significant difference in the first-place rankings?
Yes, but...
There is a significant difference between the observed and expected rankings when I do a cost-weighted comparison. (Exact multinomial p = .0004; Chi-square, x2 = 11.19, df = 2, p = .003). However, what's interesting here is that the significance is being driven by two things, first, Edradour does significantly better than expected (Chi-square residual = 2.68), and Johnny Walker Blue does significantly worse (Chi-square residual = -1.88)

4. Conclusions

TL;DR: Johnny Walker Blue Label is great marketing,...
... but not a great Scotch.
Overall, Johnny Walker Blue is a disappointment: using an assumption of equivalence, it doesn't even beat its cheaper relative, Johnny Walker Green. On a price-adjusted basis, it fails against both the Green Label and Edradour "Caledonia." In fact, the Edradour scores very well: perhaps the non-chill filtered is making a difference. In any case, in this tasting, it ends up being the overall favorite.
Thanks to PSD for spearheading this.

1: Surprisingly, certain British sources say that "Scotchen" is a valid plural for Scotch. This is a relic of a Germanic / Viking invasion from around 1100.
2: American sources say that "Scotchii" is correct. Maybe the melting pot of America, where the Scots and the Greeks mixed freely caused this unusual plural.
3: Australian English uses "Scitch" (Like mouse / mice) to refer to more than one Scotch. This is probably an example of Foster's Rule.

Wednesday, May 15, 2013

Minor Rant: "Natural" Cures

An acquaintance posted a link on Facebook with the commentary, "Nature does provide the cures for what she throws at us. It's just up to us to find them and use them. We don't need all the chemicals we take in. We really don't." I decided to dig in a little bit further as to what they were talking about. It turns out that the "cure" was a protein extracted from a natural source, then "glued" onto a nanoparticle (no detail of what the nanoparticle was made out of), then the nanoparticle had "bumpers" (again, no detail of what the bumpers were made of, but I doubt it was sugar and spice and everything nice. Probably more Compound X) added to it, so that human cells wouldn't hit the peptide, but viruses would. If this is what passes as "natural," then, to quote Andre the Giant, "I don't think that word means what you think it does." That being said, I don't really understand the fetish about "natural" and "artificial," especially in the medicinal context. Every time someone says "it's natural," all I think about is questions around "well, what's my dosage going to be like?" or "What's my lot to lot variability?" There are a lot of really nice toxins out there in the natural world.