<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[With Pith]]></title>
  <link href="http://ethanp.github.io/atom.xml" rel="self"/>
  <link href="http://ethanp.github.io/"/>
  <updated>2016-10-10T01:04:45-07:00</updated>
  <id>http://ethanp.github.io/</id>
  <author>
    <name><![CDATA[Ethan Petuchowski]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[On Learning]]></title>
    <link href="http://ethanp.github.io/blog/2016/10/09/on-learning/"/>
    <updated>2016-10-09T15:03:32-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/10/09/on-learning</id>
    <content type="html"><![CDATA[<p><a href="https://www.youtube.com/watch?v=Zjm8JeDKvdc">I like to learn things</a>, but I
often feel like time I spend learning things is wasted.</p>

<h3>It&rsquo;s hard to tell what info is going to be useful and what is fluff</h3>

<p>Classes in school (at all levels) lead to lots of wasted effort, because a lot of the stuff you&rsquo;re supposed to learn is simply not useful knowledge. For example, for one college exam, I had to name about 30 different rocks and minerals by sight, and the names are very complicated. People say: &ldquo;well just going through the process is teaching you how to think.&rdquo; I don&rsquo;t disagree, but I think that&rsquo;s a fallacy; learing to think and useful knowledge are not mutually exclusive. Useless knowledge includes specific dates, long names, irrelevant historical events, and other things we might in everyday life dismiss as &ldquo;trivia&rdquo;.</p>

<blockquote><p>In my experience, all academic disciplines have room to reduce the &ldquo;trivia&rdquo; overhead in their curricula.</p></blockquote>

<p>In college computer science classes, there is not a whole lot of trivia. However, their is a &ldquo;jump into abstraction&rdquo; that often left me confused. I quite often didn&rsquo;t understand <em>why</em> what I was learning was important. Edsger Dijkstra said we should teach the abstractions before rotting the students&#8217; brains with modern programming realities and deficiencies. That&rsquo;s a laudable mission, but problems still need to be better <em>motivated</em>.</p>

<h4>Difficult material with no motivation is not rewarding</h4>

<p>E.g. during my first Operating Systems course. Everything seemed way more complicated than anything I should ever be expected to know in real life. So I figured I would never end up needing to understand virtual memory, processes and threads, scheduling, networking, etc. In reality, I just had no idea what I was talking about, and that&rsquo;s why it should have been <em>motivated</em>. For example, I have since learned that the Linux kernel has a very interesting open source development ecosystem. They are making upgrades to the thing all the time that affect everyone developing most kinds of modern software. In addition, a whole lot of the different features of computers are motivated by internal business tool usecases, rather than home consumer usecases. Being a college student without a computer nerd background, I had no idea about any of that. Being taught to appreciate how important this stuff would end up being for me would have made me learn drastically more during the course of the class.</p>

<p>What I wanted at that time was for the material to be motivated with an example, like, &ldquo;let&rsquo;s build a program for running a medical radiation machine.&#8221;Now we&rsquo;re talking; we&rsquo;re going to need to get all the different low-level components right and make them fit together so that we can save lives!</p>

<p>What actually happened was similar in content, but not in objective. We were expected to read a very long and dry paper on the Therac-25, a radiation machine with concurrency issues that killed a few people in the 1980s. I spent a few hours trying to read the paper. One could say I was spending that time &ldquo;learning to think&rdquo;, but I think I spent that time &ldquo;getting nowhere&rdquo;. The whole time I was reading the paper, I was thinking: it would take me so long to get to the point of this paper, and I&rsquo;m not even going to get a whole lot out of it. No good. In the end I learned about the Therac-25 from Wikipedia, a source of information whose expected readership has a level of background knowledge more commensurate with my own. Then, a few years later, I had to read that paper again for a graduate operating systems class, and at that point I had sufficient background knowledge to simply read the paper and feel like I understood its content and learned important lessons about software engineering.</p>

<p>This example demonstrates the fundamental principle that the same content can lead to completely different learning outcomes for different people, and even for the same person at different points of time and (as we shall return to later) in different emotional moods.</p>

<h3>Some keys to not wasting effort</h3>

<ul>
<li>Have something in mind that you want to accomplish with your newly-obtained knowledge

<ul>
<li>E.g. &ldquo;I want to build the software for a medical radiation device&rdquo;</li>
</ul>
</li>
<li>Have someone (colleague) or some place (e.g. stack overflow) where you can ask questions when you get confused

<ul>
<li>Sometimes having another person just <em>explain the whole thing</em> in one shot face to face can lead you to simply <em>just get it</em></li>
</ul>
</li>
<li>Learn the relevant vocabulary for the field on wikipedia

<ul>
<li>For example take at least half an hour digging through linked topics until you have a general grasp of the various concerns and their names, and ideally how they relate to one another</li>
</ul>
</li>
<li>Skim liberally but don&rsquo;t expect to understand what you skim</li>
</ul>


<!-- more -->


<h3>Qualities of time</h3>

<p>During college, I learned from <a href="http://www.aaronsw.com/weblog/productivity">an Aaron Swartz article</a> about how one can&rsquo;t just spend all day reading tough material because it&rsquo;s too mentally strenuous. Instead, you have to choose the right activity for the mood you&rsquo;re in, and there&rsquo;s not a whole lot you can do if you&rsquo;re just not in the mood to learn anything. Mr. Swartz says we should have a todo list that is aware of the fact that different tasks require different levels of mental effort. Maybe after reading my compilers textbook for an hour I&rsquo;m mentally tired and physcally peppy, so I go for a jog. Then I&rsquo;m relaxed and ready to work on writing code for a program. Then I&rsquo;ll get overwhelmed so I step back and plan out some next steps for the program. By now I&rsquo;m sick of the program so I meet some friends. Then I&rsquo;m tired and take a nap. And so on. The point is each activity follows from the next according to my current mental and physical state. So many people don&rsquo;t start the project until the night before, and have to do one strenuous activity all night, fighting exhaustion, and not really getting much out of it because it&rsquo;s all slapped together. Granted, plenty of people are really proficient at getting that strategy to work out.</p>

<h3>Find the right source of information for you right now</h3>

<p>Some places to learn things include youtube, conversations, podcasts, wikipedia, books, tutorials, projects, and blogposts. In terms of enabling lots of deep and useful knowledge to be gained per minute, the most effective might be reading a great book, and doing a great project.</p>

<h4>Try out a few different compiler books</h4>

<p>Finding a great book can be pretty challenging, because you may spend some time reading the wrong books. But it is worth it, because the right book can make a big impact. After looking for some time, I have finally found a <a href="https://www.amazon.com/Writing-Compilers-%20Interpreters-%20Software-%20Engineerin%20g/dp/0470177071/ref=sr_1_sc_1?ie=UTF8&amp;qid=1476048360&amp;sr=8-1-spell&amp;keywords=writ%20ing+compilers+and+interpretere">compilers book</a> from which I am learning what I wanted to know. (It&rsquo;s cover happens to be quite ugly compared to the other compiler books.) Before that I tried reading a hundred pages of each of a few more theoretical compiler books, but over time I decided that for now I&rsquo;m more interested in learning to write the code, over understanding the theory. So the book I found is a tutorial for building a Pascal compiler from scratch in Java, with real executable code accompanying the text. The point is I found a book that is teaching me what I want to know more efficiently than several other popular books on the subject. And I know that from having tried out the other books. I think this is a useful strategy in general.</p>

<h4>Try out a few different algorithms books</h4>

<p>I went through a similar process for learning the fundamentals of algorithms. First I took the class in school and got an A. I can&rsquo;t say that did me much good, as evidenced by failing many easy job interview questions. I didn&rsquo;t like to go to class so someone else would hand in my homework. I studied for the exams directly from the book called &ldquo;Algorithm Design&rdquo;.</p>

<p>After failing at interviews, I started rotating through different algorithms books. I finally sat and (over a few months) read the whole explanatory section of &ldquo;The Algorithm Design Manual,&rdquo; but didn&rsquo;t feel like was walking away with a good grasp of the material it covered.</p>

<p>So I tried to read the classic &ldquo;CLRS&rdquo;. After reading about two chapters then skimming around, I decided I wanted to focus on <em>implementing</em> algorithms instead of proving correctness and complexity bounds.</p>

<p>So I started &ldquo;Algorithms&rdquo; by Sedgewick and Wayne. It started off pretty slow, so I found that by skimming, I was able to read a few hundred pages in the first few hours, which wasn&rsquo;t productive but motivated me to get into some more challenging material. So I skimmed up to a lot of material that I didn&rsquo;t already know (I think chapter 2 or 3).</p>

<p>In addition to incredible explanations, the book has clean code, amazing diagrams, easy, medium, and hard code and theoretical exercises, a full sequence of lecture videos by Dr. Sedgewick himself, etc. With this mix, I was finally able to achieve a satisfying learning per minute score. In general, I was able to just &ldquo;follow along&rdquo; with the book and understand things as they came.</p>

<p>The whole time I was reading this book, whenever I found a concept hard to grasp, based on my bad experiences with other books and classes and tutorials, I could rest assured that it would be unlikely that I could find a better explanation elsewhere. This attitude strengthened my resolve, and I ended up reading and understanding almost everything in the book.</p>

<h3>Maybe it&rsquo;s just the tone</h3>

<p>I have always been deeply affected by the tone in which information is presented to me in turms of being able to remember it. So have you, just ask your parents. I&rsquo;ve had professors I couldn&rsquo;t stand teach me subjects I love and walked away with nothing from that class. And vice versa. Everyone knows that. It&rsquo;s the same with textbooks. I think the key is inspring the student about the potential of what they can do with the knowledge, and to actually convey the excitement of it!, instead of simply boring you to tears with facts.</p>

<p>One thing the Khan Academy (among others) has brought to teaching is excitement for the actual content. The end result is that people (e.g. me) can actually bear to spend hours watching and re-watching Mr. Khan explain differential equations. He&rsquo;s doing a taylor series expansion, and he&rsquo;s like &ldquo;just check out how cool this is!&rdquo; and I&rsquo;m like &ldquo;yeah, expand that shit!&rdquo; Whereas otherwise I&rsquo;d be sitting at class at 8:30 in the morning like, &ldquo;FML I can&rsquo;t believe I&rsquo;m not sleeping right now&rdquo;. Junior year of college, without Khan Academy&rsquo;s help (and <em>Paul&rsquo;s Online Notes</em>, a free math textbook), there&rsquo;s no way I would have done so well in diff-eq, which gave me the confidence to transfer into a computer science degree later on.</p>

<h3>When to skim</h3>

<p>Most of the time, skimming material does not lead to understanding the material. Instead, it only leads to some kind of <em>awareness</em> of the material; that feeling later of &ldquo;yeah I think I&rsquo;ve seen this stuff before&rdquo;. That said, skimming is often crucial, especially for reading blogposts.</p>

<p>Blogposts may be written with any of an assortment of aims.</p>

<p>For example, the author may be trying to assert her expertise in an area by providing a good explanation, in order to further her career. Those vary widely in quality, and can occassionally be read deeply for understanding. Data science has a <em>huge</em> number of those.</p>

<p>Or the author may be trying to clear up his own thoughts on some subject. That&rsquo;s what I&rsquo;m trying to do here. In that case, skim to find the parts that are relevant or entertaining, and read those.</p>

<p>Or it may be written for the author to remember something, like &ldquo;how to get HDFS running locally on Mac&rdquo;. Thank heavens for such blogposts.</p>

<p>Or it may be some sort of <em>war story</em>: &ldquo;this is what we did, and this is how it worked out&rdquo;. If you&rsquo;re trying to build a complex system, you probably want to read some of those.</p>

<p>Lectures are generally only worth skimming. It takes me less effort to understand a lecturer than to understand a book. But even when it&rsquo;s clearly stated, the lecturer is generally not divulging much information, and when they start to say something actually complicated, it becomes easy to zone out, get lost, and lose interest. They may also get lost on side-tracks or try to pull things in that they think are cool. A great lecturer can be a very entertaining way of learning, but it&rsquo;s generally not time spent learning nearly as much substance as can be read from a page. Please note, hour long lectures are quite different from a super-helpful and targeted 5 minute youtube video.</p>

<p>One useful question to ask before reading something is &ldquo;Can I just skim it?&rdquo; The answer to this is yes if you <em>don&rsquo;t</em> want to be able to actively use that knowledge in the future. Often, skimming is useful for verifying that you understand something that you learned about elsewhere. For example, if I were to skim a tutorial on parsing theory, I would be able to roughly guage how much of that stuff I actually understand based on how much I&rsquo;m recognizing concepts and &ldquo;agreeing with the author&rdquo; as I skim.</p>

<p>A frustrating situation is to read something deeply when you could have just skimmed it because there was nothing really there. Maybe this can be avoided by skimming first to see whether it&rsquo;s even worth really reading. But I never do that.</p>

<p>Another time skimming is useful is when you&rsquo;re stuck. When I got stuck reading about computer architecture, I would go back to the table of contents. &ldquo;How did I get to where I am?&rdquo; &ldquo;Where are we going with this stuff?&rdquo; &ldquo;What do I find interesting about this?&rdquo; Then I would skip to some figures. Look at the figure. Is it cool looking? Wouldn&rsquo;t it be fun to understand what it means? Let&rsquo;s look at some of the keywords in the text near the figure. Oh there&rsquo;s some words I&rsquo;ve seen before but didn&rsquo;t understand, I guess they were important. Now I can go back to their definitions and see if I can place them properly in the figure. And so on.</p>

<h3>Is learning important</h3>

<p>By getting better at seeking out the best sources of information, we build an improved ability to affix knowledge into the brain accessibly, without expecting the best educators to just to fall in our lap when we need them. Here, I have tried to explore and share some lessons about building the conditions for learning efficiently.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[First Glance at Genomics With ADAM and Spark]]></title>
    <link href="http://ethanp.github.io/blog/2016/06/19/first-glance-at-genomics-with-adam-and-spark/"/>
    <updated>2016-06-19T14:15:22-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/06/19/first-glance-at-genomics-with-adam-and-spark</id>
    <content type="html"><![CDATA[<p>At work, we have a Spark cluster. One of my first responsibilities was to make
it more reliable and efficient. So I looked on Github to see how people
actually use Spark, and what magic they use to get their clusters not to crash
in the process. This is how I found <a href="https://github.com/bigdatagenomics/adam/">ADAM</a>, &ldquo;a genomics analysis platform
built using Avro, Spark, and Parquet&rdquo;. Then I looked in the repo&rsquo;s
<em>contributors</em> list, and watched a few lectures by Frank Nothaft and Matt
Massie, two of the project&rsquo;s main contributors. What I heard there was pretty
cool.</p>

<p>In short, they&rsquo;re looking to build systems that will &ldquo;one day&rdquo; recommend more
effective treatments for diseases including cancer and Alzheimer&rsquo;s within an
hour of receiving a patient&rsquo;s DNA sample. They describe several components of
what needs to be done to make [research toward] this possible.</p>

<!-- more -->


<h3>ADAM according to YouTube</h3>

<p>See the <strong>references</strong> at the bottom of this post.</p>

<p>They didn&rsquo;t explain much of the actual genomics, though what they did explain
of that, they did with laudable clarity. In essence, the first step is to
assemble (&ldquo;sequence&rdquo;) a large number of small strips of DNA into a single
massive (250 GB) chain (per sample per person), even in the presence of small
errors in the small strips. Ok now the the second step is to take the data from
multiple samples from multiple people, and find anomalies that correlate
historically with the likelihood to end up being diagnosed with a particular
disease. These anomalies may be on the scale of a few single-character changes
scattered around the DNA, again in the presence of mistakes in the
transcription of the DNA. So statistical techniques are used (from what I
remember, Poisson random sampling methods and binomial Bayesian methods, and
Chi-squared independence tests) to evaluate the confidence in the correlations.</p>

<p>According to the lecturers, currently, techniques for doing this sort of
computation and analysis relies on code written by PhD students in genomics,
who are not so interested in writing high quality code, as learning about
genomics and finishing their dissertations. Therefore, there are many
(compressed) file formats, and processing subsystems written in every
programming language you can think of. These subsystems are assembled together
by piping data through the filesystem between each compoenent. Many of these
subsystems are inherent unscalable.</p>

<p>All these are issues the researchers at the <em>Big Data Genomics</em> project are
trying to solve using <strong>ADAM</strong>. They <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/adam.pdf">reported in <em>SIGMOD 2015</em></a> to
have achieved a 28x speedup and 63% cost reduction over current genomics
pipelines. The <em>Big Data Genomics</em> project is a collaboration between
researchers at the AMPLab at UC Berkeley, and other genetics and cloud-
computing research institutions and companies. They note that their ADAM
pipeline infrustructure is able to accommodate analyses from other areas of
scientific research as well, including astronomy and neuroscience.</p>

<p>Producing a high-quality human genome currently takes 120 hours using a
&ldquo;single, beefy node&rdquo;. Their improvements involve using map-reduce (via Spark),
and columnar storage (via Avro &amp; Parquet) to distribute the workload across a
scalable cluster, so their 28x speedup is probably something people are happy
about.</p>

<p>A major issue faced by genomics researchers is the proprietary nature of the
data. This means that researchers must collect and process their own data,
which is heavily human and computer labor intensive. The humongous (I mean
really&hellip;) nature of the datasets means that the only practical means of
transferring them is by shipping boxes of hard drives. As data collection gets
easier, and the amount of data available explodes, even shipping boxes will
become impractical. So the vision is to keep the data in the Simple Storage
Service (S3) hosted by Amazon Web Services (AWS) so that it can be operated on
via Spark without being copied into an institution&rsquo;s private data center. This
is currently more dream than reality, but seems like the logical step because
of how Big the Data is.</p>

<p>Another major issue faced by genomics researchers is the large number of (open
and proprietary) genomics data formats, and the quirks/bugs in their
implementations. The ADAM team&rsquo;s solution to this is their large, fixed schema,
created using Apache Avro. This schema is designed to be able to accommodate
whatever genetics research you may want to do. To allow ADAM to ingest your
format, you write a transformation from your format to their standard Avro
schema. Then all the applications built-in to ADAM for analyzing the data are
available to you.</p>

<h3>ADAM, according to the paper</h3>

<p>The software architecture is (explicitly) based to a large extent on the
Internet&rsquo;s OSI model. It also has 7 layers, starting with a few storage layers,
going to a schema layer, going to compute and transform layers, then to an
application layer. The point is, like the OSI model, to make it easier to
implement on top of existing components, to make it portable in the important
dimensions across scientific disciplines and execution environments; and also
to allow the implementations of each layer to be swapped out over time as the
hardware, compute software, and relevant scientific algorithm ecosystems
evolve.</p>

<p>They proceed to discuss several pipelines (one genomics, the other astronomy)
that they implemented (mainly) around the needs of Spark and their AWS
(virtual) hardware. They note that their re-implementation provides several
bug-fixes with respect to algorithm implementations in reference genetics
software components. Plus, each of the reference components can only
communicate through disk I/O, while the Spark data pipeline keeps data in
memory until the end. This is reminiscent of the original insight of Spark:
speedup over Hadoop MapReduce by keeping as much data in memory as possible
throughout the job.</p>

<p>Datasets in genomics, astronomy, and many other scientific fields involve a
coordinate system. In genomics, it is generally a 1-dimensional string of
<em>nucleotides</em> (A, C, G, T), which serves as a convenient abstraction over what
is really a collection of molecules, each containing a packaged collection of
DNA polymers in a complex 3D shape, crammed into the nucleus of a cell (I
think, pg. 8). There is an assumed independence of data between distant regions
of the 1D string.</p>

<h4>The Region Join Algorithm</h4>

<p>To figure out which regions of the 1D space were acquired by this sample from
this &ldquo;non-reference&rdquo; human being, it is matched up with a &ldquo;reference&rdquo; (a.k.a.
&ldquo;idealized&rdquo;) human genome, generated by aggregating many people together. What
they use Spark for is to figure out which regions of the reference genome were
sequenced in <em>this</em> real DNA sample. They call it &ldquo;convex hull&rdquo; (see below for
definition), because to generalize to multi-dimensional spaces that&rsquo;s the right
way to see it, but for 1-Dimension, you&rsquo;re really just looking to line-up sub-
lines along a big line, and that would have been a simpler way to explain it if
I&rsquo;m not mistaken.</p>

<p>This &ldquo;region join&rdquo; can be implemented by (a) shipping the reference genome to
each node, and having them output which part was matched for each input data-
strip, or (b) repartitioning and sorting the datasets in such a way that puts
the joinable data from each dataset on the same machine in sorted order (which
I believe is part of the Spark API), and then iterating through both, and
&ldquo;zipping&rdquo; them together.</p>

<p>The output of that stage is a powerful primitive for higher-level algorithms
that researchers will want to run.</p>

<h4>Storage</h4>

<p>Part of the Hadoop concept is that you get data locality because MapReduce (or
Spark) schedules your computations on the HDFS nodes that already happen to be
holding the relevant data. However, this conflicts with their goal of being
able to store like 100s of <em>petabytes</em> of data. They can&rsquo;t really maintain the
cost of having enough (even virtual) machines up and storing their giant
datasets. So they opted to store the data in Amazon S3 (distributed storage)
&ldquo;buckets&rdquo;. This increases job start (and sometimes finish) duration, but lowers
costs.</p>

<p>This reminds me of how the default method for using the &ldquo;Databricks cloud
platform&rdquo; assumes that you are storing your dataset in S3 buckets, and the data
will be loaded (remotely) into the relevant EMR (virtual machine) nodes when
the user of an interactive Spark session (or scheduled script) asks for them.</p>

<p>They store the data as in Parquet files. They wrote their own Parquet &ldquo;row
chunk&rdquo; index file, that Spark then uses to figure out how to intelligently
paralellize file [system block] reads, and not read (too much) more of the
files than is really necessary. This is done with the help of a query predicate
pushdown mechanism. If I&rsquo;m not mistaken, all this stuff is really cool, but now
it&rsquo;s built-in to Spark with the help of the Catalyst optimizer, which I think
came out after this paper did.</p>

<h4>Cost and Performance</h4>

<p>In their experiments, ADAM is way faster and cheaper than existing
implementations of each of its components in almost all genomic situations.
Taking the whole pipeline as a single system, then it&rsquo;s always way cheaper and
faster. This is also true for the astronomy dataset and task. They achieve near
linear performance scaling (i.e. when adding more machines to the cluster) in
almost all components.</p>

<h3>Diversion: Convex Hull</h3>

<ul>
<li><strong>Convex set/shape</strong> &mdash; for any two points in the shape, the line connecting
them is also part of the shape</li>
<li><strong>Convex hull</strong> &mdash; the smallest convex set that contains a given subset of
points in the space</li>
<li><strong>Convex hull in 1-Dimension</strong>

<ul>
<li>given a set of lines, the smallest range of the line that contains the
start and end of every given line (right?)</li>
<li>given a set of points, just find the line connecting the smallest and
largest point (right?)</li>
</ul>
</li>
</ul>


<h3>References</h3>

<ul>
<li>Nothaft, Frank Austin, et al. <a href="https://scholar.google.com/scholar?hl=en&amp;q=Next-Generation+Genomics+Analysis+Using+Spark+and+ADAM-+Timothy+Danford+(AMPLab%2C+UC+Berkeley)&amp;btnG=&amp;as_sdt=1%2C44&amp;as_sdtp=">&ldquo;Rethinking data-intensive science using scalable analytics systems.&rdquo; Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015.</a></li>
<li>Youtube: <a href="https://www.youtube.com/watch?v=axLEBM_PZeI">Next-Generation Genomics Analysis Using Spark and ADAM- Timothy Danford (AMPLab, UC Berkeley)</a></li>
<li>Youtube: <a href="https://www.youtube.com/watch?v=dJX-bphMZRs">ADAM: Fast, Scalable Genomic Analysis &ndash; Frank Austin Nothaft (UC Berkeley)</a></li>
<li>Youtube: <a href="https://www.youtube.com/watch?v=ctLyjYw0BOg">sfspark.org: Frank Nothaft, Scalable Genome Analysis With ADAM </a></li>
<li><a href="https://github.com/bigdatagenomics/adam">Github project</a></li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Hdfs Output Stream Api Semantics]]></title>
    <link href="http://ethanp.github.io/blog/2016/05/30/hdfs-output-stream-api-semantics/"/>
    <updated>2016-05-30T14:00:09-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/05/30/hdfs-output-stream-api-semantics</id>
    <content type="html"><![CDATA[<p>Writing to files can get tricky. You have to think about the semantics you
want, versus any performance imperatives, etc. Here, we look a briefly at the
Linux file system API, and then contrast it with a brief look at the Hadoop
File System (HDFS) Java API.</p>

<h2>Linux file API</h2>

<p>In the normal Linux file system API, there are various ways to &ldquo;flush&rdquo; a file.
Here are a few of the ones I have seen.</p>

<p>We have <code>fflush(3)</code>, which flushes all user-space buffers via the
stream&rsquo;s underlying write function. This data may still exist in the kernel
(e.g. buffer cache and page cache [since 2.4, Linux buffer cache usually just
points to an entry in the page cache <a href="">see quora</a>]).</p>

<p>We have <code>fsync(2)</code>, which flushes modified pages of data from the operating
system&rsquo;s buffer cache to the actual disk device, and blocks until this has
completed. Modified metadata (e.g. file size) is also written out to the
file&rsquo;s inode&rsquo;s metadata section.</p>

<p>We have <code>close(2)</code>, which closes a file descriptor, but does not cause
flushing of any kernel buffers.</p>

<p>We have <code>fclose(3)</code>, which closes the file descriptor, <em>and</em> flushes its <em>user
space</em> buffers (like <code>fflush(3)</code>).</p>

<p><a href="https://www.quora.com/What-is-the-major-difference-between-the-buffer-cache-and-the-page-cache/answer/Robert-Love-1?srid=zA4O">https://www.quora.com/What-is-the-major-difference-between-the-buffer-cache-and-the-page-cache/answer/Robert-Love-1?srid=zA4O</a></p>

<h3>References</h3>

<ul>
<li><a href="http://man7.org/linux/man-pages/man3/fflush.3.html">man fflush(3)</a></li>
<li><a href="http://man7.org/linux/man-pages/man2/fsync.2.html">man fsync(2)</a></li>
<li><a href="https://www.wikiwand.com/en/Inode#/POSIX_inode_description">wiki inode</a></li>
<li><a href="http://linux.die.net/man/2/close">man close(2)</a></li>
<li><a href="http://linux.die.net/man/3/fclose">man fclose(3)</a></li>
</ul>


<h2>Hadoop File System (HDFS) file Java API</h2>

<p>In this API the names of the functions are similar, but the semantics are
quite different.</p>

<p>In HDFS, a &ldquo;file&rdquo; is stored as a sequence of &ldquo;blocks&rdquo;, and each block is is
globally-configured to be e.g. either 64MB or 128MB in size. Each block is
separately stored on the configured number of machines, according to the
chosen HDFS &ldquo;replication factor&rdquo;. For the instance of Linux running on a
particular node in the HDFS cluster, a block is a file that Linux must track
just like it would any other file: with a page/buffer cache (see above),
inode, etc. Tracking and deciding which blocks belong to each HDFS file, and
on which nodes each of those blocks are stored is the responsibility of the
HDFS NameNode (i.e. the single master node).</p>

<p>But the whole block-level view of HDFS is not (directly) visible to the HDFS
client API. Instead, a client simply opens an <code>OutputStream</code> to a file by
telling the name node that it either wants to create a new file, or append to
an existing file. The NameNode responds with nodes that should accept the
first block of data. The client starts writing to the first DataNode willing
to take its data. That DataNode, pipelines this incoming data to the other
DataNodes responsible for replicating this block.</p>

<p>Similar to the Linux file system API above, just because bytes are being
&ldquo;written&rdquo; by the client, does <strong>not</strong> mean they&rsquo;ll necessarily:</p>

<ol>
<li>be visible to someone who now tries to the read the file</li>
<li>be reflected in the current metadata available about the file (which lives
in the NameNode)</li>
<li>survive crashes of</li>
<li>the client or</li>
<li>the DataNode(s)</li>
</ol>


<p>Similar to the Linux file system API above, we have a few methods we can use
to decide the buffering semantics we want of our pending writes.</p>

<p>We have <code>hflush()</code>, which flushes data in the client&rsquo;s user buffer all the way
out to the nodes which are responsible for storing the relevant <em>&ldquo;blocks&rdquo;</em> of
this file. The metadata in the NameNode is <strong>not</strong> updated. Data is <em>not</em>
necessily flushed from the DataNodes&#8217; buffer caches to the actual disk device.</p>

<p>We have <code>hsync()</code>, which is <em>just</em> an alias for <code>hflush</code>.</p>

<p>We have <code>close()</code>, which closes the stream, makes sure all the data has arrived
at all the relevant nodes, and updates the metadata in the NameNode (e.g.
updates the file-length as seen from the <code>hadoop fs -ls myFile.txt</code> command
line interface).</p>

<p>In my experience, <strong>it is not safe to be opening and closing the same files
from the same instance of the Hadoop client on different threads</strong>. Maybe I
was naive in thinking this would be OK, as the Linux man pages given above
seem to suggest that this would be problematic even with the direct Linux file
system API.</p>

<h3>References</h3>

<ul>
<li><a href="https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FSDataOutputStream.html">HDFS output stream API</a></li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Ramblings on Insight]]></title>
    <link href="http://ethanp.github.io/blog/2016/05/14/ramblings-on-insight/"/>
    <updated>2016-05-14T14:46:12-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/05/14/ramblings-on-insight</id>
    <content type="html"><![CDATA[<p>I&rsquo;m re-designing a program I&rsquo;ve been working on for a few months. I&rsquo;ve
implemented a prototype for the new design and also started implementing it.
Because I haven&rsquo;t thought this through completely or written much of the code
yet, there is still room for fundamental issues to come up with plugging this
model into my codebase which will make it impossible for this major
refactoring to complete. But really, what I expect is that this model allows
us to express what is really going on in such a way that by looking at the
<em>names</em> of the components, we know <em>where</em> to start looking for code
describing <em>what</em> is really happening. If that is true, and we write function
names from the top-down (i.e. <strong>start with &ldquo;main&rdquo; and go down, as a breadth-
first search</strong>), then whole bunches of code will be isolated in that they
survive to create the capability of some unique higher-level function (or a
small set of functions).</p>

<p>Even though I have a clearer conception of how to go forward than I did when
first writing the program. I still find it hard to think through the whole
problem without just trying it. But if I just try it then I don&rsquo;t know what
I&rsquo;m doing. Then, after I learn what the pieces I need <em>are</em>, I can start from
the top-level design and then go down.</p>

<p>I don&rsquo;t know if there&rsquo;s a particular <em>moment</em> when I am <em>ready</em> to see the
problem more fully. In this case, it seemed to happen because I <em>had</em> to find
a better way. The program was going to be a lot of trouble to keep dealing
with if I didn&rsquo;t figure out a better way to name and organize the different
concerns it actually addresses. By naming and organizing things better, we get
a little &ldquo;registry&rdquo; of package, class, method, and value names that reveal the
solution&rsquo;s structure.</p>

<p>Refactoring can get a bit tedious. But maybe the most tedious bits are often
not really worth doing. I hope to build up a better intuition for it. But for
now, it&rsquo;s one of those things where I don&rsquo;t know the words to describe what
I&rsquo;m really making. I can see how knowing a bunch of &ldquo;patterns&rdquo; would make this
easier. Especially considering that the crux of the pipeline I implemented is
the &ldquo;observer&rdquo; pattern, one of the only patterns I know. But I think the main
thing is to just try things out, see what works, and what doesn&rsquo;t, then go
back and learn the &ldquo;patterns&rdquo; after I&rsquo;ve read and written a bunch of different
designs <em>in the code</em>.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Form in 'Main' Follows Program Function]]></title>
    <link href="http://ethanp.github.io/blog/2016/05/14/form-in-main-follows-program-function/"/>
    <updated>2016-05-14T14:44:11-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/05/14/form-in-main-follows-program-function</id>
    <content type="html"><![CDATA[<p>My program is a pipeline that takes multiple data sources, transforms them,
mashes them together, and writes them to multiple locations. It does this in a
somewhat resilient way by using Kafka as an internal buffer and data bus.
However you would have no idea from the structure of the program that that is
what is going on. In the &ldquo;main&rdquo; method, all that happens is a few
configuration settings are overridden, and a server is started. That doesn&rsquo;t
tell the reader <em>anything</em> about what&rsquo;s happening.</p>

<p>Since I&rsquo;m using Scala, the new design makes the &ldquo;main&rdquo; function look more like
a Unix program:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
</pre></td><td class='code'><pre><code class='scala'><span class='line'><span class="k">val</span> <span class="n">src1</span><span class="k">:</span> <span class="kt">DataSource</span><span class="o">[</span><span class="kt">Type1</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Type1Source</span><span class="o">()</span>
</span><span class='line'><span class="k">val</span> <span class="n">src2</span><span class="k">:</span> <span class="kt">DataSource</span><span class="o">[</span><span class="kt">Type2</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Type2Source</span><span class="o">()</span>
</span><span class='line'><span class="k">val</span> <span class="n">merger</span><span class="k">:</span> <span class="kt">Merger</span><span class="o">[</span><span class="kt">Type1</span>, <span class="kt">Type2</span>, <span class="kt">Type3</span><span class="o">]</span> <span class="k">=</span> <span class="nc">OneAndTwoMerger</span><span class="o">()</span>
</span><span class='line'><span class="k">val</span> <span class="n">output1</span><span class="k">:</span> <span class="kt">DataSink</span><span class="o">[</span><span class="kt">Type3</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Dest1Sink</span><span class="o">()</span>
</span><span class='line'><span class="k">val</span> <span class="n">output2</span><span class="k">:</span> <span class="kt">DataSink</span><span class="o">[</span><span class="kt">Type3</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Dest2Sink</span><span class="o">()</span>
</span><span class='line'>
</span><span class='line'><span class="nc">Merge</span><span class="o">(</span><span class="n">src1</span><span class="o">,</span> <span class="n">src2</span><span class="o">)</span> <span class="o">|</span> <span class="n">merger</span> <span class="n">tee</span> <span class="o">(</span><span class="n">output1</span><span class="o">,</span> <span class="n">output2</span><span class="o">)</span>
</span><span class='line'>
</span><span class='line'><span class="c1">// desugared to show there is no magic</span>
</span><span class='line'><span class="nc">Merge</span><span class="o">(</span><span class="n">src1</span><span class="o">,</span> <span class="n">src2</span><span class="o">).|(</span><span class="n">merger</span><span class="o">).</span><span class="n">tee</span><span class="o">(</span><span class="n">output1</span><span class="o">,</span> <span class="n">output2</span><span class="o">)</span>
</span></code></pre></td></tr></table></div></figure>


<p>Ok, so data streams emanating from source-1 and source-2 are merged together
by a type-compatible &ldquo;merger&rdquo; class, who writes its output stream into both
output-1 and output-2.</p>

<p>There&rsquo;s not a lot of code required to create those interfaces and methods.
Basically any number of <code>DataSink[T]</code>s can be &ldquo;observers&rdquo; of a
<code>DataSource[T]</code>. Whenever a <code>DataSource</code> finds itself with data to publish, it
calls the <code>receiveMsgs(msgs: Seq[T])</code> method of all the observing <code>DataSink</code>s.
So now we have a &ldquo;reactive&rdquo; (sources produce data whenever it is available to
them), and typesafe pipeline where components can be swapped in an out.
Communication between sources and sinks by default is just function calls
(i.e. synchronous), but their calls could be wrapped with Futures or Akka
actors. Using function-calls makes coding, testing, debugging easier, has
better type-checking, and doesn&rsquo;t need backpressure. Increased asynchrony
would allow for higher speeds, but is not needed yet, and will be hooked-in
as-needed.</p>

<p>The biggest influences on this design are the Unix shell, and the Akka
Streaming library, which I saw some presentations about. I think both were
inspired by electrical engineering (e.g. circuits and signal processing).</p>

<p>With this approach, each component has a single responsibility: to ingest,
filter, transform, aggregate, or output streams of data. Then in the &ldquo;main&rdquo;
function we just assemble the data flow of the program by hooking components
together. This means to test the program, we just need to test that each
component produces or consumes the data that it says it does properly.</p>

<p>Before, almost all of my tests involved at least three separate major program
components. I think I will start by re-writing those, and wherever things
don&rsquo;t work, write lower-level tests of one thing, and keep zooming in like
that. That way, testing effort is spent on the parts that are hard to get
right. I&rsquo;m not writing the tests first because most of the code for the
program is simple hooking things into each other. Testing that would be an
unecessary duplication of effort. If the main logic is so plain to see and
understand and will not undergo heavy modification, it does not need to be
written twice. Then there are a few bits that use some pretty difficult
external APIs that can be used well and can be used badly. I want to make sure
that I&rsquo;m using those at least as well as is necessary for the program to
function properly. Most of the issues I&rsquo;ve had in the past are with the HDFS
API. With HDFS, it takes to take a little while sometimes for opens, writes,
and closes to propagate properly to all the replicas. Before I knew that, I
was using the API sub-optimally, and the program would crash every twelve
hours or so. That problem itself would not be simple to test against, but it
gives the impression that interaction ith these external APIs is where the
main complexity in my program lives.</p>

<p>In this new &ldquo;source-to-sink&rdquo; program model, a single Kafka &ldquo;topic&rdquo; can be
implemented as an <code>object</code> (i.e. Singleton) that has two ends (fields): a
<code>Producer</code> (which is a <code>DataSink[T]</code>, since it writes data out of the
program), and a corresponding <code>Consumer</code> (a <code>DataSource[T]</code> for the program).</p>

<p>So if the program has two &ldquo;main&rdquo; functions, one connects to the <code>Producer</code>
side of a Kafka topic, and the other connects to the <code>Consumer</code> side, all
using this Unix-like Scala DSL, then we have integrated Kafka as a resilient
buffer connecting two stages of the pipeline. This means the computation
subgraph connected within-JVM to the Consumer side can be taken offline for
fixing or augmentation without losing ephemeral data being collected by the
Producer side.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Name According to Function]]></title>
    <link href="http://ethanp.github.io/blog/2016/05/12/name-according-to-function/"/>
    <updated>2016-05-12T15:08:16-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/05/12/name-according-to-function</id>
    <content type="html"><![CDATA[<p>Over the past few months, I wrote a kind of crappy program. Now I need to make
additions to that program and there are a lot of internals about the program
that I need to recall in order to be able to implement the additions
efficiently and robustly. This is not a world I want to continue to live in,
so my crappy program requires some sort of improvement. I looked to Amazon for
a book with the answers on &ldquo;what <em>specifically</em> to do&rdquo;. Based on its cover,
title, and reviews, I ended up with the book <strong>Clean Code: A Handbook of Agile
Software Craftsmanship</strong>, put together from multiple authors by a guy who
refers to himself as &ldquo;Uncle Bob&rdquo; (real name Robert C Martin). The book is
frankly a big part of the answer I was looking for.</p>

<p>I don&rsquo;t know a whole lot about Uncle Bob, but he seems to be <em>very</em>
experienced with designing, writing, maintaining, refactoring, etc. large Java
projects. He also is very well-read on modern software engineering, but for
the first quarter of the book, manages to stay away from the over-use of
terminology I&rsquo;m not familiar with. After that it goes into things like agile,
test driven development, behavior driven development, cohesion, object
oriented programming patterns (e.g. Gang of Four), plain old Java objects,
data access objects, etc. that I don&rsquo;t have much familiarity with. He also
talks about his conversations with (the only) guys in this arena that I have
heard of, including Fred Brooks and Martin Fowler.</p>

<p>His writing tone can be characterized as follows</p>

<ul>
<li>I have read and written sooo much more code than you</li>
<li>Over time, I have taken the time to think about the best way to make the
code &lsquo;clean&rsquo;</li>
<li>Herein, I shall share with you what I have learned</li>
<li>I&rsquo;ll use the best didactic methods I can think of</li>
<li>I hope you benefit as much as possible from my wisdom</li>
</ul>


<p>His main didactic method can be characterized as</p>

<ul>
<li>Here is an example of some bad code</li>
<li>This is how it goes wrong</li>
<li>This is why that is bad</li>
<li>Let&rsquo;s improve it in these areas</li>
<li>Here is the improved code</li>
<li>Note how this new code doesn&rsquo;t have the flaws of the previous</li>
</ul>


<p>So far I have read perhaps &frac12; of the book, and I have already come up with
and partially implemented a design for my problematic program that is <strong>so
much</strong> better than the old design, and a rough prototype implementation is
already running, and I have much more &ldquo;direction/vision&rdquo; for how it is going
to progress. That shows me that the <em>Clean Code</em> book has produced an
incredible return on investment.</p>

<p>The main thing I have understood from the book is that <strong>parts of programs
should do what they say they do</strong>. Put most briefly, there are two parts of
making that possible</p>

<ol>
<li><strong>Components should be simple enough to be described by just a few words</strong></li>
<li><strong>The <em>name</em> of a component should be the few words that describe it</strong></li>
</ol>


<p>It is embarrassing to say that I never thought of that myself, but the truth
is I didn&rsquo;t, and that is reflected all-too-obviously in the program that I
wrote. I can come up with countless excuses for why I wrote the program that
way, but that doesn&rsquo;t fix the problem. The only way to fix the problem is to
reorganize the program to conform to the above two rules.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bigtable Paper Summary]]></title>
    <link href="http://ethanp.github.io/blog/2016/04/10/bigtable-paper-summary/"/>
    <updated>2016-04-10T22:02:12-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/04/10/bigtable-paper-summary</id>
    <content type="html"><![CDATA[<p>When looking into what Cassandra and HBase are, and their relative strengths
and weaknesses, people often seem to think they can get away with the
following very succinct characterizations: &ldquo;Cassandra is like is Dynamo plus
Bigtable, and HBase is just Bigtable&rdquo;. I don&rsquo;t know much about Dynamo <em>or</em>
BigTable because we skipped those papers in my systems courses. So to get
started understanding what&rsquo;s going on with all this mess, I decided to read
the Bigtable paper. What follows is a brief summarization/retelling of the
<a href="http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">Bigtable paper</a>. It follows roughly the form of the paper, especially in
that it starts high level, and then digs slightly lower-down into the
implementation. It contains basically only and all of the parts of the paper
that I found illuminating, but broken down into sentences that are hopefully
easier to understand.</p>

<h3>What problem are we solving?</h3>

<p>Bigtable provides an API for storing and retrieving data.</p>

<p>It is most useful if</p>

<ul>
<li>there is a <em>lot</em> of data coming in at a high rate over time</li>
<li>there is no need to join each data table with another</li>
<li>data might need to be updated</li>
<li>range queries are common</li>
</ul>


<p>Bigtable is a distributed database. It is a database management system which
allows you to define tables, write and update data, and run queries against
the stored data. It is similar to a relational database, except that it is not
&ldquo;relational&rdquo; in the sense denoted by the term &ldquo;relational database&rdquo;. Instead,
it brings its own type of data model, where instead of storing data in normal
two-dimensional table cells, you store it according to a new set of rules that
allows for flexibility in the shape of each record, while still enabling
overall efficiency.</p>

<!-- more -->


<h3>Go on&hellip;</h3>

<p>In the Bigtable data model, data is grouped in tables. Each record in one
table must have a string identifier unique in that table. Each table is
defined to have a set of &ldquo;column families&rdquo;. This is where the data goes. A
record has values, that is the point of having a record. Each of those values
is associated with a timestamp when it was created. This can either be defined
by Bigtable or by the user inserting the data. Having two versions of a record
means it was updated. Each value has its own key (a string) pointing to it.
That key has two components, a &ldquo;family&rdquo; and a &ldquo;qualifier&rdquo;. Each table has a
finite set of families with known names. However, the qualifier segment of a
value&rsquo;s key may be an arbitrary string. There may be any number of anchors for
each family in each record. You can hint whether you want Bigtable to prefer
disk locality when writing many values in the same column family.</p>

<p>Rows are stored on disk in alphabetically by row key, and for each value, in
<em>descending</em> order of timestamps (i.e. most recent first). All reads and
writes over a single key are atomic with respect to each other, but not with
respect to changes in other rows. (This is something the client must be
careful about.) All the rows for a given table are chopped up (dynamically)
into &ldquo;tablets&rdquo; which each are stored in by exactly one &ldquo;tablet server&rdquo; node in
the cluster. One can specify garbage collection to either only keep the last
<em>n</em> versions, or to only keep everything for a specified duration.</p>

<p>When reading the data, one might e.g. just fetch the data for one column
family within a particular range of row names, but get all currently known
versions of that data. This type of query is what was optimized by the
creators of Bigtable. One might for example want to execute this type of query
to feed as input to a MapReduce job. One could do that, and then also be able
to enjoy the intentionally efficient writing of MapReduce results into a
different Bigtable table.</p>

<p>Bigtable data is stored in the SSTable file format. In this format, each file
represents what in Scala would be called a <code>immutable.Map[String, String]</code>. A
file is made of blocks, each containing a continuous interval of the keys that
is listed in the file&rsquo;s &ldquo;block index&rdquo;. Each SSTable file is stored in Google
File System (GFS), which I will not describe in detail here, and is similar to
the Hadoop File System (HDFS).</p>

<p>Bigtable relies on Chubby (similar to Apache Zookeeper) to ensure there is &ldquo;at
most one active aster at any time&rdquo; and for bootstrapping.</p>

<p>A Bigtable instance has a single master node, and a potentially dynamic number
of tablet servers. A tablet is a complete alphabetical interval of all the
data in a table. The master manages tablet assignment, server cluster
membership changes (with help from Chubby), load-balancing of tablets among servers, garbage
collection, and schema changes. The tablet server handles read and write
requests to its tablets, and splits tablets when they get too big.</p>

<p>Clients only need to communicate with the master to find the relevant tablet
servers. The relative amount of this sort of communication compared to the
actual data transfers is such that this practice is not a system-wide
bottleneck. Clients also cache this data until a server says the information
is stale.</p>

<p>Tablet location information (across all tables being managed by a single
Bigtable instance) is stored in one big 3-level hierarchical B+-tree type
thing (i.e. it&rsquo;s a sorted <em>n</em>-way tree, where you go down the appropriate
pointer from an internal node based on what it says about the interval of row
IDs contained in its leaves. The root is stored in Chubby. At the third level
I think you have the locations of all the tablets and what keys are owned by
each.</p>

<p>As previously mentioned, persistent tablet state is stored in GFS, but what is
that &ldquo;tablet state&rdquo;. Each write or update is appended a commit log (with a
unique mutation identifier to resolve duplication caused by tricky crash
scenarios). It also modifies the (copy-on-write) in-memory representation of
the current state of this chunk of the table. One can always reconstruct the
in-memory representation by playing back the commit log, but ideally most
reads are serviced from memory. Every once in a while the in-memory state gets
too big, and <em>it</em> is written out as an SSTable into GFS. Every once in a
while, multiple SSTables are read back into, and since some might be updates
to others, they are merged together, and the sources deleted.</p>

<p>In practice, the authors found that applying a very computationally efficient
compression algorithm (rather than one with a top-notch compression ratio)
increased overall system efficiency in many cases. They also added bloom
filters to the SSTables to reduce the number of unecessary disk reads of non-
existent data.</p>

<p>The paper says Google has used Bigtable as a backend for its Google Analytics
product, Google Earth, Personalized Search, and storing websites for
retrieving results for its Search Engine.</p>

<p>As future work they want to be able to provide better (but not full) support
for transactions (I think this is addressed in the Spanner paper released a
few years later).</p>

<p>Their lessons learned include</p>

<ul>
<li>&ldquo;delay adding new features until it is clear how the new features will be
used&rdquo;</li>
<li>&ldquo;the importance of proper system-level monitoring&rdquo;</li>
<li>&ldquo;the value of simple designs&hellip;code and design clarity are of immennse
help in code maintenance and debugging&rdquo;</li>
</ul>


<p>They conclude by noting that this system bears some resemblance to the C-Store
column-oriented database system, for which there is another good paper that I
ought to finish reading at some point. They also note that people who are used
to &ldquo;traditional&rdquo; relational databases find the Bigtable interface awkward, but
that it works so much better than the alternatives that people end up using it
anyway.</p>

<h3>References</h3>

<ul>
<li>Chang, Fay, et al. &ldquo;Bigtable: A distributed storage system for structured
data.&rdquo; ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4.</li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[What Is a Rails Application]]></title>
    <link href="http://ethanp.github.io/blog/2016/04/09/what-is-a-rails-application/"/>
    <updated>2016-04-09T14:47:14-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/04/09/what-is-a-rails-application</id>
    <content type="html"><![CDATA[<h1>What is a server and how does it relate to my Rails app?</h1>

<p>I learned the basics of Ruby on Rails web application development a few years
ago. At that time, my understanding how that system works was as follows</p>

<ol>
<li>Find articles describing things Rails allows you to do easily</li>
<li>Decide on an app that requires only those things</li>
<li>Follow the instructions in those articles to implement your app idea</li>
<li>Push the app to Heroku</li>
<li>Now the app is accessible to anyone on the World Wide Web</li>
</ol>


<p>The fact that it is <em>so</em> easy to do such a thing is nothing short of magical.
Especially while you have <em>no idea</em> how any of it works. Now that I am
slightly more knowledgeable about how software systems are put together and
deployed, I&rsquo;d like to take a slightly more nuanced stab at what really was
going on when the above 5 steps were executed.</p>

<!-- more -->


<h2>How Rails makes things so easy</h2>

<p>Ruby on Rails was created out of frustration with &ldquo;heavyweight&rdquo; web
application development frameworks, such as ones from the late 90&rsquo;s that use
Java. It gives you base-classes for all the important parts of your program so
that you only need to write (and even <em>see</em>) code for the pieces that are
unique to your system. It generates a base project with everything you need
already included in a well-structured layout that helps you make sense of what
is going on. Through some clever usage of the Ruby programming language, it
creates methods named after the components you implemented, and makes them
available in the places you may need them. This has led to a culture of
natural-language inspired community-built APIs that enable you to easily do
things only the most modern websites have.</p>

<h2>How does traffic get into and out of Rails?</h2>

<p>Maybe there is another way to do it, but at least by default Rails uses Rack.
Rack is a &ldquo;webserver interface&rdquo;. A Rack application has a Ruby object that has
a method that receives parsed HTTP requests in a well-specified format, and
then calls a method (called <code>call</code>) which receives the parsed HTTP request as
a Ruby object, and returns a parsed HTTP response in a well-specified format
(specifically a Ruby array structured like <code>[response_code, {headers_map},
[body_lines]]</code>).</p>

<p>A Rails application implements Rack&rsquo;s <code>call</code> method. What that <code>call</code> method
actually returns for any given HTTP request argument is defined by all your
Rails code, e.g. your router, controllers, models, views, etc.</p>

<h3>What about Rack then?</h3>

<p>Fine, now we understand that Rack passes the HTTP request to Rails, and Rails
returns the HTTP response to Rack. But how did Rack get the HTTP request in
the first place? Answer: from a Web Server.</p>

<p>A web server suitable for Rack would be a program that</p>

<ul>
<li>listens on a TCP port</li>
<li>connects to clients on that port</li>
<li>receives the HTTP requests from those clients</li>
<li>parses them into the format that Rack expects</li>
<li>Calls the Rack method <code>call</code> using that format</li>
<li>Receives the HTTP response from Rack</li>
<li>Converts that into the raw HTTP response format</li>
<li>Sends the response to the client</li>
</ul>


<p>There are many such suitable web servers, such as Unicorn and Thin.</p>

<h3>So what is nginx?</h3>

<p>It seems to me that one does not typically use nginx to call Rack&rsquo;s <code>call</code>
method (though this might be possible to do). One typically uses nginx as a
waypoint of traffic bound for the application, but in need of some sort of
pre-processing. That pre-processing could take multiple forms. Maybe there is
a cache sitting outside of Rails itself, and nginx can be setup to return
results from the cache instead of proxying requests into the server which
interacts with Rails itself. This will improve performance for each client who
hits such a cache, and will increase the load the system is capable of
supporting overall. Nginx might know of multiple servers connected to
different instances of the Rails app running concurrently, and will choose
which server to send each request to in such a way as to distribute load
across the Rails applications (&ldquo;load balancing&rdquo;). Nginx might &ldquo;terminate SSL&rdquo;,
meaning that clients connect to the address on which nginx is listening, and
will send their request to nginx over the TLS protocol, but nginx will decrypt
that request before passing it wherever it will go next; then nginx will
encrypt the response according to the TLS state and the client will know how
to decrypt the response (and ideally no one else will know how to decrypt it).</p>

<h2>What Heroku does</h2>

<p>Using the simple <code>git push heroku</code> workflow, Heroku takes care of all the
aspects of managing the deployment of your Rails server. It will provide a
virtual server to run your Rails code. It will provide a server which speaks
Rack to receive requests, invoke your Rails code, and return the response to
the client. It has DNS entries installed in the relevant places on the
internet so that you can choose a name for your app to be accessed by your
friends in their homes using their browsers. It will do a whole lot more, such
as autoscaling, caching, and other things beyond the scope of this post
because I don&rsquo;t know about them.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Simple First Deployment]]></title>
    <link href="http://ethanp.github.io/blog/2016/04/09/simple-first-deployment/"/>
    <updated>2016-04-09T13:45:39-07:00</updated>
    <id>http://ethanp.github.io/blog/2016/04/09/simple-first-deployment</id>
    <content type="html"><![CDATA[<p>I just had my first experience of &ldquo;deploying my system into production&rdquo;. I
have been learning about software engineering for a few years now, and I have
seen this term &ldquo;deploy into production&rdquo; many times, but never experienced it
myself. The &ldquo;software system&rdquo; that was deployed is an internal tool, almost 3
months in development by yours truly. This article is a retrospective of the
development of this system so far.</p>

<!-- more -->


<p>The system itself uses many APIs with which I was previously unfamiliar;
Kafka, Elasticsearch, HDFS, ZooKeeper, Spray, CouchDB, etc.; but I have glued
together unfamiliar APIs on many previous occasions. The system is
&ldquo;distributed&rdquo; and has in-built fault-tolerance in various ways; but I have
included that feature in previous projects. The thing that is new for me about
this project is that it was written with maintainability as a goal, and
&ldquo;released&rdquo; in such a way that it runs continuously as a service, and its
output can serve as input for the daily work of other human beings. This is my
first project for my first job, and it is finally working.</p>

<p>There were <em>many</em> times during the development of this project when I figured
it would probably be deployed within the next few days. At those points, I
told my manager to begin finding me something interesting to work on next. He
always sounded more skeptical than I expected about the project actually being
nearly done. On each of those occasions, I realized within the next few hours
after announcing my near-completion, that there was still, in fact, a <em>slew</em>
of components still needing to be worked out before the software could be
deployed, and that I had <em>no idea</em> how to craft any of these components.</p>

<p>There are <em>too many</em> guides out there about how to develop and deploy software
systems. For me, the most important ended up being <a href="https://medium.com/@jlouis666/how-to-build-stable-systems-6fe9dcf32fc4">a Medium article</a> by
Jesper L. Andersen (no idea who that is). I think I found the article on Hacker
News a month and a half ago. I will summarize my takaways from it</p>

<ul>
<li>The developer is responsible for every aspect of the software</li>
<li>Make sure the project makes sense to tackle</li>
<li>Choose the right libraries, frameworks etc. to use

<ul>
<li>This requires a small amount of experimentation which should be done up
front in a separated/extracted code-base</li>
</ul>
</li>
<li>Limit scope

<ul>
<li>E.g. create a list of problems you&rsquo;re <em>not</em> solving</li>
</ul>
</li>
<li>Limit external dependencies</li>
<li>Create a transition plan from the previous system to the new one</li>
<li>Design for future extension</li>
<li>Prefer correctness and elegance to efficiency of execution

<ul>
<li>This is especially true for my project because it is network bound anyway</li>
<li>But still, understand which parts must be more correct than others</li>
</ul>
</li>
<li>Collect metrics about system performance</li>
<li>Write automated tests of whether

<ul>
<li>The components that need to do something correctly are</li>
<li>The different components of the system can communicate with each other
effectively</li>
</ul>
</li>
</ul>


<p>There are still aspects of my system that are not perfect. For example, I
occasionaly see errors about a timeout in some library used by a library I am
using. I don&rsquo;t know whether those errors are corrupting my data, or killing
JVM threads, or whatever else. The system is already a vast improvement over
what was in place before in terms of correctness and latency of the data
produced, so having the occasional issue was not a good enough reason to delay
deployment any further.</p>

<p>The system was deployed in a manner one might describe as &ldquo;by hand&rdquo;. I rsync
the code onto a server, build the code on the server, open up two <code>tmux</code>
windows, run one JVM which produces to Kafka, run another JVM which consumes
from Kafka, make sure everything that is <em>supposed</em> to be happening <em>is</em>
happening, and detach from tmux. According to the blog post I linked above, it
is desirable to be able to hit a button and see it deploy itself
automatically. In my case this conflicts with the other suggestion in that
article, which is to stick to limit the number of things you have to learn in
order to get the first version of your software working, which I believe the
author meant to be more important than a one-click deployment scheme.</p>

<p>The system is not exactly &ldquo;done&rdquo; in the sense that it doesn&rsquo;t have all the
features it is supposed to have. But at least I know now that the system works
in its production environment and is able to interact successfully and
(demonstrably) correctly with all the external infrastructure that it relies
upon. But to be sure, many corners were cut, many pieces are <em>kind of</em> there
but not really, the metrics are not collected well, and configurations (e.g.
of Kafka, ZooKeeper, and Elasticsearch) were chosen on a &ldquo;that&rsquo;ll probably get
us running&rdquo; basis. As it turns out, there have already been issues with
Elasticsearch not being able to handle the load of multiple of my colleagues
querying it at once.</p>

<p>A problem with having deployed it without a larger share of the issues worked-
out is that I now need to be more careful to make my changes behave as
transitions from &ldquo;working state&rdquo; to &ldquo;working state&rdquo;. There are pieces of the
system that if they are left not-working for more than 24 hours, data will be
irretrievably lost, and people will be sad. But by design of the software, the
piece that <em>must</em> be left running properly is mostly disentangled from the
rest of the system, so that it can run while the rest of the system is
offline, and none of the data will be lost. This is accomplished by buffering
data in Kafka.</p>

<p>Since I have no previous experience deploying a software system like this, I
did not have a good basis to evaluate which libraries to build my system upon,
except for the buzz that I see on Hacker News, and my own intuition and
background. I like the Scala programming language best, and because of Spark,
some people I work with already use Scala, so I chose Scala. Is Go a better
tool for the job? According to the relative buzzes of the two ecosystems on
Hacker News, maybe it is. Would Docker have provided a better mechanism for
deployment? Most-likely it would have, but that can be added at any time if it
turns out that deployment is a <em>problem</em> which needs to be <em>solved</em>, which at
this simple stage, it is not.</p>

<p>It is exciting to final have real people relying on code that I wrote. I am
excited to continue making this system better-meet the needs of its users. I
am even more excited by the shift in the way I look at writing software now
that I have seen more of the lifecycle of a single project.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[What Actually Is SSH]]></title>
    <link href="http://ethanp.github.io/blog/2016/03/06/what-actually-is-ssh/"/>
    <updated>2016-03-06T14:18:53-08:00</updated>
    <id>http://ethanp.github.io/blog/2016/03/06/what-actually-is-ssh</id>
    <content type="html"><![CDATA[<h3>SSH Tunneling</h3>

<p><em>&ldquo;Tunnelling&rdquo;, with two ells, is the British spelling.</em></p>

<p>A few months ago, I downloaded a tool (an ELK stack) that didn&rsquo;t work right
off the bat due to some sort of misconfiguration. It was running in a Vagrant-
made virtual machine (VM) on my laptop. The Vagrant setup script had forwarded
a local port on my laptop into the VM. So in order to debug it, my coworker
configured a chain of tunnels that enabled him to SSH into the VM on my
laptop.</p>

<p>In the back of my mind, I spent the next few weeks trying to figure out what
SSH port-forwarding is and how its syntax works, then another few weeks to
figure out what reverse port-forwarding is, and another few weeks to find
practical use-cases for each.</p>

<p>Here is my executive summary</p>

<pre><code>ssh -L [&lt;localhost&gt;:]&lt;localport&gt;:&lt;remotehost&gt;:&lt;remoteport&gt; &lt;gateway&gt;
</code></pre>

<p>By default, <code>&lt;localhost&gt;</code> will be <code>localhost</code>.</p>

<p>What this does is, start a serversocket listening to local address
<code>localhost:localport</code> using the &ldquo;SSH client&rdquo;. When a client establishes a
connection to that address, traffic received from that client will be
encrypted, and forwarded to the <code>sshd</code>[aemon] listening on port 22 of
<code>gateway</code>. (Only) after the gateway receives this traffic, the <code>sshd</code> will
establish a (normal, unencrypted) connection to remote address
<code>remotehost:remoteport</code>, and forward the data originally received by the SSH
client there. Response traffic originating from <code>remotehost:remoteport</code> will
go back to the <code>sshd</code>, back through the encrypted tunnel to the SSH client,
and back to the originating client.</p>

<p><em>Reverse</em> tunneling by contrast, means that traffic <em>originating</em> on the
remote end will be forwarded to the local end.</p>

<h2>What&rsquo;s SSH</h2>

<p>One thing that confused me about SSH forwarding, is that if I don&rsquo;t use some
extra flags to disable it, when I set up port forwarding to a another machine,
I also end up with a shell ready executing commands remotely. What is going on
here? It turns out it is doing what is called &ldquo;remote command execution&rdquo;.</p>

<!-- more -->


<p>Here&rsquo;s what&rsquo;s happening: SSH is capable of multiplexing multiple communication
&ldquo;channels&rdquo; over a single encrypted TCP connection. If you just execute the
normal tunneling command given above, it will create two channels, one for
forwarding the tcp traffic, and another for executing commands visually in
your pseudo-terminal but physically in a remote shell.</p>

<p>When you <code>ssh</code> onto a remote machine, a particular sequence of things happen
that I think really clarify what the heck&rsquo;s going on. You can see it happening
if you turn on verbose mode (<code>-v</code>).</p>

<ol>
<li>You connect via TCP to the standard SSH-daemon port (22) on the remote
machine</li>
<li>You authenticate it via its public key

<ul>
<li>That&rsquo;s why the first time you connect is says &ldquo;do you want to trust this
server&rsquo;s fingerprint?&rdquo; or something like that (I think saying &ldquo;yes I
trust it&rdquo; implies you have verified that the fingerprint does not belong
to a <em>man-in-the-middle</em> attacker)</li>
</ul>
</li>
<li>It authenticates you, either by you supplying a password, or by it finding
your username in its <code>~/.ssh/authorized_keys</code> file, and [using what protocol?] you match it with your private key</li>
<li>Y&#8217;all negotiate an encryption for the session, over which future
communications will take place

<ul>
<li>[Maybe you have to choose a specific cryptographic algorithm like in TLS?]</li>
</ul>
</li>
<li>Now that communications are encrypted, control and channel data are
transferred using a binary packet format that contains

<ul>
<li>Encrypted packet length &mdash; answers: when does this packet end?</li>
<li>Various random paddings &mdash; to fool the bad guys</li>
<li>The encrypted (and possibly compressed) payload &mdash; here&rsquo;s the good stuff</li>
<li>A message authentication code (MAC) &mdash; find evidence of corruption/tampering</li>
<li>Recipient&rsquo;s integer identifier for this channel &mdash; answers: what <em>channel</em>
did this data come in from?</li>
</ul>
</li>
<li>To setup a remote shell, your SSH client requests permission to create a
special channel specifically designed for the purpose of remote shells

<ul>
<li>At this point you tell the server which ID you&rsquo;re allocating to this
channel, so that it knows how to fill the last item in the format
specified above</li>
<li>Since each channel performs its own flow-control, you also state your
&ldquo;window size&rdquo; (similar to TCP)

<ul>
<li>Both sides maintain a receive window, and once the sender has sent
enough bytes to fill it, they will not send more data until you say
that your window now has more room</li>
</ul>
</li>
</ul>
</li>
<li>If the server supports &ldquo;shell&rdquo; channels, it will tell you which ID it
allocates for the new shell channel

<ul>
<li>so that you can put the correct ID in SSH packets in this channel</li>
</ul>
</li>
<li>Then you tell the server your psuedo-terminal&rsquo;s display specifics

<ul>
<li>characters horizontally and vertically, as well as pixels horizontally and
vertically</li>
<li>If you were using X11, you would tell the server your screen&rsquo;s specifics</li>
<li>If your dimensions change, you can send an update message to the server</li>
</ul>
</li>
<li>Now you can send the server the values of environment variables to use</li>
<li>Then you request that the server initiate a shell in which to execute the
commands you&rsquo;re going to give it. You can specify the path to the specific
shell to use.</li>
<li>Then you start sending commands to execute, which the server will pass to
the shell it created</li>
<li>The server will send you back output received from the shell, formatted
for your terminal</li>
<li>You can also send signals for the server to pass to its shell</li>
<li>When the remote shell command terminates, the server will send you its
&ldquo;exit-status&rdquo;, or if there was a violent termination, it will send an
&ldquo;exit-signal&rdquo;</li>
</ol>


<p>Note this communication only used one channel, and there can be however many
channels open over the same SSH connection. This is implemented using the
framing layer that puts the data into the higher-level datagram format
described in list item 5 above. Datagrams are then demultiplexed by the
receiver according to the the channel identifiers in the datagram. So one
channel might be used for forwarding packets received on a port registered
with SSH for forwarding, while another is used for the normal SSH remote shell
execution terminals. So that&rsquo;s how we see the effects of the &ldquo;-L&rdquo; feature of
SSH described in the beginning.</p>

<h3>It&rsquo;s like HTTP/2</h3>

<p>I don&rsquo;t know many protocols, so a lot of this stuff is new territory for me.
When I got curious, to find out more, I started by reading the Wikipedia page,
because in addition to having its explanation, it often has good references
and links to related topics. From there I followed the reference to the IETF
RFC. Maybe it was because I read the Wikipedia page first, but I found the RFC
much easier to follow, and with a better enunciation of what is going on than
the Wikipedia page. I didn&rsquo;t read the RFC word for word except for the first
few sections, but then, based on its structure and formatting, was able to
find the pieces I was interested in surprisingly fast. The reason for that is
the RFC&rsquo;s authors are familiar with explaning such concepts, and spent months
working as a team to increase the document&rsquo;s utter clarity to reduce the
possibility of miscommunication.</p>

<p>In a previous post I looked at the new HTTP/2 IETF RFC. That protocol contains
a similar framing layer over TCP that allows it to multiplex multiple channels
concurrently over a single underlying TCP socket. IIRC, the reason for that
was to ease issues with &ldquo;head-of-line blocking&rdquo;, which is where one channel&rsquo;s
being slow ends up slowing down all the other channels because of TCP
unecessarily guaranteeing maintaining FIFO order of packets across multiplexed
channels that have no globally-defined order. At the same time opening up many
TCP connections per client is undesirable for the server. So after having seen
this framing-layer thing in two protocols, now I understand it to be some sort
of &ldquo;pattern&rdquo; for allowing multiplexing communications over a single TCP link.
One big advantage is less connections per client on the server side. Also, the
TCP algorithm has a &ldquo;ramp up&rdquo; time, i.e. the congestion control starts out
with a rather pessimistic window size. By using only one socket, that ramp up
only must be done once. Surely, better knowledge of TCP would lead to more
insight here.</p>

<p>I like to get at least a basic idea of what my favorite tools are actually
doing. This sort of investigation can only go so far because the modern stack
of technologies is so vast and the &ldquo;lyfe so shorte&rdquo;. My first computer science
professor encouraged us to press the &ldquo;I believe&rdquo; button when something seemed
like magic. Still, lifting some of the veil on one tool I use everyday is
satisfying, and hopefully the time taken for others to get to my (still
minimal) level of understanding is reduced by the explanations above.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Some Java Network Programming Fundamentals]]></title>
    <link href="http://ethanp.github.io/blog/2016/02/28/some-java-network-programming-fundamentals/"/>
    <updated>2016-02-28T15:15:13-08:00</updated>
    <id>http://ethanp.github.io/blog/2016/02/28/some-java-network-programming-fundamentals</id>
    <content type="html"><![CDATA[<p>Most of what I&rsquo;ve learned and discussed here comes from <em>TCP/IP Sockets in
Java</em>, a highly recommended book about this stuff by Calvert and Donahoo. Some
of it also comes from <em>Java Network Programming</em> by Elliotte Rusty Harold.</p>

<h2>Overview</h2>

<ul>
<li>The <em>only</em> transport-layer protocols Java supports are TCP &amp; UDP; for
anything else, you must link to native code via the Java Native Interface</li>
<li>TCP uses <em>stream</em> sockets, through which one generally just writes to an
<code>OutputStream</code> and reads from an <code>InputStream</code> of bytes that remain in-order
and uncorrupted and are (practically) guaranteed delivery by the
implementation of the protocol by the operating system.

<ul>
<li>Unless you&rsquo;re using NIO; see below for more on that</li>
</ul>
</li>
<li>UDP uses <em>datagram</em> sockets, through which you <code>send</code> and <code>receive</code> objects
called <code>DatagramPacket</code>s, which are just a length, a destination, and data</li>
<li>Unless you&rsquo;re using NIO, everything <em>blocks</em>: e.g. connecting to servers,
listening for clients, reads, writes, and disconnecting (for TCP)

<ul>
<li>By default most of these actions may block <em>indefinitely</em></li>
<li>For reading and connecting, you can configure a timeout, after which you
will receive an <code>InterruptedIOException</code></li>
<li>For <em>writing</em> to a TCP stream, you <em>cannot</em> configure a timeout</li>
</ul>
</li>
</ul>


<h3>Handling multiple clients</h3>

<ul>
<li>Deal with one at a time, which is simplest, especially if there&rsquo;s some state
that is shared by all potential clients. Speed may become problematic
quickly.</li>
</ul>


<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="kt">void</span> <span class="nf">mainLoop</span><span class="o">()</span> <span class="o">{</span>
</span><span class='line'>    <span class="k">while</span> <span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">{</span>
</span><span class='line'>        <span class="n">Socket</span> <span class="n">s</span> <span class="o">=</span> <span class="n">serverSocket</span><span class="o">.</span><span class="na">accept</span><span class="o">();</span>
</span><span class='line'>        <span class="n">handle</span><span class="o">(</span><span class="n">s</span><span class="o">);</span>
</span><span class='line'>    <span class="o">}</span>
</span><span class='line'><span class="o">}</span>
</span><span class='line'><span class="kt">void</span> <span class="nf">handle</span><span class="o">(</span><span class="n">Socket</span> <span class="n">s</span><span class="o">)</span> <span class="o">{</span>
</span><span class='line'>    <span class="n">InputStream</span> <span class="n">in</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="na">getInputStream</span><span class="o">();</span>
</span><span class='line'>    <span class="c1">// process request, etc.</span>
</span><span class='line'>    <span class="n">s</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
</span><span class='line'><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure>


<ul>
<li>Create a new thread to handle each incoming client. This is still pretty
simple, but will lead to massive overhead if you have many concurrent
clients, and therefore you&rsquo;re context-switching all the time.</li>
</ul>


<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="kt">void</span> <span class="nf">mainLoop</span><span class="o">()</span> <span class="o">{</span>
</span><span class='line'>    <span class="k">while</span> <span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">{</span>
</span><span class='line'>        <span class="n">Socket</span> <span class="n">s</span> <span class="o">=</span> <span class="n">serverSocket</span><span class="o">.</span><span class="na">accept</span><span class="o">();</span>
</span><span class='line'>        <span class="k">new</span> <span class="nf">Thread</span><span class="o">()</span> <span class="o">{</span>
</span><span class='line'>            <span class="nd">@Override</span> <span class="kd">public</span> <span class="kt">void</span> <span class="n">run</span><span class="o">()</span> <span class="o">{</span>
</span><span class='line'>                <span class="n">InputStream</span> <span class="n">in</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="na">getInputStream</span><span class="o">();</span>
</span><span class='line'>                <span class="c1">// process request, etc.</span>
</span><span class='line'>                <span class="n">s</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
</span><span class='line'>            <span class="o">}</span>
</span><span class='line'>        <span class="o">}.</span><span class="na">start</span><span class="o">()</span>
</span><span class='line'>    <span class="o">}</span>
</span><span class='line'><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure>


<ul>
<li>Use a thread pool to handle requests. Java has abstracted the thread pool
concept into the <code>Executors</code> factory class. There are a multitude of
executors to choose from. This <code>newCachedThreadPool()</code> one will execute each
task on an existing thread if one is idle, and will create a thread
otherwise. Threads sitting idle in the cache for over one minute are
terminated.</li>
</ul>


<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="n">ExecutorService</span> <span class="n">executor</span> <span class="o">=</span> <span class="n">Executors</span><span class="o">.</span><span class="na">newCachedThreadPool</span><span class="o">()</span>
</span><span class='line'>
</span><span class='line'><span class="kt">void</span> <span class="nf">mainLoop</span><span class="o">()</span> <span class="o">{</span>
</span><span class='line'>    <span class="k">while</span> <span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">{</span>
</span><span class='line'>        <span class="n">Socket</span> <span class="n">s</span> <span class="o">=</span> <span class="n">serverSocket</span><span class="o">.</span><span class="na">accept</span><span class="o">();</span>
</span><span class='line'>        <span class="n">executor</span><span class="o">.</span><span class="na">execute</span><span class="o">(</span><span class="k">new</span> <span class="n">TheHandler</span><span class="o">(</span><span class="n">s</span><span class="o">));</span>
</span><span class='line'>    <span class="o">}</span>
</span><span class='line'><span class="o">}</span>
</span><span class='line'><span class="kd">static</span> <span class="kd">class</span> <span class="nc">TheHandler</span> <span class="kd">implements</span> <span class="n">Runnable</span> <span class="o">{</span>
</span><span class='line'>    <span class="n">Socket</span> <span class="n">s</span><span class="o">;</span>
</span><span class='line'>    <span class="kd">public</span> <span class="nf">TheHandler</span><span class="o">(</span><span class="n">Socket</span> <span class="n">s</span><span class="o">)</span> <span class="o">{</span> <span class="k">this</span><span class="o">.</span><span class="na">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">;</span> <span class="o">}</span>
</span><span class='line'>    <span class="nd">@Override</span> <span class="kd">public</span> <span class="kt">void</span> <span class="n">run</span><span class="o">()</span> <span class="o">{</span>
</span><span class='line'>        <span class="n">InputStream</span> <span class="n">in</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="na">getInputStream</span><span class="o">();</span>
</span><span class='line'>        <span class="c1">// process request, etc.</span>
</span><span class='line'>        <span class="n">s</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
</span><span class='line'>    <span class="o">}</span>
</span><span class='line'><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure>


<ul>
<li>Use NIO (rather complicated) to allow <code>N</code> threads to service <code>M</code>
clients, where <code>N</code> is small and <code>M</code> is huge. This uses event-based
programming. We can set all network operations to be non-blocking, and
only wait as long as we want to for them. An extensive example can be
found below.</li>
<li>Use a framework like Netty, Akka, etc. that wraps the NIO stuff up in a
ribbon and a tie</li>
</ul>


<h3>10K feet above NIO</h3>

<!-- more -->


<ul>
<li>If you&rsquo;re using NIO, you create <code>Channel</code>s of bytes into and out of sockets
(or file handles)</li>
<li>You register a <code>Selector</code> to be notified when the <code>Channel</code> is ready to be
read from or written to</li>
<li>You query the <code>Selector</code> to tell you which <code>Channel</code> are ready, and may then
take action on those that are</li>
<li>You get data in and out by passing a <code>Buffer</code> to the <code>Channel</code></li>
</ul>


<p>Here&rsquo;s an example based on TCP/IP Sockets in Java, a highly recommended book
about this stuff by Calvert and Donahoo.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
<span class='line-number'>43</span>
<span class='line-number'>44</span>
<span class='line-number'>45</span>
<span class='line-number'>46</span>
<span class='line-number'>47</span>
<span class='line-number'>48</span>
<span class='line-number'>49</span>
<span class='line-number'>50</span>
<span class='line-number'>51</span>
<span class='line-number'>52</span>
<span class='line-number'>53</span>
<span class='line-number'>54</span>
<span class='line-number'>55</span>
<span class='line-number'>56</span>
<span class='line-number'>57</span>
<span class='line-number'>58</span>
<span class='line-number'>59</span>
<span class='line-number'>60</span>
<span class='line-number'>61</span>
<span class='line-number'>62</span>
<span class='line-number'>63</span>
<span class='line-number'>64</span>
<span class='line-number'>65</span>
<span class='line-number'>66</span>
<span class='line-number'>67</span>
<span class='line-number'>68</span>
<span class='line-number'>69</span>
<span class='line-number'>70</span>
<span class='line-number'>71</span>
<span class='line-number'>72</span>
<span class='line-number'>73</span>
<span class='line-number'>74</span>
<span class='line-number'>75</span>
<span class='line-number'>76</span>
<span class='line-number'>77</span>
<span class='line-number'>78</span>
<span class='line-number'>79</span>
<span class='line-number'>80</span>
<span class='line-number'>81</span>
<span class='line-number'>82</span>
<span class='line-number'>83</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="n">Selector</span> <span class="n">slctr</span> <span class="o">=</span> <span class="n">Selector</span><span class="o">.</span><span class="na">open</span><span class="o">();</span> <span class="c1">// factory</span>
</span><span class='line'><span class="n">ServerSocketChannel</span> <span class="n">chnl</span> <span class="o">=</span> <span class="n">ServerSocketChannel</span><span class="o">.</span><span class="na">open</span><span class="o">();</span> <span class="c1">// factory</span>
</span><span class='line'><span class="n">chnl</span><span class="o">.</span><span class="na">socket</span><span class="o">().</span><span class="na">bind</span><span class="o">(</span><span class="n">inetAddr</span><span class="o">);</span> <span class="c1">// set address to listen on</span>
</span><span class='line'>
</span><span class='line'><span class="c1">// For some reason Channels block by default. If we want to</span>
</span><span class='line'><span class="c1">// register with the Selector for notifications, we must turn</span>
</span><span class='line'><span class="c1">// that off.</span>
</span><span class='line'><span class="n">chnl</span><span class="o">.</span><span class="na">configureBlocking</span><span class="o">(</span><span class="kc">false</span><span class="o">);</span>
</span><span class='line'>
</span><span class='line'><span class="c1">// Notify Selector whenever this Channel has a new connection</span>
</span><span class='line'><span class="c1">// ready to be &quot;accepted&quot;. Such a notification still does</span>
</span><span class='line'><span class="c1">// *not* guarantee it will work immediately.</span>
</span><span class='line'><span class="n">chnl</span><span class="o">.</span><span class="na">register</span><span class="o">(</span><span class="n">slctr</span><span class="o">,</span> <span class="n">SelectionKey</span><span class="o">.</span><span class="na">OP_ACCEPT</span><span class="o">);</span>
</span><span class='line'>
</span><span class='line'><span class="k">while</span> <span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="o">{</span>
</span><span class='line'>    <span class="c1">// wait configurable period of time to be notified</span>
</span><span class='line'>    <span class="c1">// by any registered channel</span>
</span><span class='line'>    <span class="kt">int</span> <span class="n">numNotifications</span> <span class="o">=</span> <span class="n">slctr</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="n">timeoutMS</span><span class="o">);</span>
</span><span class='line'>    <span class="k">if</span> <span class="o">(</span><span class="n">numNotifications</span> <span class="o">==</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span>
</span><span class='line'>        <span class="c1">// We timed out without any notification.</span>
</span><span class='line'>        <span class="c1">// We could do whatever we want here because we&#39;re</span>
</span><span class='line'>        <span class="c1">// no longer blocked.</span>
</span><span class='line'>    <span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
</span><span class='line'>        <span class="c1">// numNotifications different channels have notified us of</span>
</span><span class='line'>        <span class="c1">// being available for Connect, Read, Accept, or Write.</span>
</span><span class='line'>        <span class="c1">// It is OK to use these keys in concurrent threads.</span>
</span><span class='line'>        <span class="k">for</span> <span class="o">(</span><span class="n">SelectionKey</span> <span class="n">key</span> <span class="o">:</span> <span class="n">slctr</span><span class="o">.</span><span class="na">selectedKeys</span><span class="o">())</span> <span class="o">{</span>
</span><span class='line'>            <span class="c1">// We&#39;re not sure which channel this key belonged to.</span>
</span><span class='line'>            <span class="c1">// Also, notification was just a &quot;hint&quot; and we need to</span>
</span><span class='line'>            <span class="c1">// check again whether the Channel is available.</span>
</span><span class='line'>            <span class="k">if</span> <span class="o">(</span><span class="n">key</span><span class="o">.</span><span class="na">isAcceptable</span><span class="o">())</span> <span class="o">{</span>
</span><span class='line'>                <span class="c1">// here&#39;s the actual call to accept()</span>
</span><span class='line'>                <span class="n">SocketChannel</span> <span class="n">clientChnl</span> <span class="o">=</span>
</span><span class='line'>                    <span class="o">((</span><span class="n">ServerSocketChannel</span><span class="o">)</span> <span class="n">key</span><span class="o">.</span><span class="na">channel</span><span class="o">()).</span><span class="na">accept</span><span class="o">();</span>
</span><span class='line'>
</span><span class='line'>                <span class="c1">// similar to the ServerSocketChannel</span>
</span><span class='line'>                <span class="n">clientChnl</span><span class="o">.</span><span class="na">configureBlocking</span><span class="o">(</span><span class="kc">false</span><span class="o">);</span>
</span><span class='line'>                <span class="c1">// Except that here we register to notify Selector</span>
</span><span class='line'>                <span class="c1">// about being &quot;readable&quot;, and</span>
</span><span class='line'>                <span class="n">clientChnl</span><span class="o">.</span><span class="na">register</span><span class="o">(</span>
</span><span class='line'>                    <span class="n">key</span><span class="o">.</span><span class="na">selector</span><span class="o">(),</span>
</span><span class='line'>                    <span class="n">SelectionKey</span><span class="o">.</span><span class="na">OP_READ</span><span class="o">,</span>
</span><span class='line'>                    <span class="c1">// We must associate an &quot;attachment&quot; with this</span>
</span><span class='line'>                    <span class="c1">// channel. This is the buffer that will be</span>
</span><span class='line'>                    <span class="c1">// filled with the incoming bytes rcvd via TCP.</span>
</span><span class='line'>                    <span class="n">ByteBuffer</span><span class="o">.</span><span class="na">allocate</span><span class="o">(</span><span class="n">NUM_BYTES</span><span class="o">)</span> <span class="c1">// eg 256?</span>
</span><span class='line'>                <span class="o">);</span>
</span><span class='line'>            <span class="o">}</span>
</span><span class='line'>            <span class="k">if</span> <span class="o">(</span><span class="n">key</span><span class="o">.</span><span class="na">isReadable</span><span class="o">())</span> <span class="o">{</span>
</span><span class='line'>                <span class="c1">// retrieve the readable client socket&#39;s channel</span>
</span><span class='line'>                <span class="n">SocketChannel</span> <span class="n">client</span> <span class="o">=</span>
</span><span class='line'>                    <span class="o">(</span><span class="n">SocketChannel</span><span class="o">)</span> <span class="n">key</span><span class="o">.</span><span class="na">channel</span><span class="o">();</span>
</span><span class='line'>
</span><span class='line'>                <span class="c1">// retrieve the ByteBuffer we associated with</span>
</span><span class='line'>                <span class="c1">// that channel</span>
</span><span class='line'>                <span class="n">ByteBuffer</span> <span class="n">buf</span> <span class="o">=</span> <span class="o">(</span><span class="n">ByteBuffer</span><span class="o">)</span> <span class="n">key</span><span class="o">.</span><span class="na">attachment</span><span class="o">();</span>
</span><span class='line'>
</span><span class='line'>                <span class="c1">// Attempt to read `buf.remaining()` bytes _from_</span>
</span><span class='line'>                <span class="c1">// the Channel _into_ the ByteBuffer.</span>
</span><span class='line'>                <span class="kt">int</span> <span class="n">bytesRead</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="na">read</span><span class="o">(</span><span class="n">buf</span><span class="o">);</span>
</span><span class='line'>
</span><span class='line'>                <span class="c1">// -1 from read() means end-of-stream, which in</span>
</span><span class='line'>                <span class="c1">// this case means the client closed their output</span>
</span><span class='line'>                <span class="c1">// side of the TCP connection. We may still be</span>
</span><span class='line'>                <span class="c1">// able to send data if that side of the connection</span>
</span><span class='line'>                <span class="c1">// has not been closed yet.</span>
</span><span class='line'>                <span class="k">if</span> <span class="o">(</span><span class="n">bytesRead</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="o">)</span> <span class="n">client</span><span class="o">.</span><span class="na">close</span><span class="o">();</span>
</span><span class='line'>
</span><span class='line'>                <span class="k">else</span> <span class="nf">if</span> <span class="o">(</span><span class="n">bytesRead</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span>
</span><span class='line'>                    <span class="c1">// if our application has data to write back</span>
</span><span class='line'>                    <span class="c1">// to the client, we must tell the selector</span>
</span><span class='line'>                    <span class="c1">// that we&#39;ve now become interested in writing</span>
</span><span class='line'>                    <span class="n">key</span><span class="o">.</span><span class="na">interestOps</span><span class="o">(</span><span class="n">SelectionKey</span><span class="o">.</span><span class="na">OP_READ</span>
</span><span class='line'>                                    <span class="o">|</span> <span class="n">SelectionKey</span><span class="o">.</span><span class="na">OP_WRITE</span><span class="o">);</span>
</span><span class='line'>                <span class="o">}</span>
</span><span class='line'>            <span class="o">}</span>
</span><span class='line'>            <span class="c1">// socket not closed, and is writable</span>
</span><span class='line'>            <span class="k">if</span> <span class="o">(</span><span class="n">key</span><span class="o">.</span><span class="na">isValid</span><span class="o">()</span> <span class="o">&amp;&amp;</span> <span class="n">key</span><span class="o">.</span><span class="na">isWritable</span><span class="o">())</span> <span class="o">{</span>
</span><span class='line'>                <span class="c1">// beyond the scope of this post.</span>
</span><span class='line'>            <span class="o">}</span>
</span><span class='line'>        <span class="o">}</span>
</span><span class='line'>    <span class="o">}</span>
</span><span class='line'><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure>


<h2>Tips for Traps</h2>

<ul>
<li>Don&rsquo;t write to the network through a <code>PrintStream</code>

<ul>
<li>It chooses end-of-line chars based on your platform, not the protocol
(HTTP uses <code>\r\n</code>)</li>
<li>It uses the default char encoding of your platform (likely UTF-8), not
whatever the server expects (likely UTF-8)</li>
<li>It eats all exceptions into this <code>boolean checkError()</code> method, when
you&rsquo;re better off just using the normal exception hubbub</li>
</ul>
</li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Asking for Advice]]></title>
    <link href="http://ethanp.github.io/blog/2016/02/27/asking-for-advice/"/>
    <updated>2016-02-27T15:38:06-08:00</updated>
    <id>http://ethanp.github.io/blog/2016/02/27/asking-for-advice</id>
    <content type="html"><![CDATA[<p>There are many circumstances in life during which one may feel the need to ask
for advice. For example, when evaluating a significant decision, or when
dealing with an emotionally stressful circumstance. Here, I will discuss
advice about significant decisions.</p>

<p>In giving advice, everyone has a different approach. I generally try to follow
a line I heard in a rap song by The Streets, &ldquo;If you never tell a lie to her,
you don&rsquo;t have to remember anything.&rdquo; In other words, lying will only
complicate your life because you have to remember the lies you made up.
(Caveat: this may not always the best way to go for emotionally complex
issues.) I also enjoy helping people rationally and realistically evaluate
their options for significant decisions, and surely if someone recalls that
your input was helpful in the past, they will be more likely to ask you in the
future.</p>

<p>Most people I know don&rsquo;t seem to like giving useful advice. It seems they
either are (1) too afraid that their honesty will lead you to dislike them, or
(2) they feel so stressed with their own issues that taking on yours for a few
minutes would be overwhelming, or (3) they find your problem uninteresting and
simply have better things to do.</p>

<p>But some people are the opposite. They will patiently listen to your question
and give what they feel to be an honest evaluation of where you stand and what
you should do. The advice of people in this category will often be heavily and
obviously biased by their own experience and ideology. This is simply a
symptom of &ldquo;being honest&rdquo;.</p>

<p>So if you want good advice, it would be ideal to find someone who is honest,
not stressed about a similar problem to yours, as well as interested in and
knowledgeable of the subject; they should generally also be disinterested in
your particular problem. However, this ideal candidate is not always
available.</p>

<!-- more -->


<p>In that case one can obviously try the timeless &ldquo;pros vs cons&rdquo; list, which
doesn&rsquo;t necessarily rely on external sources of wisdom, but often external
sources of wisdom are critical. One can ask unideal candidates for advice, and
maybe they&rsquo;ll at least have some curt nugget that has some use. That&rsquo;s what I
usually do, and it is generally not effective at all, but occasionaly that
curt nugget is exactly the required pithy jewel.</p>

<p>One can consult the Internet, but I kind of assumed that if you needed advice,
you already checked the Internet for answers. But to go one step further you
can <em>ask</em> the Internet, treating it like this all knowing Oracle. This will
work to varying degrees depending on your problem. If your problem is one that
everyone and their mother has an opinion on, you will end up sifting through
junk answers, and may or may not get anything useful. In that case, the more
details you can reveal, the more benefit you will obtain. Of course <em>where</em>
you ask matters: for example Quora will be more effective than Yahoo Answers.
If your problem is esoteric, <em>find the people who are into that thing</em>.</p>

<p>For example, I have noticed to my surprise that StackOverflow/StackExchange is
not always the best place for all programming-related questions, because if
your question is esoteric to one technological ecosystem, the question-
answering population on StackOverflow won&rsquo;t necessarily have the flag-bearers
of the cause of that particular ecosystem. But if it really is an ecosystem,
it will have a place where the flag-bearers dwell, so <em>find it</em> and they will
probably help. In my experience, IRC channels tend to be empty; Gitter
channels may be well-attended; Github issues get a lot of flak for being a
mess, but are often effective if you&rsquo;re sure your question isn&rsquo;t stupid;
mailing lists can be high-latency but that&rsquo;s often where the true experts of
super-technical projects converse. Following <strong>good forum-question-asking
practices</strong> is <em>crucial</em> in <em>any</em> of these environments, and it can be
trickier to do that than it sounds. <strong>The better you phrase your question, the
more appealing it is to answer in the eyes of a high-quality potential source
of advice.</strong></p>

<p>Sometimes the advice you receive from multiple sources will be in direct
conflict with each other. I have been having this issue a lot lately. Let&rsquo;s
say one person suggested I do <code>A</code>, and another suggested I do <code>not A</code>. I have
decided this pair of suggestions implies that <em>either</em> decision is just fine,
which is basically the ideal outcome; viz. my so-called &ldquo;significant decision&rdquo;
was not so significant after all and doesn&rsquo;t not need to be balanced
carefully.</p>

<p>Just some biased thoughts from my experience; take some, leave some, etc.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[The Tone Makes the Point]]></title>
    <link href="http://ethanp.github.io/blog/2016/01/07/the-tone-makes-the-point/"/>
    <updated>2016-01-07T08:40:04-08:00</updated>
    <id>http://ethanp.github.io/blog/2016/01/07/the-tone-makes-the-point</id>
    <content type="html"><![CDATA[<p>Recently, hot-shot investor Paul Graham wrote a <a href="http://www.paulgraham.com/ineq.html">piece about economic
inequality</a>. This essay provoked a lot of dialogue online. I read the
piece, and a few responses to it. I enjoyed the essay, and I found it very
persuasive. In fact, I would say my &lsquo;opinion on the matter&rsquo; has been changed by
reading the essay and its accompanying summary by the author. The responses I
read were also very interesting. They were from seemingly center-left-wing
individuals like myself, only they were <em>not</em> persuaded by Graham&rsquo;s arguments.</p>

<p>My point here is not so much to discuss the piece, as the tone of the piece.
This piece was written with a tone that raises the hair on the back of the
necks of liberals. This may have been unintentional on the part of Graham,
&hellip;although considering the follow-up discussions of his previous piece on
females in tech, may he just likes raising those particular hairs (they&rsquo;re
called &ldquo;hackles&rdquo;).</p>

<!-- more -->


<p>I will now summarize what I took from the essay. Technology increases the
potential productivity of people at an exponential rate. Since some people do
not become more productive, this correspondingly increases the <em>variability</em> of
productivity within the population. Those who are more productive in producing
innovative goods are benefitting society. They mainly only are interested in
being more productive <em>because</em> of the possibility of getting rich off of it.
If we want to help the poor, the problem is <em>not</em> that there are rich people;
the problem is that there are poor people. To help the poor, instead of taking
away the ability for the productive to get rich, we should help the poor become
more productive. This is because economic inequality <em>per se</em> is not the issue,
poverty is.</p>

<p>For me, this is, as I&rsquo;ve mentioned, both convincing and enlightening. But there
are two concerns any left-winger is almost certainly going to bring up. First,
a lot of people&rsquo;s wealth is not made by being productive in a way that benefits
society; shouldn&rsquo;t we wage class-war against them? Second, even people who
became rich by benefitting society, are likely to abuse their wealth via some
form of corruption, so that&rsquo;s why we should prevent them from getting rich in
the first place. After those two arguments, there are a few other popular
rebuttals. (E.g.) Third, why should we allow some people to be <em>that</em> much
richer than everyone else; something about it seems creepy, therefore it should
be prevented.</p>

<p>These <em>are</em> legitimate concerns, they just don&rsquo;t <em>actually</em> rebut anything
Graham was saying. That&rsquo;s why, if Graham wants to have a legitimate
conversation with people who are going to make the above rebuttals, his
argument should <em>start</em> by saying that those rebuttals are <em>correct</em>, and then
go into how it is <em>still</em> the case that his original arguments are correct. In
this way, the hackles are down, and the legitimate points can be heard.</p>

<p>However, maybe Graham wasn&rsquo;t writing the essay for left-wing people to be
convinced. Maybe he was trying to convince right-wing people. If that is true,
the essay was pointless, because right-wing people <em>already</em> agree with the
opinions he stated. Or maybe he was writing the essay for maximum &ldquo;virality&rdquo; to
increase his fame and international mindshare. In that case, I&rsquo;m not a good
judge of effective argument technique.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Gamified Learning]]></title>
    <link href="http://ethanp.github.io/blog/2016/01/03/gamified-learning/"/>
    <updated>2016-01-03T21:34:05-08:00</updated>
    <id>http://ethanp.github.io/blog/2016/01/03/gamified-learning</id>
    <content type="html"><![CDATA[<p>I have spent the last few days trying to get better at programming. I am no
longer in school, so if I want to learn about, say, compilers, I can&rsquo;t just
sign up for a class on that. There are multiple ways to go about it. It depends
on what you really want to learn and what you want to be able to do with that
knowledge. I have tried various methods of learning over the past few days,
which I have listed below.</p>

<p>What I have noticed of myself is that gamification <strong>works</strong> for me. I love the
satisfaction of (in descending order of [perceived] satisfaction)</p>

<ul>
<li>earning a badge</li>
<li>getting to the next level</li>
<li>beating other humans (directly, or via &ldquo;percentile&rdquo; calculation)</li>
<li>beating robots (similar to &ldquo;beating a boss&rdquo; in a video game)</li>
<li>getting points</li>
</ul>


<!-- more -->


<p>Missing from the list below is Coursera, Udacity, etc. After 6 years of school,
I&rsquo;m <em>really</em> of sick of sitting in a lecture hall, even if it&rsquo;s a virtual one.
As a testament to this sentiment, so far I&rsquo;ve read ~150 pages of the Compilers
textbook, which is the book I would be reading were I to take the class at
University. Sometimes I miss having a teacher whose personality I can associate
with the material. This is a bizarre notion, but it is really important. For
me, having a &ldquo;great teacher&rdquo; means someone I want to please by doing well on
the test. Having a &ldquo;bad teacher&rdquo; means I don&rsquo;t care what the teacher thinks of
me, and frankly I want to do badly to make them feel bad. This is a mean way of
looking at the world, and I don&rsquo;t do it consciously, but looking back at how I
approached my classes, this is what I did a lot of the time. Having a &ldquo;great
teacher&rdquo; is an amazing feeling, and an excellent motivator to really dig into
the subject. But I certainly <em>don&rsquo;t</em> miss all the other crap that comes from
having a teacher (even a &ldquo;great&rdquo; one). The main problem with teachers is that
you need to be respectful of their time and the time of the other students in
the class. This leads to a huge amount of wasted time on the part of each
student, because his or her <em>particular</em> questions aren&rsquo;t being answered. I
could go on about that but I won&rsquo;t. The point is, that I really
feel like I understand everything that has happened so far in the 150 pages of
the Compilers book, and it didn&rsquo;t require and physical hand-holding. It is
exceptionally easy to read a book and think you understand it, and actually
<em>not</em> be understanding it. So who knows.</p>

<h3>Here on out</h3>

<p>My first &ldquo;real&rdquo; day of my first &ldquo;real&rdquo; job is tomorrow. So I probably won&rsquo;t be
spending much time just randomly learning whatever I want in the future. But in
general, these are things that I like to learn about, and methods of learning
them that I enjoy doing; so as I find time, I will come back to these things.</p>

<h3>The List</h3>

<p>This list contains bullets for each thing I have tried to learn over the past
few weeks. The sub-bullets contain methods I have used to actually <em>do</em> the
learning. Many of them are gamified methods of learning (e.g. HackerRank).
Others, I have tried to gamify on my own (e.g. using test-driven development).</p>

<ul>
<li>Learn about compilers

<ul>
<li>read Compilers &ldquo;Dragon Book&rdquo; textbook

<ul>
<li>avg 8 pages per day for 20 days</li>
</ul>
</li>
<li>do the exercises (on paper)</li>
</ul>
</li>
<li>Review AI &amp; machine learning

<ul>
<li>earn an AI &ldquo;badge&rdquo; on HackerRank by

<ul>
<li>doing statistics problems (e.g. &ldquo;find the z-score&rdquo;)</li>
<li>writing a few programs (e.g. &ldquo;do a multiple linear regression on the
given data&rdquo;)</li>
</ul>
</li>
</ul>
</li>
<li>Become more familiar with algorithms and their application

<ul>
<li>do practice problems on Hackerrank</li>
<li>Read [from] the algorithms textbook &ldquo;CLRS&rdquo;</li>
</ul>
</li>
<li>Become a &ldquo;good programmer&rdquo;

<ul>
<li>write the programs suggested in &ldquo;Programming Pearls&rdquo; by Jon Bentley</li>
<li>Beat bots and people at debugging code on <code>codefights.com</code></li>
</ul>
</li>
<li>Learn via observation

<ul>
<li><code>Livecoding.tv</code> (haven&rsquo;t actually looked into this one yet)</li>
<li>Annotate the code of small open source projects with comments explaining
how it is working

<ul>
<li>substack &mdash; creator of popular modules for Node.js</li>
<li>Li Haoyi &mdash; Scala projects using concepts from compilers</li>
</ul>
</li>
</ul>
</li>
<li>Understand P2P networking

<ul>
<li>Create a personal &ldquo;open source&rdquo; programming project that is a P2P
networking application

<ul>
<li>To a large extent using test-driven development</li>
</ul>
</li>
</ul>
</li>
<li>Learn to use NodeJS

<ul>
<li>do the <code>learnyounode</code> interactive tutorial</li>
</ul>
</li>
<li>Learn about functional programming

<ul>
<li>read from Learn you a Haskell</li>
<li>write some simple haskell programs as part of a &ldquo;learn to code in 30
days&rdquo; competition on HackerRank</li>
</ul>
</li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[First Glimpse of Compilers]]></title>
    <link href="http://ethanp.github.io/blog/2015/12/26/first-glimpse-of-compilers/"/>
    <updated>2015-12-26T22:49:00-08:00</updated>
    <id>http://ethanp.github.io/blog/2015/12/26/first-glimpse-of-compilers</id>
    <content type="html"><![CDATA[<p>I am close to two chapters into &ldquo;Compilers&rdquo; (a.k.a. &ldquo;The Dragon Book&rdquo;), by Aho,
Lam, Sethi, and Ullman. It is an exciting topic to learn about.</p>

<p>The most fundamental thing I have learned so far is the overall pipeline for
understanding the modular components forming the way a compiler is typically
constructed. It goes</p>

<ol>
<li>Lexical analyzer</li>
<li>Syntax analyzer</li>
<li>Semantic analyzer</li>
<li>Intermediate code generator</li>
<li><em>Machine-independent code optimizer</em> (optional)</li>
<li>Code generator</li>
<li><em>Machine-dependent code optimizer</em> (optional)</li>
</ol>


<p>The input to the compiler pipeline is a &ldquo;character stream&rdquo; of the program.
However I don&rsquo;t recall the book ever dealing with specifications of that
stream, except that it supports a generic <code>getchar()</code> operation. I can
understand that if the entire source consists of one file, you just read
character-by-character from the file into the compiler. Maybe the language
designer decides how multiple-file projects are to be read by the compiler,
i.e. it is beyond the scope of this book; or maybe they&rsquo;ll go more in-depth on
this later-on.</p>

<h3>The Lexical Analyzer</h3>

<p>The character stream is read into first component in the compilation pipeline:
the <strong>lexical analyzer</strong>, which maps the <em>character stream</em> into a <em>&ldquo;token&rdquo;
stream</em>. It seems like a <strong>token</strong> knows its <em>tag</em>, and its <em>value</em>. The
<strong>tag</strong> is basically the <em>type</em> of this token in the eyes of the compiler.
Possible tags in their example include <code>NUM</code>, <code>ID</code>, <code>[keyword]</code>s, and I added
<code>COMMENT</code> as part of an exercise. The <strong>value</strong> for a token might be the
literal <em>value</em> of a literal; or it might be the name of a variable. More about
that below.</p>

<h4>Symbol Tables</h4>

<p>In addition to creating the token stream, the lexical analyzer builds the
<em>symbol table</em>. It seems to that the <strong>symbol table</strong> maps names to places in
memory. A symbol table also &lsquo;by default&rsquo; implements the language&rsquo;s preferred
form of <em>block scoping</em>. Upon entering a new scope, a new symbol table is
created that points to its &ldquo;parent&rdquo;, the one it is nested inside of. If a
variable is <em>declared</em>, it is added to the symbol table to this scope. Later,
when a variable is <em>referenced</em> (i.e. init&rsquo;d, read, or updated), we will
consult the table for the current innermost scope. If the variable&rsquo;s data is
not found there, we will continue to check ancestors until we find it.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Learning vs Doing]]></title>
    <link href="http://ethanp.github.io/blog/2015/12/19/learning-vs-doing/"/>
    <updated>2015-12-19T14:05:20-08:00</updated>
    <id>http://ethanp.github.io/blog/2015/12/19/learning-vs-doing</id>
    <content type="html"><![CDATA[<p>The goal of my first project at my first &ldquo;real&rdquo; job is to get something useful
done within a few days and start to feel like a contributing member of the
team. However it has been about a week, and I still have not finished that
project.</p>

<p>From my experience talking with some of my colleagues so far, their collective
attitude might be summed up with, &ldquo;Google it, then copy-paste it, but don&rsquo;t
worry about what it does.&rdquo; In my life, I have worked with and met <em>many</em> people
having that attitude. Partially because of my experience working with those
people, it happens to decidedly <em>not</em> be <em>my</em> attitude. My attitude is more
like &ldquo;google it, learn what to do, learn why that is the right approach, take
notes, and then copy-paste and modify the best solution to make the final
solution as clean as possible.&rdquo; This strategy got me through many tough
situations, so I have built up faith in it.</p>

<p>So, after seeing me spend <em>days</em> learning about ssh tunnelling, ansible, and
vagrant &mdash; and not finishing my simple project &mdash; they finally said something
along the lines of</p>

<blockquote><p>At this rate it will take you weeks to learn how to automate deployment of a
virtual machine. Why don&rsquo;t you just deploy <em>one</em> copy and then learn about
how to automate it on your <em>own</em> time?</p></blockquote>

<p>Now, &ldquo;weeks&rdquo; is probably an overstatement, but they pointed out to me that I
had sort of assumed out of nowhere that I was hired as some sort of devops role
for the company, even though what I&rsquo;m actually interested in is what one might
call &ldquo;big data engineering&rdquo;. They said, &ldquo;<em>If</em> you find yourself repeating the
same tasks over and over, <em>then</em> you should learn to automate them.&rdquo; It is now
obvious that they are in the right.</p>

<p>At the time, I was startled by the way they approached me about what I consider
to be largely a difference in personalities. But I can see that they are
concerned that someone who reads too much never gets anything done, and so far
I have fit that stereotype. At the same time, I am concerned that a person who
<em>doesn&rsquo;t</em> read will do their work quickly and incorrectly and will then spend
the next few weeks rejiggering a broken project, and so far they have fit that
stereotype. In short, it seems that we are all prejudiced and have plenty to
learn from each other.</p>

<p>Going forward I shall take their advice and do the more mundane tasks as fast
as possible and only learn things that come up more than once. If that doesn&rsquo;t
work out for me because I&rsquo;m just cranking out shitty work, I will revert back
to understanding what I am doing.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Very Basics of Jetty]]></title>
    <link href="http://ethanp.github.io/blog/2015/10/27/very-basics-of-jetty/"/>
    <updated>2015-10-27T09:09:50-07:00</updated>
    <id>http://ethanp.github.io/blog/2015/10/27/very-basics-of-jetty</id>
    <content type="html"><![CDATA[<h3>Jetty has <em>cutting edge</em> HTTP/2 support</h3>

<p>For my Wireless Networking project, I&rsquo;m going to compare HTTP/1.1 and HTTP/2
with respect to performance metrics when talking to a mobile phone over the
cellular network. There are not so many <a href="https://github.com/http2/http2-spec/wiki/Implementations">implementations</a> of HTTP/2 right
now, and some of them seem a bit shaky. At the end of the day, it seems to me
that the easiest way to run this sort of experiment in a reliable fashion is to
use Java&rsquo;s <a href="http://www.eclipse.org/jetty/">Jetty</a> project. It has well-tested HTTP/1.1 support, many
heavyweight framework users, implements HTTP/2&rsquo;s server push and all sorts of
HTTP/2 negotation mechanisms, and I like static types.</p>

<h3>Jetty is hard to find tutorials for</h3>

<p>So I need to learn the basics of Jetty; and there&rsquo;s a lot to learn, and I
haven&rsquo;t found a stellar resource, so I&rsquo;ve been reading the <a href="http://www.eclipse.org/jetty/documentation/current/embedding-jetty.html">embedded
examples</a>, which are decent. I&rsquo;m usng &ldquo;embedded&rdquo; Jetty because that means
I can write type-safe Java rather than XML. Perhaps XML would be a good choice
for a long-standing app, but I&rsquo;m making a prototype and it&rsquo;s easier to just
write code.</p>

<p>Here&rsquo;s a brief overview of what I&rsquo;ve learned so far, which may come in handy
for someone else wanting to understand the very basics of Jetty (version 9).
This is not a detailed overview because I don&rsquo;t know what I&rsquo;m talking about.
I&rsquo;m just explaining the things I have learned from the tutorial linked above
and some mucking around.</p>

<!-- more -->


<h3>Jetty has a few large architectural components</h3>

<p>The main things you need to deal with in Jetty are <code>Server</code>s, <code>Handler</code>s, and
<code>Connector</code>s.</p>

<p>The <code>Server</code> manages your entire web site. It does this by connecting
<code>Connector</code>s to <code>Handler</code>s. The default <code>Connector</code>, <code>ServerConnector</code> responds
to vanilla HTTP/1.1 requests. If you try to make a request via <code>https</code> to a
<code>ServerConnector</code> without <code>https</code> support, in Chrome you will see the following
error message</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='text'><span class='line'>ERR_SSL_PROTOCOL_ERROR
</span><span class='line'>
</span><span class='line'>Unable to make a secure connection to the server. This may be a problem with
</span><span class='line'>the server, or it may be requiring a client authentication certificate that you
</span><span class='line'>don&#39;t have.
</span></code></pre></td></tr></table></div></figure>


<h4>Handler</h4>

<p>You may want a chain of <code>Handler</code>s that will be tried one after another until
one is found to be appropriate for handling the current request</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="n">HandlerList</span> <span class="n">handlers</span> <span class="o">=</span> <span class="k">new</span> <span class="n">HandlerList</span><span class="o">();</span>
</span><span class='line'><span class="n">handlers</span><span class="o">.</span><span class="na">setHandlers</span><span class="o">(</span><span class="k">new</span> <span class="n">Handler</span><span class="o">[]</span> <span class="o">{</span> <span class="n">resource_handler</span><span class="o">,</span> <span class="k">new</span> <span class="n">DefaultHandler</span><span class="o">()</span> <span class="o">});</span>
</span></code></pre></td></tr></table></div></figure>


<p>The <code>DefaultHandler</code> will simply return <code>404</code>s for everything. Jetty will keep trying <code>Handler</code>s in the list until one of them performs</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="n">baseRequest</span><span class="o">.</span><span class="na">setHandled</span><span class="o">(</span><span class="kc">true</span><span class="o">);</span>
</span></code></pre></td></tr></table></div></figure>


<p>Looking at the <code>DefaultHandler</code>&rsquo;s source, we can see that indeed, in the
<code>handle</code> method, this is the <em>first</em> thing it does.</p>

<h4>Connector</h4>

<p>Each <code>Connector</code> instance can respond to a particular <em>protocol</em> at a
particular <em>host</em> on a particular <em>port</em>. So if you want to respond to <code>https</code>
requests as well as vanilla <code>http</code>, you&rsquo;re going to need another <code>Connector</code>.</p>

<h4>Routing</h4>

<p>A <code>ContextHandler</code> is used to do what in Rails would be called &ldquo;routing&rdquo;. This
is where you associate path&rsquo;s in the HTTP request header to the right
controller. After you&rsquo;ve created your <code>ContextHandler</code>s and set their
<code>contextPath</code>s, you need to do the following</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="n">Server</span> <span class="n">server</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Server</span><span class="o">(</span><span class="mi">8080</span><span class="o">);</span>
</span><span class='line'><span class="c1">//...</span>
</span><span class='line'><span class="n">ContextHandlerCollection</span> <span class="n">contexts</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ContextHandlerCollection</span><span class="o">();</span>
</span><span class='line'><span class="n">contexts</span><span class="o">.</span><span class="na">setHandlers</span><span class="o">(</span><span class="k">new</span> <span class="n">Handler</span><span class="o">[]</span> <span class="o">{</span> <span class="n">context0</span><span class="o">,</span> <span class="n">context1</span><span class="o">,</span> <span class="o">...</span> <span class="o">});</span>
</span><span class='line'><span class="n">server</span><span class="o">.</span><span class="na">setHandler</span><span class="o">(</span><span class="n">contexts</span><span class="o">);</span>
</span></code></pre></td></tr></table></div></figure>


<p>This means the server is going to ask the <code>ContextHandlerCollection</code> for the
first <code>ContextHandler</code> that matches the path specified on the incoming request.</p>

<h3>Conclusion: Jetty feels dated compared to Ruby on Rails</h3>

<p>So far, I find Jetty surprisingly nice. I&rsquo;m
starting to appreciate the fact that it is written in Java and it is open
source. It&rsquo;s so different from Ruby on Rails and Node.js because there is not a
lot of magical mystery code running in the background whose source is difficult
to find. The advantage of those other frameworks is that they have much higher
level APIs. I find myself wrapping Jetty methods into my own methods that do
what in Rails would have been done automatically. Now I understand the draw of
Rails&rsquo;s &ldquo;convention over configuration&rdquo; and &ldquo;don&rsquo;t repeat yourself&rdquo;. I didn&rsquo;t
realize how much configuration and repeating yourself was necessary back in the
Java days. And that leads me to my next point, that writing Jetty feels like
writing 10 year old code. I wasn&rsquo;t coding 10 years ago, but I imagine you had
to explicitly wire every last thing together by hand over and over again, as
you do in Jetty.</p>

<p>Unlike Rails or iOS, basic Jetty is not a Model View Controller framework. It&rsquo;s
just handlers on a server, more like Node.js. Personally, I like MVC out of
ignorance because I assume people who came up with it thought the &ldquo;application
framework&rdquo; way of doing thigs was &lsquo;bad&rsquo;, and I tend to trust people who don&rsquo;t
like the &lsquo;old&rsquo; way of doing things. That said, I am not making a database based
application here so I will not evaluate this difference any further.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Intro to HTTP/2]]></title>
    <link href="http://ethanp.github.io/blog/2015/10/26/intro-to-http-slash-2/"/>
    <updated>2015-10-26T14:11:59-07:00</updated>
    <id>http://ethanp.github.io/blog/2015/10/26/intro-to-http-slash-2</id>
    <content type="html"><![CDATA[<h2>The problem of HTTP/1.1</h2>

<h3>HTTP/1.1 is from an earlier Web</h3>

<p>For the past several years, Google, Mozilla, Akamai, the IETF, the academic
research community, and others have been engaged in efforts to reduce PLTs
experienced by users of the WWW. One bottleneck in apparent need of an update
is the now-&ldquo;ancient&rdquo; HTTP/1.1 (H1) protocol, published in 1999.</p>

<p>From the beginning, one of the major design goals of the original HTTP protocol
was simplicity to implement and adopt, to encourage growth of the WWW.
Evidently, this technique worked. However, over the past 16 years, the way
people create, distribute, and view pages on the WWW has changed drastically.
Pages today have far more Javascript, CSS, images, and other content to go
along with the vanilla HTML. In the course of this evolution, numerous issues
with H1 have come up, mostly pertaining to the number of sequential round-trips
it requires to fully download the data for each web page. These round-trips
introduce unnecessary latency.</p>

<!-- more -->


<p>Over the past 16 years, countermeasures, &ldquo;optional&rdquo; protocol alterations, and
application level workarounds have been suggested and implemented to reduce the
round-trip count and therefore latency of pages retrieved using H1.</p>

<h3>HTTP/1.1 PLT Optimizations</h3>

<p>Some optimizations servers and browsers use to lower PLTs seen by clients are
the following:</p>

<ol>
<li>Opening multiple TCP connections to request the download of multiple
required web page resources in parallel.</li>
<li>Inlining and concatenating scripts and stylesheets to reduce the number of
requests.</li>
<li><em>HTTP pipelining</em>, in which multiple requests are sent by the browser over a
single TCP connection before waiting for responses to any, then the server
sends all of the responses back in the same order.</li>
</ol>


<p>HTTP pipelining was an idea that did not work out. It seemed like a logical
approach to improve HTTP performance by reducing latency. However, in reality,
HOL blocking became an even greater issue then it already had been, and it also
led to issues with older proxies that are difficult to update. Nowadays, modern
browsers do not enable pipelining.</p>

<h2>HTTP/2 aims to address these issues</h2>

<p>The main reason multiple resources can&rsquo;t be <em>multiplexed</em> (downloaded in
parallel) over a single TCP connection in H1 is because it is an ASCII
protocol. Being an ASCII protocol means that there is no easy way to specify
how to demultiplex those responses with a parser.</p>

<p>To address this, HTTP/2 (H2) is a <em>binary</em> protocol that <em>can</em> easily multiplex
multiple requests and responses through a single TCP connection. It does this
using a new <em>binary framing layer</em>, in which responses are broken into separate
<em>frames</em> and interleaved through the TCP socket.</p>

<p>Web designers have discovered that to yield the best user experience, the
server should transmit certain most-important resources first, as soon as they
are available to send (e.g. loaded from disk). H2 frames (and <em>streams</em>) have
an (optional) <code>priority</code> field to allow the application developer (and server
writer) to bias the ordering of frames sent through the socket. For example,
one may want to ensure that <code>.html</code> files have a higher priority than <code>.jpeg</code>
files because <code>.jpeg</code>s contain a lot of data and are not as crucial for showing
the user a <em>barebones</em> version of the page they requested.</p>

<h3>Brief History</h3>

<p>The current H2 specification (RFC 7540) is based primarily on the SPDY protocol
invented and first used at Google. Google is in a rather unique position to
conduct research on this topic because they write the code running large
proportion of both user&rsquo;s browsers, <em>as well as</em> the servers for a large
portion of the web pages those people visit. This allows them to run real
controlled experimental studies on web traffic, to try to guage the best way to
speed up the Web.</p>

<p>Many have voiced concerns that Google&rsquo;s ideas are welcomed to quickly by the
standards committee. However, supporters are eager to get the ball rolling on
any good ideas, and point to the fact that Google already has <em>results</em>.</p>

<p>Note that some of the experimental results below were actually obtained using
SPDY rather than H2, but we will not concern ourselves with that because they
are the same in essence. Google Chrome still supports SPDY, but has claimed it
will remove support in 2016 so that there are not two competing
standardizations of basically the same protocol.</p>

<p>In reality, in my Chrome browser, using the &ldquo;HTTP/2 and SPDY indicator&rdquo;
extension, I can see that Google&rsquo;s search results and YouTube videos are served
over SPDY/3 running on top of their still-in-research QUIC protocol which is
meant to replace TCP as the transport layer protocol underneath HTTP. The
&ldquo;indicator&rdquo; reveals that many websites are in fact already being served over
H2, including Twitter. I hope to explore what exactly QUIC does, promises, and
delivers further in my research if there is time and space for it.</p>

<h3>Experimental Results</h3>

<p>Experiments investigating relative performance differences between H1 and H2
have yielded contradictory results (Wang et al.). In (Wang et al.), they found
that H2 generally outperforms H1, except when the task is to transmit <em>large</em>
objects over a <em>high-loss</em> connection. One of the goals in defining H2 is that
it is supposed to be easy to swap-in in-place of H1. This finding indicates
that especially with respect to mobile phones, the H1 optimization techniques
of inlining and concatenating web page resources is the wrong way to
improve performance for H2.</p>

<h3>HTTP for Mobile</h3>

<p>With respect to mobile devices, there are multiple problems with HTTP that are
<em>not</em> addressed by H2 (Erman et al.).</p>

<p>First of all, TCP congestion window part of congestion avoidance (<code>cwnd</code> and
<code>ssthresh</code>) makes it so that dropping a single TCP datagram leads to a
drastically reduced throughput. This is because TCP assumes the packet was
dropped due to network congestion, when in reality, it could have been for any
of the wireless-specific issues (multipath propagation, etc.)</p>

<p>Secondly, bad interactions between the 3G &amp; LTE MAC state machine timeouts and
modern default TCP timeouts and mechanisms lead to datagram <em>retransmissions</em>
with further resets on congestion window state variables, resulting in a
drastically negative impact on throughput.</p>

<h3>Research is needed</h3>

<p>From the system administrators&#8217; perspective, it is still difficult to decide
whether enabling H2 on one&rsquo;s servers is going to have a positive impact on
performance <em>at all</em> (Varvello et al.). The following questions still do not
have good enough answers to make answering that decision easy.</p>

<ol>
<li>What changes to the infrastructure and optimization pipeline are needed to
provide a significant benefit over H1?</li>
<li>What are the best algorithms for leveraging H2&rsquo;s new capabilities for
<em>server push</em> and <em>prioritization</em>?</li>
<li>Why do experiments by different research groups yield such a wide disparity
in results?</li>
<li>Are we unlikely to see any real PLT improvements until we improve the
underlying transport layer protocol?</li>
<li>In what ways is the problem different for mobile?

<ul>
<li>What changes need to be made in the mobile-specific scenario, viz. where
interference, low bandwidth, high latency, and battery life
considerations are major concerns?</li>
</ul>
</li>
</ol>


<h2>Glossary</h2>

<ul>
<li><strong>IETF</strong> (Internet Engineering Task Force) &mdash; a standards organization for
the Internet, which produces &ldquo;RFC&#8221;s (requests for comment) specifying some of
the crucial internet protocols, such as HTTP, TCP, and TLS.</li>
<li><strong>PLT</strong> (Page Load Time) &mdash; the duration between the time at which a user
submits a request for a web page, and the time at which she receives the last
byte of data needed to represent that web page correctly</li>
<li><strong>WWW</strong> (World Wide Web) &mdash; a decentralized collection of resources,
requested and transmitted using <em>HTTP</em></li>
<li><strong>HTTP</strong> (HyperText Transfer Protocol) &mdash; a protocol specifying the way an
HTTP <em>client</em> may upload, download, update, etc. documents on an HTTP
<em>server</em>. Often used for downloading <em>HTML</em> in particular.</li>
<li><strong>TCP</strong> (Transmission Control Protocol) &mdash; a protocol which provides an
application the abstraction of having a reliable pipe across the Internet
into/from which it may send/receive ordered sequences of bytes</li>
<li><strong>HOL blocking</strong> (head-of-line blocking) &mdash; when multiple messages are being
sent, but due to imposed FIFO ordering, messages ready to be sent must wait
for an earlier message which is taking a long time</li>
</ul>


<h2>References</h2>

<ul>
<li>Erman, Jeffrey, et al. &ldquo;Towards a SPDY&#8217;ier mobile web?.&rdquo; <em>Proceedings of the
ninth ACM conference on Emerging networking experiments and technologies.
ACM</em>, 2013.</li>
<li>Grigorik, Ilya. High Performance Browser Networking: What every web developer
should know about networking and web performance. <em>O&#8217;Reilly Media, Inc.</em>, 2013.</li>
<li>Roth, Gregor. &ldquo;HTTP/2 for Java Developers&rdquo;. Retrieved from
<a href="http://www.javaworld.com/article/2916548/java-web-development/http-2-for-">http://www.javaworld.com/article/2916548/java-web-development/http-2-for-</a>
java-developers.html on October 8, 2015.</li>
<li>Stenberg, Daniel aka. bagder. <em>http2 explained</em>, retrieved October 8, 2015;
downloaded as MOBI.</li>
<li>Varvello, Matteo, et al. &ldquo;To HTTP/2, or Not To HTTP/2, That Is The Question&rdquo;
arXiv preprint arXiv:1507.06562 (2015).</li>
<li>Wang, Xiao Sophia, et al. &ldquo;How speedy is SPDY.&rdquo; <em>Proc. of the 11th USENIX
Symposium on Networked Systems Design and Implementation (NSDI)</em>. 2014.</li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Why the WiFi Sucks on the Porch]]></title>
    <link href="http://ethanp.github.io/blog/2015/09/24/why-the-wifi-sucks-on-the-porch/"/>
    <updated>2015-09-24T14:31:30-07:00</updated>
    <id>http://ethanp.github.io/blog/2015/09/24/why-the-wifi-sucks-on-the-porch</id>
    <content type="html"><![CDATA[<h2>Why is WiFi so slow on the porch?</h2>

<p>The problem at my house is that WiFi connectivity is generally fine inside the
house, but while sitting on the porch it is nearly impossible for anyone to
browse the Hinternets. I think a lesson of my <a href="http://www.cs.utexas.edu/~lili/classes/F15-CS386W/">Wireless Networking</a> course
explains this observation.</p>

<p>In Wireless networking we have the situation called the <a href="http://www.wikiwand.com/en/Hidden_node_problem"><strong>Hidden Terminal
Problem</strong></a>, which is the following. Node <code>A</code> is too far from node <code>C</code> to
hear its transmissions, but both are in range of node <code>B</code> which sits between
<code>A</code> and <code>C</code>. Suppose <code>C</code> is currently transmitting data to <code>B</code>. Now <code>A</code> checks
whether any transmissions are currently happening, and finds that there are
none (because it is out of range of <code>C</code>). So <code>A</code> goes ahead and sends data to
<code>B</code>. Now <code>B</code> can&rsquo;t understand either data packet because they collided and
interfered with each other in an unrecoverable way because they were both sent
in the same channel.</p>

<p>I think the porch scenario is simply an example of the <em>Hidden Terminal
Problem</em> above. In my room, my laptop can hear many of my neighbors&#8217; WiFi LAN
networks, but it&rsquo;s the same set that my WiFi router sitting in the closet can
hear. However outside, where there&rsquo;s less cause for signal attenuation, my
laptop can hear many more WiFi LAN networks than the router inside. So the
router checks whether there is congestion, doesn&rsquo;t hear any, and sends the
data. But there <em>is</em>, in fact, congestion, and my computer doesn&rsquo;t receive the
data properly. The data is therefore not <em>ACKd</em>, the router times out on the
<em>ACK</em>, and has to retransmit, and so on. This makes for a <em>far</em> slower
Hinterconnectivity outside on the porch than in my room.</p>

<p>Maybe the lesson learnt is that the WiFi router should be situated in a place
where it can hear more of the outside noise so that it can compensate better
for that noise.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bit Twiddling: Data Structure Alignment]]></title>
    <link href="http://ethanp.github.io/blog/2015/09/23/bit-twiddling-data-structure-alignment/"/>
    <updated>2015-09-23T11:44:04-07:00</updated>
    <id>http://ethanp.github.io/blog/2015/09/23/bit-twiddling-data-structure-alignment</id>
    <content type="html"><![CDATA[<p>In order to align a data structure at the nearest word-alignment to a given
starting address, we&rsquo;d must find the smallest multiple of N (a power of 2)
which is greater than or equal to integer X. Is there a way we can <em>make this
fast</em>, i.e. not use division or modulo?</p>

<h2>First Thoughts</h2>

<h3>First of all</h3>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="k">if</span> <span class="p">(</span><span class="n">N</span> <span class="o">&gt;</span> <span class="n">X</span><span class="p">)</span> <span class="k">return</span> <span class="n">N</span><span class="p">;</span>
</span></code></pre></td></tr></table></div></figure>


<h3>Otherwise, we have two cases</h3>

<h4>First Case</h4>

<pre><code>N: 00100
X: 01100
--------
=&gt; 01100
</code></pre>

<p>This example indicates the following:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">X</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">)))</span> <span class="k">return</span> <span class="n">X</span><span class="p">;</span>
</span></code></pre></td></tr></table></div></figure>


<h4>Second Case</h4>

<pre><code>N: 00100       
X: 01101       
--------       
=&gt; 10000       
</code></pre>

<p>Which can be accomplished by</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="k">return</span> <span class="p">(</span><span class="n">X</span> <span class="o">|</span> <span class="p">(</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
</span></code></pre></td></tr></table></div></figure>


<p>So we have the first-take program of</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="k">if</span> <span class="p">(</span><span class="n">N</span> <span class="o">&gt;</span> <span class="n">X</span><span class="p">)</span> <span class="k">return</span> <span class="n">N</span><span class="p">;</span>
</span><span class='line'><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">X</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">)))</span> <span class="k">return</span> <span class="n">X</span><span class="p">;</span>
</span><span class='line'><span class="k">return</span> <span class="p">(</span><span class="n">X</span> <span class="o">|</span> <span class="p">(</span><span class="n">N</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
</span></code></pre></td></tr></table></div></figure>




<!-- more -->


<h2>Engineering Wisdom</h2>

<p>The above solution is in fact correct <em>(takes a quick bow)</em>.</p>

<p>When I looked up a better answer on <a href="http://stackoverflow.com/questions/19450743">StackOverflow</a>, I found a
program that does not use conditionals, and is therefore much faster</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class='c'><span class='line'><span class="k">return</span> <span class="p">(</span><span class="n">X</span> <span class="o">+</span> <span class="n">N</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="o">~</span><span class="p">(</span><span class="n">N</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
</span></code></pre></td></tr></table></div></figure>


<p>However, it was not obvious to me that this program works! Let&rsquo;s peer inside
and see how similar it is to my solution.</p>

<p>First we&rsquo;re going to compute something, and then we&rsquo;re going to &ldquo;and&rdquo; it with
<code>~(N-1)</code>. Well, &ldquo;and&#8221;ing <em>anything</em> against <code>~(N-1)</code> is going to round it down
to a multiple of N. So that part is taken care of.</p>

<p>Now we need to get the <em>right</em> multiple of N. <code>if X &lt; N</code>, then then adding X to
N and rounding down to a multiple N is just going to give us N. That&rsquo;s the
first line of my answer, taken care of.</p>

<p>So <code>if X &gt; N</code>, we either do or don&rsquo;t have bits in place values below N. This is
what my last two lines are taking care of. However, we can deal with this by
noting that after we add X to N, if there are bits in place values below N,
when we subtract 1, it will not change any bits that will survive the &ldquo;and&rdquo;
stage. But if there are no bits in place values below N, then subtracting one
will decrement our answer wrt bits N and higher.</p>

<p>Hmm, not the best explanation of all time. But that&rsquo;s what&rsquo;s going on in there.</p>
]]></content>
  </entry>
  
</feed>
