Jekyll2021-09-20T21:47:03-07:00/feed.xmlProgrammer’s MusingsHello! I am a soon to be junior at UC Berkeley studying Computer Science! My goal is to be a software engineer, and catalog the process of me getting there.
What am I Working On?2021-09-20T00:00:00-07:002021-09-20T00:00:00-07:00/2021/09/20/What-am-I-Working-on<h2 id="preface">Preface</h2>
<p>I have been in a bit of a rut. Nothing bad, just have more free time now since I am between school and work, and time has led to both introspection and tackling a few different projects. I wanted to write this to see and show what my current projects are and how long I have been tackling them!</p>
<h2 id="daily">Daily</h2>
<p>What have I been doing each day?</p>
<h4 id="learning-chinese">Learning Chinese</h4>
<p>I have been learning for around 2 years now, and probably doing it daily for the past month. This has a few different projects inside learning chinese.</p>
<h4 id="remembering-the-hanzi"><a href="https://en.wikipedia.org/wiki/Remembering_the_Kanji_and_Remembering_the_Hanzi#Remembering_the_Hanzi">Remembering the Hanzi</a></h4>
<p>Two books that systematically teach ~3000 chinese characters. For each character the book teaches you how to handwrite it, as well as an english keyword that is similar to one of meanings of the character. I have augmented the system to also teach me the pronunciation of the character. I have been doing this daily for the last month and am ~500 characters in. I try to do ~30 characters a day.</p>
<h4 id="reading-about-learning-chinese">Reading about learning Chinese</h4>
<p>I think this is more procrastination than studying, but I spend too much time on https://www.reddit.com/r/ChineseLanguage/ and https://www.chinese-forums.com/</p>
<h4 id="go-on-the-internet-on-my-phone">Go on the internet on my phone</h4>
<p>I spend a lot of time on Hacker News, Reddit, and YouTube. Internet obsession for me started at a young age, with my iPod touch and my grandfathers old PC.</p>
<h2 id="weekly">Weekly</h2>
<p>Things I do at least once a week</p>
<h4 id="running">Running!</h4>
<p>I got into it again after a haitus last month, and have been doing it a couple times a week. I might run the golden gate half marathon in november. I also bike fairly often.</p>
<h4 id="time-with-friends">Time with Friends</h4>
<p>I really enjoy chatting with and seeing friends, as well as making new ones, so I try to make plans fairly often.</p>
<h4 id="reading-books-in-chinese">Reading books in Chinese</h4>
<p>I have read quite a few. Currently I am reading an abridged version of 西游记, <a href="https://imagin8press.com/books/the-rise-of-the-monkey-king-2/">Journey to the West</a> as well as The Witches by Roald Dahl translated into Chinese. I think I read at least once a week for an hour, upwards of 5 hours a week.</p>
<h4 id="reading-books">Reading books</h4>
<p>I listen to a lot of audio books, and try to read books on my kindle/phone at other times. Hope I can make this a daily habit. I peaked in reading around middle school, dipped a lot in high school, and have been doing it more often in and after college.</p>
<h4 id="video-games-in-chinese">Video games in Chinese</h4>
<p>Currently I am having a lot of fun playing <a href="https://eastwardgame.com/">Eastward</a> on the Nintendo Switch. I have also recently played Pokemon Snap, Pokemon Unite, and Oneshot in chinese which have all been quite fun.</p>
<h4 id="video-games">Video games</h4>
<p>Dark Souls, Startdew Valley, Brawl Stars, and Slither.io have kept me busy in the past few months. I have loved video games ever since in elemetary school I played my brother’s pokemon game.</p>
<h2 id="monthly">Monthly</h2>
<h4 id="lifting-weightgym">Lifting weight/gym</h4>
<p>I am not very consistent with this, but hope to get better so I can get stronger!</p>
<h4 id="programming-for-fun">Programming for fun</h4>
<p>I like learning and programming. I have started back up with this because this course <a href="http://brendanfong.com/programmingcats.html">Programming with Categories</a> just seemed too fun. I am very good at starting, but not finishing lots of things I put my effort in. I am not too harsh on myself.</p>
<h4 id="watching-a-show-in-chinese">Watching a show in Chinese.</h4>
<p>I have paired <a href="https://languagelearningwithnetflix.com/">language learning with netflix</a> with an anime Hunter x Hunter that has both subtitles and audio in Mandarin Chinese. I am one season in and there are six seasons. I have been watching it on and off for half a year. I watch in spurts, been a while since I last did.</p>
<h2 id="yearly">Yearly</h2>
<h4 id="blogging">BLOGGING</h4>
<p>I wrote my last post nearly a year ago! Oh well. I have lots of ideas on what to write about. Book reviews, my time at UC Berkeley, python explanations, my experiences with learning chinese, and more. We shall see what I get to, and maybe future me will go back and add links to those posts from this article as I make these posts?</p>
<h4 id="traveling">Traveling</h4>
<p>I am lucky I get to do this a few times a year.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Listing things I am doing really shows me how much I am doing. Even then I feel like I am “wasting” more than half my day when I browse the internet. Though I do recognize it is good to recharge to avoid burnout.</p>
<p>Chinese is my current biggest hobby. Before that in college it was teaching, in high school it was programming, in middle school it was chess. Reading, exercise, and learning are things I do fairly consistently and have been doing them for many years, which I am proud of. I can, however, do better.</p>PrefaceCounting with less Bits2020-12-11T00:00:00-08:002020-12-11T00:00:00-08:00/2020/12/11/Counting-with-less-Bits<h2 id="introduction">Introduction</h2>
<p>Currently I am taking a class called <a href="https://www.sketchingbigdata.org/fall20">Sketching Algorithms</a>. This course covers how many problems can be solved probabilistically in much less space/faster. You improve space/compute time, but instead of getting back exact answers, you get a solution that is provably approximately close to the actual solution most of the time.</p>
<h3 id="counting">Counting</h3>
<p>There is nothing special about a simple counter. Just a variable that you increment.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Counter</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span>
</code></pre></div></div>
<p><em>A python class for a simple counter</em></p>
<p>What gets more interesting is if you have a probabilistic counter. Here instead of always incrementing X, we increment X with probability of 1/2 to the power of X. Wow! And instead of returning X, we return 2 to the power of X minus 1.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MorrisCounter</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">bits</span><span class="o">=</span><span class="mi">8</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">=</span> <span class="mi">0</span>
<span class="bp">self</span><span class="p">.</span><span class="n">bits</span> <span class="o">=</span> <span class="n">bits</span>
<span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o"><</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="bp">self</span><span class="p">.</span><span class="n">bits</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span> <span class="ow">and</span> \
<span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span>
</code></pre></div></div>
<p><em>A python class for a simple probabalistic counter</em></p>
<p>For the python code I also added a variable to simulate the counter being a certain number of bits.</p>
<p>This counter above is known as the <a href="https://core.ac.uk/download/pdf/208681313.pdf">Morris Counter</a>. This algorithm is from 1985, and to me it is exciting both how young and how old the algorithm is. First off it was invented within the last 50 years, but also it was invented long before I was born. What else is exciting is this algorithm, with a minor tweak, recently had a <a href="https://arxiv.org/abs/2010.02116">lower, optimal bound</a> proven in October of 2020. Never have I been so close to the fronteir of computer science research!</p>
<p>The Morris Counter linked above only can count in powers of 2 minus 1, which is great in terms of conserving bits, but there is a tradeoff with if you have a few more bits you can reduce variance and get a better probabilistic counter. Instead of incrementing with probability 1/2 to the power of X, you increment with 1/(1 + alpha) to the power of X, where alpha is a small constant, and setting alpha equal to 1 gets you back the original Morris Counter. To return the approximate value of increments we return (1/alpha) * ((1 + alpha) ^ X) - 1. Another important improvement is for small values not to use an approximate counter but exact, and switch over to approximate after a certain threshhold.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MorrisAlpha</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="o">=</span><span class="p">.</span><span class="mi">05</span><span class="p">,</span> <span class="n">bits</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">=</span> <span class="mi">0</span>
<span class="bp">self</span><span class="p">.</span><span class="n">a</span> <span class="o">=</span> <span class="n">a</span>
<span class="bp">self</span><span class="p">.</span><span class="n">bits</span> <span class="o">=</span> <span class="n">bits</span>
<span class="bp">self</span><span class="p">.</span><span class="n">default</span> <span class="o">=</span> <span class="n">default</span>
<span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o"><</span> <span class="bp">self</span><span class="p">.</span><span class="n">default</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">elif</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o"><</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="bp">self</span><span class="p">.</span><span class="n">bits</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span> <span class="ow">and</span> \
<span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="mi">1</span><span class="o">/</span><span class="p">((</span><span class="mi">1</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span><span class="p">)</span><span class="o">**</span><span class="bp">self</span><span class="p">.</span><span class="n">X</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o"><=</span> <span class="bp">self</span><span class="p">.</span><span class="n">default</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span>
<span class="k">return</span> <span class="mi">1</span><span class="o">/</span><span class="bp">self</span><span class="p">.</span><span class="n">a</span> <span class="o">*</span> <span class="p">(((</span><span class="mi">1</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span><span class="p">)</span><span class="o">**</span><span class="bp">self</span><span class="p">.</span><span class="n">X</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p><em>A python class for the improved Morris Counter</em></p>
<p>Finally for my final project I did a little investigation on the variant of the Morris Counter used by <a href="https://github.com/redis">Redis</a>. The investigation was inspired by <a href="https://github.com/redis/redis/issues/7943">Professor Jelani Nelson’s github issue</a>, and the TL; DR is Redis has a Morris counter, but increments X with probability 1/(1 + alpha * X), with the corresponding approximation for the number of increments being (alpha/2) * X ^ 2.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RedisCounter</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">bits</span><span class="o">=</span><span class="mi">8</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">a</span> <span class="o">=</span> <span class="n">a</span>
<span class="bp">self</span><span class="p">.</span><span class="n">default</span> <span class="o">=</span> <span class="n">default</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">=</span> <span class="mi">0</span>
<span class="bp">self</span><span class="p">.</span><span class="n">bits</span> <span class="o">=</span> <span class="n">bits</span>
<span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">!=</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="bp">self</span><span class="p">.</span><span class="n">bits</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">()</span>
<span class="n">baseval</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">default</span>
<span class="k">if</span> <span class="n">baseval</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">baseval</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">p</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="n">baseval</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">r</span> <span class="o"><</span> <span class="n">p</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o"><=</span> <span class="bp">self</span><span class="p">.</span><span class="n">default</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">X</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">X</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">default</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span>
</code></pre></div></div>
<p><em>A python class for the Redis Morris Counter</em></p>
<p>Is this counter better? Worse? Find out in <a href="Testing-Random-Algorithms.html">part 2</a>.</p>IntroductionTesting Random Algorithms2020-12-11T00:00:00-08:002020-12-11T00:00:00-08:00/2020/12/11/Testing-Random-Algorithms<h2 id="introduction">Introduction</h2>
<p>This is a followup to <a href="Counting-with-less-Bits.html">part 1</a>. A quick recap is we had two slightly different approximate counters, and now we will figure out which is better and learn about how to test functions that have randomness.</p>
<h3 id="tests">Tests</h3>
<p>I was interested in exploring how programmers write tests for code that is inherently random. One runs into a few problems immediately. What should your expected output be to test a function that approximates? What if the function fails? What should I be testing?</p>
<p>I found two satisfying answers. One is to decouple the randomness from the code you are testing. One way to do this is with <a href="https://softwareengineering.stackexchange.com/questions/356456/testing-a-function-that-uses-random-number-generator">Dependency Injection</a>, or passing in the source of randomness, and then mocking it out during the test with any deterministic sequence you want. One way to utilize this for testing the Morris Counter is to mock the randomness and have the calls to randomness return values from 0 to 1 each time increasing by increments of 1/n, where n is the number of times you are incrementing the counter. The benefit of such a test would be complicated enough to test your Counter is correct, while also deterministic so it would have no chance of failing with a correct implementation. Another way to achieve similar results is to seed your random generator at the beginning of each test with the same seed. That way each algorithm gets the same stream of psuedorandom numbers, and results are consistent.</p>
<p>Below is an example of a test I wrote with the random seed being set to decouple randomness from code that is being tested. The code is all available <a href="https://github.com/alexkassil/sketching-testing">here</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">counterClass</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">,</span> <span class="n">times</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span>
<span class="n">res</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">times</span><span class="p">):</span>
<span class="n">counter</span> <span class="o">=</span> <span class="n">counterClass</span><span class="p">(</span><span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">):</span>
<span class="n">counter</span><span class="p">.</span><span class="n">update</span><span class="p">()</span>
<span class="n">res</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">counter</span><span class="p">.</span><span class="n">query</span><span class="p">())</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">res</span><span class="p">)</span>
<span class="p">...</span>
<span class="k">def</span> <span class="nf">test_RedisCounter_standard_deviation</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">run</span><span class="p">(</span><span class="n">RedisCounter</span><span class="p">,</span> <span class="p">{},</span> <span class="bp">self</span><span class="p">.</span><span class="n">times</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">N</span><span class="p">)</span>
<span class="n">within_25_percent</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">((</span><span class="bp">self</span><span class="p">.</span><span class="n">N</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">N</span><span class="o">/</span><span class="mi">4</span> <span class="o"><=</span> <span class="n">vals</span><span class="p">)</span> <span class="o">&</span> \
<span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">N</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">N</span><span class="o">/</span><span class="mi">4</span> <span class="o">>=</span> <span class="n">vals</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">For RedisCounter,"</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">within_25_percent</span> <span class="o">*</span> <span class="mi">100</span><span class="p">)</span><span class="o">+</span>\
<span class="s">"% of runs are within 25% on either side of"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">N</span><span class="p">,</span> \
<span class="s">"after"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">times</span><span class="p">,</span> <span class="s">"runs calling update"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">N</span><span class="p">,</span><span class="s">"times"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">assertTrue</span><span class="p">(</span><span class="n">within_25_percent</span> <span class="o">></span> <span class="p">.</span><span class="mi">75</span><span class="p">)</span>
<span class="p">...</span>
</code></pre></div></div>
<p>The second satisfying answer I found to testing probabilistic code is <a href="https://beust.com/weblog2/archives/2006_02_21.html">statistical tests</a>. A simple example of a statistical test for the Morris Counter is to average the results of many different runs of the same counter. You can also utilize the theoretical bounds on failure probabilities and arbitrarily-close guarantees to know exactly how small of a chance this test fails if everything is implemented correctly. The test above is an example of that, so is the test below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">test_RedisCounter_expectation</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="n">average</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">times</span><span class="p">):</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">RedisCounter</span><span class="p">()</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">N</span><span class="p">):</span>
<span class="n">c</span><span class="p">.</span><span class="n">update</span><span class="p">()</span>
<span class="n">average</span> <span class="o">+=</span> <span class="n">c</span><span class="p">.</span><span class="n">query</span><span class="p">()</span>
<span class="n">average</span> <span class="o">=</span> <span class="n">average</span> <span class="o">/</span> <span class="bp">self</span><span class="p">.</span><span class="n">times</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="nb">type</span><span class="p">(</span><span class="n">c</span><span class="p">).</span><span class="n">__name__</span> <span class="o">+</span> <span class="s">"'s average is"</span><span class="p">,</span> <span class="n">average</span><span class="p">,</span>\
<span class="s">"after"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">times</span><span class="p">,</span> <span class="s">"runs calling update"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">N</span><span class="p">,</span><span class="s">"times"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">assertTrue</span><span class="p">(</span><span class="n">within</span><span class="p">(</span><span class="n">average</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">N</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">N</span> <span class="o">*</span> <span class="mi">1</span><span class="o">/</span><span class="mi">25</span><span class="p">))</span>
</code></pre></div></div>
<p>Finally below is the output on running the full suite of tests I wrote for the deterministic Counter, Basic Morris Counter, Alpha Morris Counter with alpha=.05 and 8 bits, and the Redis Morris Counter with alpha=10 and 8 bits. Here is the output of running <code class="language-plaintext highlighter-rouge">python3 -m unittest test.py</code>, <a href="https://github.com/alexkassil/sketching-testing/blob/main/test.py">test file</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Counter's average is 10000.0 after \
100 runs calling update 10000 times
.
MorrisAlpha's average is 9838.326270536447 after \
100 runs calling update 10000 times
.
MorrisCounter's average is 9399.32 after \
100 runs calling update 10000 times
F
RedisCounter's average is 10089.4 after \
100 runs calling update 10000 times
.
For Counter, 100.0% of runs are within 25% on either \
side of 10000 after 100 runs calling update 10000 times
.
For MorrisAlpha, 90.0% of runs are within 25% on either \
side of 10000 after 100 runs calling update 10000 times
.
For MorrisCounter, 49.0% of runs are within 25% on either \
side of 10000 after 100 runs calling update 10000 times
F
For RedisCounter, 91.0% of runs are within 25% on either \
side of 10000 after 100 runs calling update 10000 times
.
======================================================================
FAIL: test_MorrisCounter_expectation (test.TestExpectation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 79, in test_MorrisCounter_expectation
self.assertTrue(within(average, self.N, self.N * 1/25))
AssertionError: False is not true
======================================================================
FAIL: test_MorrisCounter_standard_deviation (test.TestStandardDeviation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 37, in test_MorrisCounter_standard_deviation
self.assertTrue(within_25_percent > .75)
AssertionError: False is not true
----------------------------------------------------------------------
Ran 8 tests in 5.141s
FAILED (failures=2)
</code></pre></div></div>
<p>The two failures are from the Basic Morris Counter and highlight why you shouldn’t use it in practice.</p>
<h3 id="visualizations">Visualizations</h3>
<p>I made some charts here to help show what the Morris Counter in all it’s forms is doing and how the Alpha Morris Counter and the Redis Morris Counter relate.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I was hoping to have a definitive answer to why the Redis Morris algorithm was great, but based on the testing and the visualizations it seems like the two algorithms, for the right parameter alpha, are quite similar. There might be a slight edge towards the Redis implementation, due to its more concentrated nature at large values, as shown in the last chart for 1,000,000 million insertions.</p>IntroductionNotes of MOSS and Followups2020-08-23T00:00:00-07:002020-08-23T00:00:00-07:00/2020/08/23/Notes-on-Moss-and-Followups<h2 id="introduction">Introduction</h2>
<p>It seems like cheating is quite prevalent in computer science courses, where students, often desperate, overwhelmed, and near a deadline copy solutions or collaborate too much with others to cheat and submit work that isn’t entirely their own. It is all too easy for students to copy and paste coding solutions from a friend or online github solution.</p>
<p>Just as long as students have been copying code, instructors have had counter measures in place to catch people who copy code. This post will talk about a prevalent system MOSS, how it works, and a few followups to it.</p>
<h3 id="moss"><a href="http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf">MOSS</a></h3>
<p>Moss stands for Measure of Software Similarity and automatically detects similarity between two files of code. You can read more about MOSS <a href="https://theory.stanford.edu/~aiken/moss/">here</a>.</p>
<p>An interesting bit of Moss history is found in this <a href="https://www.berkeley.edu/news/media/releases/97legacy/11_19_97b.html">berkeley news article from 1997</a> where it says Alex Aiken is a professor at Berkeley, about 10 percent of students were caught cheating by using Moss with Professor Grisworld’s compilers course in UC San Diego, Professor Aiken’s penalty for cheating was a 0 on the assignment plus a decrease of one letter grade, and Professor Aiken’s long term goal was not to flunk students but deter students as a whole from cheating.</p>
<p>Moss and previous work deals with finding k-grams that match between documents, where a k-gram is a contiguous substring of length k. Previous copying detection before Moss removed irrelevant features from text, split text into many k-grams, selected a hashed subset of these k-grams to be a document’s fingerprints, and see if those hashes match up with another document.</p>
<p>Example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text:
1. Alex Kassil
2. Alexander F. Kassil
Text with irrelevant Features Removed:
1. alexkassil
2. alexanderfkassil
Text split into 5-grams:
1. alexk lexka exkas xkass kassi assil
2. alexa lexan exand xande ander nderf derfk erfka rfkas fkass kassi assil
Imagining our subset selection took every other 5-gram,
kassi or assil would match between the two documents and be the one fingerprint match
</code></pre></div></div>
<p>There are three desirable properties of copy-detection: whitespace insensitivity, noise suppression, and position independence.</p>
<p>For whitespace insensitivity, everything other than text is removed, text is lowercased, and for code all variable names are swapped with “V” to achieve variable name agnostic search.</p>
<p>For noise suppression, a value of k needed to be selected that isn’t too small that it allows too many short commonalities to be found and not too long to never have matches.</p>
<p>For positional independence, choose fingerprint-s independent of position, one way of doing it is choosing hashes which equal <em>0 mod p</em>.</p>
<p>The Moss paper describes an algorithm for selecting which hashes should be the fingerprints, with the algorithm called <strong>winnowing</strong>. The algorithm slides a window across the hashes and repeatedly selected the minimum in the window to be added to the list of fingerprints.</p>
<p><img src="/assets/winnowing.png" alt="image from the Moss paper" /></p>
<p>The winnowing algorithm results in fingerprints with desirable properties like having a guarantee threshold t where matches longer than t are detected and a noise threshold k, where matches shorter than k are not selected.</p>
<p>So the Moss algorithm is an improvement of previous detection algorithms and is so prevalent due to it providing a nice web interface for all to use. Specifically for matching code the algorithm can easily ignore boiler-plate by fingerprinting the boilerplace with a special document ID that indicates any match with that fingerprint should be discarded. A simple elegant algorithm for plagiarism detection.</p>
<h3 id="taps-a-moss-extension-for-detecting-software-plagiarism-at-scale"><a href="http://lucylabs.gatech.edu/b/wp-content/uploads/2016/04/wip146-sheahenA.pdf">TAPS: A MOSS Extension for Detecting Software Plagiarism at Scale</a></h3>
<p><a href="https://github.com/danainschool/moss-taps">Code for this project</a></p>
<p>A <strong>MOSS</strong> <strong>T</strong>ool <strong>A</strong>ddressing <strong>P</strong>lagiarism at <strong>S</strong>cale. The original Moss project works great for single documents you want to check for plagiarism against each other and online solutions that might exist, like a one off job for a big course project. TAPS is an extension that works for doing a lot of important preprocessing when there are both multiple assignments in a class as well as multiple previous offerings of the course that have had similar/the same assignments.</p>
<p>The problem TAPS solves is it becomes quite cumbersome to check a current batch of assignments against each other as well as the previous ones, so TAPS:</p>
<ol>
<li>Allows for Mixed languages, and separates them before submission to Moss</li>
<li>Deals with File Management, like zips and multiple depths of directories which need to be expanded and normalized before sent to Moss.</li>
<li>Filters, since a student shouldn’t have checks between their current their nth assignment checked against their own (n - 1)th assignment, where matches are likely when assignments build upon each other.</li>
</ol>
<p>Impressively this saved an instructor tons of time of organizing, submitting, and filtering class assignments for the purpose of software plagiarism detection by slashing the time from 50 hours to only 10 minutes.</p>
<h3 id="tmoss-using-intermediate-assignment-work-to-understand-excessive-collaboration-in-large-classes"><a href="https://stanford.edu/~cpiech/bio/papers/tmoss.pdf">TMOSS: Using Intermediate Assignment Work to Understand Excessive Collaboration in Large Classes</a></h3>
<p><a href="https://github.com/yanlisa/tmoss">Code for this project</a></p>
<p>TMOSS is another extension to Moss, relying on the fact that students often have backups of intermediate assignment work, through say git or okpy. By utilizing these backups as well as the final submission for plagiarism detection, while time analyzing increases, TMOSS is “almost twice as effective as traditional software similarity detectors in identifying the number of students who exhibit excessive collaboration”. Also of interest is the paper also finds “that such students [who cheat] spend significantly less time on their assignment, use fewer class tutoring resources, and perform worse on exams than their peers”</p>
<p>The heart of the paper is the algorithm below, which just adds comparison of intermediate backups of each student to each other students submission as well as online solutions.</p>
<p><img src="/assets/tmoss.png" alt="algorithm from the TMOSS paper" /></p>
<p>Interestingly enough the paper also examines how start day of assignment effects midterm/final scores, with HEC standing for hypothesized excessive collaboration, or students who are suspected of cheating/copying solutions. Students who started earlier and didn’t cheat had higher exam scores, while cheaters had lower exam scores regardless of if they cheated or not.</p>
<p><img src="/assets/tmoss-startday-graph.png" alt="image from the TMOSS paper showing start day vs midterm score" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>Moss allows for great plagiarism detection checks, and TAPS allows for ease of managing Moss submissions when there are repeat offerings of a course/assignments build on each other, and TMOSS allows for better checks if in progress work of assignments is available too.</p>IntroductionBack to 42?!?2018-04-29T00:00:00-07:002018-04-29T00:00:00-07:00/2018/04/29/Back-To-42-Maybe<p>I’ve been loving my first year at UC Berkeley, which I plan to detail in a different post. One thing I’ve missed is the intense focus on programming I had at 42. During the piscine, I had a single goal in mind - programming. Here there are clubs and classes and people and many other distractions. It’s really great, but I miss the hardcore nature of programming at 42.</p>
<p>And so I’ve started again working on libft. I am confident I can finish it within the next couple of days, get it corrected at 42, and maybe keep the connection open and continue doing something small, partime at 42. I’ll be busy this summer interning, but maybe I’ll come during the weekends, and then come more often during my sophomore year. We shall see. All I know is I’m going to give it a damn good try.</p>I’ve been loving my first year at UC Berkeley, which I plan to detail in a different post. One thing I’ve missed is the intense focus on programming I had at 42. During the piscine, I had a single goal in mind - programming. Here there are clubs and classes and people and many other distractions. It’s really great, but I miss the hardcore nature of programming at 42.42 USA Piscine Results2017-07-26T00:00:00-07:002017-07-26T00:00:00-07:00/piscine/2017/07/26/Piscine-Results<h1 id="i-got-in">I GOT IN!!!</h1>
<h1 id="yay">YAY!!!!!!!!</h1>
<p><img src="/assets/42acceptance.png" alt="Screenshot of the acceptance email" /></p>
<p>Only 5 short (read: long) days after the piscine I, and many of my friends from the 42 piscine, received this very email.</p>
<p>I start the 3-5 year program September 19th, but 42 will not be my focus, Berkeley will. 42 will, however, serve as an extra curricular that delivers many interesting side projects!</p>I GOT IN!!! YAY!!!!!!!!Exam Final: Eight Hours of Fun2017-07-21T00:00:00-07:002017-07-21T00:00:00-07:00/piscine/2017/07/21/Piscine-Exam-Final<h2 id="the-final-exam-and-the-final-day-day-26">The Final Exam and The Final Day (Day 26)</h2>
<p>It’s over. Almost one month of grueling, 12 hours a day, intense programming, I finished the piscine. What a month. It flew by, since every day I was so busy. Also I barely procrastinated and spent not that much time on my phone. I was more or less 100% focused on programming. I will speak more about my final thoughts in the next post. For now, I will just recap the last day.</p>
<h3 id="one-last-exam">One Last Exam</h3>
<p>This final exam was special, in that it was twice as long as the other three. Eight hours instead of the usual 4. It was also much longer, going up way past level 5 to level 10. I only (barely) made it to level 9. The first 6 questions were that of a normal exam, level 0 to level 5. I breezed through them. Each of these questions were worth 9 points, but if you got one wrong you would get a question of the same level but only worth 4 points.</p>
<h4 id="level-0">Level 0</h4>
<p>Print last param. Using only write, I had to display the last command line argument passed into the program. Very simple</p>
<h4 id="level-1">Level 1</h4>
<p>String copy. I had to write a function that takes two char arrays as inputs and copies the second’s value into the first, and return the first. Quite simple.</p>
<h4 id="level-2">Level 2</h4>
<p>String compare. Return the ascii difference between two strings. I wrote a function that iterated through the two strings until a character differed/null was reached and then returned the difference of the current char. Pretty straightforward.</p>
<h4 id="level-3">Level 3</h4>
<p>Ft rrange. Using malloc, return an array filled with ints starting from input end and ending at input start. Not too hard, but definitely starting to ramp up in complexity. At this point I was about two hours in, and at 36 points.</p>
<h4 id="level-4">Level 4</h4>
<p>Reverse string by whitespace. This one was the first new problem for me. I had to write a program that takes a string delineated by white space and prints the string with the words reversed. <code class="language-plaintext highlighter-rouge">./rev_wstr "This is a test"</code> would output <code class="language-plaintext highlighter-rouge">"test a is This"</code>. I ended up doing this one recursively, printing each last word, then setting the last space to null and calling the function again. Conceptually interesting, and nontrivial to implement.</p>
<h4 id="level-5">Level 5</h4>
<p>Brainfuck for the third time! This time I breezed through it in 15 minutes. It’s crazy how something so complex to me on the first exam became so easy! I was even able to explain how brainfuck works/how to write an interpreter for it to others so they could pass brainfuck on the exam. Super easy this time around, satisfying to complete. ~3 hours in</p>
<h4 id="level-6---uncharted-territory">Level 6 - Uncharted territory</h4>
<p>Count alpha. Taking a string as input, print out the number of each letter present. So <code class="language-plaintext highlighter-rouge">./count_alpha "Hello, world!"</code> should display <code class="language-plaintext highlighter-rouge">"1h, 1e, 3l, 2o, 1w, 1r, 1d"</code>. This problem wasn’t too difficult. I used a global char array with 26 elements to keep track of which letters had already been printed so as to not print any letter twice, and then iterated over the input displaying the number and letter. Kind of difficult, but also fun. ~4 hours in</p>
<p>I later learned from the cadets that these last 5 levels and the problems they hold are only used for the final exam of the C piscine. It was honestly jarring during the exam when I got to level 6. I though I would have to do multiple level 5 questions.</p>
<h4 id="level-7">Level 7</h4>
<p>Order by alpha and length. This one was long. You had to take the string given as input, and then print all the words in it sorted first by length and then by lexicographical order. For this problem I had to actually rewrite a level 4 problem, split whitespaces, in order to manage the given input. Hard and long. ~6 hours in</p>
<h4 id="level-8-attempt-1">Level 8 Attempt 1</h4>
<p>Count Island. The first problem I failed. For this one you had to design a program that takes a map as input (rectangle populated only by .’s and #’s), and number each island (represented as a group on touching #’s). My solution worked for the two given exampled, but didn’t pass the tests. I believe this is because I didn’t clear the buffer every time the program ran. Fun, but difficult. ~7 hours in.</p>
<h4 id="level-8-attempt-2">Level 8 Attempt 2</h4>
<p>Infinity Addition. For this program I had to take 2 strings representing valid integers as input, and return their sum. I ended up doing this recursively, adding (or subtracting) the number digit by digit with the carry recursively to push the first result way back into the stack and print it at the very end. While this problem wasn’t crazy difficult, I was so exhausted and mentally drained I had a lot of trouble getting it to work. Before I submitted it with minutes left, I kept having seg faults for the input of -10 and 9. Thankfully the automated tests didn’t catch that and I passed this problem. Very difficult due to physical and mental exhaustion. ~7 hours and 57 minutes in</p>
<h4 id="level-9">Level 9</h4>
<p>Graph diameter. Given a string with graph links, return the length of the largest circular route. Now, I’ve never done any work with graphs, and I only had three minutes left. So I was unable to even start this question.</p>
<h4 id="level-10-didnt-get-to-this-one">Level 10 (Didn’t get to this one)</h4>
<p>MD5. Rewrite the MD5 hash algorithm. Yep, this one is hard. No one, as far as I know, has gotten this right.</p>
<p>I ended up with a 76/100. The highest score in my piscine was 81/100. I was very happy with what I got, it being the second highest. I definitely enjoyed this very long exam, but was brain dead afterwards.</p>
<h3 id="bbq">BBQ</h3>
<p>Woohoo! Piscine is over. The pisciners plus some cadets all had a barque on the lawn. It was nice to relax and play volleyball and eat food and be done with the piscine.</p>
<h3 id="viewing-of-the-hitchhikers-guide-to-the-galaxy">Viewing of The Hitchhiker’s Guide to the Galaxy</h3>
<p>To end the piscine, the movie club had a viewing of the movie that 42 got it’s namesake from. Also it made clear a lot of the references during the piscine. Vogsphere, or the server we pushed all our work to, is from Hitchhiker’s Guide to the Galaxy. So is Sastantua, Marvin, and 42! A great way to end a great month of learning.</p>The Final Exam and The Final Day (Day 26) It’s over. Almost one month of grueling, 12 hours a day, intense programming, I finished the piscine. What a month. It flew by, since every day I was so busy. Also I barely procrastinated and spent not that much time on my phone. I was more or less 100% focused on programming. I will speak more about my final thoughts in the next post. For now, I will just recap the last day.BSQ: BSQ2017-07-19T00:00:00-07:002017-07-19T00:00:00-07:00/piscine/2017/07/19/BSQ<h2 id="the-last-day-of-the-last-project-day-24">The Last Day of The Last Project (Day 24)</h2>
<p>The last project is BSQ, aka Biggest SQuare. For this project we got to choose partners, which was a refreshing change from the randomly assigned rush groups. Also it was a team of two instead of three. For this project we had to find the biggest square from a given map filled with empty spaces and obstacles. The cool thing about this project was that we had to take into consideration not only correctness, but also speed and efficiency.</p>
<h3 id="the-strategy">The strategy</h3>
<p>What my group ending up doing was brute force. We inefficiently send multiple copies of the input map around to functions that had while loops inside of while loops. Our logarithmic complexity was O(n ^ 8). But it worked. And 50% of the grade is correctness.</p>
<p><strong>UPDATE</strong>
We needed to have a new line after every map, and we didn’t. 0/100</p>The Last Day of The Last Project (Day 24) The last project is BSQ, aka Biggest SQuare. For this project we got to choose partners, which was a refreshing change from the randomly assigned rush groups. Also it was a team of two instead of three. For this project we had to find the biggest square from a given map filled with empty spaces and obstacles. The cool thing about this project was that we had to take into consideration not only correctness, but also speed and efficiency.Piscine Day 11: Making the most of Makefiles2017-07-11T00:00:00-07:002017-07-11T00:00:00-07:00/piscine/2017/07/11/Piscine-Day-11<h2 id="day-11-day-16">Day 11 (Day 16)</h2>
<p>While Day 11 opened up today, I did not get to it. I was so busy finishing up Day 10! While I learned a lot about Makefiles yesterday, I got to use them, and then function pointers, in conjunction for quite a few exercises. The coolest exercise of today was a rudimentary calculator that would take operations and integer inputs to give out the result.</p>
<p>Today I almost ruined my whole submission. With minutes to go before 11:42, I accidentally pushed my whole practice folder to the submission github vogsphere account other than just my files for the day. Thankfully someone with a lot more git knowledge than me fixed my stupid error and it all worked out. I definitely started to panic last minute.</p>
<p>Time has been flying by. It’s crazy how every day I am so busy, working and working. Hours fly by, and the blurred together days are only delineated by food and sleep. I love it.</p>Day 11 (Day 16) While Day 11 opened up today, I did not get to it. I was so busy finishing up Day 10! While I learned a lot about Makefiles yesterday, I got to use them, and then function pointers, in conjunction for quite a few exercises. The coolest exercise of today was a rudimentary calculator that would take operations and integer inputs to give out the result.Piscine Day 08: And Day 07 And Day 092017-07-06T00:00:00-07:002017-07-06T00:00:00-07:00/piscine/2017/07/06/Piscine-Day-08<h2 id="day-08-day-11">Day 08 (Day 11)</h2>
<p>Today was a lot of work. For a few hours I had 3 different days being open. Day 07 was due at 11:42pm on Thursday, Day 08 opened at 8:42 am on Thursday, and Day 09 started on 5:42 pm on Thursday, so from 5:42 - 11:42 I was swamped with work. Quite a lot of fun and a lot to learn. Day 08 introduced a fascinating concept of structures.</p>
<p>Structures are a type of compound data block that compounds variables into one. They are similar to classes, but they cannot have any functions. We also learned about header files. Header files take all the repetitive preprocessing from the beginning of a C file and sticks it in a header file, where all macros, prototypes, structures, and even global variables can be defined. And then these header files can be used and reused and reused again for all c files from a project.</p>
<p>As I learn more and more C components, I am itching to build some sort of grand project in C. Hopefully the final project of the piscine can be that.</p>Day 08 (Day 11) Today was a lot of work. For a few hours I had 3 different days being open. Day 07 was due at 11:42pm on Thursday, Day 08 opened at 8:42 am on Thursday, and Day 09 started on 5:42 pm on Thursday, so from 5:42 - 11:42 I was swamped with work. Quite a lot of fun and a lot to learn. Day 08 introduced a fascinating concept of structures.