<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://wangcong.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://wangcong.org/" rel="alternate" type="text/html" /><updated>2026-07-01T03:29:22+00:00</updated><id>https://wangcong.org/feed.xml</id><title type="html">A Geek’s Page</title><subtitle>Cong Wang&apos;s Personal Blog Posts</subtitle><entry><title type="html">Why I Stopped Arguing With People</title><link href="https://wangcong.org/2026-06-30-why-i-stopped-arguing-with-people.html" rel="alternate" type="text/html" title="Why I Stopped Arguing With People" /><published>2026-06-30T20:00:00+00:00</published><updated>2026-06-30T20:00:00+00:00</updated><id>https://wangcong.org/why-i-stopped-arguing-with-people</id><content type="html" xml:base="https://wangcong.org/2026-06-30-why-i-stopped-arguing-with-people.html"><![CDATA[<p>I am a software engineer, and I used to enjoy arguing with people for technical correctness. Code reviews, design meetings, mailing-list threads, dinner tables. If someone was wrong, I wanted them to know it, and I wanted them to know exactly <em>why</em>. I collected counterarguments the way I collected patches. I believed that if I just laid out the logic clearly enough, the other person would have no choice but to come around. Truth would win.</p>

<p>It almost never worked that way.</p>

<p>Sometimes I won on points and lost the person. More often I won nothing at all: I’d watch someone grow more certain of the very thing I had just disproven, while the room quietly drifted to their side. I would walk away technically right and completely alone.</p>

<p>Over the years I’ve slowly stopped arguing. Not because I stopped caring about being right, but because I finally understood what an argument actually is, and what it can and cannot do. Here is what changed my mind.</p>

<h2 id="being-correct-is-not-always-good">Being Correct Is Not Always Good</h2>

<p>The first thing I had to give up was the belief that being correct is always good. As an engineer, this felt like heresy. Correctness is the whole job. But correctness in a fact is not the same as goodness in a moment.</p>

<p>Lao Tzu saw this 2,500 years ago. In chapter 2 of the <em>Tao Te Ching</em>:</p>

<blockquote>
  <p>Being and non-being create each other.</p>

  <p>Hard and easy complete each other.</p>

  <p>Long and short define each other.</p>

  <p>High and low depend on each other.</p>

  <p>Sound and silence harmonize each other.</p>
</blockquote>

<p>Everything exists only in relation to its opposite. There is no “right” without a “wrong” to make it right, and the moment you insist on standing on the high ground, you’ve created the low ground someone else must stand on. Winning an argument manufactures a loser. Being visibly correct manufactures someone visibly wrong.</p>

<p>So being right is not a pure good floating in space. It’s half of a pair, and it drags its opposite along with it. Once I stopped treating correctness as an absolute, I stopped needing to win.</p>

<h2 id="most-arguments-are-about-ego-not-ideas">Most Arguments Are About Ego, Not Ideas</h2>

<p>When you argue with someone, you think you’re debating an idea. Often you’re not. You’re challenging their sense of self.</p>

<p>Many people are ego-driven. Their opinions aren’t positions they hold; they <em>are</em> the position. Prove the idea wrong and you haven’t corrected a fact, you’ve attacked a person. So they defend it the way anyone defends themselves: not with reason, but with resistance. The stronger your argument, the harder they dig in.</p>

<p>You can’t win an argument like this, because it was never an argument. It was a fight over whose ego stays intact. Even when you “win,” you lose, because now you have an enemy who is more convinced than before.</p>

<p>So I’ve drawn a line. I only discuss pros and cons with smart people; I don’t argue right and wrong with ego-driven ones. With the first kind, a disagreement is a joint search for the better answer, and both of us walk away sharper. With the second, there is no answer being sought, only a self to be defended. Knowing which conversation you’re in is half the battle. The other half is having the discipline to walk away from the second one.</p>

<h2 id="people-are-not-rational">People Are Not Rational</h2>

<p>We like to believe humans are rational animals who occasionally feel emotions. It’s the reverse. We are emotional animals who occasionally think.</p>

<p>Most people don’t reason their way to conclusions and then feel accordingly. They feel first, then reason backward to justify the feeling. They follow the crowd, mistake confidence for correctness, and adopt whatever the people around them already believe. Independent thinking is rare, far rarer than we admit.</p>

<p>Once you accept this, arguing with logic starts to look absurd. You’re bringing a proof to a feeling. The proof is airtight. The feeling doesn’t read.</p>

<h2 id="correcting-others-rarely-helps-them">Correcting Others Rarely Helps Them</h2>

<p>“But my motivation is good,” you say. “I’m not attacking anyone. I’m just pointing out a mistake so they don’t get hurt.”</p>

<p>I believed this for a long time. It sounds noble. But even with the best intentions, correcting people usually fails, and here’s the hard part: <em>don’t do it anyway.</em></p>

<p>People don’t see your motivation. They see criticism. They rarely understand why you bothered, and they almost never appreciate it. Worse, most people don’t learn from advice at all. They learn from consequences. They have to touch the stove themselves. Words bounce off; pain sticks.</p>

<p>This sounds cold. It is. But it’s also, sadly, true. The most respectful thing you can often do is let people meet their own consequences, because that’s the only teacher they’ll actually listen to.</p>

<h2 id="the-one-exception-when-they-ask">The One Exception: When They Ask</h2>

<p>There’s a clean exception to all of this, and it flips the entire logic.</p>

<p>Help people when they <em>explicitly ask for help.</em></p>

<p>When someone asks, the cause and effect reverse. You’re no longer imposing your judgment on someone who never wanted it. Their asking is the cause; your helping is the effect. Now there’s an opening, a real one, because they’ve decided they’re ready to hear it. The ego is lowered. The defenses are down. The advice lands.</p>

<p>So I don’t offer anymore. I wait for the door to open from the inside. And when someone opens it, I give everything I have.</p>

<h2 id="dont-win-the-argument-profit-from-the-difference">Don’t Win the Argument, Profit From the Difference</h2>

<p>If letting go of the argument sounds like pure loss, here’s the reframe that turns it into a gain.</p>

<p>When you and someone else see the world differently, you have two options. You can spend your energy trying to convince them you’re right, which, as everything above shows, almost never works. Or you can treat that difference as an asset and go build on it.</p>

<p>If you genuinely believe something others don’t, that’s not a debate to win. That’s an edge. The market rewards being right in a way that no argument ever will. Instead of persuading the skeptic, ship the thing they think is wrong and let reality settle it. Their disagreement isn’t an obstacle; it’s your moat. If everyone already agreed with you, there’d be no opportunity left.</p>

<p>This is especially true if you’re starting your own company. Differentiation is not a side effect of business, it <em>is</em> the business. A startup exists precisely because its founders believe something the rest of the world hasn’t accepted yet. If you could win that argument in a meeting, it wouldn’t be worth a company. The entire value lives in the gap between what you see and what others refuse to.</p>

<p>So I stopped trying to close that gap by talking. I started trying to profit from it by building. Let people disagree. Their disagreement is where the money, and the meaning, is.</p>

<h2 id="you-can-only-change-yourself">You Can Only Change Yourself</h2>

<p>Here’s the part that took me longest to accept.</p>

<p>In this world, there is no one you can change. Not your spouses, not your friends, not your kids, and of course not strangers on the internet. Only yourself.</p>

<p>That’s not cynicism, and it’s not giving up on people. It’s the opposite. It’s putting your energy where it can actually do something. Every hour spent trying to change someone who didn’t ask is an hour stolen from the one person (yourself) you <em>can</em> change.</p>

<p>And changing yourself is enough. You don’t need to fix everyone else to live well. When you become clearer, calmer, more skilled, more honest, the world around you shifts on its own, not because you forced anyone, but because people respond to who you actually are. Change yourself and you’ve changed your entire experience of the world. That is sufficient. Nothing more is required.</p>

<p>Accept this, and a strange peace follows. The arguments fall away. The frustration drains out. You stop trying to win people over and start letting them be who they are.</p>

<hr />

<p>So turn the question around.</p>

<p>If the only person you can change is yourself, then the one question that matters is: how do you actually get better? Not by winning arguments. You get better by asking others for feedback, again and again, and truly listening to it. It’s the same asking I described earlier, the one clean exception, now turned on myself: I’m the one requesting help, so the advice can finally land. And you cannot do that with an ego in the way. The ego that needs to win is the same ego that can’t hear. It’s not just harmful; it’s a disaster, to everyone around you and to yourself most of all, because it quietly walls you off from the one thing that would improve you.</p>

<p>So put it away. Stay humble. Keep asking. That is the whole discipline.</p>

<p>I stopped arguing not because I stopped caring about being right, but because I finally wanted something more than being right: I wanted to keep getting better. And the only door to that is the one ego keeps slamming shut.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[I am a software engineer, and I used to enjoy arguing with people for technical correctness. Code reviews, design meetings, mailing-list threads, dinner tables. If someone was wrong, I wanted them to know it, and I wanted them to know exactly why. I collected counterarguments the way I collected patches. I believed that if I just laid out the logic clearly enough, the other person would have no choice but to come around. Truth would win.]]></summary></entry><entry><title type="html">Personal Taste Is the Moat</title><link href="https://wangcong.org/2026-01-13-personal-taste-is-the-moat.html" rel="alternate" type="text/html" title="Personal Taste Is the Moat" /><published>2026-01-13T20:00:00+00:00</published><updated>2026-01-13T20:00:00+00:00</updated><id>https://wangcong.org/personal-taste-is-the-moat</id><content type="html" xml:base="https://wangcong.org/2026-01-13-personal-taste-is-the-moat.html"><![CDATA[<p>AI can now tell you whether code works. It reviews patches, spots bugs, suggests fixes, and explains trade-offs. Correctness is becoming cheap. Competence is being commoditized.</p>

<p>But there’s something AI cannot do: tell you whether something <em>should exist</em>.</p>

<p>That requires <strong>taste</strong>: judgment formed by long exposure to the best work humans have done, and by living with the consequences of decisions over time. In the AI era, personal taste is the moat.</p>

<h2 id="what-taste-actually-means">What Taste Actually Means</h2>

<p>When people hear “taste,” they think of preference, something arbitrary and subjective. That’s wrong.</p>

<p>Taste is not about what you like. It’s about what you’ve <em>seen</em>. It comes from studying great systems, watching bad ideas fail, understanding where complexity accumulates, knowing which shortcuts age badly, and internalizing what users actually experience. Taste is judgment compressed by time.</p>

<p>That’s precisely why it’s hard to automate, and impossible to shortcut.</p>

<h2 id="proof-rejecting-an-ai-approved-kernel-patch">Proof: Rejecting an AI-Approved Kernel Patch</h2>

<p>Recently, I <a href="https://lore.kernel.org/netdev/CAM_iQpXXiOj=+jbZbmcth06-46LoU_XQd5-NuusaRdJn-80_HQ@mail.gmail.com/" target="_blank">rejected a Linux kernel patchset</a> that had already passed AI-based reviews.</p>

<p>The patch wasn’t broken. It solved a real problem. It was technically coherent. I NACKed it anyway.</p>

<p>Why? The solution was wrong in ways that only matter long-term. The patch introduced a new <code class="language-plaintext highlighter-rouge">skb-&gt;ttl</code> mechanism to prevent packet loops. My objections weren’t about correctness; they were about design:</p>

<ol>
  <li><strong>Bloat</strong>: It increased <code class="language-plaintext highlighter-rouge">sk_buff</code> size even under minimal configuration.</li>
  <li><strong>Wrong layer</strong>: It fixed the symptom (infinite loops) instead of the root cause (enqueuing to the root qdisc).</li>
  <li><strong>Hidden constraints</strong>: It made <code class="language-plaintext highlighter-rouge">netem</code> behavior less predictable by introducing a kernel-internal limit invisible to userspace.</li>
</ol>

<p>No AI reviewer would reject this patch. These aren’t bugs; they’re judgment calls. And judgment calls are exactly where taste lives.</p>

<h2 id="the-limits-of-ai">The Limits of AI</h2>

<p>AI excels at pattern matching, local correctness, consistency with existing code, and applying known best practices. These are valuable. They’re also becoming table stakes.</p>

<p>But systems like the Linux kernel aren’t shaped by correctness alone. They’re shaped by the collective taste of hundreds of maintainers, accumulated over decades, enforced through code review, and passed down through mentorship. The kernel’s design reflects countless judgment calls about what belongs and what doesn’t. These are properties you only <em>feel</em> after years of exposure.</p>

<p>Here’s the key distinction: AI evaluates whether a change fits the rules. Taste decides whether the rules themselves are being bent in the wrong direction.</p>

<h2 id="why-taste-is-now-the-differentiator">Why Taste Is Now the Differentiator</h2>

<p>Before AI, engineers differentiated themselves through speed and raw execution. AI has leveled that playing field. Everyone now has AI reviews, automated tests, copilots, and fast iteration loops.</p>

<p>When correctness becomes commoditized, the differentiator moves up the stack:</p>

<ul>
  <li>Who decides what belongs in the system?</li>
  <li>Who recognizes a bad direction before it’s too late?</li>
  <li>Who can say “this works, but it shouldn’t exist”?</li>
</ul>

<p>That judgment, that taste, is the moat.</p>

<h2 id="the-paradox-better-ai-makes-humans-more-important">The Paradox: Better AI Makes Humans More Important</h2>

<p>As AI removes mechanical friction, it surfaces harder decisions: Should this abstraction exist at all? Is this trade-off worth locking in? Are we leaking complexity to users? Will this design age gracefully?</p>

<p>These questions can’t be answered with more data. They require judgment shaped by years of exposure.</p>

<p>As <a href="https://www.youtube.com/watch?v=5y03eFMmOKY" target="_blank">Steve Jobs put it</a>:</p>

<blockquote>
  <p>Ultimately, it comes down to taste — exposing yourself to the best things humans have done, and bringing those forward into what you’re doing.</p>
</blockquote>

<p>That’s not romanticism. It’s engineering wisdom.</p>

<h2 id="ai-assists-taste-decides">AI Assists. Taste Decides.</h2>

<p>AI should be part of every process. Let it catch mistakes, suggest alternatives, and reduce toil. But passing AI review should never be the acceptance bar.</p>

<p>In domains that endure (kernels, runtimes, protocols, platforms), the final filter must be human judgment, informed by taste. Not because humans are always right, but because the hardest decisions aren’t reducible to rules.</p>

<hr />

<p>AI can tell you whether something works.</p>

<p>Only taste can tell you whether it belongs.</p>

<p>In the AI era, when execution is cheap and correctness is abundant, <strong>personal taste is the moat.</strong></p>]]></content><author><name></name></author><summary type="html"><![CDATA[AI can now tell you whether code works. It reviews patches, spots bugs, suggests fixes, and explains trade-offs. Correctness is becoming cheap. Competence is being commoditized.]]></summary></entry><entry><title type="html">eBPF Trend Analysis</title><link href="https://wangcong.org/2025-02-27-ebpf-trend-analysis.html" rel="alternate" type="text/html" title="eBPF Trend Analysis" /><published>2025-02-27T22:07:00+00:00</published><updated>2025-02-27T22:07:00+00:00</updated><id>https://wangcong.org/ebpf-trend-analysis</id><content type="html" xml:base="https://wangcong.org/2025-02-27-ebpf-trend-analysis.html"><![CDATA[<p>The evolution of eBPF (extended Berkeley Packet Filter) has been one of the most significant developments in Linux system observability and networking over the past decade. As we look at the landscape in early 2025, several key trends have emerged that showcase eBPF’s growing influence and future directions.</p>

<h2 id="0-overall-momentum">0. Overall Momentum</h2>

<h3 id="security-and-observability-convergence">Security and Observability Convergence</h3>

<p>The integration of security tooling with observability has become increasingly prominent. Modern security solutions are leveraging eBPF’s ability to safely inspect system calls, network traffic, and application behavior without modifying the kernel or applications. This convergence has led to more sophisticated runtime threat detection and real-time policy enforcement capabilities.</p>

<h3 id="cloud-native-integration">Cloud Native Integration</h3>

<p>Kubernetes and container orchestration platforms have embraced eBPF as a fundamental building block. Network policies, service mesh implementations, and container security tools are increasingly built on eBPF primitives, offering better performance and more granular control compared to traditional approaches.</p>

<h3 id="performance-optimization">Performance Optimization</h3>

<p>The overhead of eBPF programs has been continuously decreasing thanks to improvements in the JIT compiler and verifier. New optimizations in the kernel have made it possible to run more complex eBPF programs with minimal impact on system performance, opening doors for more sophisticated use cases.</p>

<h3 id="cross-platform-support">Cross-Platform Support</h3>

<p>While eBPF originated in Linux, efforts to bring eBPF capabilities to other operating systems have gained momentum. Projects like eBPF for Windows have matured, enabling consistent observability and security solutions across heterogeneous environments.</p>

<h3 id="programming-model-evolution">Programming Model Evolution</h3>

<p>The developer experience around eBPF has significantly improved. High-level languages and frameworks have abstracted away much of the complexity, making eBPF programming more accessible to a broader range of developers. Tools like libbpf-bootstrap and modern CO-RE (Compile Once – Run Everywhere) support have simplified the development workflow.</p>

<p>Below is a detailed analysis of the key trends and future directions of eBPF in early 2025.</p>

<hr />

<h2 id="1-academic-research-trends">1. Academic Research Trends</h2>

<h3 id="11-publication-counts-over-time">1.1 Publication Counts Over Time</h3>

<p>An analysis of Google Scholar and IEEE Xplore from 2014 to 2023 shows a clear rise in eBPF-related publications. The table below is a rough estimate of the number of publications explicitly mentioning “extended BPF” or “eBPF” in the title, abstract, or keywords each year. We include a column for the approximate total citations of those papers.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right"><strong>Year</strong></th>
      <th style="text-align: right"><strong>Number of Publications</strong></th>
      <th style="text-align: right"><strong>Estimated Total Citations</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">2014</td>
      <td style="text-align: right">~2</td>
      <td style="text-align: right">&lt; 10</td>
    </tr>
    <tr>
      <td style="text-align: right">2015</td>
      <td style="text-align: right">~5</td>
      <td style="text-align: right">~30</td>
    </tr>
    <tr>
      <td style="text-align: right">2016</td>
      <td style="text-align: right">~12</td>
      <td style="text-align: right">~100</td>
    </tr>
    <tr>
      <td style="text-align: right">2017</td>
      <td style="text-align: right">~20</td>
      <td style="text-align: right">~220</td>
    </tr>
    <tr>
      <td style="text-align: right">2018</td>
      <td style="text-align: right">~35</td>
      <td style="text-align: right">~550</td>
    </tr>
    <tr>
      <td style="text-align: right">2019</td>
      <td style="text-align: right">~50</td>
      <td style="text-align: right">~1,200</td>
    </tr>
    <tr>
      <td style="text-align: right">2020</td>
      <td style="text-align: right">~65</td>
      <td style="text-align: right">~2,000</td>
    </tr>
    <tr>
      <td style="text-align: right">2021</td>
      <td style="text-align: right">~80</td>
      <td style="text-align: right">~3,500</td>
    </tr>
    <tr>
      <td style="text-align: right">2022</td>
      <td style="text-align: right">~95</td>
      <td style="text-align: right">~5,000</td>
    </tr>
    <tr>
      <td style="text-align: right">2023</td>
      <td style="text-align: right">~110</td>
      <td style="text-align: right">~7,500+</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Key Observations</strong>:
    <ul>
      <li>Starting from just ~2 publications in 2014, the field has grown steadily each year.</li>
      <li>By 2020, there were ~65 publications with ~2,000 total citations, showing increasing academic impact.</li>
      <li>By 2023, reached ~110 publications with over 7,500 citations, demonstrating mainstream academic adoption.</li>
      <li>Topics include verifier correctness, advanced debugging, SDN, performance analysis, and real-time intrusion detection.</li>
    </ul>
  </li>
</ul>

<hr />

<h2 id="2-industry-adoption">2. Industry Adoption</h2>

<h3 id="21-major-open-source-projects-on-github">2.1 Major Open-Source Projects on GitHub</h3>

<p>Before highlighting specific projects, it is helpful to see <strong>year-by-year growth</strong> of eBPF-related repositories on GitHub. Because GitHub does not offer official historical counts, these <strong>figures are approximate</strong> snapshots gleaned from periodic searches, archives, and retrospective queries. They should be seen as <strong>best-effort estimates</strong> rather than definitive totals:</p>

<table>
  <thead>
    <tr>
      <th><strong>Year</strong></th>
      <th style="text-align: right"><strong>Approx. Number of eBPF-Labeled GitHub Repos</strong></th>
      <th><strong>Notes / Methodology</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>2016</strong></td>
      <td style="text-align: right">&lt; 20</td>
      <td>Early references to eBPF often used “BPF” generally; only a handful of repos explicitly labeled “eBPF.”</td>
    </tr>
    <tr>
      <td><strong>2017</strong></td>
      <td style="text-align: right">~70</td>
      <td>Growth fueled by the initial popularity of bcc (BPF Compiler Collection) and a few early Cilium, Falco prototypes.</td>
    </tr>
    <tr>
      <td><strong>2018</strong></td>
      <td style="text-align: right">~150</td>
      <td>Rapid increase as more developers labeled projects with the “eBPF” tag. Influenced by blogs/conference talks introducing eBPF for XDP.</td>
    </tr>
    <tr>
      <td><strong>2019</strong></td>
      <td style="text-align: right">~300</td>
      <td>Cilium, Falco, and other major eBPF-based projects gained traction; more third-party integrations started to appear.</td>
    </tr>
    <tr>
      <td><strong>2020</strong></td>
      <td style="text-align: right">~600</td>
      <td>Surge in eBPF usage for Kubernetes security/observability, plus the release of new bcc/bpftrace tools.</td>
    </tr>
    <tr>
      <td><strong>2021</strong></td>
      <td style="text-align: right">~1,200</td>
      <td>Cloud providers and larger enterprises increasingly adopting eBPF; many specialized repos for tracing, security, and performance.</td>
    </tr>
    <tr>
      <td><strong>2022</strong></td>
      <td style="text-align: right">~1,500</td>
      <td>eBPF became a common keyword in the cloud-native ecosystem; the ecosystem of supporting tools expanded significantly.</td>
    </tr>
    <tr>
      <td><strong>2023</strong></td>
      <td style="text-align: right">~1,800</td>
      <td>Ongoing adoption in observability/security contexts; GitHub labeling “eBPF” expanded to more experimental and domain-specific projects.</td>
    </tr>
    <tr>
      <td><strong>2024</strong></td>
      <td style="text-align: right">~2,000+</td>
      <td>As of early 2025, a fresh GitHub search can exceed 2,000 results across public repos explicitly referencing or tagging eBPF.</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Observations</strong> about eBPF-labeled repos:
    <ol>
      <li><strong>Steady Increase</strong>: From fewer than 20 repositories in 2016, the count has grown to well over 2,000 by 2024.</li>
      <li><strong>Influence of Major Projects</strong>: Jumps in 2018–2019 and 2020–2021 correlate with bcc, Cilium, Falco, bpftrace, etc., as developers created or labeled new repos.</li>
      <li><strong>Domain-Specific Tools</strong>: By 2021 onward, more specialized repos emerged (database monitoring, HPC instrumentation, IoT edge, etc.).</li>
      <li><strong>Data Limitations</strong>: Older repos may have retroactively added “eBPF,” some eBPF-related work is labeled only “BPF” or “XDP,” and private repos are absent from counts.</li>
    </ol>
  </li>
</ul>

<p>Separately, there are several <strong>flagship eBPF-based tools</strong> whose GitHub star counts illustrate community mindshare (as of early 2025):</p>

<table>
  <thead>
    <tr>
      <th><strong>Project</strong></th>
      <th style="text-align: right"><strong>GitHub Stars</strong></th>
      <th><strong>Description</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>bcc</strong></td>
      <td style="text-align: right">~16,000</td>
      <td>BPF Compiler Collection—fundamental eBPF tools &amp; examples</td>
    </tr>
    <tr>
      <td><strong>bpftrace</strong></td>
      <td style="text-align: right">~6,000</td>
      <td>High-level tracing language built on eBPF; often used alongside bcc</td>
    </tr>
    <tr>
      <td><strong>Cilium</strong></td>
      <td style="text-align: right">~14,000</td>
      <td>Kubernetes-native networking &amp; security layer leveraging eBPF</td>
    </tr>
    <tr>
      <td><strong>Falco</strong></td>
      <td style="text-align: right">~7,000</td>
      <td>Runtime security platform that can use eBPF sensors</td>
    </tr>
    <tr>
      <td><strong>Pixie</strong></td>
      <td style="text-align: right">~4,000</td>
      <td>Observability platform collecting in-kernel data via eBPF</td>
    </tr>
  </tbody>
</table>

<ul>
  <li>bcc remains a staple for eBPF development and has grown steadily.</li>
  <li>bpftrace offers a more user-friendly, high-level syntax for kernel and process tracing via eBPF.</li>
  <li>Cilium, essential for container networking and security, jumped from a few thousand stars in 2018–2019 to ~14,000.</li>
</ul>

<h3 id="22-conferences--community-engagement">2.2 Conferences &amp; Community Engagement</h3>

<ul>
  <li><strong>eBPF Summit Attendance</strong>:
    <ul>
      <li><strong>2020 (Inaugural)</strong>: ~300 attendees (virtual).</li>
      <li><strong>2021</strong>: ~800 attendees.</li>
      <li><strong>2022</strong>: ~1,500 attendees (hybrid).</li>
      <li><strong>2023</strong>: 2,000+ attendees (hybrid).</li>
      <li><strong>2024</strong>: 3,000+ attendees (estimated).</li>
    </ul>
  </li>
  <li><strong>FOSDEM / KubeCon Sessions</strong>:
    <ul>
      <li>FOSDEM 2023 had 5 dedicated eBPF talks (up from 1–2 in earlier years).</li>
      <li>KubeCon 2023–2024 saw 15+ sessions specifically highlighting eBPF (service mesh, network policy, kernel tracing, etc.).</li>
    </ul>
  </li>
  <li><strong>LSF/MM/BPF Conference</strong>:
    <ul>
      <li>The LSF/MM/BPF conference (Linux Storage, Filesystem, Memory Management, and BPF) has had a dedicated BPF track since ~2019–2020, typically hosting <strong>~10–15</strong> BPF-related proposals each year (verifier design, new helpers, advanced kernel integration topics).</li>
    </ul>
  </li>
  <li><strong>Linux Plumbers Conference eBPF Track</strong>:
    <ul>
      <li>The Linux Plumbers Conference introduced a dedicated eBPF microconference around 2020–2021, which has grown to <strong>~100+</strong> participants in some sessions by 2023–2024. Talks focus on eBPF subsystem changes, performance tuning, and new features.</li>
    </ul>
  </li>
</ul>

<p>Major cloud providers (AWS, Azure, Google Cloud) also leverage eBPF internally for advanced networking, observability, and distributed tracing, reflecting an industry-wide shift to harness in-kernel programmability.</p>

<hr />

<h2 id="3-linux-kernel-community-development">3. Linux Kernel Community Development</h2>

<h3 id="31-subsystem-size-and-growth">3.1 Subsystem Size and Growth</h3>

<p>From its initial merge in Linux <strong>v3.18 (late 2014)</strong>, the BPF subsystem has grown dramatically. This includes all BPF code: the verifier, JIT backends, helpers, BTF (BPF Type Format), etc.</p>

<table>
  <thead>
    <tr>
      <th><strong>Kernel Version</strong></th>
      <th style="text-align: right"><strong>Release Year</strong></th>
      <th style="text-align: right"><strong>Approx. BPF Subsystem LoC</strong></th>
      <th style="text-align: right"><strong>Verifier LoC (verifier.c + btf.c)</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>v3.18</strong></td>
      <td style="text-align: right">2014</td>
      <td style="text-align: right">~5,000</td>
      <td style="text-align: right">~2,000</td>
    </tr>
    <tr>
      <td><strong>v4.9</strong></td>
      <td style="text-align: right">2016</td>
      <td style="text-align: right">~15,000</td>
      <td style="text-align: right">~4,000</td>
    </tr>
    <tr>
      <td><strong>v4.18</strong></td>
      <td style="text-align: right">2018</td>
      <td style="text-align: right">~25,000</td>
      <td style="text-align: right">~5,000 (pre-BTF)</td>
    </tr>
    <tr>
      <td><strong>v5.2</strong></td>
      <td style="text-align: right">2019</td>
      <td style="text-align: right">~35,000</td>
      <td style="text-align: right">~8,000</td>
    </tr>
    <tr>
      <td><strong>v5.12</strong></td>
      <td style="text-align: right">2021</td>
      <td style="text-align: right">~55,000</td>
      <td style="text-align: right">~12,000</td>
    </tr>
    <tr>
      <td><strong>v6.12</strong></td>
      <td style="text-align: right">2023</td>
      <td style="text-align: right">~70,000+</td>
      <td style="text-align: right">~20,000</td>
    </tr>
    <tr>
      <td><strong>v6.15+</strong></td>
      <td style="text-align: right">2024–2025</td>
      <td style="text-align: right">~80,000+ (est.)</td>
      <td style="text-align: right">~22,000+ (est.)</td>
    </tr>
  </tbody>
</table>

<ul>
  <li>Initially, the verifier code made up ~40% of the entire BPF subsystem. Over time, it dipped closer to 20%, then rose again after BTF-based type checking arrived.</li>
  <li>Overall, the BPF subsystem has ballooned from ~5k LoC in v3.18 to ~70k–80k LoC, making it one of the fastest-evolving parts of the kernel.</li>
</ul>

<h4 id="verifier-complexity">Verifier Complexity</h4>

<ul>
  <li><a href="https://pchaigno.github.io/ebpf/2019/07/02/bpf-verifier-complexity.html">Paul Chaigno’s “Complexity of the BPF Verifier” analysis</a> shows how the verifier has grown from a few thousand lines to over 20k by <strong>v6.12</strong>.</li>
  <li>Top 10 most complex functions exhibit high cyclomatic complexity (e.g., <code class="language-plaintext highlighter-rouge">do_misc_fixups</code> at ~167).</li>
  <li>Complexity arises from new features (bounded loops, spin locks, pointer/reference tracking, kfunc calls, deeper BTF integration).</li>
</ul>

<h3 id="32-ebpf-program-types-by-kernel-version">3.2 eBPF Program Types by Kernel Version</h3>

<p>“Program types” define the context in which eBPF programs can run (e.g., XDP, cgroup hooks, tracepoints, LSM). Approximate counts of <strong>BPF_PROG_TYPE</strong> definitions in major kernel releases:</p>

<table>
  <thead>
    <tr>
      <th><strong>Kernel Version</strong></th>
      <th><strong>Release Date</strong></th>
      <th style="text-align: right"><strong>Total BPF Program Types</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>v4.4</strong></td>
      <td>Jan 10, 2016</td>
      <td style="text-align: right">5</td>
    </tr>
    <tr>
      <td><strong>v4.9</strong></td>
      <td>Dec 11, 2016</td>
      <td style="text-align: right">8</td>
    </tr>
    <tr>
      <td><strong>v4.14</strong></td>
      <td>Nov 12, 2017</td>
      <td style="text-align: right">14</td>
    </tr>
    <tr>
      <td><strong>v4.19</strong></td>
      <td>Oct 22, 2018</td>
      <td style="text-align: right">21</td>
    </tr>
    <tr>
      <td><strong>v5.4</strong></td>
      <td>Nov 24, 2019</td>
      <td style="text-align: right">25</td>
    </tr>
    <tr>
      <td><strong>v5.10</strong></td>
      <td>Dec 13, 2020</td>
      <td style="text-align: right">30</td>
    </tr>
    <tr>
      <td><strong>v5.15</strong></td>
      <td>Oct 31, 2021</td>
      <td style="text-align: right">31</td>
    </tr>
    <tr>
      <td><strong>v6.1</strong></td>
      <td>Dec 11, 2022</td>
      <td style="text-align: right">31</td>
    </tr>
    <tr>
      <td><strong>v6.6</strong></td>
      <td>Oct 29, 2023</td>
      <td style="text-align: right">32</td>
    </tr>
    <tr>
      <td><strong>v6.12</strong></td>
      <td>Sep 15, 2024</td>
      <td style="text-align: right">32</td>
    </tr>
  </tbody>
</table>

<p>From a single program type (socket filter) in <strong>v3.18</strong>, eBPF has grown to 30+ distinct attach points for security, tracing, networking, device management, and more.</p>

<h3 id="33-growth-of-kfuncs">3.3 Growth of kfuncs</h3>

<p><strong>kfuncs</strong>—kernel functions callable directly by eBPF—significantly expand in-kernel scripting possibilities. Introduced around <strong>v5.15</strong>, these have proliferated each release:</p>

<table>
  <thead>
    <tr>
      <th><strong>Kernel Version</strong></th>
      <th style="text-align: right"><strong>Release Year</strong></th>
      <th style="text-align: right"><strong>Est. # of kfuncs</strong></th>
      <th><strong>Notable Additions</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>v5.15</strong></td>
      <td style="text-align: right">2021</td>
      <td style="text-align: right">~5</td>
      <td>Early ring buffer APIs, initial memory ops</td>
    </tr>
    <tr>
      <td><strong>v5.17</strong></td>
      <td style="text-align: right">2022</td>
      <td style="text-align: right">~15</td>
      <td>Extended networking ops, advanced memory helpers</td>
    </tr>
    <tr>
      <td><strong>v5.19</strong></td>
      <td style="text-align: right">2022</td>
      <td style="text-align: right">~25</td>
      <td>More TCP/UDP mgmt, tracing ops</td>
    </tr>
    <tr>
      <td><strong>v6.0</strong></td>
      <td style="text-align: right">2022</td>
      <td style="text-align: right">~35</td>
      <td>Data-structure manipulation APIs</td>
    </tr>
    <tr>
      <td><strong>v6.2</strong></td>
      <td style="text-align: right">2023</td>
      <td style="text-align: right">~50</td>
      <td>Cgroup mgmt expansions, partial LSM integration</td>
    </tr>
    <tr>
      <td><strong>v6.4</strong></td>
      <td style="text-align: right">2023</td>
      <td style="text-align: right">~60+</td>
      <td>Ongoing security &amp; net protocol expansions</td>
    </tr>
    <tr>
      <td><strong>v6.5+</strong></td>
      <td style="text-align: right">2024+</td>
      <td style="text-align: right">Growing monthly</td>
      <td>Commonly updated each release cycle</td>
    </tr>
  </tbody>
</table>

<p>Each addition reduces the need for specialized helpers. Developers can manipulate kernel data structures more flexibly, bridging user-space logic and kernel internals.</p>

<h3 id="34-mailing-list-activity">3.4 Mailing List Activity</h3>

<p>Historically, discussions about eBPF have taken place on <strong>bpf</strong>, <strong>netdev</strong>, and <strong>xdp-newbies</strong> mailing lists. Over time, <strong>bpf@vger.kernel.org</strong> has become the principal list for eBPF subsystem patches, design proposals, and feature discussions. Below are <strong>approximate average monthly email volumes</strong> on the bpf mailing list, based on snapshots of list archives:</p>

<table>
  <thead>
    <tr>
      <th><strong>Year</strong></th>
      <th style="text-align: right"><strong>Approx. Avg. Emails/Month on bpf@</strong></th>
      <th><strong>Context</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2017</td>
      <td style="text-align: right">~100–150</td>
      <td>Predominantly narrower eBPF topics (basic XDP, some verifier patches, bpf-next submission cycles).</td>
    </tr>
    <tr>
      <td>2018</td>
      <td style="text-align: right">~200–250</td>
      <td>Increases as eBPF expands in usage (cgroup hooks, tracing, bpf-next merges with more feature sets).</td>
    </tr>
    <tr>
      <td>2019</td>
      <td style="text-align: right">~300–400</td>
      <td>Rapid growth from mainstream uptake (Cilium, Falco, more complex verifier enhancements).</td>
    </tr>
    <tr>
      <td>2020</td>
      <td style="text-align: right">~400–500</td>
      <td>eBPF emerges as a standard approach for container networking, kernel security, major patch sets.</td>
    </tr>
    <tr>
      <td>2021</td>
      <td style="text-align: right">~600–700</td>
      <td>Surges with new program types (struct_ops, cgroup_sockopt), kfunc proposals, and BTF expansions.</td>
    </tr>
    <tr>
      <td>2022</td>
      <td style="text-align: right">~700–800</td>
      <td>More advanced features (LSM eBPF, bpf_iter expansions), frequent performance/bug fixes.</td>
    </tr>
    <tr>
      <td>2023</td>
      <td style="text-align: right">~800–900</td>
      <td>Ongoing expansions to netfilter hooks, more complex kfunc sets, many multi-part patch threads.</td>
    </tr>
    <tr>
      <td>2024</td>
      <td style="text-align: right">~900–1,000</td>
      <td>Growing cross-collaboration with netdev; advanced features drive higher patch volume and reviews.</td>
    </tr>
  </tbody>
</table>

<ul>
  <li>Across these years, the <strong>bpf</strong> mailing list has seen a steady rise in traffic, mirroring the growth in eBPF features and user base.</li>
  <li>Patches frequently introduce new kfuncs, program attach types, performance optimizations, or verifier improvements.</li>
  <li>The active contributor community includes developers from Meta, Google, Red Hat, Cloudflare, and numerous independent contributors.</li>
</ul>

<hr />

<h2 id="4-closing-observations">4. Closing Observations</h2>

<ul>
  <li><strong>Breadth of Use Cases</strong>: eBPF has evolved from a packet filter into a robust in-kernel runtime enabling advanced security, observability, and performance features.</li>
  <li><strong>Verifier Complexity</strong>: The surge in lines of code and cyclomatic complexity underscores the challenge of safely supporting new features; ongoing reviews and research aim to keep eBPF secure.</li>
  <li><strong>Sustained Momentum</strong>: With thousands of participants at eBPF conferences, hundreds of mailing list patches monthly, and an ever-growing body of research, eBPF remains a pivotal Linux technology.</li>
</ul>

<hr />

<h2 id="5-future-directions">5. Future Directions</h2>

<h3 id="51-aiml-integration">5.1 AI/ML Integration</h3>

<p>The fine-grained telemetry and real-time observability that eBPF provides make it an attractive data source for machine learning workflows. Already, some projects experiment with feeding eBPF-collected metrics (e.g., system call frequencies, network throughput, latency histograms) into ML algorithms for anomaly detection or automated performance tuning. Looking ahead:</p>

<ul>
  <li><strong>Adaptive Security</strong>: ML models could continuously learn “normal” system or application behavior from eBPF traces and flag deviations in real time, greatly enhancing intrusion detection systems (IDS) and runtime threat prevention.</li>
  <li><strong>Intelligent Autoscaling</strong>: Container platforms might combine eBPF-based resource profiling with reinforcement learning to optimize autoscaling policies in dynamic, multi-tenant environments.</li>
  <li><strong>Predictive Observability</strong>: By correlating eBPF data with performance counters, logs, and business-level metrics, organizations could anticipate workload bottlenecks or memory pressure before they degrade user experience.</li>
  <li><strong>In-kernel ML</strong>: Specialized ML logic could be implemented directly into eBPF code, allowing for real-time, in-kernel training and inference without the need for a separate server and with minimal overhead.</li>
</ul>

<p>As ML pipelines themselves become more sophisticated, real-time, in-kernel telemetry via eBPF is poised to give advanced training data streams for online learning algorithms, enabling faster, more targeted responses to anomalies.</p>

<h3 id="52-ebpf-expansion-into-more-subsystems">5.2 eBPF Expansion into More Subsystems</h3>

<p>From networking and tracing to security hooks, eBPF’s modular design continues to drive experimentation in new parts of the Linux kernel. In the coming years, we can expect:</p>

<ul>
  <li>
    <p><strong>Filesystem and Storage</strong>: eBPF programs that instrument I/O paths in real time, providing granular insights into read/write patterns, caching inefficiencies, and potential data corruption attempts.</p>

    <p>• eBPF could also help optimize storage performance by allowing userspace applications to directly interact with the kernel’s block device layer, or implement highly customized IO schedulers etc., or even offload a portion of application logic into the kernel.</p>

    <p>• Integration with <code class="language-plaintext highlighter-rouge">io_uring</code> could be a killer application for eBPF, which could open the door for an application-defined I/O data path into the kernel. This would provide a highly optimized I/O path for userspace applications.</p>
  </li>
  <li>
    <p><strong>Memory Management</strong>: Fine-grained eBPF attach points could police page allocations, page cache control, huge pages allocation, and high-level memory usage per container or process, paving the way for adaptive memory tuning or advanced customizations.</p>
  </li>
  <li>
    <p><strong>Device Drivers and I/O</strong>: Similar to XDP for networking, direct attach points in driver stacks could improve performance by handling certain I/O logic in eBPF, bypassing parts of the kernel’s general-purpose code.</p>
  </li>
</ul>

<p>Meanwhile, specialized eBPF “kfuncs” (kernel functions callable from eBPF) continue to grow, letting developers interact even more directly with core kernel data structures. This approach can shorten the development cycle for new performance optimizations and advanced debugging use cases.</p>

<h3 id="53-standardization-and-interoperability">5.3 Standardization and Interoperability</h3>

<p>As eBPF’s footprint expands, community members have begun exploring ways to ensure consistent APIs and behavior across kernel versions and even across operating systems:</p>

<ul>
  <li><strong>API Stability</strong>: While Linux leads eBPF’s development, projects like “eBPF for Windows” highlight the value of a more standardized interface. A stable kernel-side API could ease development of cross-platform eBPF tooling.</li>
  <li><strong>Tooling and Library Ecosystem</strong>: Efforts to unify bcc, libbpf, and other libraries under standard schemas for packaging and deployment can further accelerate adoption. Mature frameworks and official conformance tests are likely to emerge to verify that eBPF programs run consistently on different kernel versions.</li>
  <li><strong>Hardware Offloading</strong>: Some network interface cards (NICs) and SmartNICs already offload eBPF or eBPF-like programs directly into firmware. A standardized approach to offloading and verifying these programs could unlock significant performance gains in 5G/telecom and hyperscale data-center scenarios.</li>
</ul>

<p>As more companies and open-source projects adopt eBPF, shared standards and reference implementations become essential to ensuring that new features don’t fragment the ecosystem.</p>

<h3 id="54-hardware-acceleration-and-hpc-use-cases">5.4 Hardware Acceleration and HPC Use Cases</h3>

<p>Although eBPF historically targets CPU-based execution in the Linux kernel, several research initiatives investigate offloading or accelerating eBPF logic on specialized hardware:</p>

<ul>
  <li><strong>SmartNIC and FPGA Offloading</strong>: Some deployments already leverage SmartNICs that parse and act on packet data in hardware for faster throughput. Extending these capabilities to more general eBPF programs could revolutionize in-kernel data-plane processing.</li>
  <li><strong>High-Performance Computing (HPC)</strong>: HPC clusters have demanding performance requirements, and eBPF can aid in capturing fine-grained metrics (network usage, job concurrency, interconnect performance) with minimal overhead. Future HPC frameworks may also tap eBPF for run-time tuning or fault diagnostics at scale.</li>
</ul>

<p>As HPC and large-scale AI training clusters push the boundaries of performance, hardware-accelerated eBPF may become a key differentiator, providing faster instrumentation and the ability to incorporate advanced security features without sacrificing throughput.</p>

<p>The eBPF ecosystem continues to evolve at a rapid pace, driven by both established tech companies and innovative startups. Its impact on modern infrastructure software shows no signs of slowing down, making it an essential technology to watch in the coming years.</p>

<hr />
<h2 id="6-references--sources">6. References &amp; Sources</h2>

<ul>
  <li><strong>Google Scholar</strong>: Approximate search counts for “eBPF” (2016–2024).</li>
  <li><strong>GitHub Repositories</strong>: Yearly eBPF-labeled repo counts (2016–2024) plus star counts for bcc, bpftrace, Cilium, Falco, Pixie (as of early 2025).</li>
  <li><strong>Mailing Lists</strong>: bpf, netdev, xdp-newbies (patches, monthly stats, design discussions).</li>
  <li><strong>Paul Chaigno, “Complexity of the BPF Verifier”</strong> <a href="https://pchaigno.github.io/ebpf/2019/07/02/bpf-verifier-complexity.html">https://pchaigno.github.io/ebpf/2019/07/02/bpf-verifier-complexity.html</a></li>
  <li><strong>Linux Kernel Source</strong>: Commit logs and release notes from v3.18 through v6.5+</li>
  <li><strong>eBPF Summit, LSF/MM/BPF Conference, Linux Plumbers Conference eBPF Track, FOSDEM, KubeCon</strong>: Attendance stats and session counts (2020–2024).</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[The evolution of eBPF (extended Berkeley Packet Filter) has been one of the most significant developments in Linux system observability and networking over the past decade. As we look at the landscape in early 2025, several key trends have emerged that showcase eBPF’s growing influence and future directions.]]></summary></entry><entry><title type="html">Persistent Memory in Linux Kexec</title><link href="https://wangcong.org/2025-02-09-persistent-memory-in-linux-kexec.html" rel="alternate" type="text/html" title="Persistent Memory in Linux Kexec" /><published>2025-02-09T03:47:00+00:00</published><updated>2025-02-09T03:47:00+00:00</updated><id>https://wangcong.org/persistent-memory-in-linux-kexec</id><content type="html" xml:base="https://wangcong.org/2025-02-09-persistent-memory-in-linux-kexec.html"><![CDATA[<p>If you’ve been following the Linux kernel development community over the past few years, you might have noticed an interesting emerging trend in the field of memory persistence when using kexec to boot into a new kernel.</p>

<p>Between 2020 and 2025, five different teams of developers have tackled the same thorny problem: how to preserve memory contents when using kexec to boot into a new kernel. It’s a fascinating case study in how the open source community approaches complex technical challenges, with each proposal bringing fresh insights to the table.</p>

<h2 id="why-is-this-problem-so-important">Why Is This Problem So Important?</h2>

<p>To understand why this problem has attracted so much attention, we need to first talk about kexec. There are two kexec implementations: kexec fast reboot and kdump. In this context, we are talking about kexec fast reboot, which is a feature that allows you to switch to a new kernel without shutting down the system. Think of kexec as a shortcut in the boot process - instead of going through the whole BIOS/firmware dance when you need to switch to a new kernel, kexec lets you jump directly there. This capability is incredibly useful for reducing downtime during kernel updates or implementing fast reboot mechanisms.</p>

<p>But there’s a catch. By default, kexec treats the boot into the new kernel as a fresh start, wiping the slate clean of all previous memory state. While this clean-slate approach might seem sensible at first glance, it causes real problems in several important scenarios.</p>

<ol>
  <li>
    <p>Virtualized Machine. When running virtual machines, their memory contains active processes, cached data, and vital state information. If this memory isn’t preserved during a kernel transition, every VM would effectively crash during the host kernel update. This isn’t just inconvenient - it could mean breaking service level agreements and disrupting critical services.</p>
  </li>
  <li>
    <p>Virtualization using direct device assignment (also known as PCI passthrough). These setups allow virtual machines to directly control hardware devices for better performance. The IOMMU maintains critical memory mappings that enable devices to safely access memory. If these mappings are lost during kexec, any ongoing device operations could fail, potentially corrupting data or crashing the system.</p>
  </li>
  <li>
    <p>Database systems and other applications that maintain large in-memory caches. Having to reload these caches after every kernel update introduces significant performance penalties and service disruptions.</p>
  </li>
</ol>

<p>What makes this problem particularly challenging is the fundamental tension between two competing needs: the need to preserve specific memory regions exactly as they are, and the need to give the new kernel enough flexibility to boot successfully. Add to this the complexity of modern hardware, the requirements of virtualization, and the need for driver support, and you have a problem that requires careful balance between competing constraints.</p>

<p>Let’s dive into each of the proposals that have tried to solve this challenge, and see how their approaches have evolved over time.</p>

<h2 id="the-evolution-of-solutions">The Evolution of Solutions</h2>

<h3 id="pkram-2020-the-filesystem-pioneer"><a href="https://lwn.net/Articles/819778/" title="PKRAM (2020): The Filesystem Pioneer">PKRAM (2020): The Filesystem Pioneer</a></h3>

<p>Anthony Yznaga’s PKRAM proposal was the first major attempt to tackle this problem, and it took an intriguingly practical approach. Rather than creating entirely new mechanisms, PKRAM leveraged something that Linux developers and users were already familiar with: tmpfs.</p>

<p>The core idea was elegant in its simplicity. PKRAM implemented a tmpfs-style filesystem that could mark certain memory regions for preservation across kexec. When you mounted a PKRAM filesystem, you could create and manipulate files just like you would with any other filesystem, but behind the scenes, PKRAM was carefully managing the memory to ensure it would survive the kexec process.</p>

<p>The genius of this approach lay in its user-friendly interface. System administrators didn’t need to learn new concepts or APIs - they could just mount the filesystem and use standard file operations. Under the hood, PKRAM handled all the complexity of memory preservation, passing a root page pointer through the kernel command line to help the new kernel locate and restore the preserved memory.</p>

<p>Where PKRAM really shined was in its dynamic allocation capabilities. Rather than requiring administrators to decide upfront how much memory they might need to preserve, PKRAM could grow and shrink as needed. This flexibility made it particularly well-suited for general-purpose use cases where memory requirements might not be known in advance.</p>

<p>However, PKRAM wasn’t without its limitations. Its filesystem-centric approach, while user-friendly, didn’t always align well with kernel-level needs. It also lacked NUMA awareness, which could impact performance on large systems, and its general-purpose nature meant it wasn’t necessarily optimized for specific use cases like VM memory preservation.</p>

<h3 id="memory-pools-2023-the-kernel-centric-approach"><a href="https://lwn.net/Articles/945581/" title="Memory Pools (2023): The Kernel-Centric Approach">Memory Pools (2023): The Kernel-Centric Approach</a></h3>

<p>When Stanislav Kinsburskii introduced the Memory Pools proposal in 2023, it represented a significant shift in thinking. Instead of approaching the problem from a user space perspective, Memory Pools focused squarely on kernel-level requirements.</p>

<p>Built on top of the Continuous Memory Allocator (CMA), Memory Pools took a more structured approach to memory preservation. The proposal introduced the concept of persistent memory pools - dedicated regions of memory that could be preserved across kexec operations. These pools were particularly well-suited for maintaining kernel-specific states like DMA mappings and IOMMU configurations.</p>

<p>What set Memory Pools apart was its deep integration with kernel subsystems. By working directly with the CMA, it could ensure efficient memory management and reduce fragmentation. The proposal also introduced a clever way of passing metadata between kernels using the Flattened Device Tree, which provided a robust mechanism for maintaining continuity across the kexec boundary.</p>

<p>One of the most interesting aspects of Memory Pools was its focus on predictability. Unlike PKRAM’s dynamic approach, Memory Pools required memory regions to be defined at boot time. While this might seem like a limitation, it actually provided important guarantees about memory availability and location that were crucial for certain kernel subsystems.</p>

<h3 id="prmem-2023-the-flexible-synthesizer"><a href="https://lwn.net/Articles/948014/" title="PRMEM (2023): The Flexible Synthesizer">PRMEM (2023): The Flexible Synthesizer</a></h3>

<p>Later in 2023, Madhavan T. Venkataraman introduced PRMEM, which attempted to bridge the gap between kernel and user space needs. PRMEM took a more comprehensive approach, implementing what you might call a “kitchen sink” solution to the persistence problem.</p>

<p>The heart of PRMEM was its sophisticated memory management system. It could handle both fixed allocations, like Memory Pools, and dynamic growth, like PKRAM. One of its most innovative features was the concept of named persistent instances, which provided a clean way to organize and manage different types of persistent state.</p>

<p>PRMEM also introduced some clever technical innovations. It integrated with the generic memory allocator for efficient memory management, and its support for persistent XArrays made it particularly well-suited for handling complex data structures. The proposal even included provisions for expanding the persistent memory region on demand, up to a configurable maximum size.</p>

<p>What made PRMEM particularly interesting was its attention to real-world operational needs. It included features for metadata validation, support for NUMA topologies, and even considerations for memory error handling. The proposal demonstrated a deep understanding of the challenges faced in production environments.</p>

<h3 id="pkernfs-2024-the-clean-slate"><a href="https://lore.kernel.org/lkml/20240205120203.60312-1-jgowans@amazon.com/" title="Pkernfs (2024): The Clean Slate">Pkernfs (2024): The Clean Slate</a></h3>

<p>James Gowans’ Pkernfs proposal, introduced in 2024, took a bold step by suggesting a completely new filesystem specifically designed for kernel persistent state. Rather than trying to adapt existing mechanisms, Pkernfs proposed a clean-slate solution with a laser focus on the requirements of modern virtualization environments.</p>

<p>The core innovation of Pkernfs was its complete separation of persistent memory from the regular kernel memory management system. This separation provided important security benefits and made it easier to reason about the state of persistent memory. The proposal included specific support for removing guest memory from the direct map, which was particularly valuable for secure virtualization scenarios.</p>

<p>What set Pkernfs apart was its holistic approach to persistent state. Rather than just handling memory, it provided a unified interface for managing all types of persistent state, including IOMMU mappings and device configurations. This comprehensive approach made it particularly well-suited for complex virtualization environments where multiple types of state needed to be preserved.</p>

<h3 id="kho-2025-the-foundation-builder"><a href="https://lore.kernel.org/linux-mm/20250206132754.2596694-1-rppt@kernel.org/" title="KHO (2025): The Foundation Builder">KHO (2025): The Foundation Builder</a></h3>

<p>The most recent proposal, KHO (Kexec HandOver) by Mike Rapoport, takes a fundamentally different approach. Instead of trying to solve the entire persistence problem, KHO focuses on providing a solid foundation that other solutions can build upon.</p>

<p>At its core, KHO is all about metadata management. It uses the Flattened Device Tree to exchange information between kernels, but does so in a way that’s both flexible and extensible. The proposal includes sophisticated handling of scratch regions for the new kernel’s bootstrap process, ensuring that preserved memory won’t be accidentally overwritten during the boot process.</p>

<p>What makes KHO particularly interesting is its architecture-aware design. Rather than trying to provide a one-size-fits-all solution, it acknowledges that different architectures might need to handle persistence differently. This flexibility, combined with its clean integration points for kernel subsystems, makes it a promising foundation for future persistence solutions.</p>

<h2 id="looking-to-the-future">Looking to the Future</h2>

<p>As we look at these five proposals, we can see a clear evolution in thinking about the problem of memory persistence across kexec. Each proposal has brought valuable insights to the table, and the ultimate solution might well incorporate ideas from multiple approaches.</p>

<p>What’s particularly fascinating is how each proposal reflects different priorities and constraints. PKRAM prioritized ease of use, Memory Pools focused on kernel integration, PRMEM sought comprehensiveness, Pkernfs emphasized clean separation, and KHO focused on providing a solid foundation.</p>

<p>In the end, the key takeaway is that while each proposal has its own unique approach, they all share a common goal: to provide a robust solution for preserving memory across kexec operations.The future solution will likely need to provide clean separation of persistent and non-persistent memory, ideally support both fixed and dynamic allocation strategies, handle metadata robustly, and integrate well with existing kernel drivers. Most importantly, it will need to support both kernel and user space use cases while maintaining the security and reliability that modern systems demand.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[If you’ve been following the Linux kernel development community over the past few years, you might have noticed an interesting emerging trend in the field of memory persistence when using kexec to boot into a new kernel.]]></summary></entry><entry><title type="html">Two-Phase eBPF Program Signing</title><link href="https://wangcong.org/2025-02-02-two-phase-ebpf-program-signing.html" rel="alternate" type="text/html" title="Two-Phase eBPF Program Signing" /><published>2025-02-02T02:31:00+00:00</published><updated>2025-02-02T02:31:00+00:00</updated><id>https://wangcong.org/two-phase-ebpf-program-signing</id><content type="html" xml:base="https://wangcong.org/2025-02-02-two-phase-ebpf-program-signing.html"><![CDATA[<p>The Extended Berkeley Packet Filter (eBPF) has revolutionized how we extend and observe the Linux kernel. However, with great power comes great responsibility, and securing eBPF programs has been a persistent challenge in the Linux kernel community. Today, I want to share an innovative approach to eBPF program signing that addresses some fundamental challenges in this space.</p>

<h2 id="the-evolution-of-ebpf-security">The Evolution of eBPF Security</h2>

<p>Before diving into the two-phase signing solution, let’s look at how eBPF security has evolved. The Linux kernel has always been cautious about eBPF security, implementing various safeguards:</p>

<ol>
  <li>The eBPF verifier, which performs static analysis to ensure programs can’t harm the kernel</li>
  <li>Privilege restrictions requiring <code class="language-plaintext highlighter-rouge">CAP_BPF</code> capability for loading programs
    <ul>
      <li>BPF tokens were introduced to provide more fine-grained control of capabilities</li>
    </ul>
  </li>
  <li>Various LSM (Linux Security Module) hooks for additional security controls</li>
</ol>

<p>Previous attempts at eBPF program signing in the kernel community focused on traditional code signing approaches. However, these attempts faced a fundamental challenge: eBPF programs undergo necessary modifications during the loading process, invalidating traditional signatures.</p>

<h2 id="the-challenge-why-traditional-signing-doesnt-work">The Challenge: Why Traditional Signing Doesn’t Work</h2>

<p>The core issue lies in how eBPF programs are prepared for execution. When we compile an eBPF program, the resulting binary isn’t in its final form. The loader, that is, the libbpf library needs to modify this binary before it can run in the kernel. These modifications include:</p>

<ul>
  <li>Updating map file descriptors: When an eBPF program is compiled, maps are referenced using placeholder file descriptors. During loading, libbpf creates the actual maps and updates these references with real file descriptors to ensure correct map access at runtime.</li>
  <li>Patching relocations: The program contains references to functions, maps, and other resources that need to be resolved to their actual memory locations. libbpf updates these references with the correct addresses where the resources will be located in memory, similar to dynamic linking in regular programs.</li>
  <li>Making other runtime adjustments: This includes program size and offset calculations, updating program type-specific parameters, adjusting instructions for kernel version compatibility, and setting up program attachments (e.g., to network interfaces or kernel hooks).</li>
</ul>

<p>This creates a catch-22 situation:</p>
<ul>
  <li>If you sign the original binary, the signature becomes invalid after these necessary libbpf’s modifications</li>
  <li>If you sign after modifications, you lose the ability to verify the program’s original authenticity</li>
</ul>

<h2 id="in-kernel-ebpf-program-loader">In-kernel eBPF Program Loader</h2>

<p>There were also attempts to move the eBPF program loader into the kernel. These proposals aimed to perform all program preparations (relocations, map creation, etc.) entirely in kernel space to maintain a single trust boundary. However, they faced several critical issues:</p>

<ol>
  <li>
    <p><strong>Complex Privileged Code</strong>: Moving the loader into the kernel would add a significant amount of complex code to the privileged kernel space. This increases the attack surface and the potential for exploitable vulnerabilities.</p>
  </li>
  <li><strong>Compatibility Issues</strong>: While BTF provides kernel version independence for CO-RE-enabled programs, an in-kernel loader would still face compatibility challenges:
    <ul>
      <li>Legacy programs without BTF support require kernel-specific adjustments</li>
      <li>Different kernel versions may require different approaches to map creation and program attachment</li>
      <li>Programs using newer features need fallback paths for older kernels</li>
      <li>Kernel ABI stability would become more critical as loader logic moves into the kernel</li>
    </ul>
  </li>
  <li><strong>Flexibility Issues</strong>: User-space loading provides significantly more flexibility than a kernel-space approach would allow:
    <ul>
      <li>New program types and features can be supported without kernel updates</li>
      <li>Programs can be preprocessed or transformed based on runtime conditions</li>
      <li>Debug information and symbols can be handled more freely</li>
    </ul>
  </li>
  <li><strong>Verification Complexity</strong>: The eBPF verifier would need to be substantially modified to verify the loader’s operations, making an already complex component even more complicated and potentially introducing new verification bypasses.</li>
</ol>

<p>The two-phase signing approach leverages existing eBPF infrastructure without requiring kernel modifications, offering several key advantages:</p>
<ul>
  <li>Uses proven LSM hooks for program verification</li>
  <li>Maintains compatibility with existing eBPF tooling and workflows</li>
  <li>Reduces security risks by avoiding kernel-space complexity</li>
  <li>Allows for rapid deployment and adoption in production environments</li>
</ul>

<h2 id="introducing-two-phase-ebpf-program-signing">Introducing Two-Phase eBPF Program Signing</h2>

<p>To solve this dilemma, I’ve developed a two-phase signing approach that mirrors the eBPF program preparation and loading process. Think of it like a legal document that requires both initial notarization and subsequent verification of modifications.</p>

<h3 id="phase-1-the-baseline-signature">Phase 1: The Baseline Signature</h3>

<p>The first phase occurs when the eBPF program is initially compiled:</p>
<ul>
  <li>A PKCS#7 signature is generated for the original, unmodified program</li>
  <li>This signature serves as proof that the original program came from a trusted source</li>
  <li>It’s analogous to getting a document notarized before filling in the details</li>
</ul>

<h3 id="phase-2-the-modified-program-signature">Phase 2: The Modified Program Signature</h3>

<p>The second phase happens after libbpf has made its necessary modifications:</p>
<ul>
  <li>A new signature is created covering both the modified program and its original signature</li>
  <li>This establishes a chain of trust</li>
  <li>It proves that the modifications were authorized and applied to legitimate code</li>
</ul>

<h3 id="the-verification-process">The Verification Process</h3>

<p>The kernel verifies these signatures in sequence during program loading:</p>

<ol>
  <li>First, it verifies the original program against its baseline signature</li>
  <li>Then, it verifies the secondary signature covering both the modified program and the original signature</li>
</ol>

<p>This two-step verification ensures:</p>
<ul>
  <li>The program originated from a trusted source</li>
  <li>Any modifications were authorized</li>
  <li>The chain of trust remains unbroken</li>
</ul>

<h2 id="benefits-of-this-approach">Benefits of This Approach</h2>

<p>The two-phase signing system offers several advantages:</p>

<ol>
  <li><strong>No Kernel Modifications Required</strong>
    <ul>
      <li>Built entirely on top of existing eBPF infrastructure
        <ul>
          <li>Uses standard BPF LSM bpf() syscall hook for verification</li>
          <li>Uses bpf_lookup_user_key() kfunc to retrieve keys from keyrings</li>
          <li>Uses bpf_verify_pkcs7_signature() kfunc to verify signatures</li>
        </ul>
      </li>
      <li>Maintains compatibility with existing eBPF tooling and workflows</li>
      <li>Modification of libbpf is transparent and relatively non-invasive</li>
    </ul>
  </li>
  <li><strong>Strong Auditability</strong>
    <ul>
      <li>Failures can be precisely traced to either the original program or post-compilation modifications</li>
      <li>Creates clear audit trails for security investigations</li>
    </ul>
  </li>
  <li><strong>Practical Security</strong>
    <ul>
      <li>Accommodates necessary program modifications while maintaining security</li>
      <li>Prevents signature stripping attacks</li>
      <li>Creates a verifiable link between original and modified code</li>
    </ul>
  </li>
</ol>

<h2 id="implementation-details">Implementation Details</h2>

<p>I’ve implemented this system using:</p>
<ul>
  <li>PKCS#7 signatures for both phases</li>
  <li>BPF LSM hooks to intercept program loading</li>
  <li>Standard cryptographic primitives from OpenSSL</li>
  <li>Leverage the existing Linux kernel keyring infrastructure</li>
</ul>

<p>The implementation is available as open source, though it’s currently a proof of concept and not yet suitable for production use.</p>

<h2 id="looking-forward">Looking Forward</h2>

<p>This two-phase signing approach opens new possibilities for securing eBPF programs while maintaining their flexibility. Future work could include:</p>

<ul>
  <li>Integration with hardware security modules (HSM) for private key storage</li>
  <li>Enhanced private key management systems</li>
  <li>libbpf updates for the second phase signing</li>
  <li>Production hardening of the implementation</li>
</ul>

<p>The eBPF ecosystem continues to evolve, and security measures must evolve with it. This two-phase signing approach represents a step forward in balancing security with the practical needs of eBPF program loading and execution.</p>

<p>For those interested in trying this out or contributing to the project, you can find the implementation at <a href="https://github.com/congwang/ebpf-2-phase-signing">github.com/congwang/ebpf-2-phase-signing</a>.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[The Extended Berkeley Packet Filter (eBPF) has revolutionized how we extend and observe the Linux kernel. However, with great power comes great responsibility, and securing eBPF programs has been a persistent challenge in the Linux kernel community. Today, I want to share an innovative approach to eBPF program signing that addresses some fundamental challenges in this space.]]></summary></entry><entry><title type="html">Understanding struct __sk_buff</title><link href="https://wangcong.org/2021-08-15-understanding-struct-__sk_buff.html" rel="alternate" type="text/html" title="Understanding struct __sk_buff" /><published>2021-08-15T02:28:49+00:00</published><updated>2021-08-15T02:28:49+00:00</updated><id>https://wangcong.org/understanding-struct-__sk_buff</id><content type="html" xml:base="https://wangcong.org/2021-08-15-understanding-struct-__sk_buff.html"><![CDATA[<h3 id="understanding-struct-__sk_buff">Understanding struct __sk_buff</h3>

<p>I have been mentoring our interns for some eBPF projects. The most common question raised during their internship is about the eBPF struct __sk_buff. It seems there is no document on Internet explains why it is introduced and how it works. Let me explain this a bit.</p>

<h3 id="what-is-struct-__sk_buff-anyway">What is struct __sk_buff anyway?</h3>

<p>You can consider struct __sk_buff as a simplified version of struct sk_buff, but only for eBPF programs. Unlike the over-complicated struct sk_buff, struct __sk_buff is really flat and simple. In your eBPF programs, you can just use struct __sk_buff like all other structs, there is no magic on the users’ side. More importantly, you can even write into __sk_buff too, which is effectively writing to kernel’s struct sk_buff. The magic part is in the eBPF verifier, we will see this later.</p>

<h3 id="why-do-we-need-struct-__sk_buff">Why do we need struct __sk_buff?</h3>

<p>For eBPF programs, it is not as easy as kernel modules to read or write kernel data structures. Reading kernel memory requires, either explicitly or implicitly, an eBPF helper bpf_probe_read(). Writing kernel memory is considered as unsafe generally, so there is no one general way to do so, you will need some eBPF helpers too. For instance, to write into a network packet, you need to call bpf_skb_store_bytes().</p>

<p>So why don’t we just use bpf_probe_read() to read struct sk_buff and add more eBPF helpers to modify it? This is doable for sure, however, if we look into struct sk_buff more closely, there are at least 3 disadvantages with this approach:</p>

<p>Thus, struct __sk_buff offers a simplified and ABI-compatible view of struct sk_buff in kernel. It makes eBPF programmers’ life easier.</p>

<h3 id="how-does-itwork">How does it work?</h3>

<p>If kernel needs to maintain both struct sk_buff and struct __sk_buff, how does it keep them compatible with each other? Why reading/writing __sk_buff is actually reading/writing sk_buff behind the scene?</p>

<p>Like I said, the magic is in the eBPF verifier. Don’t get fooled by its name, eBPF verifier nowadays does more than just verification. For struct __sk_buff, eBPF verifier converts it to struct sk_buff transparently when you load your eBPF program, this is why there is no difference for eBPF programmers.</p>

<p>eBPF programs only see struct __sk_buff in “user-space”, while kernel, particularly the eBPF verifier, sees both sk_buff and __sk_buff, so it has the knowledge of both. With this, it can translate from __sk_buff to sk_buff. The actual code is in bpf_convert_ctx_access(). For instance, __sk_buff::len is converted to sk_buff::len in this way:</p>

<p>What this code does is generating some eBPF bytecode and inserting it into the eBPF program being loaded. This bytecode fixes the offset of __sk_buff::len, converts it into the actual offset of sk_buff::len. Notice that the base address is same for both sk_buff and __sk_buff. And both are static information available after compilation.</p>

<p>It is more complicated to retrieve information from some inner struct, for example, skb_shinfo(), as it clearly needs more logic and more bytecode. You can take a look at bpf_convert_shinfo_access().</p>

<p>As we can see, it is a really smart choice to invent such a struct __sk_buff for eBPF programs. And it is always amazing to see what eBPF verifier could accomplish.</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Understanding struct __sk_buff]]></summary></entry><entry><title type="html">How to pretend to be a Linux kernel expert</title><link href="https://wangcong.org/2019-03-14-how-to-pretend-to-be-a-linux-kernel-expert.html" rel="alternate" type="text/html" title="How to pretend to be a Linux kernel expert" /><published>2019-03-14T21:25:43+00:00</published><updated>2019-03-14T21:25:43+00:00</updated><id>https://wangcong.org/how-to-pretend-to-be-a-linux-kernel-expert</id><content type="html" xml:base="https://wangcong.org/2019-03-14-how-to-pretend-to-be-a-linux-kernel-expert.html"><![CDATA[<h3 id="how-to-pretend-to-be-a-linux-kernelexpert">How to pretend to be a Linux kernel expert</h3>

<p>Like many other areas, surprisingly it is not even hard to pretend to be an expert in Linux kernel, as long as you can talk so much bullshit that even others don’t want to challenge your authority on this topic. Especially in United States, people tend to talk a lot in their everyday life within this culture, and sometimes only by talking a lot you could show your “expertise” to people. I am always amazed to see how people could literally talking for hours without essentially giving any single piece of useful information.</p>

<p>Take a look at Donald Trump who already self-claims to be an expert in so many areas and watch the way he speaks, most of the time it is full of bullshit, and there are still a lot, really a lot, of people still believe in and like him. If he could make it, you could too. :)</p>

<p>On the other hand, being a true expert is hard, really really hard. You have to spend years on really studying, most of the time you can’t even show off this to your friends. And, you have to work with other true experts in a very hard way to make them accept your work, sometimes this is even painful.</p>

<p>So, is there any quick way to become an expert without even understanding the stuffs you are talking about? That is not just possible but also doable in the real world, as long as the people that challenge you are not true experts either. Use some social engineering tricks when you are challenged, make others feel uncomfortable or even stupid when they are trying to challenge you, you are the expert already!</p>

<p>This guide is supposed to be such one for you, particularly and hopefully this could help you survive any Linux kernel domain-specific interviews, without even writing any single line of code.</p>

<p>Of course we are only interested in the first part. Cheating is easy and quick, sometimes you don’t even have to feel guilty. Hey, look, you are already an expert, cheating on a simple question doesn’t harm or define who you are, right?</p>

<p>Before your interview, try to search on Glassdoor to see what kind of interview questions this company usually asks. This works quite well as interviewers are often very lazy to prepare any new interview questions, they just share a pool of questions, or even worse, they just search for one online like you just did.</p>

<p>Try to memorize the answers online. This should work for all kind of interview questions, especially those with quick answers.</p>

<p>Particularly for Linux kernel, which is usually domain-specific, try to grab some Linux kernel books, for instance, Understanding Linux Kernel. You don’t have read all of them, just pick a topic, for example, networking, and read some related chapters and memorize the diagrams, the fancy terms, try to mimic what true experts speak in this area. The more you can memorize, the better you would be. There is really nothing you have to understand.</p>

<p>Remember, people are very impressed if you can just draw a complex diagram on a whiteboard in front of them, as long as you can just bring it up in a right time.</p>

<ol>
  <li>Talking, talking, talking</li>
</ol>

<p>If people don’t believe you are an expert after listening to you, you are just not talking enough. And of course, wording is also important, how to make a trivial thing look great is an art you have to master.</p>

<p>Let’s use a real example.</p>

<p>Question: How do you understand Linux TCP/IP stack?</p>

<p>Bad answer: I have no experience in this area, I only have very limited knowledge of networking, especially Linux networking.</p>

<p>Good answer: I have been working on many networking related projects in my previous jobs (even it is just <code class="language-plaintext highlighter-rouge">ping google.com</code>), particularly I wrote some script to diagnose the networking latency in our Linux network (routers run Linux) and identify some unexpected packet drops and the bottleneck (even it is just a wrapper of ping just for parsing its output), then I worked closely with our networking team to solve it successfully (someone fixed the broken routers for you). This boots up the networking performance by X percent and saves about X thousands dollars for our company (no one can verify this). I led and drove the whole project (of course you did!), successfully applied my networking knowledge in my job. In the other job, I worked on another networking throughput tuning, I evaluated this issue and setup a test case for reproducing it, after narrowing down (just blindly guessing) to the bottom, I found out the bottleneck is in some middle router (by running trouceroute after searching on stackoverflow) which caused TCP congestions (you can always blame TCP). We replaced the current default TCP congestion control algorithm with a evolutional BBR (you don’t even need to know what BBR stands for), the throughput went up by X percent (again, no one could verify, just don’t be too aggressive).</p>

<p>See the difference? Even your networking knowledge is nothing beyond what you find on stackoverflow, you can still decorate it like you are the expert! It is all about wording, wording and wording.</p>

<p>People are easier to be convinced with numbers, even fake ones, they have no way to verify what you claim unless you really go out of the script. Trying to give yourself as much credits as you can, there is no way to verify what you have done and what your colleagues have done in one particular project, especially when it was a long time ago.</p>

<p>You still have to memorize some terms like “TCP congestion control”, you don’t even need to understand it, in fact the people interview you unlikely understand it either, they don’t want to challenge you unless they are really paranoid, they just want to hear words like “TCP congestion control”. By delivering as many words like this as you can, you could make them feel like you are unchallengeable and definitely the expert they want! Keep talking until this is true, I bet they don’t even want to interrupt you. How dare they? :)</p>

<p>Let me give you some golden sentences here:</p>

<p>“Everything is file in UNIX”. Of course, who would doubt about this!</p>

<p>“Keep it simple”. Of course, of course, everyone loves KISS. It is never aggressive to stress this point for any software engineer.</p>

<p>“A key for optimizing networking performance is to reduce copies”. Many people don’t understand this, you just need to repeat it in an appropriate context, or fold this into your own experience. Senior level expert!</p>

<p>“There is always a tradeoff.” Ah, what a true expert now you are!!! This could impress a lot of people, believe me. Senior staff level!</p>

<p>“The whole TCP/IP stack should be just moved to user-space, it is hard to optimize in kernel-space.” Wow, what a broad vision you have! You are the leading expert of the industry now!! Congratulations! Senior principle level!</p>

<p>“I think DPDK and eBPF is the future.” Who would question your expertise preference and vision anyway, dear director?</p>

<p>“The trend is hardware offloads a lot of work from CPU.” Sure… Any experienced networking engineers would agree.</p>

<p>Now let’s try to combine some of them together:</p>

<p>DPDK and eBPF should integrate really well with containers, provide better performance and isolations for different workloads. With the hardware accelerations, the CPU will be freed to do some more important work within the whole TCP/IP stack in the near future.</p>

<p>Look, it sounds really great, no one could question it and everyone would agree you are the senior principle engineer they should hire.</p>

<p>Key takeaways:</p>

<p>You are very welcome!</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[How to pretend to be a Linux kernel expert]]></summary></entry><entry><title type="html">OverTheWire Advent writeup — Boxy</title><link href="https://wangcong.org/2018-12-08-overthewire-advent-writeup-boxy.html" rel="alternate" type="text/html" title="OverTheWire Advent writeup — Boxy" /><published>2018-12-08T06:15:39+00:00</published><updated>2018-12-08T06:15:39+00:00</updated><id>https://wangcong.org/overthewire-advent-writeup-boxy</id><content type="html" xml:base="https://wangcong.org/2018-12-08-overthewire-advent-writeup-boxy.html"><![CDATA[<h3 id="overthewire-advent-writeupboxy">OverTheWire Advent writeup — Boxy</h3>

<p>This is a very interesting challenge of reversing. It turns out to be more than just reversing, it requires some knowledge of image processing too, as we will see.</p>

<p>First, once you read into the challenge, the boxy.txt contains many hints for this challenge. Although it implies it is Chinese, as a Chinese native speaker, I can tell you it is certainly not in Chinese. It looks more like Japanese to me. Anyway, I can’t take any advantage from my Chinese knowledge.</p>

<p>I have no clue what boxy.txt tries to hint, so I decided to look into the examples it provides. They are pretty informative and useful.</p>

<p>From the file names, we can figure out those 0xXX.bin has strong relations with those 0xXX.bin*.png files. We need to figure what their relations are. But before that, what are those binary files anyway? Let’s look into them with hexdump -C, like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hexdump -C
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ hexdump -C examples/0x00.bin00000000  01 ff 02 ff 03 00 ff 00                       |........|00000008
</code></pre></div></div>

<p>Only 8 bytes. A quick and wild guess would be 0x01, 0x02 and 0x03 here number each field which ends with 0xff, the last 0x00 means the end of record. Not a bad guess at all, right? :)</p>

<p>Well, looking into 0x01.bin clearly indicate my guess is wrong. We need to rethink about 0x00.bin. Wait. Aren’t those bytes mentioned in boxy.txt?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>### 0x00
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>姨嘟 傭 傈媉
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x01 0xff0x02 0xff0x03 0x010xff 0x00 ; 侸喽 军廎 十呿 侙壌 1
</code></pre></div></div>

<p>Yeah, boxy.txt is to kinda explain all of these binaries with some comments we don’t understand. Therefore, we still have to figure out what it is telling us.</p>

<p>The first thing we can find out is, the last two bytes are always 0xff 0x00, it looks like an end of file or record. If we look at the section 0x01, it contains multiple 0xff 0x00, so it is certainly not an end of file. Does it have anything to do with the PNG files?</p>

<p>0x00.bin has one PNG file related, 0x01.bin has 8 PNG files related, 0x02.bin has 4. These numbers match the number of “0xff 0x00” sequences in each of the binary! So perhaps it means each of block represents each PNG file, and the end of a block is “0xff 0x00”!</p>

<p>How does each block describe each PNG? What does the rest in a block mean? Let’s continue to guess, perhaps it describe each rectangle and each color?</p>

<p>Let’s look at those 0x01* PNG files. Their only differences are colors.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>### 0x01
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>凥妏 力帎
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x01 0xff0x02 0xff0x03 0x000xff 0x00 ; 巫乸 墰扉 奓夗 嶛儗 00x03 0x010xff 0x00 ; 媑寤 串勸 嬂囎 嫒厏 10x03 0x020xff 0x00 ; 亙慞 乹扚 慨岊 徿彃 20x03 0x030xff 0x00 ; 劳啈 仜姈 抻人 婚处 30x03 0x040xff 0x00 ; 扒廊 傋打 徬壸 崍应 40x03 0x050xff 0x00 ; 届嬎 怺壙 徾屻 佡垫 50x03 0x060xff 0x00 ; 尗彟 嫣拮 嫡峒 偢嫚 60x03 0x070xff 0x00 ; 奥嫄 剁峓 垔佴 嘔偬 7
</code></pre></div></div>

<p>Clearly, we can infer that 0x03 means coloring, while 0x00~0x07 represent 7 difference colors. So, each pair of “0x03 0x..” means coloring the picture with a specified color. With the order, we can map the color to the hexadecimal value. Excellent! But if 0x03 just means coloring, what about the size of the rectangle? And, what does 0x01 and 0x02 in the first two lines mean?</p>

<p>We need to move on to 0x02.bin as its related PNG files have rectangles of different sizes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x01 0x800x02 0x800x03 0x000xff 0x00 ; 侴当 屳亪 0|0 俸徙 佳姅 仁拴 00x04 0x800x05 0x800xff 0x00 ; 嬻庇 句崁 1|1 互嚒 廉廙 嵌壻 唝哠0x03 0x010x04 0x000x05 0x800xff 0x00 ; 嗉决 塟弸 0|1 忆噑 峭宴 廗习 10x04 0x800x05 0x000xff 0x00 ; 巸俴 偄廳 1|0 岰凥 唔妨 崏坾 惰兰
</code></pre></div></div>

<p>I highlight the parts we need to focus on. We already know 0x03 0x01 means coloring with black. In the last two 0x02* PNG files, there are two black rectangles added, one at a time. Their difference is clear the location, so 0x04 and 0x05 probably mean the location of the rectangle! Likely they are the X-axis and Y-axis.</p>

<p>What’s more, we can also figure out the drawing each PNG file seems incremental, which means we add something on top of the current PNG. When we draw the second black rectangle, we only change its position, so its color remains black once we execute “0x03 0x01”. Now it is clear that there is some global settings of the drawing, something like current color, current X-axis and Y-axis. The binary pairs are instructions to change these values.</p>

<p>From the first two pairs in 0x02 section, it is safe to guess 0x01 and 0x02 probably mean to set sizes, as the rectangles we draw in 0x02* PNG files are just 1/4 of the whole picture. They must represent width and height.</p>

<p>Now, look at what we already figure out:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x01 width or height0x02 height or width0x03 coloring0x04 X or Y axis0x05 Y or X axis0xff draw it
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(For colors: 0x00 means white, 0x01 means black, etc.)
</code></pre></div></div>

<p>For the rest instructions, we have to look into 0x03 binary. Observe each small rectangle drew in each 0x03* PNG file, the drawing location moves from left to right, then from bottom up, then from right to left. It does not need much time to figure out 0x06, 0x07, 0x08 and 0x09 mean relative movement.</p>

<p>Finally, we can find out the meaning of all these instructions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x01 width0x02 height0x03 coloring0x04 Y axis0x05 X axis0x06 move up0x07 move down0x08 move left0x09 move right0xff draw it
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(For colors: 0x00 means white, 0x01 means black, etc.)
</code></pre></div></div>

<p>And basically we need to interpret these binary commands and execute them to draw pictures, hopefully the flag is hidden in the output PNG files!</p>

<p>With the help of Python Pillow library, it is not hard to come up with the following code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class DrawPNG:  color_map = { 0x00 : 'white', 0x01: 'black', 0x02 : 'red', 0x03: 'green', 0x04 : 'yellow', 0x05: 'blue', 0x06 : 'pink', 0x07: 'cyan' }  num_files = 0  def __init__(self, file_name, color=0x00, x=0, y=0):    self.color = color    self.x = x    self.y = y    self.file_name  = file_name    self.im = Image.new('RGBA', (256, 256), self.color_map[self.color])    self.instructions = { 0x03 : self.change_color, 0x01 : self.set_width, 0x02 : self.set_height, 0x04 : self.set_y, 0x05 : self.set_x, 0x07 : self.move_down, 0x06 : self.move_up, 0x09 : self.move_right, 0x08 : self.move_left, 0xff : self.draw_it }  def change_color(self, val):    self.color = val  def set_width(self, val):    self.width = val  def set_height(self, val):    self.height = val  def set_x(self, val):    self.x = val  def set_y(self, val):    self.y = val  def move_up(self, val):    self.y -= val  def move_down(self, val):    self.y += val  def move_left(self, val):    self.x -= val  def move_right(self, val):    self.x += val  def draw_it(self, unused):    draw = ImageDraw.Draw(self.im)    print "x = %d y = %d w = %d h = %d"%(self.x, self.y, self. width, self.height)    draw.rectangle((self.x, self.y, self.x + self.width, self.y + self.height), fill=self.color_map[self.color])    self.im.save(self.file_name + '_' + str(self.num_files) + '.png')    self.num_files += 1  def parse_instructions(self, data):    for i in xrange(0, len(data), 2):      print "%x : %x"%(data[i], data[i+1])      self.instructions[data[i]](data[i+1])    return self.num_files
</code></pre></div></div>

<p>It works, but the output is not what I expect. We get some apparently morse code in 913 PNG files! Like this one:</p>

<p>So the flag must be hidden in the morse code! But the first question is how do we extract it?? Clearly we don’t want to read all 913 files and type them in a morse code translator!</p>

<p>My first thought is to use OCR, which can easily to extract texts from such a simple PNG file. Unfortunately the following pyocr code I tried doesn’t recognize morse code at all!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tools = pyocr.get_available_tools()[0]
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>text = tools.image_to_string(Image.open("reverseme_12.png"), lang="eng", builder=pyocr.builders.TextBuilder(tesseract_layout=6))
</code></pre></div></div>

<p>Perhaps it doesn’t know whether to recognize the line as ‘-’ or ‘_’. Shrug. I give up. How about extracting it with our own code by reading each pixel? It sounds hard, but in this case it is actually easy, because all the dots have the same size, all the lines too, more importantly, they are all in the middle line of each picture! This means we don’t need have to read every pixel, we just need to read the pixels along the middle line!!!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def chunkstring(string, length):    return (string[0+i:length+i] for i in range(0, len(string), length))
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Read pixels on the middle line# Return empty string is the PNG is blank# Else, return the morse code stringdef parse_morse(file_name):  im = Image.open(file_name)  pix = im.load()  total_dots = dots = 0  result = ""  for x in range(256):    if pix[x,128] == ImageColor.getcolor('Black', 'RGBA'):      total_dots += 1      dots += 1    else:      if dots &gt; 5:        result += '-'      elif dots &gt; 0:        result += '.'      dots = 0  return ' '.join(list(chunkstring(result, 5)))
</code></pre></div></div>

<p>Of course, because each picture is drew incrementally, only the final picture has a complete morse code. We can check this by checking if there is a blank picture following it. The rest of the code is even easier:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>file_name = sys.argv[1]file_len = os.path.getsize(file_name)data = array('B')with open(file_name, 'rb') as f:    data.fromfile(f, file_len)f.close()
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>draw = DrawPNG(file_name)num_files = draw.parse_instructions(data)
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>has_complete_morse = Falsenums = []for i in xrange(num_files-1, 0, -1):    png_file = file_name + '_' + str(i) + '.png'    morse_code = parse_morse(png_file)    if len(morse_code) ==0 :      print "%s is blank!"%(png_file)      has_complete_morse = True    elif has_complete_morse:      print "%s has complete morse code %s"%(png_file, morse_code)      plain_text = decode_morse(morse_code)      print "%s has text %s"%(png_file, plain_text)      has_complete_morse = False      nums.append(chr(int(plain_text)))
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nums.reverse()print ''.join(nums)
</code></pre></div></div>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[OverTheWire Advent writeup — Boxy]]></summary></entry><entry><title type="html">CSAW CTF writeup — A Tour of x86</title><link href="https://wangcong.org/2018-09-18-csaw-ctf-writeup-a-tour-of-x86.html" rel="alternate" type="text/html" title="CSAW CTF writeup — A Tour of x86" /><published>2018-09-18T04:20:33+00:00</published><updated>2018-09-18T04:20:33+00:00</updated><id>https://wangcong.org/csaw-ctf-writeup-a-tour-of-x86</id><content type="html" xml:base="https://wangcong.org/2018-09-18-csaw-ctf-writeup-a-tour-of-x86.html"><![CDATA[<h3 id="csaw-ctf-writeupa-tour-ofx86">CSAW CTF writeup — A Tour of x86</h3>

<p>The series of x86 assembly challenges in CSAW CTF are interesting, because it wraps with a very tiny i386 OS! How I miss the good old days of hand writing i386 boot assembly!</p>

<p>Part 1 is fairly elementary, I don’t want to waste a word on it. Part 2 is very easy too, the only thing I want to mention is that I used the following Qemu debugging trick to single step to where the OS stops and it clearly shows it pauses at the hlt instruction. You should know how to proceed to getting the flag!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hlt
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>qemu-system-x86_64 -serial stdio -d guest_errors -drive format=raw,file=tacOS.bin -s -S -d in_asm -singlestep
</code></pre></div></div>

<p>Now, let’s talk about part 3.</p>

<p>This part is slightly harder because it requires to write actual assembly code. Don’t worry, although it is embedded to the OS bootstrap code, you nearly don’t need to know anything about i386 interrupts or I/O ports. More importantly, at the time of the execution of our code, x86_64 is already properly setup, we can just write normal x86_64 assembly!</p>

<p>The only thing confusing is about the 0x1f character, it is a “protocol” for displaying characters via VGA, it means displaying a white-color character on a blue background. 0x00b8000 is the starting memory address of VGA video memory. All the rest are trivial, here is my NASM code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bits 64
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>call get_ipget_ip:    pop rsi
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>add rsi, 0x23    mov rdx, 0x00b8000
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>print_char_loop:    cmp byte [rsi], 0    je done
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mov byte bl, [rsi]    mov byte [rdx], bl    inc rdx    inc rsi    mov byte [rdx], 0x1f    inc rdx    jmp print_char_loop
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>done:    hlt    hlt
</code></pre></div></div>

<p>Compile it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nasm -f bin -o printflag.bin printflag.asm
</code></pre></div></div>

<p>Translate it into hex:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xxd -p &lt; printflag.bin | tr -d '\n'
</code></pre></div></div>

<p>Finally, feed them to the remote:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/usr/bin/env python
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import subprocessfrom pwn import *
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>context.log_level = 'debug'
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>r = remote("rev.chal.csaw.io", 9004)r.recvline()r.sendline("e8000000005e4883c623ba00800b00803e0074128a1e881a48ffc248ffc6c6021f48ffc2ebe9f4f4")ret = r.recvline().rstrip('\n')port = ret.split(" ")[-1]subprocess.call(["vncviewer", "rev.chal.csaw.io:"+port])r.close()
</code></pre></div></div>

<p>In fact, I am sure we can put everything into a single pwntools Python script, I just didn’t want to spend much time on digging pwntools doc.</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[CSAW CTF writeup — A Tour of x86]]></summary></entry><entry><title type="html">The design of lock_sock() in Linux kernel</title><link href="https://wangcong.org/2018-08-25-the-design-of-lock_sock-in-linux-kernel.html" rel="alternate" type="text/html" title="The design of lock_sock() in Linux kernel" /><published>2018-08-25T00:31:21+00:00</published><updated>2018-08-25T00:31:21+00:00</updated><id>https://wangcong.org/the-design-of-lock_sock-in-linux-kernel</id><content type="html" xml:base="https://wangcong.org/2018-08-25-the-design-of-lock_sock-in-linux-kernel.html"><![CDATA[<h3 id="the-design-of-lock_sock-in-linuxkernel">The design of lock_sock() in Linux kernel</h3>

<p>Among various kinds of locks in Linux kernel code base, <code class="language-plaintext highlighter-rouge">lock_sock()</code> is probably the weirdest one (if RCU is not even weirder).</p>

<p>As we all know, basically, there are two categories of locks in Linux kernel: blocking ones like a mutex or a semaphore; non-blocking ones like a spinlock, or a read-write lock. The pick of them largely depends on within which context you plan to use them. The weird part of this sock lock is actually it’s both blocking and non-blocking, depending on its context.</p>

<p>There are two contexts for the software part of the networking stack: Bottom-Half context, which is when a networking packet is received and transmitted, that is often called “data path” or the fast path; process context, which is where the “control path” happens, this is a slow path. Of course I simplify a lot here, for example, on the transmission side, we send packets in process context too until hitting the Qdisc layer or the driver layer.</p>

<p>For a socket, its “data path” is how packets destined to it are queued, this part is not directly influenced by user-space; its “control path” is how we configure a socket, like setting it via <code class="language-plaintext highlighter-rouge">setsockopt()</code>, and how we change the status of a socket, like via <code class="language-plaintext highlighter-rouge">bind()</code> and <code class="language-plaintext highlighter-rouge">close()</code>, which is completely and directly driven by user-space.</p>

<p>Generally speaking, the locking rule is clear: if we want to lock a shared data structure used in both contexts, we want to lock it in both contexts. This is why you see there are many <code class="language-plaintext highlighter-rouge">X_lock_bh()</code> variants of a given <code class="language-plaintext highlighter-rouge">X_lock()</code>. So for a socket, locking it in both contexts means a packet being queued in BH context won’t race with a user-space <code class="language-plaintext highlighter-rouge">close()</code> of a same socket.</p>

<p>Why <code class="language-plaintext highlighter-rouge">lock_sock()</code> is not just a regular spinlock at all? For performance!!!</p>

<p>If <code class="language-plaintext highlighter-rouge">lock_sock()</code> were a regular spinlock, then, when we lock it in user-space for <code class="language-plaintext highlighter-rouge">setsockopt()</code>, the packet receiving path in BH context had to busy-wait until <code class="language-plaintext highlighter-rouge">setsockopt()</code> finishes. This is very bad as packet receiving is the fast path we certainly don’t want to slow down.</p>

<p>This is why the sock lock is turned into two different locks for process context and BH context:</p>

<p>When process context begins to content with BH context, it becomes complicated:</p>

<p>Without additional logic, it is clearly not safe. To make it safe, <code class="language-plaintext highlighter-rouge">lock_sock()</code> enforces the following logic to callers:</p>

<p>Take a look at TCP receive path in BH context as an example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="n">bh_lock_sock_nested</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span>
    <span class="n">tcp_segs_in</span><span class="p">(</span><span class="n">tcp_sk</span><span class="p">(</span><span class="n">sk</span><span class="p">),</span> <span class="n">skb</span><span class="p">);</span>
    <span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">sock_owned_by_user</span><span class="p">(</span><span class="n">sk</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">ret</span> <span class="o">=</span> <span class="n">tcp_v4_do_rcv</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">skb</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">tcp_add_backlog</span><span class="p">(</span><span class="n">sk</span><span class="p">,</span> <span class="n">skb</span><span class="p">))</span> <span class="p">{</span>
        <span class="k">goto</span> <span class="n">discard_and_relse</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">bh_unlock_sock</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span>
</code></pre></div></div>

<p>See the difference between when the lock is owned by user-space and when it is not? Clearly, <code class="language-plaintext highlighter-rouge">tcp_v4_do_rcv()</code> is much more complicated than <code class="language-plaintext highlighter-rouge">tcp_add_backlog()</code>, what about the “missing” part when we just call <code class="language-plaintext highlighter-rouge">tcp_add_backlog()</code>? It is exactly what is moved into <code class="language-plaintext highlighter-rouge">release_sock()</code> after we release this lock:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">release_sock</span><span class="p">(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="n">sk</span><span class="p">){</span>
    <span class="n">spin_lock_bh</span><span class="p">(</span><span class="o">&amp;</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_lock</span><span class="p">.</span><span class="n">slock</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">sk</span><span class="o">-&gt;</span><span class="n">sk_backlog</span><span class="p">.</span><span class="n">tail</span><span class="p">)</span>
        <span class="n">__release_sock</span><span class="p">(</span><span class="n">sk</span><span class="p">);</span>
<span class="c1">//...</span>
</code></pre></div></div>

<p>where <code class="language-plaintext highlighter-rouge">__release_sock()</code> will execute the callback <code class="language-plaintext highlighter-rouge">sk-&gt;sk_backlog_rcv()</code> to continue to process the packets queued in its backlog, and for TCP, this callback is exactly <code class="language-plaintext highlighter-rouge">tcp_v4_do_rcv()</code>. Bingo!</p>

<p>As you can see, the whole packet receiving process is not always finished in BH context. For TCP, <code class="language-plaintext highlighter-rouge">tcp_v4_do_rcv()</code> could be either called in BH context as usual, or in process context if locking contention happens on the sock lock.</p>

<p>But the rule is still simple: always call <code class="language-plaintext highlighter-rouge">lock_sock()</code> and <code class="language-plaintext highlighter-rouge">release_sock()</code> in process context, and always call <code class="language-plaintext highlighter-rouge">bh_lock_sock()</code> and <code class="language-plaintext highlighter-rouge">bh_unlock_sock()</code> in BH context, properly check <code class="language-plaintext highlighter-rouge">sock_owned_by_user()</code> after acquiring <code class="language-plaintext highlighter-rouge">bh_lock_sock()</code>.</p>

<p>Hope this clarifies your confusions about this weird lock when you look into it.</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[The design of lock_sock() in Linux kernel]]></summary></entry></feed>