I had all the skills broken down into bite-size pieces and structure laid out in a sensible way, but the agent would often improvise. When I pressed the agent why they weren’t following the skills, I would usually hear something like:
“The skills are clearly written, I just didn’t follow them well. I skimmed them, and then improvised. I should have done x, y, z. Should I redo it?”
The solution wasn’t to write clearer skills. It was to codify correctness so it was enforceable.
Backend Gates FTW
The failures were happening because there was no way for me to force the agent to follow the skills or redirect the agent if it made a misstep during the procedure. Backend gates help address this by doing things like:
Only give the agent the initial step. It can’t complete its full task without a backend request for the next step.
Don’t allow writing to the database without an access token.
Disallow writes that don’t have required data.
In combination, these form powerful gaurdrails to keep your agent on track.
Applying Backend Gates for Grocery Shopping
For my case, I wanted the grocery shopping procedure to be split into concrete tasks on smaller single-responsibility agents that can operate in parallel:
Item Resolution Agent
Receives the data for what stores to shop at, and what item needs to be searched for. Then it can:
Use the store’s API to search for the item
Collect the results
Hand them to the parent orchestrator
The agent would discard relevant results, not search thoroughly when it received lackluster results, or not write images.
Backend Gate
To address this, I now require it to write its results to the database in a transient database row (another default WordPress feature that majorly helps with AI infrastructure) with the list id, item id and shop cycle id. Now, when it writes its results, if it doesn’t satisfy the backend’s requirements, then it receives instructions on how to improve its results.
Writing the Selection
When the orchestrator writes the final selection, it was often discarding results or only writing the final selection instead of offering alternates. Or, capping alternates to 3 when there were really 10 valid alternates.
Backend Gate
To improve results, I added these backend gates:
Must include alternates array or pass a flag and reason there are no alternates
Must include images or a flag and reason why there are no images
Must include at least 3 items per store or flag and reason there are fewer than 3 per store
With how significant the improvements are, I highly recommend using server-side gates to improve agents’ reliability.
Over the past month, I’ve been building a WordPress site as a data-brain layer connected to Claude in a NanoClaw container. My goal is a plugin with a group of skills and endpoints that can:
meal plan,
build a grocery list
shop the list from multiple stores
add items to store cart(s)
Since starting the project, I’ve dramatically evolved my thoughts on the future of the UX of AI interfaces its practical applications to everyday life.
Why not Chat-only?
Chat is great for general information, but reviewing, updating, and finalizing a plan in chat is tedious and error prone. I started the plugin with a chat-powered approach but found myself in a mundane loop of things like:
“Remove red peppers. I already have Paper towels. I only need 1 pack of beef.”
Then re-review all its text output… again…
I knew I had to find a better way. How could I review the output more easily, provide quick edits to the output, and provide context to the agent quickly?
I started out thinking I was building a meal-planning and grocery-shopping tool. Now, I think I’m building an example of how agentic software should work: structured context, constrained abilities, and easy-to review and edit outputs
The result: an interface where the agent and I can build a list together. The agent gets you close, and the interface lets you quickly make adjustments.
A collaborative shopping list for you and the Agent
Plugin Overview
A meal-planning and weekly grocery list solution to make your life easier.
It uses WordPress as a context layer to provide an agent with all the data and abilities it needs to:
choose meals you’ll like
decide on grocery items you’ll need
shop from your preferred stores
1. Meal Planning
Start by asking the agent for a meal plan. I have my WordPress site managed in a NanoClaw container by Claude, and I interact with it via Telegram.
I ask it to build me a meal plan for the week, and the agent accesses my site to see what my family’s preferences are, what our schedule is, etc. Information is saved in WordPress post content and metadata.
It has examples from previous weeks, and a list of recipes we like (also WordPress posts, but it could be connected to any Recipe API).
List Building
After the Meal Plan is set, it generates a grocery list off of the meals as well as checks my past orders for recurring items and household items.
This list gets saved as a WordPress post that polls for updates so the Agent and I can collaboratively build the grocery list.
The UX layer has been really interesting. I’ve found I needed a quick way to fix AI’s mistakes and provide context. The AI gets me a rough draft, and I can quickly edit it for final approval.
The initial list created by the agent, ready for me to edit
You can:
edit item quantities
add items
delete items
choose which store(s) you want to shop at
add notes to an item for more context (“Buy organic if less than $1 more expensive than the cheapest option”)
filter the list by meal, aisle, store, or unshopped items
This UX layer is where I see the human layer of AI evolving very quickly in the coming years. There are not established patterns on how to interact with AI in the most useful way. Building this plugin is rapidly evolving my thoughts on the UX of AI beyond chat interfaces.
Grocery Shopping
After you have your list where you want it, you can have your agent shop the list for you, streaming its updates along the way.
Once it has shopped your list, it returns the item titles, prices, images, and a quick-change dropdown to select an alternate related item.
Selecting an alternate item for Teriyaki.
The UX is all about quick fixes and edits to the agent. We know the agent isn’t right 100% of the time. It’s close, but not perfect. So, we let it get close, and provide a quick way to make it 100% right.
When you’re done, you can send the items to the stores you’ve selected.
How is the plugin useful?
It hits the perfect duo for usefulness:
Saves time
Saves money
Every week I do a meal plan and start from a blank slate. It doesn’t need to be that way. Now that my WordPress site has all my preferences, the agent can access that context and can get me a very solid start. It has taken an annoying task and gotten me to a finished meal plan in less time and with less annoyance. That’s win.
Same with building a shopping list. It knows all the items for a recipe. It can evaluate what I likely need and give me a chance to easily edit. Another win.
The really powerful thing is being able to shop multiple stores and build the best grocery store list for you. Imagine if you could ask the agent:
“Pick the store that has all the items for the cheapest total cost”
“Build me a grocery list with the cheapest items from each store”
“Prefer organic ingredients if they’re within 20% of the cost of the cheapest item”
It’s like having a personal grocery shopping assistant that knows you and your preferences, and gives you a quick way to review and fine tune the decisions.
Why WordPress?
WordPress provides a great foundation for building personal OS agents. The biggest issues with a system like OpenClaw is:
Security (whole computer access)
Difficult, technical setup
WordPress solves this with:
Secure container for your data
Easy server install
Plugin System for expanding agent skills
Agent Abilities API (in 7.0)
Consistent data structure
Custom API endpoints
Login/privacy layer
Plenty more 🙂
Building the Plugin on WordPress meant I got so many things “for free” and didn’t have to reinvent the wheel. It can be a containerized secure, online endpoint for me and an agent to work together on. It’s a perfect use case.
What’s Next
I’m still finalizing how this can get released. It may end up as a public WordPress plugin, or get run on WordPress.com’s agent abilities. We’re not sure yet. Follow along to find out 🙂
I’ve been a jack of all trades web developer for nearly 15 years and moved to working full-time in hiring in September 2020. At Automattic, we’re hiring as fast as we can. This has given me a chance to work with and evaluate hundreds of software engineers over the past eight months.
The Day-to-Day: A Bit of Context
As a full-time hirer, my first responsibility is to kindly assess, mentor, and guide candidates through our hiring process to determine if they would be successful at Automattic. On the surface, it sounds like all I do is run interviews, grade code tests, and guide trials all day.
While that may be my primary responsibilities, it comes across less as “grading” and more as:
Digging into the why behind someone’s code.
Becoming a PR Review expert.
Delivering kind, constructive feedback.
Implementing new ideas on how to accurately and more quickly evaluate candidates.
Formalizing/automating processes to make hiring more efficient.
Working on interesting technical challenges to improve our hiring toolkit.
Improving my own problem solving and designing abstractions skills by challenging my assumptions of how code can and should be written.
Changing someone’s life by letting them know they got the job. 🎉
The Human Side of Engineering
Most engineering roles at Automattic are focused on coding in order to help people. In hiring, it’s a bit more meta: we are focused on identifying the best people to write code in order to help people.
Kramer can’t believe it.
Engineering hiring is an interdisciplinary role that allows you to think deeply about how to best evaluate and mentor your candidates. Even when a candidate might not be ready to be an Automattician today, most are eager, kind, and smart. Our goal is to help them learn through useful feedback and mentorship on how they can grow to be successful their next time around.
If you’re like me and have a hard time with giving direct feedback, then working in hiring is also a huge opportunity to grow as a person. I’ve learned more about how to confidently give effective, constructive, useful feedback in the last eight months than I have in my whole life.
Giving Effective Feedback
This should be its own post to dive into the details. Here are a few quick lessons learned:
Be positive, and focus on their ability to grow to meet the challenge.
Be direct. If the feedback isn’t clear and actionable, you can’t be sure it was understood. And if you can’t be sure it was understood, you didn’t really give any feedback.
Give feedback continuously. Don’t save it all for one big review at the end. The “official” feedback shouldn’t be a surprise – they should already have heard it before. This also allows people a chance to correct problematic areas before they become a bigger problem. It also shows you how they are at receiving and addressing feedback.
Set clear expectations. What does successfully addressing the feedback look like? How will they know they’re doing it better now?
What Makes a Good Developer?
This is something I’ve spent a lot of time thinking about recently. I think it’s different for each organization. For example, you can be a really technically skilled engineer and not get a job at Automattic. And you can be a great engineer for Automattic, and not be able to get a job at another big organization. I don’t think I have the technical skills to pass more algorithmic-focused code tests, but I’m a great match for the kinds of problems that need solved at Automattic.
So, more accurately, the question should be, “What makes a good developer at [insert organization name here].”
For us, a brief list looks like:
High self-motivation and drive
Technical ability
Fast learner
Great written communication
Can adapt and solve problems from many different areas, even if they’re not familiar with the language or codebase
Good self-management and dependability
Can give and receive feedback well, allowing for a lot of positive growth
There’s plenty more, but the main takeaway here is that technical ability is one of many pieces. It’s OK to not have the strongest technical ability as long as you have many of the others and, especially, if you can learn whatever you need to learn.
How do you Evaluate that?
Having a list of things we’re looking for is one thing, but how do you fairly and accurately determine someone has those qualities?
This is much harder to do than it seems, and relying on your own intuition of, “I know it when I see it,” is a great way to have a super-biased, toxic, homogenous company.
I wrote about Qualities of Expertise earlier this year as part of my learning around evaluating expertise. It’s hard, as the problems that require expertise to solve are open-ended, contextual, and don’t have a clear right/wrong answer.
The best way we’ve found to evaluate this is to have a 25 – 40 hour, paid trial where we give candidates a close-to-real-world codebase with some basic direction and let them solve it however they feel is best. It’s like a paid-mentorship and code camp where you might get a job offer in the end.
The great thing about the trial is that there are no right answers. There are multiple, valid approaches to everything. It’s less about the direction you choose, and more about:
How you arrived at your decision.
How you communicate your decision.
How you solve/implement it.
Does it work? Does it work well? How do you know?
There’s no accidentally passing the trial. If you receive an offer, it’s because you’ve been thoroughly vetted and deserve the offer.
The fast-paced learning and self-improvement is keeping me highly engaged in Engineering hiring. My next big step is learning more data and rubric development. I doubt I’ll run out of things to learn anytime soon.
If you’re a software developer looking to grow in these areas and have the opportunity to work in hiring, I highly recommend it.
Since joining the hiring team at Automattic in the fall of 2019, I’ve noticed different patterns and preferences on text-based interviews. Some of these are also general interviewing tips.
Send shorter messages
The pacing of the conversation is improved when you send multiple shorter messages instead of one multi-paragraph message. Answer as you go along. This gives more opportunity for your interviewer to drop pertinent emoji reactions and give feedback/redirect along the way.
Shorter messages = more engagement and faster feedback.
If someone starts down the wrong path on a question and answers once every five minutes in a wall of text, then the interviewer isn’t given an opportunity to redirect early on. The interviewee may unintentionally be cutting their interview short since they spent so long answering the wrong question. Also, it’s not a very engaging conversation if you only see a message once every five minutes.
Tip: It’s OK to type fast and correct typos/edit for clarity as you go.
Avoid Threads if possible
This is something I prefer. I watch that “_ is typing,” alert like a hawk. This is your cue to know if you’ll be interrupting someone or not.
It’s really helpful to know if you should wait for them to finish sending a message or if you should jump in with another message. In Slack, this alert only shows up if you’re typing in the main channel. The threads don’t show it, so you’re not sure if someone is typing.
Of course, there are some small comments that are better in threads. That’s fine. Keep threaded messages short if you need to use them. If you need to go back and discuss something, instead of using a thread, you can use a block quote for context:
Quoted text you want to talk about.
Can you clarify this comment? I’m not sure I understand what you’re asking.
In async conversations where you don’t need to know if someone is typing (most everything except interviews), then I 100% prefer threads. It’s all about context.
Show your thought process
So much of our interview and hiring process is about how you approach a problem rather than the final answer. Do you need more context to make a good decision or provide a good answer? Ask for it! Don’t make an assumption about it.
In the end, a lucky right answer without showing your thought process isn’t worth as much as a solid thought process that arrives at a different answer.
Don’t bother name dropping
We don’t have a checklist of frameworks or languages we want you to mention. It’s about how you solve and approach problems. If you demonstrate a strong learning ability, then we’re not worried about if you’ve worked with x, y, or z, since we know you can learn whatever you need to in order to get the job done.
Tell the story
If we ask a question like, “Tell me about a time when…,” act like you’re telling someone a story about it, because you are! Give context:
What was the project?
Who worked on it? What was your role?
Why did you solve the problem the way you solved it? Did you consider alternatives?
How did you know you were successful? Specific metrics are 🔥
In the end, it should sound like a concise story. This allows the interviewer to understand the issue and context so much more than if you only focus on the “What.”
Also, be prepared to answer follow-up questions related to your story. For example, that specific metric you mentioned earlier — how did you measure that?
It’s not that different
A few little etiquettes are different, but, in the end, the same rules apply to in-person interviews as text-only interviews. Most people have never been interviewed over text, and most people do great with it. If you practice telling your stories and can demonstrate a pattern of success, you’ll be fine, text-based or in-person.
Thanks to Thuy Copeland, Josh Betz, and Enej Bajgoric, my fellow interviewers at Automattic, for reviewing this post 🙌🏻
I just finished reading Code Simplicity, and, as someone who has a tendency for perfectionism, one thing that stood out to me was the idea of not worrying about building something perfect.
It’s OK to not aim for perfection on version one. Or any version. You don’t even know what perfection looks like at when first starting a project. Instead, aim for a simple solution to your current problem, not a perfect solution to a future problem that may never exist.
If you keep striving for simplicity with each new addition, your system will gain as much complexity as it needs, while still maintaining enough flexibility and simplicity for the next addition.
Gaining expertise is difficult enough, but how can you tell if yourself (or someone else) has reached expert status? Unfortunately, multiple-choice tests are ineffective predictors since the benefit of expertise is being able to conjure up complex, situational-dependent solutions where there is no “correct” answer.
One of the simpler ideas is to see how long someone has been in an industry (say, 10+ years) and give them the benefit of the doubt. Surprisingly, for programmers at least, “programming experience gained in industry does not appear to have any effect whatsoever on quality and productivity.”1 Ironically, what does increase with experience is the confidence in your incorrect decisions.2
If you have the same one year of experience, ten years in a row, is that nine years of experience?3 Putting the time in is not necessarily effective. You need a strong feedback loop that allows you to hone your decision making and see what is effective,4 and an opportunity to vary and increase the difficulty and scope of your work.
To find expertise, we’ll need to look beyond the CV, using some generalized qualities across different domains.
Discrimination of subtle differences
The ability to make fine discriminations between similar, but not equivalent, cases is a defining skill of experts.
One of the most chill shows I watch is The Repair Shop. It’s a team of repair experts, taking worn down heirloom objects and returning them to their former glory. In one episode, art conservation expert Lucia Sclaisi is shown restoring a damaged painting. Where I see an old painting with a hole in it, Sclaisi sees:
how highly finished it is and what finish is used (and thus how to best clean it)
and surely much more they didn’t show/I missed.
An expert can get information from clues that the novice didn’t even realize existed.
However, it’s not simply about knowing the information. If a quick Google search can tell us the answer, then it doesn’t require expertise. The big piece is knowing how to apply the information to make successful, situational-dependent decisions.
Consistency
We all get lucky from time to time. Sometimes when the problem is a nail, we happen to have a hammer. Being able to consistently formulate a good solution to a variety of problems is the key here.
Situational-Dependent Solutions
I’ve mentioned “situational-dependent” several times now. What does that even mean?
As Andy Hunt emphasizes in Pragmatic Thinking & Learning: context is key.5 The same problem in a different context may require a different solution. Problems rarely have a one-size-fits-all solution. It requires expertise to correctly apply information within the right context.6
This means when encountering a new problem, an expert often asks important questions about the context. For example, the art restoration expert may ask where the painting has been stored, under what conditions, etc. I have a guess as to how some of these details could impact a painting, but I’m not sure how I could practically use that information. An expert does.
These details, which may seem unimportant to the novice, reveal nuances that can significantly inform the expert’s solution.
Assessing Expertise
Accurately assessing these qualities when hiring is easier said than done, and likely looks different across industries. However, the key qualities of expertise remain:
Ability to discriminate between subtle differences
Consistency
Situational-dependent solutions in the right context
If you are trying to assess expertise in an area that you don’t have much domain knowledge, Tyler Alterman has some great recommendations in their article, Why and how to assess expertise.
In many of my examples it’s easy to see the difference between a novice and an expert. Assessing someone proficient in a subject vs an expert can get tricky. In the end, if someone can consistently discriminate subtle differences in context-dependent situations to find the best solution, they’re well on their way to expertise.
As a joke (with some seriousness), I bought a glass-like impossible puzzle. It’s made of clear acrylic, has eight corners for extra trickiness, and it’s impossible to see which side is “up.” My wife is much better at puzzles than I am, and I thought this might be finally be her match.
After an hour or two of combined work, we had 15 pieces together. It’s hard to even look at it for too long – your eyes don’t know how to focus on the edges and the pieces become a double-vision ghost of themselves.
My best success came from sorting pieces by the way the angles of the edges slanted. It wasn’t very fun to methodically go through each piece, and only mildly satisfying when two pieces snapped into place.
My wife gave in. Even with zero consequences for abandoning, I had a strong urge to push through to the end. It’s a problem to be solved. This drive to solve whatever problem is in front of me makes me a good programmer, but at times like this, the benefit is questionable.
What problems need solving, and which are we working on simply because they’re here in front of us? Which should we allow ourselves, guilt-free, to abandon?
Doing something like a puzzle, that @ghosthoney on TikTok described as, “too much work for a wrinkly version of an image I don’t really care about1,” isn’t inherently worthwhile. If you enjoy it or get satisfaction from it, go for it! But if it feels like a chore, then it’s OK to skip it. Leave the feeling of being a chore for things that are actually chores.
Starting doesn’t mean I need to finish. My friend Jonathan Vieker talks about how it’s better to stop reading a book once you understand the point, rather than slogging through to the end: “we’re best off using that time to read something that will benefit us.” With this in mind, I’m giving in and giving myself permission to simply be,2 and see what I become when I don’t attach my self-worth to my accomplishments.
That said, I won’t be surprised if someday I find myself, pausing with appreciation, as I place the final piece into a wrinkly, transparent rectangle.
I’m very proud of myself for working that line into this post.
For months, our bedroom light would not turn on. Well, at times it would, randomly illuminating the room whenever it seemed fit. And at others, turn off without warning. As far as we could tell, there were no signs of ghosts.
We were having our attic turned into livable space and the contractors had recently put in subflooring. We figured they nicked a wire and that was causing the flickering.
I read up on how to find where the wire might be compromised, thinking through how to identify potential causes and learning electrical diagrams. All I can recall now is something about tracing the neutral using some metering device I don’t have.
A couple months later, we had an electrician come out for something else and asked him about it. He suggested we try changing the lightbulb. Really? That’s it?
He explained that CFL lightbulbs, the ones that look like the spiral staircase of some futuristic space habitation, have a wire inside them. The jolting of the nail gun on the subfloor installation right above it probably caused that wire connection to come loose. The expansion and contraction of heat from the bulb would cause the wire to connect then detach in a slow cycle. On and off.
He was right. We switched out the light bulb and it has been working as a light should ever since.
Next time, whether it’s a light or a web development project, I’ll try the simple fix first.
Since joining the hiring team at Automattic, I’ve been using the recruiting/hiring software Greenhouse to score Code Tests and evaluate trial candidates.
There are some things that have annoyed me a little about the site, so I wrote a few user scripts to improve my own time on Greenhouse.
A user script is JavaScript that runs on a site after it’s loaded. There’s a regex pattern matching to see what site you want to run the script on, and then you can add your own JavaScript code in to do whatever you want. 🙂
When you have a lot of text in a scorecard attribute, it won’t show you all of the content. The textarea is too small.
Before the user script.
So, to see the full overview of everything you have to manually drag the little corner handle to expand all the textareas. That can be a slow, repetitive task. So, I wrote a script to automatically expand all textareas to show the full content on page load.
After installing the user script, each textarea is expanded on load to show its full contents.
Much better! Now I can see the full overview as soon as I go to the page without needing to do any extra work.
Show/Hide Code Test Scorecard Sections
There isn’t a way in Greenhouse to have separate scorecards for individual stages of the hiring process. So, any attributes you have on your interview scorecard will also show up on your code test scorecard, and so on.
Our code test has a lot of individual attributes that are only applicable for the code test. This clutters your interview scorecard with a lot of things that are unnecessary for that stage.
I wrote a user script that will look for the string “Code Test” in the Interview Plan/Scorecard title. If the scorecard title matches “Code Test,” then it will only show scorecard sections with “Code Test” in their titles.
Code test scorecard after the user script
This majorly cleans up the clutter across all scorecards.
I knew I needed to build some kind of survey to see if dropping the time limit from the code test would have any measurable impact on time spent or pressure. But I wasn’t sure if it should be anonymous or not.
On one hand, I assumed the data for an anonymous survey would be more reliable as people would be more honest. On the other, we could get more info about the outcomes of the candidate if we knew who sent it.
To figure out the best path, I asked myself two questions:
What am I measuring?
Are people more honest in an anonymous survey?
What am I measuring?
I wanted to see if removing the time limit had an impact on:
Time spent taking the test
Pressure felt from the test
In my instance, knowing the outcome of the test (did they pass or not, do they end up being hired, etc), did not influence either of those pieces of data. While that extra info would be interesting, it would not help me answer my core questions. As a result, I felt anonymous was the best choice.
Are people actually more honest in an anonymous survey?
This decision relied on my assumption that people were more honest in an anonymous survey. I figured someone had thought about and researched this before.
They designed a study where they knew if people had cheated on a test or not, then asked them if they cheated under confidentiality vs anonymity. In confidentiality only 25% told the truth, while 75% told the truth under anonymity.
The really interesting (and funny) part is how they designed the study. Basically, they wanted to see if people would self-report cheating in a scenario where they could tell if a person had actually cheated or not. 😈
They told people they’d get $25 if they got a score better than 17/20 on a test with really difficult words. There was a dictionary amongst some books set out that the participant could access, but they didn’t mention this. They would know if the person cheated based on if the dictionary was moved or if a bookmark in it ended up in a different spot.
Then, the pièce de résistance:
In order to ensure that the words would be difficult enough to inspire cheating, we made up the last three words.
The whole study is quite clever and funny. It’s well worth a read.
Anonymous is Best for Honesty
In the end, I went with an anonymous survey because I needed to be able to trust the self-reported time and pressure results as much as possible. Anonymous surveys are more reliable in this sense, and the extra info gleaned from a confidential survey would not have helped me determine the core goal of the study.