Eugene Yan · 2026-05-03 · 원문 보기

AI와 함께 일하고 성과를 복합하는 방법

How to Work and Compound with AI

번역: Korean by DG Hong

AI와 어떻게 효과적으로 협력할 수 있을까요? 어떤 워크플로우를 사용하고, 어떻게 확장하며, 시간이 지남에 따라 우리의 시스템을 어떻게 개선할까요? 이상적으로는, 이것이 복합적으로 작동해야 합니다. 완성된 모든 산출물—코드, 문서, 분석, 결정—은 다음 세션을 위한 컨텍스트가 됩니다. 그리고 각 수정사항은 향후 오류를 줄이는 설정을 업데이트합니다. 아직 배우고 있지만, 제 답변을 충분히 반복해서 여기에 적어두고 있어서 다음번에 물어볼 때는 링크를 공유할 수 있습니다.

정기적으로 AI를 사용한다면, 아마도 이미 이러한 많은 관행을 적용하고 있을 것입니다. 그럼에도 불구하고 기본 원칙이 광범위하게 적용된다고 생각합니다: 좋은 컨텍스트를 제공하고, 당신의 취향을 설정으로 인코딩하고, 검증을 쉽게 하고, 더 큰 작업을 위임하고, 루프를 닫으세요. 어떤 관행이 맞지 않는다면, 원칙을 조정하고 당신 자신만의 것을 만드세요. 또한 읽으면서 알아차리세요. 이것 중 어느 것도 AI에만 해당하는 것이 아닙니다. 이것은 단순히 새로운 협력자와 함께 온보딩하고 일하는 방식일 뿐입니다.

• • •

인프라로서의 컨텍스트

모델이 당신의 컨텍스트를 탐색하도록 도와주세요. 예를 들어, 제 코드는 모두 ~/src에 있고 제 지식 작업은 모두 ~/vault에 있습니다 (projects/, notes/, kb/ 등으로 구성). 우리의 작업이 정리되어 있으면 모델이 grep 또는 glob을 사용하여 컨텍스트를 검색하기가 더 쉬워집니다. 그리고 깔끔한 디렉토리 트리를 가짐으로써, 디렉토리를 탐색하고, 이전 코드, 프로젝트 문서, 분석 등을 찾고 활용하여 진행 중인 작업을 개선하는 것이 더 간단합니다.

모델을 당신 조직의 컨텍스트에 연결하세요. 모델은 조직 지식으로부터 이득을 볼 수 있으며, 이는 아마도 Slack, Drive, Mail 등에 있을 것입니다. 대부분은 Claude Code, Cowork, Claude.ai에 대한 MCP를 가지고 있습니다. 이에 더해 프로젝트당 INDEX.md를 유지합니다. 관련 문서와 채널의 주석이 달린 색인입니다. 각 항목에는 URL, 소유자, 내부의 내용과 언제 읽을지를 설명하는 간단한 단락이 포함되어 있습니다. 주석이 많은 도움이 됩니다. 단순한 URL 목록은 모델이 어느 것이 관련되어 있는지 파악하기 위해 모든 링크를 열도록 강요하여 시간과 컨텍스트를 낭비합니다. 미리 주석을 달아두면, 우리는 무거운 작업을 한 번 하고 색인에 저장합니다.

새로운 세션을 새 직원처럼 온보딩하세요. 새로운 세션이 시작될 때마다, 모델은 백지상태로 시작합니다. 따라서 프로젝트별 CLAUDE.md를 첫 날에 새로운 팀원에게 건네는 온보딩 문서처럼 취급하는 것이 도움이 됩니다. Claude가 제 프로젝트별 CLAUDE.md 파일을 검사했고, 약자 용어집, 프로젝트 코드명, 같은 이름의 팀원들이 포함되어 있다고 강조했습니다. 저는 또한 CLAUDE.md에 제안된 읽기 순서를 가지고 있습니다. 예를 들어 모델에게 INDEX.md를 먼저 훑어보고, 그 다음 TODOS.md, 마지막으로 특정 주제 노트를 읽으라고 지시하는 것과 같습니다.

메모리 레이어를 구축하세요. 기본적으로 모델은 마지막 세션에서 무엇이 일어났는지 기억하지 않으므로, 지속할 가치가 있는 모든 것은 디스크에 써야 합니다. 저는 메모리 레이어를 두 개의 버킷으로 나눕니다. ~/vault는 프로젝트 상태, 산출물, 도메인 지식 같은 사실을 보유합니다. ~/.claude (및 그 CLAUDE.md, skills/, guides/)는 제 선호도, 워크플로우, 개인적인 취향을 포함합니다. 전자는 컨텍스트를 제공하고 후자는 설정을 제공합니다.

설정으로서의 취향

~/.claude/CLAUDE.md에서 시작하세요. Claude는 모든 세션의 시작 부분에서 이를 읽습니다. 저는 이것을 행동 계약으로 생각합니다. 제 CLAUDE.md는 얼마나 직설적일지, 언제 반박할지, 실수를 어떻게 처리할지, 무엇을 가르쳐줄지 등의 선호도를 포함합니다. 다음은 축약된 버전입니다:

<behavior>
- 직설적이고 당신이 동의하지 않을 때 반박하세요; 제 접근 방식에 문제가 있으면 그렇게 말하세요.
- 무언가에 대해 확실하지 않을 때는 자신감 있게 추측하기보다는 확실하지 않다고 말하세요.
- 무언가가 실패하면, 재시도하기 전에 근본 원인을 조사하세요.
- diff를 작업에 맞게 유지하세요: 임의의 재포맷이나 관련 없는 리팩토링은 없습니다.
...
</behavior>

<teaching>
저는 항상 새로운 시스템과 도메인을 배우고 있습니다. 중요한 용어가 나타날 때
아마도 내면화되지 않았을 것 같으면, 1-2 문장으로 설명한 후 진행하세요. 형식:

> 💡 1-2 문장 설명
...
</teaching>

디렉토리별로 범위를 지정하세요: 전역, 리포지토리, 프로젝트. 어디서든 적용되는 선호도 (예: 행동, 장기 목표, 교육)를 ~/.claude/CLAUDE.md에 넣으세요. 특정 리포지토리에 대한 규칙 (예: 린팅, 명명, 풀 요청)을 리포지토리의 루트에 넣으세요. 프로젝트별 컨텍스트 (즉, 디렉토리 레이아웃, 도메인 지식)를 프로젝트 디렉토리에 넣으세요. 서브디렉토리에서 Claude Code를 시작할 때, 트리를 따라 올라가며 각 CLAUDE.md를 로드합니다. 모델이 세션 중간에 서브디렉토리로 이동할 때, 모델은 그 디렉토리의 CLAUDE.md도 받습니다. 문서에서 자세히 알아보세요.

CLAUDE.md가 너무 길어지면 분할하세요. 긴 CLAUDE.md는 컨텍스트 비용이 될 수 있습니다. 세션이 필요하지 않더라도 모든 세션에 모든 것이 로드됩니다. 이를 해결하려면 청크를 게으르게 로드되는 가이드로 리팩토링하세요. 이들을 @import하지 마세요 (단지 인라인할 뿐입니다). 대신 관련이 있을 때 이들을 읽도록 CLAUDE.md에 지시하세요. 이렇게 하면, eval을 구축하는 세션은 문서 작성 가이드를 건너뜁니다. 다음은 예제 가이드 섹션입니다:

<guides>
- 문서, 한 페이지짜리, 모든 글쓰기: ~/.claude/guides/writing.md
- Eval 구축 및 보고서: ~/.claude/guides/evals.md
- 대시보드: ~/.claude/guides/dashboards.md
...
</guides>

주 1회 이상 무언가를 하면, 스킬로 만드세요. 스킬은 모델이 요청시 로드하는 이름, 트리거, 절차가 있는 마크다운 파일입니다. 스킬을 마크다운에 작성된 워크플로우로 생각하세요. 로직을 포함할 수 있습니다. 예를 들어, 제 /polish 스킬은 산출물 diff를 봅니다. 메트릭을 생성하면 연관된 eval을 실행합니다. 브라우저에서 렌더링되면 Claude in Chrome을 통해 출력을 확인합니다. 둘 다 아니면, 코드를 실행하고 출력 또는 오류를 읽습니다. 스킬은 단계와 어떤 단계가 적용되는지에 대한 판단을 인코딩합니다. 저는 다음을 포함합니다:

/polish: 버그를 확인하고, 코드를 단순화하며, 출력을 검증하고 (eval, Claude in Chrome, 또는 다른 것을 통해), 중요한 피드백이 없을 때까지 반복하고, PR 초안을 작성합니다
/write: 개요를 위해 인터뷰하고, 연구 서브에이전트를 생성하고, 초안을 작성하고, 적대적 비평가를 통해 피드백을 제공하고, 중요한 피드백이 없을 때까지 반복합니다
/daily: 제 캘린더, slack, PR, 어제의 로그 등을 읽고 오늘의 우선순위를 작성합니다

저는 SKILL.md를 작고 워크플로우와 라우팅에 집중한 상태로 유지하는 경향이 있습니다. 템플릿과 스크립트 같은 지식은 별도의 파일이며, 모델은 필요할 때만 읽고 실행합니다. 게으르게 로드되는 가이드처럼요.

작업을 한 번 수행한 후 모델에게 스킬로 만들도록 요청하여 스킬을 부트스트랩하세요. 이것이 대부분의 스킬을 구축하는 방법입니다. 먼저, 일반 세션에서 상호작용적으로 작업을 한 번 수행합니다. 그런 다음, 모델에게 방금 한 것을 스킬로 전환하도록 요청합니다. 다음으로, 같은 또는 유사한 작업에서 스킬을 실행합니다. 불가피하게 출력을 수정해야 할 것이고, 같은 세션에서 이를 수행하여 피드백이 세션 기록에 기록되도록 합니다. 마지막으로, 수정 및 피드백에 기반하여 스킬을 업데이트하도록 모델에게 요청합니다. 원하는 출력의 예제로 스킬을 시작할 수도 있습니다. 모델에게 패턴을 추출하도록 요청하세요. 예를 들어 코드를 구성하는 방법 또는 문서의 구조와 톤 같은 것입니다.

파일을 직접 편집하기보다는 기록을 통해 스킬을 개선하세요. 스킬의 첫 번째 버전은 원래 세션을 과적합하므로 완벽하게 작동하는 경우가 드뭅니다. 이것은 정상입니다. 스킬을 실행하고 출력을 업데이트해야 할 때, 세션 내에서 수정하세요. SKILL.md를 직접 열고 편집하지 않으려고 노력하세요. 세션에서 피드백을 제공하면 모델이 이전/이후 쌍을 제공합니다. 이는 기록에 누적됩니다. 여기가 우리가 한 일이고, 여기가 내가 원한 것이고, 왜인지. 출력이 맞으면, 모델에게 피드백을 스킬로 병합하도록 요청하세요. 몇 번의 반복 후에, 스킬이 수렴하고 최종 출력을 거의 편집할 필요가 없습니다.

그럼에도 불구하고, 모든 작업이 이 컨텍스트를 필요로 하는 것은 아닙니다. 브레인스토밍, 탐색, 거친 초안의 경우, 저는 간단한 모드 (CLAUDE_CODE_SIMPLE=1 claude)를 사용하는 것을 즐깁니다. 여기서 CLAUDE.md는 여전히 로드되지만 에이전트 하네스—훅, 스킬, 도구 중심 루프—는 그렇지 않습니다. 이것은 배우고 있는 것이 아니라 생각하고 있을 때 제 마음에 더 가깝게 다가가게 합니다.

자율성을 위한 검증

검증을 앞으로 옮기세요; 쓰기 시간에 오류를 포착하세요. 저는 검증을 계단으로 생각합니다. 하단은 저렴하고 결정적입니다. 상단은 비싸고 판단이 필요합니다. 우리는 가능한 한 낮은 단계에서 문제를 해결하고 싶습니다. 하단 근처에는 모델이 방금 업데이트한 파일에서 ruff format, ruff check --fix를 실행하는 편집 후 훅이 있습니다. 이것은 결정적으로 발생하며 토큰 비용이 들지 않습니다. 계단을 따라 더 높은 곳에는 테스트, eval, LLM 검토 등이 있습니다.

모델이 작업을 검증하기 쉽게 만드세요. 모델에게 피드백 루프를 제공하여 출력을 개선하세요. 시스템이 메트릭을 생성하면 모델이 eval을 실행하고 이를 최적화하도록 하세요. 출력이 브라우저에서 렌더링되면 모델이 Claude in Chrome을 통해 이를 검사하도록 하세요. 둘 다 아니면 모델이 이를 실행하고 오류를 읽도록 하세요. 예를 들어, Docker 이미지를 구축할 때, 저는 모델이 빌드하고, 오류를 읽고, Dockerfile을 편집하고, 다시 빌드하도록 합니다. 하네스를 튜닝할 때, 모델은 eval을 실행하고, 기록을 읽고, 실패를 수정합니다. 대시보드를 구축할 때, 모델은 Chrome에서 도움말이 렌더링되는지, 라벨이 겹치지 않는지, 그리고 내러티브가 숫자와 일치하는지 확인합니다.

장시간 작업의 경우, 모델이 모델을 관찰하도록 하세요. 긴 세션은 오류가 쌓이면서 표류할 수 있습니다. 한 가지 해결책은 새로운 컨텍스트로 보조 세션을 실행하여 원본 사양 및 주 세션의 최근 차례를 읽도록 하는 것입니다. 제 최소 설정은 두 개의 tmux 창을 사용합니다. 하나는 주 개발용, 하나는 쌍 프로그래머용입니다. 초기 지침과 후속 프롬프트는 공유 파일에 추가됩니다. 주기적으로 쌍 프로그래머는 시작되고, 사양을 주 개발자의 최근 기록과 대조하고, 뭔가 잘못되었으면 주 개발자에게 피드백을 제공하여 진로를 수정합니다.

우리는 다양한 방법으로 이것을 할 수 있습니다. 예를 들어, 쌍 프로그래머는 실행 표류를 관찰할 수 있습니다. 모델이 작업을 올바르게 수행하고 있습니까? 이것은 로컬이고 전술적입니다. 예를 들어 오류를 무시하거나, 잘못된 메트릭을 보고하거나, 사양에서 벗어나는 것 같은 것입니다. 방향 표류도 있습니다. 모델이 올바른 작업을 수행하고 있습니까? 이것들은 더 큰 그림이고 전략적이며, 모델이 원래 의도를 잘못 해석하고 잘못된 것을 만드는 데 몇 시간을 보낼 때 발생합니다. 실행 표류는 자주 확인하고 방향 표류는 가끔 확인하세요.

위임을 통한 확장

점점 더 큰 작업 청크를 위임하세요. 때로는 모델과 쌍 프로그래밍을 합니다. 짧은 작업, 빠른 피드백, 루프에 머물기. 이것은 빠른 반복, 탐색 분석, 프로토타이핑에 잘 작동합니다. 하지만 점점 더 강력해지는 모델이 있으면, 더 큰 작업을 위임하는 것을 목표로 해야 합니다. 의도, 제약, 성공 기준을 미리 설명한 후 모델이 작동하도록 하세요. 검증할 수 없는 것을 위임할 수 없으므로, 이것은 먼저 성공 기준과 메트릭을 정의하고 필요합니다. 변화는 한 번에 한 번씩 지침을 제공하는 것에서 계획을 마련하고 모델이 끝까지 실행하도록 하는 것입니다:

"이러한 eval 제품군이 주어졌을 때, 제품군당 격리된 컨테이너를 구축하고 각 컨테이너가 구축되는지 스모크 테스트합니다. 그런 다음, 전체 실행을 수행하고, eval 메트릭과 기록을 로그하고, 서브에이전트를 사용하여 기록을 읽고 eval이 올바르게 실행되었음을 확인합니다. 신뢰도 구간에 대해 각 eval을 n번 실행합니다. 마지막으로, 보고서를 생성하고, 보고서 가이드를 따르는지 확인하고, 결과 및 보고서 URL을 slack으로 알려주세요."

세션을 병렬로 실행하고 병목을 찾으세요. 더 큰 작업을 위임하는 것은 한 번에 더 많은 것을 실행할 수 있다는 의미입니다. Claude는 일반적으로 3~6개의 세션을 동시에 실행합니다. 병목은 작업을 수행하는 것에서 명확한 사양을 작성하고 출력을 빠르게 검토하여 파이프라인을 계속 이동시키는 것으로 바뀌었습니다. 중간이 비어가고 있습니다. 평행 세션이 리포지토리를 공유하면, git worktree를 사용하여 각 세션이 자신의 체크아웃을 가지고 서로의 변경 사항을 덮어쓰지 않도록 하세요.

세션을 관찰하기 쉽게 만드세요. 여러 세션을 실행할 때, 해당 상태를 알아야 하고 어느 세션이 주의가 필요한지 알아야 합니다. 제 맥에서는 stop hook이 세션이 끝났을 때 소리를 재생합니다 (아래 예제). 제 tmux 윈도우 제목은 상태 이모지 (⏳ 작업 중; 🟢 완료)와 각 창이 무엇을 하는지 알 수 있도록 짧은 Haiku-생성된 레이블을 사용합니다. Claude Code 상태 라인은 컨텍스트 사용량과 현재 모드를 표시합니다. 함께, stop-hook 소리는 완료된 작업을 신호하고, tmux 제목은 어느 것이고, 상태 라인은 세부 사항을 제공합니다.

# Example stop hook alert
"Stop": [
    {
    "hooks": [
        {
        "type": "command",
        "command": "if command -v afplay >/dev/null 2>&1; then afplay -v 1.0 /System/Library/Sounds/Glass.aiff; else tput bel; fi"
        }
    ]
}

AFK 상태일 때도 확인할 수 있습니다. Claude Code의 /remote-control은 이를 쉽게 만듭니다. 출근하거나 줄을 기다리는 동안, 제 Claude 앱의 코드 탭을 열어서 무엇이 실행 중이고 무엇이 차단되었는지 보고, 필요하면 추가 컨텍스트 또는 새로운 지침으로 정체된 세션을 차단 해제합니다. 이것은 세션이 시간 동안 유휴 상태로 앉아 있지 않고 계속 진행되도록 유지합니다. 긴급 상황이 있을 때만 하세요. 현재하려고 하거나 풀밭에 가려고 할 때는 하지 마세요.

루프 닫기

공개적으로 작업하여 컨텍스트를 풍부하게 유지하세요. 공유 문서, 리포지토리, 채널에서 작업을 수행하면—모델을 포함한 모두—컨텍스트를 검색하고 이득을 보기가 더 쉬워집니다. 오늘 우리가 공유하는 것이 내일 조직 컨텍스트의 일부가 됩니다. 간단한 테스트를 시도해 보세요: 새로운 팀원이 지난주 당신의 작업을 공유 컨텍스트만 사용하여 복제할 수 있을까요? 예라면, 당신은 조직 컨텍스트에 잘 기여하고 있습니다. 아니라면, 그 귀중한 컨텍스트는 당신의 머릿속에 갇혀 있습니다. 저는 이것을 어느 정도 자동화합니다. 제 CLAUDE.md의 지침에 따라 실질적인 작업을 완료할 때마다 worklog 채널에 짧은 업데이트를 게시합니다. 산출물 PR 또는 문서에 대한 링크와 함께요.

설정 업데이트를 위해 기록을 마이닝하세요. 모델에게 과거 세션 기록을 읽어서 격차를 찾도록 하세요. 약 2,500개의 과거 사용자 차례를 스캔했을 때, 상당한 비율에는 "또한...", "확인했어요...", "아직 잘못됨" 같은 구절이 포함되어 있었습니다. 이는 모델이 프롬프트 없이 무언가를 했어야 하고, 제가 CLAUDE.md나 스킬을 업데이트해야 하거나, 검증 단계가 누락되었거나 손상되었음을 시사합니다. 히트 횟수는 수정이 얼마나 자주 발생하는지를 보여주고, 기록은 정확히 무엇이 실패했는지를 보여줍니다. 이것이 제가 세션 내에서 수정을 하는 이유입니다. 그래서 저는 기록을 제 다음 CLAUDE.md나 스킬 업데이트를 위한 입력으로 사용할 수 있습니다.

주기적으로 리팩토링하고 정리하세요. 설정이 증가하면서 겹치거나 충돌할 수 있습니다. 결과적으로, 모델이 규칙을 무시하면, 다른 규칙이 충돌하기 때문일 수 있습니다. 주기적으로 리팩토링하여 이를 해결하세요. 각 규칙이나 선호도는 정확히 한 곳에서만 유지되어야 합니다 (중요한 지침은 주 CLAUDE.md에서 반복될 수 있습니다). 저는 또한 스트레이 디렉토리 수준 settings.json을 확인하고 ~/.claude로 통합합니다.

• • •

구체적인 설정은 모델이 개선될 때 변경될 가능성이 있지만, 원칙은 관련성이 유지될 것이라고 생각합니다: 좋은 컨텍스트를 제공하고, 당신의 취향을 인코딩하고, 검증을 저렴하게 하고, 더 많이 위임하고, 루프를 닫으세요. 우리가 하는 것은 한 번에 한 피드백씩 협력자를 훈련시키는 것입니다. 그리고 생각해보면, 이러한 원칙은 인간 팀과의 협력 방식에도 적용됩니다.

시작하려면, 모델이 이 SETUP.txt를 읽도록 한 후 이를 적용하도록 도와달라고 요청하세요. 또한, 어떤 관행이나 원칙을 가치 있다고 생각했는지 배우고 싶습니다. 아래에 댓글을 달거나 연락해 주세요!

p.s. 이것은 단순히 개인용 도구에 관한 것이 아닙니다. 에이전트 하네스를 설계하고, 팀 규범을 설정하고, 조직 인프라를 구축하는 방법이기도 합니다. 이 계층들을 마음에 두고 다시 읽어 보세요.

이것이 유용했다면, 이 글을 다음과 같이 인용해 주세요:

Yan, Ziyou. (2026년 5월). AI와 어떻게 협력하고 성장할까. eugeneyan.com. https://eugeneyan.com/writing/working-with-ai/.

또는

@article{yan2026default,
  title   = {How to Work and Compound with AI},
  author  = {Yan, Ziyou},
  journal = {eugeneyan.com},
  year    = {2026},
  month   = {May},
  url     = {https://eugeneyan.com/writing/working-with-ai/}
}

Share on:

Translations: Korean by DG Hong

How can we work effectively with AI? What’s the workflow, how does it scale, and how do we improve our systems over time? And ideally, it should compound. Every finished artifact—code, docs, analysis, decisions—becomes context for the next session. And each correction updates a config that reduces future errors. While I’m still learning, I’ve repeated my answers often enough that I’m writing it here so the next time I’m asked I can share a link instead.

If you use AI regularly, you likely already apply many of these practices. Nonetheless, I believe the underlying principles apply broadly: provide good context, encode your taste as config, make verification easy, delegate bigger tasks, and close the loop. If a practice does not fit, adapt the principle and invent your own. Also notice, as you read, that none of this is specific to AI. It’s simply how you onboard and work with any new collaborator.

• • •

Context as infrastructure

Help models nagivate your context. For example, all my code lives in ~/src and all my knowledge work lives in ~/vault (organized into projects/, notes/, kb/ and so on). When our work is organized, it makes it easier for the model to retrieve context using grep or glob. And by having a clean directory tree, it’s more straightforward to navigate the directory, and find and lean on prior code, project docs, analysis, etc. to improve the work being done.

Connect models to your organization’s context. Models can benefit from organizational knowledge which likely lives in Slack, Drive, Mail, etc. Most have MCPs for Claude Code, Cowork, Claude.ai. On top of these, I also maintain a INDEX.md per project. It’s an annotated index of the relevant docs and channels, and each entry includes the URL, owner, and a brief paragraph explaining what’s inside and when to read it. The annotation helps a lot. A bare list of URLs forces the model to open every link to figure out what’s relevant, wasting time and context. By annotating upfront, we do the heavy lifting once and store it in the index.

Onboard each new session like a new hire. With each new session, the model starts with a blank slate. Thus, it helps to treat the per-project CLAUDE.md like the onboarding doc we’d hand to a new teammate on day one. Claude scanned my per-project CLAUDE.md files and highlighted that they included glossaries for acronyms, project code names, and teammates with the same first name. I also have a suggested reading order in the CLAUDE.md, like telling the model to skim INDEX.md first, then TODOS.md, and finally specific topic notes.

Build your memory layer. By default, models don’t remember what happened in the last session, so anything worth persisting should be written to disk. I split my memory layer into two buckets. ~/vault holds facts such as project state, artifacts, and domain knowledge; ~/.claude (along with its CLAUDE.md, skills/, guides/) contains my preferences, workflows, and personal taste. The former provides context while the latter provides configuration.

Taste as configuration

Start with ~/.claude/CLAUDE.md. Claude reads this at the start of every session. I think of it as a behavioral contract. My CLAUDE.md contains preferences like how direct to be, when to push back, how to handle mistakes, what to teach me, etc. Here’s a trimmed version:

<behavior>
- Be direct and push back when you disagree; if my approach has problems, say so.
- When unsure about something, say you're unsure rather than guessing confidently.
- When something fails, investigate the root cause before retrying.
- Keep diffs scoped to the task: no drive-by reformats or unrelated refactors.
...
</behavior>

<teaching>
I'm always picking up new systems and domains. When a key term surfaces that I 
likely haven't internalized, explain it in 1-2 sentences and then move on. Format:

> 💡 followed by 1 - 2 sentence explanation
...
</teaching>

Scope it by directory: global, then repo, then project. Put preferences that apply everywhere (e.g., behavior, long-term goals, teaching) in ~/.claude/CLAUDE.md. Put conventions for a specific a repo (e.g., linting, naming, pull requests) in the repo’s root. Put project-specific context (i.e., directory layout, domain knowledge) in the project directory. When you start Claude Code in a subdirectory, it walks up the tree and loads each CLAUDE.md. And when the model navigates into a subdirectory mid-session, the model picks up that directory’s CLAUDE.md too. More in the docs.

When CLAUDE.md gets too long, split it out. A long CLAUDE.md can become a context tax. It loads everything every session even if the session doesn’t need it. To fix this, refactor chunks into guides that load lazily. Don’t @import them (because that just inlines them). Instead, tell your CLAUDE.md to read them when relevant. This way, a session that’s building evals skips the guide on writing docs. Here’s an example guide section:

<guides>
- Docs, 1-pagers, any writing: ~/.claude/guides/writing.md
- Eval building and reports: ~/.claude/guides/evals.md
- Dashboards: ~/.claude/guides/dashboards.md
...
</guides>

If you do something ≥ once a week, make it a skill. A skill is a markdown file with a name, trigger, and procedure that the model loads on demand. Think of skills as workflows written in markdown. They can include logic. For example, my /polish skill looks at the artifact diff. If it produces a metric, it runs the associated eval. If it renders in a browser, it checks the output via Claude in Chrome. If neither, it runs the code and reads the output or error. Skills encode both the steps and the judgment of which steps apply. A few I have include:

/polish: checks for bugs, simplifies the code, verify the output (via evals, Claude in Chrome, or something else), iterate until no critical feedback, draft the PR
/write: interviews me for the outline, spawn research subagents, writes the draft, gives feedback via adversarial critic, iterate until no critical feedback
/daily: reads my calendar, slack, PRs, yesterday’s log, etc and writes today’s priorities

I tend to keep SKILL.md small and focused on the workflow and routing. The knowledge, like templates and scripts, are separate files that the model reads and runs only when needed, just like lazy-loaded guides.

Bootstrap skills by doing the task once and then asking the model to make it a skill. This is how I build most skills. First, I do the task once, interactively, in a normal session. Then, I ask the model to turn what we just did into a skill. Next, I run the skill on the same or similar task. Inevitably, I’ll need to correct the output, which I do in the same session so feedback is logged in the session transcript. Finally, I ask the model to update the skill based on the corrections and feedback. You can also seed a skill with exapmles of the desired output. Ask the model to extract the patterns, like how you organize your code, or the structure and tone of your docs.

Refine skills via the transcript, not the file directly. The first version of the skill rarely works perfect because it overfits the original session. This is normal. When you run it and need to update the output, correct it within the session. Try not to open and edit SKILL.md directly. Providing feedback in the session gives the model before-and-after pairs which accumulate in the transcript—here’s what we did, here’s what I wanted, and why. Once the output is right, ask the model to merge the feedback into the skill. After a few rounds, the skill converges and you barely have to edit the final output.

Nonetheless, not every task needs this context. For brainstorming, exploration, and rough drafts, I enjoy using simple mode (CLAUDE_CODE_SIMPLE=1 claude). Here, CLAUDE.md still loads but the agentic harness—hooks, skills, tool-heavy loops—doesn’t. This gets me closer to the model, which is what I want when I’m thinking out loud rather than shipping.

Verification for autonomy

Shift verification left; catch errors at write time. I think of verification as a ladder. The bottom is cheap and deterministic; the top is expensive and requires judgement. We want to address issues at the lowest possible rung. Near the bottom are post-edit hooks that run ruff format, ruff check --fix on files the model just updated. This happens deterministically and doesn’t cost tokens. Higher on the ladder are tests, evals, LLM reviews, etc.

Make it easy for the model to verify the work. Give the model feedback loops to improve its output. If the system produces a metric, let the model run the eval and optimize it. If the output renders in a browser, let the model inspect it via Claude in Chrome. If neither, let the model run it and read the error. For example, when building Docker images, I let the model build, read the error, edit the Dockerfile, and rebuild. If I’m tuning a harness, the model runs evals, reads the transcripts, and fixes failures. When building a dashboard, the model checks in Chrome that tooltips render, labels don’t overlap, and the narrative matches the numbers.

For long-running tasks, have models watch models. Long sessions can drift as errors build up. One fix is to run a secondary session with fresh context to read the original spec and the recent turns of the primary session. My minimal setup uses two tmux panes, one for the primary dev, one for the pair programmer. Initial instructions and follow-up prompts are appended to a shared file. Periodically, the pair programmer spins up, checks the spec against the primary’s recent transcript, and if something’s off, provides feedback to course correct.

We can do this in various ways. For example, the pair programmer can watch for execution drift—is the model doing the task right? This is local and tactical, like ignoring an error, reporting a bad metric, or diverging from the spec. There’s also direction drift—is the model doing the right task? These are bigger picture and strategic, and occur when the model misinterprets the original intent and spends hours building the wrong thing. Check for execution drift often and direction drift occasionally.

Scaling via delegation

Delegate increasingly bigger chunks of work. Sometimes, we pair-program with models: short tasks, fast feedback, staying in the loop. This well works for fast iterations, exploratory analysis, and prototyping. But with increasingly stronger models, we should aim to delegate bigger tasks. Explain your intent, constraints, and success criteria upfront, then let the model work. You can’t delegate what you can’t verify, so this requires first defining success criteria and metrics. The shift is from giving instructions, one at a time, to fleshing out plans and letting the model execute them end to end:

“Given these eval suites, build isolated containers per suite and smoke-test that each builds. Then, do the full run, log the eval metrics and transcripts, and use subagents to read the transcripts and confirm the evals ran correctly. Run each eval n times for confidence intervals. Finally, generate the report, verify it follows the report guide, and slack me the results and report URL.”

Run sessions in parallel and find the bottleneck. Delegating bigger tasks means we can run more at once. Claude says I typically run three to six sessions simultaneously. The bottleneck has shifted from doing the work to writing clear specs and reviewing outputs fast enough to keep the pipeline moving—the middle is hollowing out. If parallel sessions share a repo, use git worktrees so each session gets its own checkout and don’t overwrite each other’s changes.

Make sessions easy to observe. When running multiple sessions, I need to know their state and which one needs attention. On my mac, a stop hook plays a sound when a session finishes (example below). My tmux window titles use a status emoji (⏳ working; 🟢 complete) and a short Haiku-generated label so I know what each pane is doing. The Claude Code status line shows context usage and the current mode. Together, the stop-hook sound signals a finished task, the tmux titles shows which one, and the status line provides the details.

# Example stop hook alert
"Stop": [
    {
    "hooks": [
        {
        "type": "command",
        "command": "if command -v afplay >/dev/null 2>&1; then afplay -v 1.0 /System/Library/Sounds/Glass.aiff; else tput bel; fi"
        }
    ]
}

You can check in even if AFK. /remote-control in Claude Code makes this easy. While commuting or waiting in line, I open the code tab in the Claude app to see what’s running and what’s blocked, and if needed, unblock a stalled session with additional context or new instructions. This keeps sessions moving instead of sitting idle for hours. Only do this if there’s something urgent though, not when you’re trying to be present or touch grass.

Closing the loop

Keep the context rich by working in the open. When we do our work in shared docs, repos, and channels, it makes it easier for everyone—including models—to retrieve and benefit from the context. What we share today becomes part of the org context tomorrow. Try this simple test: could a new teammate replicate your work from last week using only the shared context? If yes, you’re contributing well to the org context; if not, that precious context is stuck in your head. I automate this somewhat via instructions in my CLAUDE.md to post short updates in a worklog channel whenever I finish a substantial task, with links to the artifact PR or doc.

Mine your transcripts for config updates. Have the model read past session transcripts to find gaps. When I scanned ~2,500 of my past user turns, a sizable percentage contained phrases like “can you also…“, “did you check…“, “still wrong”, etc. These suggest that the model should have done something unprompted, and I should update the CLAUDE.md or skill, or that a verification step is missing or broken. Hit counts show how often a correction happens and the transcripts show exactly what failed. This is why I make corrections within the session, so I can use the transcript as input for my next CLAUDE.md or skills update.

Refactor and prune periodically. As configs grow, they can overlap or conflict with each other. As a result, if the model ignores a rule, it can be because another rule contradicts it. Fix this by refactoring periodically. Each rule or preference should live in exactly one place (though critical instructions can be repeated in the main CLAUDE.md). I also check for stray directory-level settings.json and consolidate them back into ~/.claude.

• • •

While the specific setup will likely change as models get better, I think the principles will remain relevant: provide good context, encode your taste, make verification cheap, delegate more, and close the loop. What we’re doing is training a collaborator, one feedback at a time. And if you think about it, these principles apply to how we work with a human team too.

To get started, have your model read this SETUP.txt and help you apply it. Also, I’d love to learn what practices or principles you’ve found valuable—please comment below or reach out!

p.s. This isn’t just about personal tooling. It’s also how you’d design agent harnesses, set team norms, and build org infrastructure. Try reading it again with those layers in mind.

If you found this useful, please cite this write-up as:

Yan, Ziyou. (May 2026). How to Work and Compound with AI. eugeneyan.com. https://eugeneyan.com/writing/working-with-ai/.

@article{yan2026default,
  title   = {How to Work and Compound with AI},
  author  = {Yan, Ziyou},
  journal = {eugeneyan.com},
  year    = {2026},
  month   = {May},
  url     = {https://eugeneyan.com/writing/working-with-ai/}
}

Share on:

#ai-workflows #compound-growth #context-infrastructure #autonomous-verification #ai-delegation #taste-configuration