507개 테스트를 AI가 만들었다

"테스트를 안 짜면 어떻게 되나요?"

솔직히 말하면, 처음엔 테스트 코드가 뭔지도 몰랐습니다. 코드가 돌아가면 되는 거 아닌가? 브라우저에서 클릭해보고 결과 나오면 끝 아닌가? 12년간 채용 업계에서 일하면서 "테스트 자동화"라는 건 후보자에게 물어보는 면접 질문이었지, 제가 직접 할 일이라고는 생각해본 적이 없었습니다.

그런데 기능이 하나씩 추가될 때마다 무서워지기 시작했습니다. JD 파싱 로직을 고치면 이력서 분석이 깨지고, 매칭 알고리즘을 수정하면 리포트 포맷이 틀어지고. 하나를 고치면 세 개가 터지는 공포. 이걸 개발자들은 "회귀 버그"라고 부르더군요.

• • •

비개발자의 테스트 공포

Convince-X의 Candidate Analyzer는 3단계 AI 파이프라인으로 동작합니다. 이력서 추출 → JD 구조화 → 매칭 분석. 각 단계가 서로 의존하고 있어서, 한 곳이 바뀌면 연쇄 반응이 일어납니다.

개발 초기에는 기능을 추가할 때마다 브라우저를 열고 수동으로 테스트했습니다. JD를 넣어보고, 이력서를 업로드해보고, 결과가 제대로 나오는지 눈으로 확인하는 방식. 기능이 10개일 때는 이게 됐습니다. 20개가 넘어가면서부터 불가능해졌습니다.

      수동 테스트의 한계

      기능 10개: 수동 테스트 가능 (30분)

      기능 30개: 수동 테스트 반나절

      기능 50개+: 수동 테스트 불가능 — 빠뜨리는 케이스 발생

      → 자동화 테스트 필수

"그러면 테스트를 짜면 되지 않느냐"는 당연한 질문이 나옵니다. 그런데 저는 테스트 코드를 짤 줄 몰랐습니다. Jest가 뭔지, Mocha가 뭔지, assertion이 뭔지. 12년간 리크루터로서 "테스트 커버리지가 어느 정도예요?"라고 물어보기만 했던 사람이, 갑자기 그걸 직접 해야 하는 상황이었습니다.

여기서 AI가 다시 등장합니다.

• • •

414건의 보안 테스트

가장 먼저 만든 건 보안 테스트였습니다. 이유는 간단합니다. SaaS를 운영하면 고객의 데이터를 다루게 되는데, 보안 사고가 나면 사업이 끝납니다.

OWASP Top 10이라는 보안 취약점 목록이 있습니다. 웹 애플리케이션에서 가장 흔하게 발견되는 10가지 보안 위협. AI에게 "우리 코드베이스에서 OWASP Top 10 각 항목에 대한 테스트를 만들어줘"라고 요청했습니다.

414건 보안 테스트 — 구성

XSS (Cross-Site Scripting)

127건

입력값 살균, CSP 헤더, 인코딩

인증/세션 보안

98건

토큰 기반 인증, 보안 쿠키, 토큰 갱신

인젝션 (SQL/NoSQL)

89건

입력값 검증, DB 쿼리 보호

위변조 방지 / 요청 제한 / 기타

100건

요청 위변조 방지, 요청 제한, 보안 미들웨어

AI가 첫 번째 배치에서 만들어낸 보안 테스트는 약 280건이었습니다. 그런데 이걸 돌려보니 실제로 취약점이 발견됐습니다. 입력값을 제대로 살균하지 않는 엔드포인트가 3곳, 요청 위변조 방지 검증이 빠진 라우트가 2곳. 테스트를 만들었더니 버그가 잡힌 겁니다.

이 경험이 테스트에 대한 인식을 완전히 바꿔놨습니다. 테스트는 "확인 작업"이 아니라 "발견 도구"였습니다.

• • •

93건의 기능 테스트

보안 다음은 기능 테스트였습니다. Candidate Analyzer의 핵심 기능인 JD 파싱, 이력서 분석, 매칭 로직을 검증하는 테스트들.

기능 테스트는 보안 테스트보다 만들기 어려웠습니다. 보안 테스트는 "이 입력을 넣었을 때 막혀야 한다"는 명확한 기준이 있지만, 기능 테스트는 "이 JD를 파싱했을 때 어떤 결과가 나와야 하는가"를 정의하는 것 자체가 어렵습니다.

테스트 영역	건수	검증 내용
JD 파싱	28건	필수/우대 조건 분리, 연차 추출, 기술스택 파싱
이력서 분석	31건	경력 구조화, 기술 키워드 추출, 프로젝트 규모 파악
매칭 로직	22건	강점/약점 도출, 리스크 플래그, 추천 코멘트
API/인프라	12건	엔드포인트 응답, 에러 핸들링, DB 연결

여기서 제가 한 역할이 있었습니다. AI가 "JD에서 기술스택을 추출하는 테스트"를 만들 때, 실제 현장에서 자주 나오는 JD 패턴을 알려줬습니다. "Python 경험자 우대"와 "Python 필수"의 차이를 AI가 스스로 구분하기 어렵거든요. 12년간 수천 개의 JD를 봐온 경험이 테스트 케이스의 품질을 올리는 데 직접적으로 기여했습니다.

• • •

AI가 테스트를 쓰는 과정

많은 분들이 "AI가 테스트를 쓴다"고 하면 버튼 하나 누르면 뚝딱 나오는 걸 상상합니다. 현실은 다릅니다.

프로세스는 이렇습니다.

AI 테스트 작성 프로세스

1단계: 코드 컨텍스트 제공

테스트할 모듈의 소스 코드 + 의존성 + 예상 동작을 AI에게 전달

2단계: AI 초안 생성

AI가 테스트 케이스 초안을 생성 — 보통 정상 케이스 + 에러 케이스 + 엣지 케이스

3단계: 실행 + 디버깅

첫 실행 시 40~60%만 통과. 실패한 테스트를 AI에게 다시 보내며 반복 수정

4단계: 휴먼 리뷰

현장 경험 기반으로 빠진 엣지 케이스 추가. "이런 JD도 있어" 식으로 보완

5단계: 전체 스위트 실행

새 테스트가 기존 테스트와 충돌하지 않는지 확인. 507건 전체 통과 확인

핵심은 3단계의 반복입니다. AI가 처음 만든 테스트는 대략 절반만 정상 동작합니다. 나머지는 import 경로가 틀리거나, mock 설정이 잘못되거나, 비동기 처리를 놓치거나. 이런 에러를 다시 AI에게 보내면서 2~3라운드를 거쳐야 비로소 안정적인 테스트가 나옵니다.

507건의 테스트를 만드는 데 약 2주가 걸렸습니다. 순수 코딩 시간이 아니라, 이 반복 과정을 매일 조금씩 진행한 결과입니다.

AI가 테스트를 "쓴다"기보다, AI와 함께 테스트를 "짓는다"가 더 정확한 표현입니다.

• • •

배포가 두렵지 않은 이유

507개의 테스트가 있으면 뭐가 달라질까요?

배포가 두렵지 않습니다. 이게 가장 큰 변화입니다.

테스트가 없던 시절에는 새 기능을 배포할 때마다 불안했습니다. "이거 올리면 기존 기능 안 깨지겠지?" 하고 기도하면서 배포 버튼을 눌렀습니다. 실제로 두세 번 프로덕션에서 문제가 생겨서 롤백한 적도 있습니다.

지금은 다릅니다. 코드를 수정하면 npm test 한 줄이면 507건 전체가 30초 안에 돌아갑니다. 하나라도 실패하면 뭐가 깨졌는지 바로 알 수 있습니다.

테스트 없이 배포

• 배포 전 수동 테스트 1시간+
• "이거 안 깨지겠지?" 기도
• 프로덕션 장애 → 긴급 롤백
• 주말 배포 = 불면의 밤

507건 테스트 후 배포

• npm test → 30초 전체 통과
• 깨진 부분 즉시 발견
• 프로덕션 장애 0건
• 언제든 자신있게 배포

특히 보안 테스트 414건이 주는 안심감이 큽니다. 새 API 엔드포인트를 추가할 때마다 "혹시 XSS 취약점은 없나?", "인증 우회는 불가능한가?"를 일일이 확인할 필요 없이, 테스트가 자동으로 검증해줍니다.

• • •

순환 검증의 함정

여기서 솔직하게 이야기해야 할 부분이 있습니다.

AI가 만든 코드를 AI가 테스트한다. 이 구조에는 근본적인 한계가 있습니다. 같은 AI가 코드를 짜고 같은 AI가 그 코드를 검증하면, 둘 다 같은 방향으로 틀릴 가능성이 있습니다. 이걸 "순환 검증"이라고 부릅니다.

      순환 검증의 위험

      AI가 "사용자 입력을 Base64로 인코딩하면 XSS를 막을 수 있다"고 판단하고 코드를 짰다면, 같은 AI가 만든 테스트도 "Base64 인코딩 여부"를 확인하는 데 그칠 수 있습니다. 실제로는 Base64만으로는 XSS를 완전히 막을 수 없는데, AI는 자신의 가정을 검증하는 테스트만 만드는 거죠.

      → 만든 사람과 검증하는 사람이 같으면 진짜 견제가 아닙니다.

이 문제를 인식한 후 도입한 것이 AI 교차 검토 시스템입니다. 코드를 작성하는 AI와 별도 AI 모델을 사용해서 교차 검토를 합니다. 코드를 수정할 때마다 자동으로 다단계 AI 검증이 수행되어 보안, 성능, 코드 품질을 독립적으로 체크합니다.

물론 이것도 완벽하지는 않습니다. 궁극적으로 "진짜 견제"는 사람만 할 수 있다는 게 지금까지의 결론입니다. AI C-Suite 시스템에서 내린 결론 — "같은 AI가 모자만 바꿔쓰는 건 진짜 견제가 아니다. 진짜 견제는 CEO만 가능하다."

• • •

607개, 그리고 그 이후

507개로 시작한 테스트는 현재 607개가 됐습니다. 이메일 시퀀스 엔진, 후보자 파이프라인, i18n 3개국어 지원 등 새 기능이 추가될 때마다 테스트도 함께 늘어났습니다.

607

총 테스트 수

100%

통과율

프로덕션 장애

~30초

전체 실행 시간

607건의 테스트와 100% 통과율. 그리고 프로덕션 인시던트 0건. 이 숫자가 주는 자신감은 생각보다 큽니다.

하지만 숫자에 취하면 안 됩니다. 607개의 테스트가 있어도, 테스트가 커버하지 못하는 영역은 반드시 있습니다. 특히 AI 출력의 "품질"을 테스트하는 건 여전히 어렵습니다. "매칭 분석이 맞는가"를 자동으로 검증하는 건, 결국 사람의 판단이 필요한 영역입니다.

그래서 테스트는 "안전장치"이지 "보증"이 아닙니다. 607개의 테스트가 말해주는 건 "기본적인 기능과 보안은 작동한다"이지, "제품이 완벽하다"가 아닙니다. 이 겸손함을 유지하는 게 중요하다고 생각합니다.

비개발자가 AI의 도움으로 607개의 테스트를 만들 수 있는 시대. 이게 2026년의 현실입니다. 그리고 이건 시작일 뿐입니다.

"What happens if you don't write tests?"

Honestly, I didn't even know what test code was at first. If the code runs, it's fine, right? Click around in the browser, check the output, done. In 12 years of working in the recruiting industry, "test automation" was an interview question I asked candidates -- not something I ever expected to do myself.

But as features piled up, things got scary. Fix the JD parsing logic and resume analysis breaks. Modify the matching algorithm and the report format goes haywire. Fix one thing, three things break. Developers call this "regression bugs."

• • •

A Non-Developer's Testing Fear

Convince-X's Candidate Analyzer runs on a 3-stage AI pipeline: resume extraction → JD structuring → matching analysis. Each stage depends on the others, so a change in one triggers a chain reaction.

In the early days, I manually tested every feature by opening the browser -- uploading a JD, uploading a resume, eyeballing the results. This worked with 10 features. Past 20, it became impossible.

      The Limits of Manual Testing

      10 features: Manual testing feasible (30 min)

      30 features: Manual testing takes half a day

      50+ features: Manual testing impossible -- edge cases get missed

      → Automated testing is essential

"Well, just write tests then" is the obvious response. But I didn't know how to write test code. Jest? Mocha? Assertions? For 12 years as a recruiter, "What's your test coverage?" was a question I asked candidates -- and suddenly I had to do it myself.

That's where AI re-enters the picture.

• • •

414 Security Tests

Security tests came first. The reason is simple: running a SaaS means handling customer data, and a security breach means game over.

There's a well-known list called OWASP Top 10 -- the ten most common security vulnerabilities in web applications. I asked AI to "create tests for each OWASP Top 10 category against our codebase."

414 Security Tests -- Breakdown

XSS (Cross-Site Scripting)

127

Input sanitization, CSP headers, encoding

Auth/Session Security

Token-based auth, secure cookies, token refresh

Injection (SQL/NoSQL)

Input validation, DB query protection

CSRF / Rate Limiting / Other

100

Request forgery prevention, rate limits, security middleware

The first batch of AI-generated security tests was about 280. Running them revealed actual vulnerabilities -- 3 endpoints that weren't properly sanitizing inputs, 2 routes missing CSRF validation. Building tests actually caught bugs.

That experience completely changed my perception of testing. Tests aren't "verification" -- they're a "discovery tool."

• • •

93 Functional Tests

After security came functional tests -- verifying the core features of Candidate Analyzer: JD parsing, resume analysis, and matching logic.

Functional tests were harder to build than security tests. Security tests have clear pass/fail criteria ("this input should be blocked"), but functional tests require defining "what result should come out of parsing this JD?" -- and that's hard.

Test Area	Count	What's Verified
JD Parsing	28	Required vs. preferred separation, YOE extraction, tech stack parsing
Resume Analysis	31	Career structuring, tech keyword extraction, project scope
Matching Logic	22	Strengths/weaknesses, risk flags, recommendation comments
API/Infrastructure	12	Endpoint responses, error handling, DB connectivity

This is where my domain expertise mattered. When AI was building "JD tech stack extraction tests," I provided real-world JD patterns from the field. AI can't easily distinguish between "Python experience preferred" and "Python required" on its own. 12 years of reviewing thousands of JDs directly improved the quality of test cases.

• • •

How AI Writes Tests

Many people imagine "AI writes tests" means pressing a button and getting perfect output. Reality is different.

Here's the actual process:

AI Test Writing Process

Step 1: Provide Code Context

Feed AI the source code, dependencies, and expected behavior of the module to test

Step 2: AI Draft Generation

AI generates test case drafts -- typically happy path + error cases + edge cases

Step 3: Run + Debug

First run: only 40-60% pass. Feed failures back to AI for iterative fixes

Step 4: Human Review

Add missing edge cases based on field experience. "Here's another JD pattern you missed"

Step 5: Full Suite Run

Verify new tests don't conflict with existing ones. Confirm all 507 pass

The key is iteration in Step 3. AI's first draft of tests only works about half the time. Wrong import paths, misconfigured mocks, missed async handling. You send these errors back to AI for 2-3 rounds before getting stable tests.

Building 507 tests took about 2 weeks. Not pure coding time, but the result of this iterative process done a little each day.

Rather than AI "writing" tests, it's more accurate to say you "build" tests together with AI.

• • •

Why Deployments Aren't Scary

What changes when you have 507 tests?

Deployments aren't scary anymore. That's the biggest change.

Before tests, every deployment came with anxiety. "Will this break existing features?" I'd pray while hitting the deploy button. I actually had to rollback from production issues two or three times.

Now it's different. After modifying code, one npm test command runs all 507 tests in under 30 seconds. If anything fails, you know exactly what broke.

Deploying Without Tests

• 1+ hour of manual testing pre-deploy
• "Please don't break" prayers
• Production incidents → emergency rollbacks
• Weekend deploys = sleepless nights

Deploying with 507 Tests

• npm test → 30s full pass
• Broken parts caught instantly
• 0 production incidents
• Deploy with confidence, anytime

The peace of mind from 414 security tests is especially valuable. Every time I add a new API endpoint, I don't need to manually check "Is there an XSS vulnerability? Can authentication be bypassed?" -- the tests verify automatically.

• • •

The Circular Validation Trap

Here's where I need to be honest.

AI-written code tested by AI. This structure has a fundamental limitation. When the same AI writes the code and validates it, both can be wrong in the same direction. This is called "circular validation."

      The Danger of Circular Validation

      If AI decides "Base64-encoding user input prevents XSS" and writes the code accordingly, the same AI's tests will only check "was Base64 encoding applied?" In reality, Base64 alone doesn't fully prevent XSS -- but the AI only validates its own assumptions.

      → When the creator and the verifier are the same, it's not real checks and balances.

After recognizing this problem, we introduced an AI cross-review system. A separate AI model independently reviews code changes. Every modification triggers multi-stage AI verification covering security, performance, and code quality.

Even this isn't perfect. The ultimate conclusion is that "real checks" can only come from a human. A key insight from the AI C-Suite system: "An AI wearing different hats is not real oversight. Real oversight can only come from the CEO."

• • •

607 Tests and Beyond

The test count that started at 507 has grown to 607. Each new feature -- the email sequence engine, candidate pipeline, i18n for 3 languages -- added more tests alongside it.

607

Total Tests

100%

Pass Rate

Production Incidents

~30s

Full Run Time

607 tests, 100% pass rate, and zero production incidents. The confidence these numbers provide is bigger than you'd expect.

But don't get intoxicated by the numbers. Even with 607 tests, there are areas they can't cover. Testing the "quality" of AI output remains inherently difficult. "Is this matching analysis correct?" is a question that ultimately requires human judgment.

That's why tests are a "safety net," not a "guarantee." What 607 tests tell you is "basic functionality and security work" -- not "the product is flawless." Maintaining that humility is important.

A non-developer building 607 automated tests with AI assistance. That's the reality of 2026. And this is just the beginning.

507개 테스트를
AI가 만들었다 507 Tests
Written by AI

비개발자의 테스트 공포

414건의 보안 테스트

93건의 기능 테스트

AI가 테스트를 쓰는 과정

배포가 두렵지 않은 이유

순환 검증의 함정

607개, 그리고 그 이후

A Non-Developer's Testing Fear

414 Security Tests

93 Functional Tests

How AI Writes Tests

Why Deployments Aren't Scary

The Circular Validation Trap

607 Tests and Beyond

607개의 테스트가 지키는 AI 분석 도구

507개 테스트를AI가 만들었다 507 TestsWritten by AI

비개발자의 테스트 공포

414건의 보안 테스트

93건의 기능 테스트

AI가 테스트를 쓰는 과정

배포가 두렵지 않은 이유

순환 검증의 함정

607개, 그리고 그 이후

A Non-Developer's Testing Fear

414 Security Tests

93 Functional Tests

How AI Writes Tests

Why Deployments Aren't Scary

The Circular Validation Trap

607 Tests and Beyond

607개의 테스트가 지키는 AI 분석 도구

507개 테스트를
AI가 만들었다 507 Tests
Written by AI