2008年7月21日 星期一

Dependency does matter


話說每一版的Milestone build前夕總免不了要爆個大問題.這次比較有趣的大概就是XXListen突然倒地不起.但時間緊迫,當時只能緊急撤掉可疑的修改讓它撐下去.但是欠債總是要還的,這版第一個build一出來,魔王又重生了


程式一執行就掛了,連first chance的call stack都讓人一頭霧水

CommandLine: C:\Builds\XXA_15.1.1007_PrivateBuild \XXListen.exe

Symbol search path is: SRV*C:\Symbols*http://myfault.kicks-ass.net;C:\PrivateBuilds\Symbols

Executable search path is: C:\Builds\XXA_15.1.1007_PrivateBuild \

ModLoad: 00400000 004f5000 XXListen.exe


(c9c.24c): Break instruction exception - code 80000003 (first chance)

eax=00351eb4 ebx=7ffda000 ecx=00000004 edx=00000010 esi=00351f48 edi=00351eb4

eip=7c92120e esp=0022fb20 ebp=0022fc94 iopl=0 nv up ei pl nz na po nc

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202

ntdll!DbgBreakPoint:

7c92120e cc int 3

0:000> kv

ChildEBP RetAddr Args to Child

0022fb1c 7c95e612 7ffdf000 7ffda000 00000000 ntdll!DbgBreakPoint (FPO: [0,0,0])

0022fc94 7c94108f 0022fd30 7c920000 0022fce0 ntdll!LdrpInitializeProcess+0xffa (FPO: [Non-Fpo])

0022fd1c 7c92e437 0022fd30 7c920000 00000000 ntdll!_LdrpInitialize+0x183 (FPO: [Non-Fpo])

00000000 00000000 00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x7

0:000> .ecxr

Unable to get exception context, HRESULT 0x8000FFFF


原來是因為連crt的exception handler都還沒種進去就掛了,只有ntdll!_except_handler3

0022fd0c: ntdll!_except_handler3+0 (7c92e900)

CRT scope 1, filter: ntdll!_LdrpInitialize+1d5 (7c95f0f6)

func: ntdll!_LdrpInitialize+1e6 (7c95f10c)

CRT scope 0, func: ntdll!_LdrpInitialize+249 (7c95f158)


因為最外層的exception handler把現場破壞得太嚴重了,所以一定得攔在exception發生前…費盡千辛萬苦終於發現掛在這行

0:000> p

eax=0ce590be ebx=0022f78c ecx=00000000 edx=000ce590 esi=656806d0 edi=00000000

eip=65634f25 esp=0022f774 ebp=0022f9a8 iopl=0 nv up ei pl zr na pe nc

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246

XXSOCK!XXSsockApp::InitInstance+0x155:

65634f25 68a0066865 push offset XXSOCK!g_ppszIPList (656806a0)

0:000> p

(6a0.5f4): Stack overflow - code c00000fd (first chance)

First chance exceptions are reported before any exception handling.

This exception may be expected and handled.

eax=00033018 ebx=00230000 ecx=00000002 edx=00000001 esi=002300d4 edi=00000004

eip=7c935401 esp=00033000 ebp=00033030 iopl=0 nv up ei pl zr na pe nc

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246

ntdll!RtlpLocateActivationContextSection+0x150:

7c935401 56 push esi


push也會出事??!! Stack明明就可讀可寫啊

00030000 : 00031000 - 001ff000

Type 00020000 MEM_PRIVATE

Protect 00000004 PAGE_READWRITE

State 00001000 MEM_COMMIT

Usage RegionUsageStack

Pid.Tid 6a0.5f4


等到冷靜下來仔細一看…不對啊…明明是單步執行怎麼一次跳好幾步? 這顯示問題不一定是那行push造成,也可能是後續被跳過無法下斷點的那幾行之一造成的.


65634eeb 8b0da0066865 mov ecx,dword ptr [XXSOCK!g_ppszIPList (656806a0)]

65634ef1 83c428 add esp,28h


0:000> p

eax=0ce590be ebx=0022f78c ecx=37fdeebe edx=000ce590 esi=656806d0 edi=00000000

eip=65634eeb esp=0022f74c ebp=0022f9a8 iopl=0 nv up ei pl nz na pe nc

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206

TMSOCK!CTmsockApp::InitInstance+0x11b:

65634eeb 8b0da0066865 mov ecx,dword ptr [XXSOCK!g_ppszIPList (656806a0)] ds:0023:656806a0=00000000

0:000> p

eax=0ce590be ebx=0022f78c ecx=00000000 edx=000ce590 esi=656806d0 edi=00000000

eip=65634f25 esp=0022f774 ebp=0022f9a8 iopl=0 nv up ei pl zr na pe nc

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246

TMSOCK!CTmsockApp::InitInstance+0x155:

65634f25 68a0066865 push offset TMSOCK!g_ppszIPList (656806a0)


連Windbg都不可信,正打算用OllyICE試試的時候…老天還是待我不薄,因為這時候的call stack像樣多了


00033030 7c93532a ntdll!RtlpLocateActivationContextSection+0x150

00033060 7c93528d ntdll!RtlpFindNextActivationContextSection+0x61

00033078 7c935571 ntdll!RtlpFindFirstActivationContextSection+0x41

000330c4 7c935cc9 ntdll!RtlFindActivationContextSectionString+0x8e

00033188 7c935b8a ntdll!RtlDecodeSystemPointer+0x9e7

000332f0 7c936748 ntdll!RtlDosApplyFileIsolationRedirection_Ustr+0x267

0003337c 7c936698 ntdll!LdrGetDllHandleEx+0xc9

00033398 7c80e524 ntdll!LdrGetDllHandle+0x18

000333e8 7c80e63b kernel32!GetModuleHandleForUnicodeString+0x1d

0003386c 7c80e4ec kernel32!BasepGetModuleHandleExW+0x18e

00033884 7c80b750 kernel32!GetModuleHandleW+0x29

00033890 0060c192 kernel32!GetModuleHandleA+0x2d

0003389c 0060c1e8 libNetCtrl!_decode_pointer+0x3f

000338a8 0060c30d libNetCtrl!__set_flsgetvalue+0x1e

000338b8 0060a6c8 libNetCtrl!_getptd_noexit+0x15

000338bc 0060e786 libNetCtrl!_errno+0x5

000338c4 0060c08b libNetCtrl!_get_winmajor+0x10

000338e0 0060c19d libNetCtrl!_use_encode_pointer+0x1b

000338e8 0060c1e8 libNetCtrl!_decode_pointer+0x4a

000338f4 0060c30d libNetCtrl!__set_flsgetvalue+0x1e


等等…這不太像是個call stack的起點呀… 完整的call stack呢?

00033030 7c93532a ntdll!RtlpLocateActivationContextSection+0x150

00033060 7c93528d ntdll!RtlpFindNextActivationContextSection+0x61

00033078 7c935571 ntdll!RtlpFindFirstActivationContextSection+0x41

000330c4 7c935cc9 ntdll!RtlFindActivationContextSectionString+0x8e

00033188 7c935b8a ntdll!RtlDecodeSystemPointer+0x9e7

000332f0 7c936748 ntdll!RtlDosApplyFileIsolationRedirection_Ustr+0x267

0003337c 7c936698 ntdll!LdrGetDllHandleEx+0xc9

00033398 7c80e524 ntdll!LdrGetDllHandle+0x18

000333e8 7c80e63b kernel32!GetModuleHandleForUnicodeString+0x1d

0003386c 7c80e4ec kernel32!BasepGetModuleHandleExW+0x18e

00033884 7c80b750 kernel32!GetModuleHandleW+0x29

00033890 0060c192 kernel32!GetModuleHandleA+0x2d

0003389c 0060c1e8 libNetCtrl!_decode_pointer+0x3f

000338a8 0060c30d libNetCtrl!__set_flsgetvalue+0x1e

000338b8 0060a6c8 libNetCtrl!_getptd_noexit+0x15

000338bc 0060e786 libNetCtrl!_errno+0x5

000338c4 0060c08b libNetCtrl!_get_winmajor+0x10

000338e0 0060c19d libNetCtrl!_use_encode_pointer+0x1b

000338e8 0060c1e8 libNetCtrl!_decode_pointer+0x4a

000338f4 0060c30d libNetCtrl!__set_flsgetvalue+0x1e


000fe250 0060e786 libNetCtrl!_errno+0x5

000fe258 0060c08b libNetCtrl!_get_winmajor+0x10

000fe274 0060c19d libNetCtrl!_use_encode_pointer+0x1b

000fe27c 0060c1e8 libNetCtrl!_decode_pointer+0x4a

000fe288 0060c30d libNetCtrl!__set_flsgetvalue+0x1e

000fe298 0060a6c8 libNetCtrl!_getptd_noexit+0x15


上面兩塊簡直像鬼打牆一樣出現了幾萬次,原來stack overflow是它造成的.

裡面出現了libNetCtrl!_errno,看來像是個error handler…好樣的,error handler寫出error…

讓我想到Dr.Watson crash之後又生出一個Dr.Watson前仆後繼的情景

(其作者Don Corbitt很久前就於空難中過世了,基於尊敬在下得說實話:Dr.Watson雖有小瑕疵但還是很好用)


接下來就直接攔在這個error handler的進入點,來看到底是誰造成第一個error啟動了這場災難

0022f4b4 0060a6c8 libNetCtrl!_getptd_noexit

0022f4b8 0060e786 libNetCtrl!_errno+0x5

0022f4c0 0060c08b libNetCtrl!_get_winmajor+0x10

0022f4dc 0060c126 libNetCtrl!_use_encode_pointer+0x1b

0022f4e4 0060c151 libNetCtrl!_encode_pointer+0x4a

0022f4ec 00616393 libNetCtrl!_encoded_null+0x7

0022f520 0060eaf4 libNetCtrl!__crtMessageBoxA+0xe

0022f544 00607c19 libNetCtrl!_NMSG_WRITE+0x162

0022f55c 006040e9 libNetCtrl!malloc+0x2f

0022f574 0060298b libNetCtrl!NetLocalMachine::LoadAdapters+0x19


終於找到了,不過怎麼是malloc?

00607c01 33f6 xor esi,esi

00607c03 393598fa6100 cmp dword ptr [libNetCtrl!_crtheap (0061fa98)],esi

00607c09 8bfd mov edi,ebp

00607c0b 7518 jne libNetCtrl!malloc+0x3b (00607c25)

00607c0d e8206f0000 call libNetCtrl!_FF_MSGBANNER (0060eb32)

00607c12 6a1e push 1Eh

00607c14 e8796d0000 call libNetCtrl!_NMSG_WRITE (0060e992)

00607c19 68ff000000 push 0FFh


原來是因為"libnetctrl!_crtheap"等於0,跑進了malloc裡一開頭的error handling block

0:000> dd libNetCtrl!_crtheap

0061fa98 00000000 00000000 00000000 00000000


為啥是0咧? 從VC裡附的crt source可以知道,只要_DllMainCRTStartup有被呼叫,此DLL對應的crt heap就應該被建立好了.

這麼說…_DllMainCRTStartup沒有被呼叫?


// 為說明而做了簡化,雙底線和單底線版本的差異與本主題無關,可以把它們看成一樣

#define _DllMainCRTStartup __DllMainCRTStartup

__declspec(noinline)

BOOL __cdecl

__DllMainCRTStartup(

HANDLE hDllHandle,

DWORD dwReason,

LPVOID lpreserved

)

{

if ( dwReason == DLL_PROCESS_ATTACH || dwReason == DLL_THREAD_ATTACH ) {


if ( retcode )

retcode = _CRT_INIT(hDllHandle, dwReason, lpreserved);


呼叫DllMainCRTStartup是Executable loader的責任耶…還好,之前曾經在路上撿到一份pseudo code,裡面說Loader大致是這樣運作的:


NTLoadDLL(DLL)

{

if (!AlreadyLoad(DLL) {

MapDLL(DLL);

if (IsDLL()){

RecirsivelyLoadDependencyDLL();

}

InsertDLLInitRoutineToList();

If (All_Implicitly_Linked_DLL_Loaded()) {

RunTheInitRoutineList();

}

}

}


也就是說,只有當所有隱式連結的DLL都已載入完畢時,各DLL的_DllMainCRTStartup(精確來說,是該DLL的PE header內標記的Entry point)才會被整批逐一執行.

而libNetCtrl.dll是XXListen.exe隱式連接的DLL之一,當它被XXSock.dll載入時,Loader認為XXListen.exe尚有未完全載入的DLL(libNetCtrl.dll就是其中之一囉),

所以並沒有執行libNetCtrl.dll的_ DllMainCRTStartup. 沒有Main就沒有_CRT_INIT,沒有_CRT_INIT就沒有_crtheap可以用.


這三個project之間的關係可以用下面的pseudo code來理解:

XXListen.cpp:

void main()

{

fun_in_DLL1();

fun_in_DLL2();

}


DLL1.cpp (就是XXSOck.dll):

typedef void (*_fun_in_DLL2)(void);

class _obj_in_DLL1 {

_obj_in_DLL1() {

fun_in_DLL2 = GetProcAddress(LoadLibraryA("DLL2.dll"), "fun_in_DLL2"));

fun_in_DLL2();

}

} obj_in_DLL1;

extern "C" void fun_in_DLL1(void)

{

printf("fun_in_DLL1\n");

}


與DLL2.cpp(也就是libNetCtrl.dll):

extern "C" void fun_in_DLL2(void)

{

printf("fun_in_DLL2\n");

malloc(65536);

}


然後令XXListen依賴DLL1與DLL2,但不對DLL1標記其依賴DLL2的事實

這樣等於向Loader宣稱DLL1和DLL2沒有互相依賴關係,所以該先執行誰的DllMain都可以.

若是Loader先執行DLL2的DllMain就天下太平,但若哪天它決定先執行DLL1的DllMain,問題就爆發了.

附帶一提,目前以路上撿來的那份pseudo code內容來看,

Loader將會把眾DllMain們按照他們的所屬DLL出現在Import Table內的順序,以先出現先執行的原則執行之.

倘若這行為永遠不變,還是存在一個變數,就是連結器又該如何決定兩個不互相依賴的DLL,誰該在前而誰又該在後?

總之這兩個變數就是讓這問題時而出現時而消失的元凶.

1 則留言:

Homeless 提到...

我終於看懂,原來如此。
太強了!
--seamxr