話說每一版的Milestone build前夕總免不了要爆個大問題.這次比較有趣的大概就是XXListen突然倒地不起.但時間緊迫,當時只能緊急撤掉可疑的修改讓它撐下去.但是欠債總是要還的,這版第一個build一出來,魔王又重生了
程式一執行就掛了,連first chance的call stack都讓人一頭霧水
CommandLine: C:\Builds\XXA_15.1.1007_PrivateBuild \XXListen.exe
Symbol search path is: SRV*C:\Symbols*http://myfault.kicks-ass.net;C:\PrivateBuilds\Symbols
Executable search path is: C:\Builds\XXA_15.1.1007_PrivateBuild \
ModLoad: 00400000 004f5000 XXListen.exe
…
(c9c.24c): Break instruction exception - code 80000003 (first chance)
eax=00351eb4 ebx=7ffda000 ecx=00000004 edx=00000010 esi=00351f48 edi=00351eb4
eip=7c92120e esp=0022fb20 ebp=0022fc94 iopl=0 nv up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202
ntdll!DbgBreakPoint:
7c92120e cc int 3
0:000> kv
ChildEBP RetAddr Args to Child
0022fb1c 7c95e612 7ffdf000 7ffda000 00000000 ntdll!DbgBreakPoint (FPO: [0,0,0])
0022fc94 7c94108f 0022fd30 7c920000 0022fce0 ntdll!LdrpInitializeProcess+0xffa (FPO: [Non-Fpo])
0022fd1c 7c92e437 0022fd30 7c920000 00000000 ntdll!_LdrpInitialize+0x183 (FPO: [Non-Fpo])
00000000 00000000 00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x7
0:000> .ecxr
Unable to get exception context, HRESULT 0x8000FFFF
原來是因為連crt的exception handler都還沒種進去就掛了,只有ntdll!_except_handler3
0022fd0c: ntdll!_except_handler3+0 (7c92e900)
CRT scope 1, filter: ntdll!_LdrpInitialize+1d5 (7c95f0f6)
func: ntdll!_LdrpInitialize+1e6 (7c95f10c)
CRT scope 0, func: ntdll!_LdrpInitialize+249 (7c95f158)
因為最外層的exception handler把現場破壞得太嚴重了,所以一定得攔在exception發生前…費盡千辛萬苦終於發現掛在這行
0:000> p
eax=0ce590be ebx=0022f78c ecx=00000000 edx=000ce590 esi=656806d0 edi=00000000
eip=65634f25 esp=0022f774 ebp=0022f9a8 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
XXSOCK!XXSsockApp::InitInstance+0x155:
65634f25 68a0066865 push offset XXSOCK!g_ppszIPList (656806a0)
0:000> p
(6a0.5f4): Stack overflow - code c00000fd (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00033018 ebx=00230000 ecx=00000002 edx=00000001 esi=002300d4 edi=00000004
eip=7c935401 esp=00033000 ebp=00033030 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
ntdll!RtlpLocateActivationContextSection+0x150:
7c935401 56 push esi
push也會出事??!! Stack明明就可讀可寫啊
00030000 : 00031000 - 001ff000
Type 00020000 MEM_PRIVATE
Protect 00000004 PAGE_READWRITE
State 00001000 MEM_COMMIT
Usage RegionUsageStack
Pid.Tid 6a0.5f4
等到冷靜下來仔細一看…不對啊…明明是單步執行怎麼一次跳好幾步? 這顯示問題不一定是那行push造成,也可能是後續被跳過無法下斷點的那幾行之一造成的.
65634eeb 8b0da0066865 mov ecx,dword ptr [XXSOCK!g_ppszIPList (656806a0)]
65634ef1 83c428 add esp,28h
0:000> p
eax=0ce590be ebx=0022f78c ecx=37fdeebe edx=000ce590 esi=656806d0 edi=00000000
eip=65634eeb esp=0022f74c ebp=0022f9a8 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206
TMSOCK!CTmsockApp::InitInstance+0x11b:
65634eeb 8b0da0066865 mov ecx,dword ptr [XXSOCK!g_ppszIPList (656806a0)] ds:0023:656806a0=00000000
0:000> p
eax=0ce590be ebx=0022f78c ecx=00000000 edx=000ce590 esi=656806d0 edi=00000000
eip=65634f25 esp=0022f774 ebp=0022f9a8 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
TMSOCK!CTmsockApp::InitInstance+0x155:
65634f25 68a0066865 push offset TMSOCK!g_ppszIPList (656806a0)
連Windbg都不可信,正打算用OllyICE試試的時候…老天還是待我不薄,因為這時候的call stack像樣多了
00033030 7c93532a ntdll!RtlpLocateActivationContextSection+0x150
00033060 7c93528d ntdll!RtlpFindNextActivationContextSection+0x61
00033078 7c935571 ntdll!RtlpFindFirstActivationContextSection+0x41
000330c4 7c935cc9 ntdll!RtlFindActivationContextSectionString+0x8e
00033188 7c935b8a ntdll!RtlDecodeSystemPointer+0x9e7
000332f0 7c936748 ntdll!RtlDosApplyFileIsolationRedirection_Ustr+0x267
0003337c 7c936698 ntdll!LdrGetDllHandleEx+0xc9
00033398 7c80e524 ntdll!LdrGetDllHandle+0x18
000333e8 7c80e63b kernel32!GetModuleHandleForUnicodeString+0x1d
0003386c 7c80e4ec kernel32!BasepGetModuleHandleExW+0x18e
00033884 7c80b750 kernel32!GetModuleHandleW+0x29
00033890 0060c192 kernel32!GetModuleHandleA+0x2d
0003389c 0060c1e8 libNetCtrl!_decode_pointer+0x3f
000338a8 0060c30d libNetCtrl!__set_flsgetvalue+0x1e
000338b8 0060a6c8 libNetCtrl!_getptd_noexit+0x15
000338bc 0060e786 libNetCtrl!_errno+0x5
000338c4 0060c08b libNetCtrl!_get_winmajor+0x10
000338e0 0060c19d libNetCtrl!_use_encode_pointer+0x1b
000338e8 0060c1e8 libNetCtrl!_decode_pointer+0x4a
000338f4 0060c30d libNetCtrl!__set_flsgetvalue+0x1e
等等…這不太像是個call stack的起點呀… 完整的call stack呢?
00033030 7c93532a ntdll!RtlpLocateActivationContextSection+0x150
00033060 7c93528d ntdll!RtlpFindNextActivationContextSection+0x61
00033078 7c935571 ntdll!RtlpFindFirstActivationContextSection+0x41
000330c4 7c935cc9 ntdll!RtlFindActivationContextSectionString+0x8e
00033188 7c935b8a ntdll!RtlDecodeSystemPointer+0x9e7
000332f0 7c936748 ntdll!RtlDosApplyFileIsolationRedirection_Ustr+0x267
0003337c 7c936698 ntdll!LdrGetDllHandleEx+0xc9
00033398 7c80e524 ntdll!LdrGetDllHandle+0x18
000333e8 7c80e63b kernel32!GetModuleHandleForUnicodeString+0x1d
0003386c 7c80e4ec kernel32!BasepGetModuleHandleExW+0x18e
00033884 7c80b750 kernel32!GetModuleHandleW+0x29
00033890 0060c192 kernel32!GetModuleHandleA+0x2d
0003389c 0060c1e8 libNetCtrl!_decode_pointer+0x3f
000338a8 0060c30d libNetCtrl!__set_flsgetvalue+0x1e
000338b8 0060a6c8 libNetCtrl!_getptd_noexit+0x15
000338bc 0060e786 libNetCtrl!_errno+0x5
000338c4 0060c08b libNetCtrl!_get_winmajor+0x10
000338e0 0060c19d libNetCtrl!_use_encode_pointer+0x1b
000338e8 0060c1e8 libNetCtrl!_decode_pointer+0x4a
000338f4 0060c30d libNetCtrl!__set_flsgetvalue+0x1e
…
000fe250 0060e786 libNetCtrl!_errno+0x5
000fe258 0060c08b libNetCtrl!_get_winmajor+0x10
000fe274 0060c19d libNetCtrl!_use_encode_pointer+0x1b
000fe27c 0060c1e8 libNetCtrl!_decode_pointer+0x4a
000fe288 0060c30d libNetCtrl!__set_flsgetvalue+0x1e
000fe298 0060a6c8 libNetCtrl!_getptd_noexit+0x15
上面兩塊簡直像鬼打牆一樣出現了幾萬次,原來stack overflow是它造成的.
裡面出現了libNetCtrl!_errno,看來像是個error handler…好樣的,error handler寫出error…
讓我想到Dr.Watson crash之後又生出一個Dr.Watson前仆後繼的情景
(其作者Don Corbitt很久前就於空難中過世了,基於尊敬在下得說實話:Dr.Watson雖有小瑕疵但還是很好用)
接下來就直接攔在這個error handler的進入點,來看到底是誰造成第一個error啟動了這場災難
0022f4b4 0060a6c8 libNetCtrl!_getptd_noexit
0022f4b8 0060e786 libNetCtrl!_errno+0x5
0022f4c0 0060c08b libNetCtrl!_get_winmajor+0x10
0022f4dc 0060c126 libNetCtrl!_use_encode_pointer+0x1b
0022f4e4 0060c151 libNetCtrl!_encode_pointer+0x4a
0022f4ec 00616393 libNetCtrl!_encoded_null+0x7
0022f520 0060eaf4 libNetCtrl!__crtMessageBoxA+0xe
0022f544 00607c19 libNetCtrl!_NMSG_WRITE+0x162
0022f55c 006040e9 libNetCtrl!malloc+0x2f
0022f574 0060298b libNetCtrl!NetLocalMachine::LoadAdapters+0x19
終於找到了,不過怎麼是malloc?
00607c01 33f6 xor esi,esi
00607c03 393598fa6100 cmp dword ptr [libNetCtrl!_crtheap (0061fa98)],esi
00607c09 8bfd mov edi,ebp
00607c0b 7518 jne libNetCtrl!malloc+0x3b (00607c25)
00607c0d e8206f0000 call libNetCtrl!_FF_MSGBANNER (0060eb32)
00607c12 6a1e push 1Eh
00607c14 e8796d0000 call libNetCtrl!_NMSG_WRITE (0060e992)
00607c19 68ff000000 push 0FFh
原來是因為"libnetctrl!_crtheap"等於0,跑進了malloc裡一開頭的error handling block
0:000> dd libNetCtrl!_crtheap
0061fa98 00000000 00000000 00000000 00000000
為啥是0咧? 從VC裡附的crt source可以知道,只要_DllMainCRTStartup有被呼叫,此DLL對應的crt heap就應該被建立好了.
這麼說…_DllMainCRTStartup沒有被呼叫?
// 為說明而做了簡化,雙底線和單底線版本的差異與本主題無關,可以把它們看成一樣
#define _DllMainCRTStartup __DllMainCRTStartup
__declspec(noinline)
BOOL __cdecl
__DllMainCRTStartup(
HANDLE hDllHandle,
DWORD dwReason,
LPVOID lpreserved
)
{
if ( dwReason == DLL_PROCESS_ATTACH || dwReason == DLL_THREAD_ATTACH ) {
…
if ( retcode )
retcode = _CRT_INIT(hDllHandle, dwReason, lpreserved);
呼叫DllMainCRTStartup是Executable loader的責任耶…還好,之前曾經在路上撿到一份pseudo code,裡面說Loader大致是這樣運作的:
NTLoadDLL(DLL)
{
if (!AlreadyLoad(DLL) {
MapDLL(DLL);
if (IsDLL()){
RecirsivelyLoadDependencyDLL();
}
InsertDLLInitRoutineToList();
If (All_Implicitly_Linked_DLL_Loaded()) {
RunTheInitRoutineList();
}
}
}
也就是說,只有當所有隱式連結的DLL都已載入完畢時,各DLL的_DllMainCRTStartup(精確來說,是該DLL的PE header內標記的Entry point)才會被整批逐一執行.
而libNetCtrl.dll是XXListen.exe隱式連接的DLL之一,當它被XXSock.dll載入時,Loader認為XXListen.exe尚有未完全載入的DLL(libNetCtrl.dll就是其中之一囉),
所以並沒有執行libNetCtrl.dll的_ DllMainCRTStartup. 沒有Main就沒有_CRT_INIT,沒有_CRT_INIT就沒有_crtheap可以用.
這三個project之間的關係可以用下面的pseudo code來理解:
XXListen.cpp:
void main()
{
fun_in_DLL1();
fun_in_DLL2();
}
DLL1.cpp (就是XXSOck.dll):
typedef void (*_fun_in_DLL2)(void);
class _obj_in_DLL1 {
_obj_in_DLL1() {
fun_in_DLL2 = GetProcAddress(LoadLibraryA("DLL2.dll"), "fun_in_DLL2"));
fun_in_DLL2();
}
} obj_in_DLL1;
extern "C" void fun_in_DLL1(void)
{
printf("fun_in_DLL1\n");
}
與DLL2.cpp(也就是libNetCtrl.dll):
extern "C" void fun_in_DLL2(void)
{
printf("fun_in_DLL2\n");
malloc(65536);
}
然後令XXListen依賴DLL1與DLL2,但不對DLL1標記其依賴DLL2的事實
這樣等於向Loader宣稱DLL1和DLL2沒有互相依賴關係,所以該先執行誰的DllMain都可以.
若是Loader先執行DLL2的DllMain就天下太平,但若哪天它決定先執行DLL1的DllMain,問題就爆發了.
附帶一提,目前以路上撿來的那份pseudo code內容來看,
Loader將會把眾DllMain們按照他們的所屬DLL出現在Import Table內的順序,以先出現先執行的原則執行之.
倘若這行為永遠不變,還是存在一個變數,就是連結器又該如何決定兩個不互相依賴的DLL,誰該在前而誰又該在後?
總之這兩個變數就是讓這問題時而出現時而消失的元凶.