-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathpager.c
2787 lines (2772 loc) · 85.9 KB
/
pager.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
/*
* Symisc unQLite: An Embeddable NoSQL (Post Modern) Database Engine.
* Copyright (C) 2012-2013, Symisc Systems http://unqlite.org/
* Version 1.1.6
* For information on licensing, redistribution of this file, and for a DISCLAIMER OF ALL WARRANTIES
* please contact Symisc Systems via:
* or visit:
* http://unqlite.org/licensing.html
*/
/* $SymiscID: pager.c v1.1 Win7 2012-11-29 03:46 stable <[email protected]> $ */
#ifndef UNQLITE_AMALGAMATION
#include "unqliteInt.h"
#endif
/*
** This file implements the pager and the transaction manager for UnQLite (Mostly inspired from the SQLite3 Source tree).
**
** The Pager.eState variable stores the current 'state' of a pager. A
** pager may be in any one of the seven states shown in the following
** state diagram.
**
** OPEN <------+------+
** | | |
** V | |
** +---------> READER-------+ |
** | | |
** | V |
** |<-------WRITER_LOCKED--------->|
** | | |
** | V |
** |<------WRITER_CACHEMOD-------->|
** | | |
** | V |
** |<-------WRITER_DBMOD---------->|
** | | |
** | V |
** +<------WRITER_FINISHED-------->+
**
** OPEN:
**
** The pager starts up in this state. Nothing is guaranteed in this
** state - the file may or may not be locked and the database size is
** unknown. The database may not be read or written.
**
** * No read or write transaction is active.
** * Any lock, or no lock at all, may be held on the database file.
** * The dbSize and dbOrigSize variables may not be trusted.
**
** READER:
**
** In this state all the requirements for reading the database in
** rollback mode are met. Unless the pager is (or recently
** was) in exclusive-locking mode, a user-level read transaction is
** open. The database size is known in this state.
**
** * A read transaction may be active (but a write-transaction cannot).
** * A SHARED or greater lock is held on the database file.
** * The dbSize variable may be trusted (even if a user-level read
** transaction is not active). The dbOrigSize variables
** may not be trusted at this point.
** * Even if a read-transaction is not open, it is guaranteed that
** there is no hot-journal in the file-system.
**
** WRITER_LOCKED:
**
** The pager moves to this state from READER when a write-transaction
** is first opened on the database. In WRITER_LOCKED state, all locks
** required to start a write-transaction are held, but no actual
** modifications to the cache or database have taken place.
**
** In rollback mode, a RESERVED or (if the transaction was opened with
** EXCLUSIVE flag) EXCLUSIVE lock is obtained on the database file when
** moving to this state, but the journal file is not written to or opened
** to in this state. If the transaction is committed or rolled back while
** in WRITER_LOCKED state, all that is required is to unlock the database
** file.
**
** * A write transaction is active.
** * If the connection is open in rollback-mode, a RESERVED or greater
** lock is held on the database file.
** * The dbSize and dbOrigSize variables are all valid.
** * The contents of the pager cache have not been modified.
** * The journal file may or may not be open.
** * Nothing (not even the first header) has been written to the journal.
**
** WRITER_CACHEMOD:
**
** A pager moves from WRITER_LOCKED state to this state when a page is
** first modified by the upper layer. In rollback mode the journal file
** is opened (if it is not already open) and a header written to the
** start of it. The database file on disk has not been modified.
**
** * A write transaction is active.
** * A RESERVED or greater lock is held on the database file.
** * The journal file is open and the first header has been written
** to it, but the header has not been synced to disk.
** * The contents of the page cache have been modified.
**
** WRITER_DBMOD:
**
** The pager transitions from WRITER_CACHEMOD into WRITER_DBMOD state
** when it modifies the contents of the database file.
**
** * A write transaction is active.
** * An EXCLUSIVE or greater lock is held on the database file.
** * The journal file is open and the first header has been written
** and synced to disk.
** * The contents of the page cache have been modified (and possibly
** written to disk).
**
** WRITER_FINISHED:
**
** A rollback-mode pager changes to WRITER_FINISHED state from WRITER_DBMOD
** state after the entire transaction has been successfully written into the
** database file. In this state the transaction may be committed simply
** by finalizing the journal file. Once in WRITER_FINISHED state, it is
** not possible to modify the database further. At this point, the upper
** layer must either commit or rollback the transaction.
**
** * A write transaction is active.
** * An EXCLUSIVE or greater lock is held on the database file.
** * All writing and syncing of journal and database data has finished.
** If no error occured, all that remains is to finalize the journal to
** commit the transaction. If an error did occur, the caller will need
** to rollback the transaction.
**
**
*/
#define PAGER_OPEN 0
#define PAGER_READER 1
#define PAGER_WRITER_LOCKED 2
#define PAGER_WRITER_CACHEMOD 3
#define PAGER_WRITER_DBMOD 4
#define PAGER_WRITER_FINISHED 5
/*
** Journal files begin with the following magic string. The data
** was obtained from /dev/random. It is used only as a sanity check.
**
** NOTE: These values must be different from the one used by SQLite3
** to avoid journal file collision.
**
*/
static const unsigned char aJournalMagic[] = {
0xa6, 0xe8, 0xcd, 0x2b, 0x1c, 0x92, 0xdb, 0x9f,
};
/*
** The journal header size for this pager. This is usually the same
** size as a single disk sector. See also setSectorSize().
*/
#define JOURNAL_HDR_SZ(pPager) (pPager->iSectorSize)
/*
* Database page handle.
* Each raw disk page is represented in memory by an instance
* of the following structure.
*/
typedef struct Page Page;
struct Page {
/* Must correspond to unqlite_page */
unsigned char *zData; /* Content of this page */
void *pUserData; /* Extra content */
pgno pgno; /* Page number for this page */
/**********************************************************************
** Elements above are public. All that follows is private to pcache.c
** and should not be accessed by other modules.
*/
Pager *pPager; /* The pager this page is part of */
int flags; /* Page flags defined below */
int nRef; /* Number of users of this page */
Page *pNext, *pPrev; /* A list of all pages */
Page *pDirtyNext; /* Next element in list of dirty pages */
Page *pDirtyPrev; /* Previous element in list of dirty pages */
Page *pNextCollide,*pPrevCollide; /* Collission chain */
Page *pNextHot,*pPrevHot; /* Hot dirty pages chain */
};
/* Bit values for Page.flags */
#define PAGE_DIRTY 0x002 /* Page has changed */
#define PAGE_NEED_SYNC 0x004 /* fsync the rollback journal before
** writing this page to the database */
#define PAGE_DONT_WRITE 0x008 /* Dont write page content to disk */
#define PAGE_NEED_READ 0x010 /* Content is unread */
#define PAGE_IN_JOURNAL 0x020 /* Page written to the journal */
#define PAGE_HOT_DIRTY 0x040 /* Hot dirty page */
#define PAGE_DONT_MAKE_HOT 0x080 /* Dont make this page Hot. In other words,
* do not link it to the hot dirty list.
*/
/*
* Each active database pager is represented by an instance of
* the following structure.
*/
struct Pager
{
SyMemBackend *pAllocator; /* Memory backend */
unqlite *pDb; /* DB handle that own this instance */
unqlite_kv_engine *pEngine; /* Underlying KV storage engine */
char *zFilename; /* Name of the database file */
char *zJournal; /* Name of the journal file */
unqlite_vfs *pVfs; /* Underlying virtual file system */
unqlite_file *pfd,*pjfd; /* File descriptors for database and journal */
pgno dbSize; /* Number of pages in the file */
pgno dbOrigSize; /* dbSize before the current change */
sxi64 dbByteSize; /* Database size in bytes */
void *pMmap; /* Read-only Memory view (mmap) of the whole file if requested (UNQLITE_OPEN_MMAP). */
sxu32 nRec; /* Number of pages written to the journal */
SyPRNGCtx sPrng; /* PRNG Context */
sxu32 cksumInit; /* Quasi-random value added to every checksum */
sxu32 iOpenFlags; /* Flag passed to unqlite_open() after processing */
sxi64 iJournalOfft; /* Journal offset we are reading from */
int (*xBusyHandler)(void *); /* Busy handler */
void *pBusyHandlerArg; /* First arg to xBusyHandler() */
void (*xPageUnpin)(void *); /* Page Unpin callback */
void (*xPageReload)(void *); /* Page Reload callback */
Bitvec *pVec; /* Bitmap */
Page *pHeader; /* Page one of the database (Unqlite header) */
Sytm tmCreate; /* Database creation time */
SyString sKv; /* Underlying Key/Value storage engine name */
int iState; /* Pager state */
int iLock; /* Lock state */
sxi32 iFlags; /* Control flags (see below) */
int is_mem; /* True for an in-memory database */
int is_rdonly; /* True for a read-only database */
int no_jrnl; /* TRUE to omit journaling */
int iPageSize; /* Page size in bytes (default 4K) */
int iSectorSize; /* Size of a single sector on disk */
unsigned char *zTmpPage; /* Temporary page */
Page *pFirstDirty; /* First dirty pages */
Page *pDirty; /* Transient list of dirty pages */
Page *pAll; /* List of all pages */
Page *pHotDirty; /* List of hot dirty pages */
Page *pFirstHot; /* First hot dirty page */
sxu32 nHot; /* Total number of hot dirty pages */
Page **apHash; /* Page table */
sxu32 nSize; /* apHash[] size: Must be a power of two */
sxu32 nPage; /* Total number of page loaded in memory */
sxu32 nCacheMax; /* Maximum page to cache*/
};
/* Control flags */
#define PAGER_CTRL_COMMIT_ERR 0x001 /* Commit error */
#define PAGER_CTRL_DIRTY_COMMIT 0x002 /* Dirty commit has been applied */
/*
** Read a 32-bit integer from the given file descriptor.
** All values are stored on disk as big-endian.
*/
static int ReadInt32(unqlite_file *pFd,sxu32 *pOut,sxi64 iOfft)
{
unsigned char zBuf[4];
int rc;
rc = unqliteOsRead(pFd,zBuf,sizeof(zBuf),iOfft);
if( rc != UNQLITE_OK ){
return rc;
}
SyBigEndianUnpack32(zBuf,pOut);
return UNQLITE_OK;
}
/*
** Read a 64-bit integer from the given file descriptor.
** All values are stored on disk as big-endian.
*/
static int ReadInt64(unqlite_file *pFd,sxu64 *pOut,sxi64 iOfft)
{
unsigned char zBuf[8];
int rc;
rc = unqliteOsRead(pFd,zBuf,sizeof(zBuf),iOfft);
if( rc != UNQLITE_OK ){
return rc;
}
SyBigEndianUnpack64(zBuf,pOut);
return UNQLITE_OK;
}
/*
** Write a 32-bit integer into the given file descriptor.
*/
static int WriteInt32(unqlite_file *pFd,sxu32 iNum,sxi64 iOfft)
{
unsigned char zBuf[4];
int rc;
SyBigEndianPack32(zBuf,iNum);
rc = unqliteOsWrite(pFd,zBuf,sizeof(zBuf),iOfft);
return rc;
}
/*
** Write a 64-bit integer into the given file descriptor.
*/
static int WriteInt64(unqlite_file *pFd,sxu64 iNum,sxi64 iOfft)
{
unsigned char zBuf[8];
int rc;
SyBigEndianPack64(zBuf,iNum);
rc = unqliteOsWrite(pFd,zBuf,sizeof(zBuf),iOfft);
return rc;
}
/*
** The maximum allowed sector size. 64KiB. If the xSectorsize() method
** returns a value larger than this, then MAX_SECTOR_SIZE is used instead.
** This could conceivably cause corruption following a power failure on
** such a system. This is currently an undocumented limit.
*/
#define MAX_SECTOR_SIZE 0x10000
/*
** Get the size of a single sector on disk.
** The sector size will be used used to determine the size
** and alignment of journal header and within created journal files.
**
** The default sector size is set to 512.
*/
static int GetSectorSize(unqlite_file *pFd)
{
int iSectorSize = UNQLITE_DEFAULT_SECTOR_SIZE;
if( pFd ){
iSectorSize = unqliteOsSectorSize(pFd);
if( iSectorSize < 32 ){
iSectorSize = 512;
}
if( iSectorSize > MAX_SECTOR_SIZE ){
iSectorSize = MAX_SECTOR_SIZE;
}
}
return iSectorSize;
}
/* Hash function for page number */
#define PAGE_HASH(PNUM) (PNUM)
/*
* Fetch a page from the cache.
*/
static Page * pager_fetch_page(Pager *pPager,pgno page_num)
{
Page *pEntry;
if( pPager->nPage < 1 ){
/* Don't bother hashing */
return 0;
}
/* Perform the lookup */
pEntry = pPager->apHash[PAGE_HASH(page_num) & (pPager->nSize - 1)];
for(;;){
if( pEntry == 0 ){
break;
}
if( pEntry->pgno == page_num ){
return pEntry;
}
/* Point to the next entry in the colission chain */
pEntry = pEntry->pNextCollide;
}
/* No such page */
return 0;
}
/*
* Allocate and initialize a new page.
*/
static Page * pager_alloc_page(Pager *pPager,pgno num_page)
{
Page *pNew;
pNew = (Page *)SyMemBackendPoolAlloc(pPager->pAllocator,sizeof(Page)+pPager->iPageSize);
if( pNew == 0 ){
return 0;
}
/* Zero the structure */
SyZero(pNew,sizeof(Page)+pPager->iPageSize);
/* Page data */
pNew->zData = (unsigned char *)&pNew[1];
/* Fill in the structure */
pNew->pPager = pPager;
pNew->nRef = 1;
pNew->pgno = num_page;
return pNew;
}
/*
* Increment the reference count of a given page.
*/
static void page_ref(Page *pPage)
{
pPage->nRef++;
}
/*
* Release an in-memory page after its reference count reach zero.
*/
static int pager_release_page(Pager *pPager,Page *pPage)
{
int rc = UNQLITE_OK;
if( !(pPage->flags & PAGE_DIRTY)){
/* Invoke the unpin callback if available */
if( pPager->xPageUnpin && pPage->pUserData ){
pPager->xPageUnpin(pPage->pUserData);
}
pPage->pUserData = 0;
SyMemBackendPoolFree(pPager->pAllocator,pPage);
}else{
/* Dirty page, it will be released later when a dirty commit
* or the final commit have been applied.
*/
rc = UNQLITE_LOCKED;
}
return rc;
}
/* Forward declaration */
static int pager_unlink_page(Pager *pPager,Page *pPage);
/*
* Decrement the reference count of a given page.
*/
static void page_unref(Page *pPage)
{
pPage->nRef--;
if( pPage->nRef < 1 ){
Pager *pPager = pPage->pPager;
if( !(pPage->flags & PAGE_DIRTY) ){
pager_unlink_page(pPager,pPage);
/* Release the page */
pager_release_page(pPager,pPage);
}else{
if( pPage->flags & PAGE_DONT_MAKE_HOT ){
/* Do not add this page to the hot dirty list */
return;
}
if( !(pPage->flags & PAGE_HOT_DIRTY) ){
/* Add to the hot dirty list */
pPage->pPrevHot = 0;
if( pPager->pFirstHot == 0 ){
pPager->pFirstHot = pPager->pHotDirty = pPage;
}else{
pPage->pNextHot = pPager->pHotDirty;
if( pPager->pHotDirty ){
pPager->pHotDirty->pPrevHot = pPage;
}
pPager->pHotDirty = pPage;
}
pPager->nHot++;
pPage->flags |= PAGE_HOT_DIRTY;
}
}
}
}
/*
* Link a freshly created page to the list of active page.
*/
static int pager_link_page(Pager *pPager,Page *pPage)
{
sxu32 nBucket;
/* Install in the corresponding bucket */
nBucket = PAGE_HASH(pPage->pgno) & (pPager->nSize - 1);
pPage->pNextCollide = pPager->apHash[nBucket];
if( pPager->apHash[nBucket] ){
pPager->apHash[nBucket]->pPrevCollide = pPage;
}
pPager->apHash[nBucket] = pPage;
/* Link to the list of active pages */
MACRO_LD_PUSH(pPager->pAll,pPage);
pPager->nPage++;
if( (pPager->nPage >= pPager->nSize * 4) && pPager->nPage < 100000 ){
/* Grow the hashtable */
sxu32 nNewSize = pPager->nSize << 1;
Page *pEntry,**apNew;
sxu32 n;
apNew = (Page **)SyMemBackendAlloc(pPager->pAllocator, nNewSize * sizeof(Page *));
if( apNew ){
sxu32 iBucket;
/* Zero the new table */
SyZero((void *)apNew, nNewSize * sizeof(Page *));
/* Rehash all entries */
n = 0;
pEntry = pPager->pAll;
for(;;){
/* Loop one */
if( n >= pPager->nPage ){
break;
}
pEntry->pNextCollide = pEntry->pPrevCollide = 0;
/* Install in the new bucket */
iBucket = PAGE_HASH(pEntry->pgno) & (nNewSize - 1);
pEntry->pNextCollide = apNew[iBucket];
if( apNew[iBucket] ){
apNew[iBucket]->pPrevCollide = pEntry;
}
apNew[iBucket] = pEntry;
/* Point to the next entry */
pEntry = pEntry->pNext;
n++;
}
/* Release the old table and reflect the change */
SyMemBackendFree(pPager->pAllocator,(void *)pPager->apHash);
pPager->apHash = apNew;
pPager->nSize = nNewSize;
}
}
return UNQLITE_OK;
}
/*
* Unlink a page from the list of active pages.
*/
static int pager_unlink_page(Pager *pPager,Page *pPage)
{
if( pPage->pNextCollide ){
pPage->pNextCollide->pPrevCollide = pPage->pPrevCollide;
}
if( pPage->pPrevCollide ){
pPage->pPrevCollide->pNextCollide = pPage->pNextCollide;
}else{
sxu32 nBucket = PAGE_HASH(pPage->pgno) & (pPager->nSize - 1);
pPager->apHash[nBucket] = pPage->pNextCollide;
}
MACRO_LD_REMOVE(pPager->pAll,pPage);
pPager->nPage--;
return UNQLITE_OK;
}
/*
* Update the content of a cached page.
*/
static int pager_fill_page(Pager *pPager,pgno iNum,void *pContents)
{
Page *pPage;
/* Fetch the page from the catch */
pPage = pager_fetch_page(pPager,iNum);
if( pPage == 0 ){
return SXERR_NOTFOUND;
}
/* Reflect the change */
SyMemcpy(pContents,pPage->zData,pPager->iPageSize);
return UNQLITE_OK;
}
/*
* Read the content of a page from disk.
*/
static int pager_get_page_contents(Pager *pPager,Page *pPage,int noContent)
{
int rc = UNQLITE_OK;
if( pPager->is_mem || noContent || pPage->pgno >= pPager->dbSize ){
/* Do not bother reading, zero the page contents only */
SyZero(pPage->zData,pPager->iPageSize);
return UNQLITE_OK;
}
if( (pPager->iOpenFlags & UNQLITE_OPEN_MMAP) && (pPager->pMmap /* Paranoid edition */) ){
unsigned char *zMap = (unsigned char *)pPager->pMmap;
pPage->zData = &zMap[pPage->pgno * pPager->iPageSize];
}else{
/* Read content */
rc = unqliteOsRead(pPager->pfd,pPage->zData,pPager->iPageSize,pPage->pgno * pPager->iPageSize);
}
return rc;
}
/*
* Add a page to the dirty list.
*/
static void pager_page_to_dirty_list(Pager *pPager,Page *pPage)
{
if( pPage->flags & PAGE_DIRTY ){
/* Already set */
return;
}
/* Mark the page as dirty */
pPage->flags |= PAGE_DIRTY|PAGE_NEED_SYNC|PAGE_IN_JOURNAL;
/* Link to the list */
pPage->pDirtyPrev = 0;
pPage->pDirtyNext = pPager->pDirty;
if( pPager->pDirty ){
pPager->pDirty->pDirtyPrev = pPage;
}
pPager->pDirty = pPage;
if( pPager->pFirstDirty == 0 ){
pPager->pFirstDirty = pPage;
}
}
/*
* Merge sort.
* The merge sort implementation is based on the one used by
* the PH7 Embeddable PHP Engine (http://ph7.symisc.net/).
*/
/*
** Inputs:
** a: A sorted, null-terminated linked list. (May be null).
** b: A sorted, null-terminated linked list. (May be null).
** cmp: A pointer to the comparison function.
**
** Return Value:
** A pointer to the head of a sorted list containing the elements
** of both a and b.
**
** Side effects:
** The "next", "prev" pointers for elements in the lists a and b are
** changed.
*/
static Page * page_merge_dirty(Page *pA, Page *pB)
{
Page result, *pTail;
/* Prevent compiler warning */
result.pDirtyNext = result.pDirtyPrev = 0;
pTail = &result;
while( pA && pB ){
if( pA->pgno < pB->pgno ){
pTail->pDirtyPrev = pA;
pA->pDirtyNext = pTail;
pTail = pA;
pA = pA->pDirtyPrev;
}else{
pTail->pDirtyPrev = pB;
pB->pDirtyNext = pTail;
pTail = pB;
pB = pB->pDirtyPrev;
}
}
if( pA ){
pTail->pDirtyPrev = pA;
pA->pDirtyNext = pTail;
}else if( pB ){
pTail->pDirtyPrev = pB;
pB->pDirtyNext = pTail;
}else{
pTail->pDirtyPrev = pTail->pDirtyNext = 0;
}
return result.pDirtyPrev;
}
/*
** Inputs:
** Map: Input hashmap
** cmp: A comparison function.
**
** Return Value:
** Sorted hashmap.
**
** Side effects:
** The "next" pointers for elements in list are changed.
*/
#define N_SORT_BUCKET 32
static Page * pager_get_dirty_pages(Pager *pPager)
{
Page *a[N_SORT_BUCKET], *p, *pIn;
sxu32 i;
if( pPager->pFirstDirty == 0 ){
/* Don't bother sorting, the list is already empty */
return 0;
}
SyZero(a, sizeof(a));
/* Point to the first inserted entry */
pIn = pPager->pFirstDirty;
while( pIn ){
p = pIn;
pIn = p->pDirtyPrev;
p->pDirtyPrev = 0;
for(i=0; i<N_SORT_BUCKET-1; i++){
if( a[i]==0 ){
a[i] = p;
break;
}else{
p = page_merge_dirty(a[i], p);
a[i] = 0;
}
}
if( i==N_SORT_BUCKET-1 ){
/* To get here, there need to be 2^(N_SORT_BUCKET) elements in he input list.
* But that is impossible.
*/
a[i] = page_merge_dirty(a[i], p);
}
}
p = a[0];
for(i=1; i<N_SORT_BUCKET; i++){
p = page_merge_dirty(p,a[i]);
}
p->pDirtyNext = 0;
return p;
}
/*
* See block comment above.
*/
static Page * page_merge_hot(Page *pA, Page *pB)
{
Page result, *pTail;
/* Prevent compiler warning */
result.pNextHot = result.pPrevHot = 0;
pTail = &result;
while( pA && pB ){
if( pA->pgno < pB->pgno ){
pTail->pPrevHot = pA;
pA->pNextHot = pTail;
pTail = pA;
pA = pA->pPrevHot;
}else{
pTail->pPrevHot = pB;
pB->pNextHot = pTail;
pTail = pB;
pB = pB->pPrevHot;
}
}
if( pA ){
pTail->pPrevHot = pA;
pA->pNextHot = pTail;
}else if( pB ){
pTail->pPrevHot = pB;
pB->pNextHot = pTail;
}else{
pTail->pPrevHot = pTail->pNextHot = 0;
}
return result.pPrevHot;
}
/*
** Inputs:
** Map: Input hashmap
** cmp: A comparison function.
**
** Return Value:
** Sorted hashmap.
**
** Side effects:
** The "next" pointers for elements in list are changed.
*/
#define N_SORT_BUCKET 32
static Page * pager_get_hot_pages(Pager *pPager)
{
Page *a[N_SORT_BUCKET], *p, *pIn;
sxu32 i;
if( pPager->pFirstHot == 0 ){
/* Don't bother sorting, the list is already empty */
return 0;
}
SyZero(a, sizeof(a));
/* Point to the first inserted entry */
pIn = pPager->pFirstHot;
while( pIn ){
p = pIn;
pIn = p->pPrevHot;
p->pPrevHot = 0;
for(i=0; i<N_SORT_BUCKET-1; i++){
if( a[i]==0 ){
a[i] = p;
break;
}else{
p = page_merge_hot(a[i], p);
a[i] = 0;
}
}
if( i==N_SORT_BUCKET-1 ){
/* To get here, there need to be 2^(N_SORT_BUCKET) elements in he input list.
* But that is impossible.
*/
a[i] = page_merge_hot(a[i], p);
}
}
p = a[0];
for(i=1; i<N_SORT_BUCKET; i++){
p = page_merge_hot(p,a[i]);
}
p->pNextHot = 0;
return p;
}
/*
** The format for the journal header is as follows:
** - 8 bytes: Magic identifying journal format.
** - 4 bytes: Number of records in journal.
** - 4 bytes: Random number used for page hash.
** - 8 bytes: Initial database page count.
** - 4 bytes: Sector size used by the process that wrote this journal.
** - 4 bytes: Database page size.
**
** Followed by (JOURNAL_HDR_SZ - 28) bytes of unused space.
*/
/*
** Open the journal file and extract its header information.
**
** If the header is read successfully, *pNRec is set to the number of
** page records following this header and *pDbSize is set to the size of the
** database before the transaction began, in pages. Also, pPager->cksumInit
** is set to the value read from the journal header. UNQLITE_OK is returned
** in this case.
**
** If the journal header file appears to be corrupted, UNQLITE_DONE is
** returned and *pNRec and *PDbSize are undefined. If JOURNAL_HDR_SZ bytes
** cannot be read from the journal file an error code is returned.
*/
static int pager_read_journal_header(
Pager *pPager, /* Pager object */
sxu32 *pNRec, /* OUT: Value read from the nRec field */
pgno *pDbSize /* OUT: Value of original database size field */
)
{
sxu32 iPageSize,iSectorSize;
unsigned char zMagic[8];
sxi64 iHdrOfft;
sxi64 iSize;
int rc;
/* Offset to start reading from */
iHdrOfft = 0;
/* Get the size of the journal */
rc = unqliteOsFileSize(pPager->pjfd,&iSize);
if( rc != UNQLITE_OK ){
return UNQLITE_DONE;
}
/* If the journal file is too small, return UNQLITE_DONE. */
if( 32 /* Minimum sector size */> iSize ){
return UNQLITE_DONE;
}
/* Make sure we are dealing with a valid journal */
rc = unqliteOsRead(pPager->pjfd,zMagic,sizeof(zMagic),iHdrOfft);
if( rc != UNQLITE_OK ){
return rc;
}
if( SyMemcmp(zMagic,aJournalMagic,sizeof(zMagic)) != 0 ){
return UNQLITE_DONE;
}
iHdrOfft += sizeof(zMagic);
/* Read the first three 32-bit fields of the journal header: The nRec
** field, the checksum-initializer and the database size at the start
** of the transaction. Return an error code if anything goes wrong.
*/
rc = ReadInt32(pPager->pjfd,pNRec,iHdrOfft);
if( rc != UNQLITE_OK ){
return rc;
}
iHdrOfft += 4;
rc = ReadInt32(pPager->pjfd,&pPager->cksumInit,iHdrOfft);
if( rc != UNQLITE_OK ){
return rc;
}
iHdrOfft += 4;
rc = ReadInt64(pPager->pjfd,pDbSize,iHdrOfft);
if( rc != UNQLITE_OK ){
return rc;
}
iHdrOfft += 8;
/* Read the page-size and sector-size journal header fields. */
rc = ReadInt32(pPager->pjfd,&iSectorSize,iHdrOfft);
if( rc != UNQLITE_OK ){
return rc;
}
iHdrOfft += 4;
rc = ReadInt32(pPager->pjfd,&iPageSize,iHdrOfft);
if( rc != UNQLITE_OK ){
return rc;
}
/* Check that the values read from the page-size and sector-size fields
** are within range. To be 'in range', both values need to be a power
** of two greater than or equal to 512 or 32, and not greater than their
** respective compile time maximum limits.
*/
if( iPageSize < UNQLITE_MIN_PAGE_SIZE || iSectorSize<32
|| iPageSize > UNQLITE_MAX_PAGE_SIZE || iSectorSize>MAX_SECTOR_SIZE
|| ((iPageSize-1)&iPageSize)!=0 || ((iSectorSize-1)&iSectorSize)!=0
){
/* If the either the page-size or sector-size in the journal-header is
** invalid, then the process that wrote the journal-header must have
** crashed before the header was synced. In this case stop reading
** the journal file here.
*/
return UNQLITE_DONE;
}
/* Update the assumed sector-size to match the value used by
** the process that created this journal. If this journal was
** created by a process other than this one, then this routine
** is being called from within pager_playback(). The local value
** of Pager.sectorSize is restored at the end of that routine.
*/
pPager->iSectorSize = iSectorSize;
pPager->iPageSize = iPageSize;
/* Ready to rollback */
pPager->iJournalOfft = JOURNAL_HDR_SZ(pPager);
/* All done */
return UNQLITE_OK;
}
/*
* Write the journal header in the given memory buffer.
* The given buffer is big enough to hold the whole header.
*/
static int pager_write_journal_header(Pager *pPager,unsigned char *zBuf)
{
unsigned char *zPtr = zBuf;
/* 8 bytes magic number */
SyMemcpy(aJournalMagic,zPtr,sizeof(aJournalMagic));
zPtr += sizeof(aJournalMagic);
/* 4 bytes: Number of records in journal. */
SyBigEndianPack32(zPtr,0);
zPtr += 4;
/* 4 bytes: Random number used to compute page checksum. */
SyBigEndianPack32(zPtr,pPager->cksumInit);
zPtr += 4;
/* 8 bytes: Initial database page count. */
SyBigEndianPack64(zPtr,pPager->dbOrigSize);
zPtr += 8;
/* 4 bytes: Sector size used by the process that wrote this journal. */
SyBigEndianPack32(zPtr,(sxu32)pPager->iSectorSize);
zPtr += 4;
/* 4 bytes: Database page size. */
SyBigEndianPack32(zPtr,(sxu32)pPager->iPageSize);
return UNQLITE_OK;
}
/*
** Parameter aData must point to a buffer of pPager->pageSize bytes
** of data. Compute and return a checksum based ont the contents of the
** page of data and the current value of pPager->cksumInit.
**
** This is not a real checksum. It is really just the sum of the
** random initial value (pPager->cksumInit) and every 200th byte
** of the page data, starting with byte offset (pPager->pageSize%200).
** Each byte is interpreted as an 8-bit unsigned integer.
**
** Changing the formula used to compute this checksum results in an
** incompatible journal file format.
**
** If journal corruption occurs due to a power failure, the most likely
** scenario is that one end or the other of the record will be changed.
** It is much less likely that the two ends of the journal record will be
** correct and the middle be corrupt. Thus, this "checksum" scheme,
** though fast and simple, catches the mostly likely kind of corruption.
*/
static sxu32 pager_cksum(Pager *pPager,const unsigned char *zData)
{
sxu32 cksum = pPager->cksumInit; /* Checksum value to return */
int i = pPager->iPageSize-200; /* Loop counter */
while( i>0 ){
cksum += zData[i];
i -= 200;
}
return cksum;
}
/*
** Read a single page from the journal file opened on file descriptor
** jfd. Playback this one page. Update the offset to read from.
*/
static int pager_play_back_one_page(Pager *pPager,sxi64 *pOfft,unsigned char *zTmp)
{
unsigned char *zData = zTmp;
sxi64 iOfft; /* Offset to read from */
pgno iNum; /* Pager number */
sxu32 ckSum; /* Sanity check */
int rc;
/* Offset to start reading from */
iOfft = *pOfft;
/* Database page number */
rc = ReadInt64(pPager->pjfd,&iNum,iOfft);
if( rc != UNQLITE_OK ){ return rc; }
iOfft += 8;
/* Page data */
rc = unqliteOsRead(pPager->pjfd,zData,pPager->iPageSize,iOfft);
if( rc != UNQLITE_OK ){ return rc; }
iOfft += pPager->iPageSize;
/* Page cksum */
rc = ReadInt32(pPager->pjfd,&ckSum,iOfft);
if( rc != UNQLITE_OK ){ return rc; }
iOfft += 4;
/* Synchronize pointers */
*pOfft = iOfft;
/* Make sure we are dealing with a valid page */
if( ckSum != pager_cksum(pPager,zData) ){
/* Ignore that page */
return SXERR_IGNORE;
}
if( iNum >= pPager->dbSize ){
/* Ignore that page */
return UNQLITE_OK;
}
/* playback */
rc = unqliteOsWrite(pPager->pfd,zData,pPager->iPageSize,iNum * pPager->iPageSize);
if( rc == UNQLITE_OK ){
/* Flush the cache */
pager_fill_page(pPager,iNum,zData);
}
return rc;
}
/*
** Playback the journal and thus restore the database file to
** the state it was in before we started making changes.
**
** The journal file format is as follows:
**
** (1) 8 byte prefix. A copy of aJournalMagic[].
** (2) 4 byte big-endian integer which is the number of valid page records
** in the journal.
** (3) 4 byte big-endian integer which is the initial value for the
** sanity checksum.
** (4) 8 byte integer which is the number of pages to truncate the
** database to during a rollback.
** (5) 4 byte big-endian integer which is the sector size. The header
** is this many bytes in size.
** (6) 4 byte big-endian integer which is the page size.
** (7) zero padding out to the next sector size.
** (8) Zero or more pages instances, each as follows:
** + 4 byte page number.
** + pPager->pageSize bytes of data.
** + 4 byte checksum
**
** When we speak of the journal header, we mean the first 7 items above.
** Each entry in the journal is an instance of the 8th item.
**
** Call the value from the second bullet "nRec". nRec is the number of
** valid page entries in the journal. In most cases, you can compute the
** value of nRec from the size of the journal file. But if a power
** failure occurred while the journal was being written, it could be the
** case that the size of the journal file had already been increased but
** the extra entries had not yet made it safely to disk. In such a case,
** the value of nRec computed from the file size would be too large. For
** that reason, we always use the nRec value in the header.
**
** If the file opened as the journal file is not a well-formed
** journal file then all pages up to the first corrupted page are rolled
** back (or no pages if the journal header is corrupted). The journal file
** is then deleted and SQLITE_OK returned, just as if no corruption had
** been encountered.
**
** If an I/O or malloc() error occurs, the journal-file is not deleted
** and an error code is returned.
**