Linux Sleuthing

It didn't take long from the release of iOS6 on 09/14/12 until I got my first iOS6 device to process: 18 days. In this case, it was an iPhone4 with iOS6. In so short a time, there are not too many commercial or open source tools out there prepared to analyze the new OS or its new data formats.

Fortunately for me, the phone was not locked. To process the phone, in which the prime interest was text messages and contacts, I used the latest iTunes to create a backup of the phone. Fortune was again on my side: the phone was not set for encrypted backups.

iTunes Backup

If you are not familiar with iTunes backups, then allow me to introduce them generally. iTunes copies user data (media, databases, settings, applications, etc.) to a directory in the operating system. The location of the directory varies by operating system and version. Since this is primarily a discussion about the new iOS6 sms.db, I'll let you find the location yourself. What's most important to know is this:

the backups are flat: all the files are dumped into one directory and are not preserved in their original tree structure.
the files are renamed: the file names are sha1 values calculated from the domain, path, and filename of the original file.

The nature of hash digest is that they produce unique, fixed length values based on the data present in the data structure being analyzed. This means that a 1 byte file and a 200gb data stream both produce a 40-byte sha1 hash value. This also means that the data on which the hash value was calculated cannot be reverse engineered from the value alone. Why did I just state all this? Certainly not to ignite another hash controversy by anything I might just misstated, but instead to point out that we can't really know the original file name, path, and domain from the sha1 value file names found in the backup. Sure, you'll find postings that tell you the sms.db is renamed "3d0d7e5fb2ce288813306e4d4636395e047a3d28", but you won't find a file name for every file in your backup. And you won't know any metadata about any of the backup files from the hash value, either.

Now, Linux helps us here. On the command line, the file command will tell us the file type of each of the backup files, and most Linux file managers will render thumbnails for known file types. This means its quite easy to view media files and identify databases just by opening a file browser pointed at the backup directory. Its good for low hanging fruit, but how do you differentiate between the databases, for example? Keyword search on a table name, maybe? Sure, that might narrow the field, but it's a less than perfect solution.

Identifying Files in the Backup

Keeping the discussion general, know one more thing about iTunes backups before we proceed with the sms.db discussion: there are database files included in the backup that identify the files by domain, path, and file name. In fact, the file metadata such a MAC times, ownership, permissions, and file size are included in the databases.

Now, the databases have changed with each iOS evolution--at least since iOS3 when I started examining iTunes backups. Well, more accurately, I should they changed with each evolution except iOS6. iOS6 retains the format introduced in iOS5. That was again fortunate, because it just so happens that I coded a tool, based on a python script posted on stackoverflow posted by user galloglass, to 'unback' the iTunes backup based on the information in the backup databases. The iOS5-unback.py tool works just fine on the iOS6 backup.

The sms.db

Where the backup format did not change in iOS6, the sms.db changed markedly. Compare, if you will, the database schema side by side with an iOS5 sms.db.

	iOS6	iOS5
Number of tables:	10	8
Table Names:	SqliteDatabaseProperties, message, sqlite_sequence, chat, attachment, handle, message_attachment_join, chat_handle_join, chat_message_join, sqlite_stat1	_SqliteDatabaseProperties, message, sqlite_sequence, msg_group, group_member, msg_pieces, madrid_attachment, madrid_chat
Number of triggers:	9	7
Trigger Names:	set_message_has_attachments, update_message_roomname_cache_insert, delete_attachment_files, clean_orphaned_attachments, clean_orphaned_handles, clear_message_has_attachments, clean_orphaned_messages, update_message_roomname_cache_delete, clean_orphaned_handles2	insert_unread_message, mark_message_unread, mark_message_read, delete_message, insert_newest_message, delete_newest_message, delete_pieces
Number of indexes:	16	18
Index Names:	sqlite_autoindex__SqliteDatabaseProperties_1, sqlite_autoindex_message_1, sqlite_autoindex_chat_1, sqlite_autoindex_attachment_1, sqlite_autoindex_handle_1, sqlite_autoindex_handle_2, sqlite_autoindex_message_attachment_join_1, sqlite_autoindex_chat_handle_join_1, sqlite_autoindex_chat_message_join_1, message_idx_is_read, message_idx_failed, message_idx_handle, chat_idx_identifier, chat_idx_room_name, message_idx_was_downgraded, chat_message_join_idx_message_id	sqlite_autoindex__SqliteDatabaseProperties_1, madrid_attachment_message_index, madrid_attachment_guid_index, madrid_attachment_filename_index, madrid_chat_style_index, madrid_chat_state_index, madrid_chat_account_id_index, madrid_chat_chat_identifier_index, madrid_chat_service_name_index, madrid_chat_guid_index, madrid_chat_room_name_index, madrid_chat_account_login_index, message_group_index, message_flags_index, pieces_message_index, madrid_guid_index, madrid_roomname_service_index, madrid_handle_service_index

For anyone familiar with the iOS5 sms.db database structure, the most apparent change in the iOS6 sms.db is the lack of "madrid" tables. In iOS5, the iMessage app was introduced, and iMessage texting and SMS texting were unified in the sms.db. This is still true in iOS6, but it is handled quite differently, and by "differently," I mean "better."

Time Stamps

When parsing the iOS5 database, it was necessary to convert two different date formats: Unix epoch time and Mac Absolute Time. The dates of SMS messages sent through the wireless carrier were recorded in unix epoch time in the date field of the message table. iMessages, on the other hand, were recorded in Mac Absolute Time in the same table and field as the SMS message date!

What's the difference between unix epoch and Mac Absolute Time? Exactly 978307200 seconds. Unix epoch starts on 01/01/1970, while Mac Absolute time starts on 01/01/2001. SQLite queries that integrated the SMS and iMessage texts had to account for these differences. Some examiners with automated tools for parsing the sms.db were likely blissfully ignorant of this issue, but it exited none-the-less. But it presented challenges to those of use extracting data from the database with the SQLite command line program or a SQLite GUI browser.

The iOS6 sms.db simplifies the date issue. All text message dates, whether SMS or iMessage are recorded in Mac Absolute time. The datetime function in SQLite can be used to convert that time by adding 978307200 seconds to the date data. This converts the time to unix epoch which is one of the 'modifiers' the datetime function was coded to handle (Mac Absolute Time is not a valid datetime modifier):

sqlite> select datetime(370884516 + 978307200, 'unixepoch', 'utc');
datetime(370884516 + 978307200, 'unixepoch', 'utc')
2012-10-02 22:28:36

While on the topic of time stamps, two additional time stamps are possible in each record: date_read and date_delivered. The date_read field defaults to 0 until the message is opened in the iMessage application. The date_sent is populated with a date when the message is sent through the iMessage service but not SMS. This meas it is possible to differentiate between the time an iMessage was initiated by the device user and when it was actually sent.

Addresses

Another significant change in the database is the address field. In the iOS5 version, the "address" was the phone number of the remote contact in the text conversation. The address field has been replaced with the handle_id in iOS6. The handle_id corresponds to the ROWID in the handle table, which contains the phone number of the the contact in the id field.

Flags

I'll quickly mention a couple of other changes. In the previous versions of the sms.db, fields such as flags and read were used to mark the type (sent, received, etc) and status (read, unread, etc) of the message. New fields exist for these attributes including is_from_me, is_empty, is_delayed, is_auto_reply, is_prepared, is_read, is_system_message, is_sent. The values 0 and 1 are used as "no" and "yes" respectively in interpreting these fields.

I mentioned earlier that the service used to transmit the text, SMS or iMessage, resulted in different time stamp type. It also resulted in different fields being populated in the message table. In iOS6, the messages share the same flags regardless of the service used to send them. The service utilized is recorded in the the new service field of the message table.

Putting it All Together

I did not come anywhere close to describing all the data in the new sms.db table, nor how to related all the tables. I only intended to alert investigators that there are differences in the new database that must be considered.

That said, I'd be remiss if I did not provide a sample query to produce a basic chat list. The following query produces a list with several important fields (RowID, Date, Phone Number,Service|Type|Date Read/Sent|Text):

SELECT m.rowid as RowID, DATETIME(date + 978307200, 'unixepoch', 'localtime') as Date, h.id as "Phone Number", m.service as Service, CASE is_from_me WHEN 0 THEN "Received" WHEN 1 THEN "Sent" ELSE "Unknown" END as Type, CASE WHEN date_read > 0 THEN DATETIME(date_read + 978307200, 'unixepoch', 'utc') WHEN date_delivered > 0 THEN DATETIME(date_delivered + 978307200, 'unixepoch', 'utc') ELSE NULL END as "Date Read/Sent", text as Text FROM message m, handle h WHERE h.rowid = m.handle_id ORDER BY m.rowid ASC;

I'll make that a little easier to read:

SELECT
m.rowid as RowID,
DATETIME(date + 978307200, 'unixepoch', 'localtime') as Date,
h.id as "Phone Number", m.service as Service,
CASE is_from_me
WHEN 0 THEN "Received"
WHEN 1 THEN "Sent"
ELSE "Unknown"
END as Type,
CASE
WHEN date_read > 0 then DATETIME(date_read + 978307200, 'unixepoch', 'utc')
WHEN date_delivered > 0 THEN DATETIME(date_delivered + 978307200, 'unixepoch', 'utc')
ELSE NULL END as "Date Read/Sent",
text as Text
FROM message m, handle h
WHERE h.rowid = m.handle_id
ORDER BY m.rowid ASC;

Your results, when imported to a spreadsheet, should look like this:

ROWID	Date	Phone Number	Service	Type	Date Read/Sent	Text
2484	2012-10-02 08:28:36	11231234567	iMessage	Sent	2012-10-02 08:28:39	Hey
2485	2012-10-02 08:45:17	11231234567	iMessage	Sent	2012-10-02 08:46:11	Call me when you get this
2486	2012-10-02 08:46:06	13217654321	SMS	Received	2012-10-02 08:47:21	Can I borrow some bucks?
2487	2012-10-02 08:47:10	1321765321	SMS	Sent		No! I don't have any doe.

I was contacted by a colleague who needed some help analyzing a SQLite database. It was the myspace.messaging.database#database located in the "\Users\\appdata\local\google\chrome\userdata\default\plugin data\google gears\messaging.myspace.com\http_80\" folder. I didn't and still don't know a whole lot about this file, but it appears to contain myspace email messages.

The Challenges of SQLite

Let's face it: SQLite is everywhere. Understanding it is essential to good examinations, and a big part of that understanding come from learning SQL statements. There are many good online sources for learning SQL, and one of my favorites is w3schools.com.

But, for digital forensics practitioners, there is another challenge beyond understanding SQL commands--understanding the construction and relationships of the tables. SQLite is a relational database, and the tables are meant to be related to one another to produce a result no possible or impractical from a single table. Knowing how the table was intended to be used can be very difficult... after all, a SQLite database is more akin to a file cabinet, not a secretary who uses the file cabinet.

For example, the secretary can place company bank records in a file called "Financial Records" or she can put them in a file called "Artichokes". It really doesn't matter, because she knows what goes in the file. Someone coming along behind her won't have much trouble finding the bank records in the Financial Records file, but might overlook them them entirely in the Artichokes file. The point is, without the secretary, it might be very hard to understand the filing system.

SQLite databases can be a lot like that. You can see the structure, or the schema as it is called, very easily. But what is not so easily understood is how the structure is intended to be used. That mystery is usually locked up in the application that utilizes the database, but it is not explained in the database itself.

Getting a Clue

To be sure, there can be hints about how tables in a database interrelate. Table and field names often speak volumes. A database called "AddressBook.db" with two tables called "Names" and "Addresses" that have a field in common called "SubjectID" isn't too hard to fathom. If we are lucky enough to be able to run the application that uses the database and test our inferences based on the applications output, our confidence grows (and our understanding, if supported by the outcome, would now be considered reliable).

My favorite hints by far are SQL 'view' statements. These are virtual tables that draw their data from other tables in the database (or an attached database). By studying a view statement, you get insight from the database creator how the database was intended to be used... at least in one capacity. Think of a view as a macro: it saves the database user the trouble of repeated typing a frequently used query. And, if the query is frequently used, then you have a good sense of how the database was intended to be used.

What if There are No Clues?
What about circumstances in which there are no clues in the database to help us understand its use. Well, if there are really no clues, then the only safe answer is we look at the data flat, that is to say, we look at the tables individually and we don't relate them in any way. But, there are often less obvious clues than can reveal an underlying relationship... which brings me to the point of this article.

Latent Rows

Latent fingerprint examiners know the term "latent" means hidden or invisible. Latent fingerprints must be revealed to be seen by some external method, such as fingerprint powder. SQLite tables have a latent field, so to speak. And, we can reveal it to help us form relations in a SQLite database.

Consider the myspace.messaging.database#database I mentioned in the open paragraph. It has the following schema:

CREATE VIRTUAL TABLE AuthorData USING fts2(AuthorDisplayName, AuthorUserName);
CREATE TABLE AuthorData_content(c0AuthorDisplayName, c1AuthorUserName);
CREATE TABLE AuthorData_segdir( level integer, idx integer, start_block integer, leaves_end_block integer, end_block integer, root blob, primary key(level, idx));
CREATE TABLE AuthorData_segments(block blob);
CREATE TABLE AuthorMetaData (AuthorId INTEGER PRIMARY KEY, AuthorImageUrl TEXT);
CREATE VIRTUAL TABLE MessageData USING fts2(Subject, Body);
CREATE TABLE MessageData_content(c0Subject, c1Body);
CREATE TABLE MessageData_segdir( level integer, idx integer, start_block integer, leaves_end_block integer, end_block integer, root blob, primary key(level, idx));
CREATE TABLE MessageData_segments(block blob);
CREATE TABLE MessageMetaData (MessageId INTEGER PRIMARY KEY, RecipientId INTEGER, AuthorId INTEGER, Folder INTEGER, Status INTEGER, CreatedDate INTEGER);
CREATE TABLE UserSettings (UserId INTEGER PRIMARY KEY, MachineId TEXT, Enabled INTEGER, TimeStamp INTEGER, LastSyncTimeStamp INTEGER, FirstRunIndexPass INTEGER, FirstRunIndexTargetCount INTEGER, OldestMessageId INTEGER, LastServerTotalCount INTEGER);
CREATE INDEX AuthorIdIndex ON MessageMetaData (AuthorId, RecipientId);
CREATE INDEX StatusIndex ON MessageMetaData (Status, CreatedDate);

Now look more closely at two tables of interest,MessageMetaData and MessageData_content:

CREATE TABLE MessageMetaData (MessageId INTEGER PRIMARY KEY, RecipientId INTEGER, AuthorId INTEGER, Folder INTEGER, Status INTEGER, CreatedDate INTEGER);

CREATE TABLE MessageData_content(c0Subject, c1Body)

It would seem from the table names that MessageMetaData contains information about the messages, and MessageData_content contains the messages themselves. But, they don't share any fields that allow the two tables to be related. In other words, which rows of the metadata table correspond to which row of the content table? Do they even correspond at all?

Let's look at our first hint or correspondence:

$ sqlite3 myspace.messaging.database#database.db 'select count(*) from MessageMetaData;'

1358
$ sqlite3 myspace.messaging.database#database.db 'select count(*) from MessageData_content;'
1358

Both tables have the same number of records. Hmm, a clue? Quite likely, especially upon study of the table content and the remaining tables contents. In fact conducting a similar study, we find another set of table correspondence: AuthorMetaData and AuthorData_content also have an equal number of records (172, to be exact) but no obvious, interrelated fields.

Unless you've studied SQLite construction in any depth, you probably don't know that it creates a 'rowid' field for every table to act as a primary key. If a table is created with a defined primary key, that primary key is just a alias to the builtin rowid (with one exception outside the scope of this discussion). But the rowid is not represented in the table or database schema, which is probably why you didn't know about it (at least, I didn't until I bought a SQLite book).

Knowing about the rowid, i can now check to see if the two tables have matching rowid fields:

$ sqlite3 myspace.messaging.database#database.db 'select count(*) from MessageMetaData m, MessageData_content c where m.rowid = c.rowid'
1358

We don't have to trust the count function, take a look for yourself:

$ sqlite3 myspace.messaging.database#database.db 'select m.rowid, c.rowid from MessageMetaData m, MessageData_content c where m.rowid = c.rowid'
...
81407357|81407357
81416917|81416917
81504605|81504605
81505714|81505714
81530947|81530947
81569294|81569294

Well, now this is even more interesting. We not only have two tables with the same number of rows, but we have two tables with fields in relation, i.e., rowid!

Understand that rowid is simply an autoincrementing, unique, 64-bit integer unless specifically declared otherwise by insert and update commands. But is this just a coincidence? Let's consider: we have non-sequential rowids throughout both tables. That might be explained by dropped rows from the tables. But two tables, each with 1358 rows, and each row having a matching rowid in the other table? That is more than coincidence--it's programatic. The application populating the tables is assigning the rowids.

The Proof is in the Pudding

My assertion is that the myspace.messaging.database#database.db is assigning the rowids as it populates the related tables and links the rows by matching rowid. Let me demonstrate how rowid can be assigned:

sqlite> create table numbers(digit integer);
sqlite> insert into numbers (digit) values(1);
sqlite> insert into numbers (digit) values(2);
sqlite> insert into numbers (digit) values(3);
sqlite> select rowid, digit from numbers;
1|1
2|2
3|3
4|3
sqlite> insert into numbers (rowid, digit) values (1000, 4);
sqlite> select rowid, digit from numbers;
1|1
2|2
3|3
4|3
1000|4

I created at table called "numbers" with one field called "digit." I then inserted three rows in the table with the values 1, 2, and 3 respectively. If you've been following along, you now know that every SQLite table also has a rowid field, even if not expressly created in the table by the user. The first select statemnt shows the autogenerated rowid and along with the digits I inserted.

The final insert statement is different. Here I assign the rowid, rather than let it be automatically populated by the SQLite engine. And, as you an see in the final select statement, I succeed in setting an non-sequential rowid.

Putting it All Together

I've demonstrated a "hidden" way that tables in SQLite databases can be related. It takes some knowledge in SQLite structure and the SQL query language to unveil this data, however. If you are in the habit of relying on SQLite browsers and looking at tables without relating them, then you are really missing out on a wealth of data.

Again, let me illustrate using the myspace.messaging.database#database. Lets look at one row in each of the tables I mentioned previously:

$ sqlite3 -header myspace.messaging.database#database.db 'select * from MessageMetaData limit 1;'
MessageId|RecipientId|AuthorId|Folder|Status|CreatedDate
1289081|544962655|41265701|0|2|1280870820000

$ sqlite3 -header myspace.messaging.database#database.db 'select * from MessageData_content limit 1;'
c0Subject|c1Body
Hi|Hey, what's up?

$ sqlite3 -header myspace.messaging.database#database.db 'select * from AuthorMetaData limit 1;'
AuthorId|AuthorImageUrl
-1930729470|http://some_url/img/some_image.png

$ sqlite3 -header myspace.messaging.database#database.db 'select * from AuthorData_content limit 1;'
c0AuthorDisplayName|c1AuthorUserName
A User|auser

The only hint of relationship, besides table names, is the AuthorID field in MessageMetaData and AuthorMetaData. But there is still no obvious way to tie the metadata to the content we are most interested in. Your favorite GUI browser maybe make the display prettier, but its just as impotent.

But, now that you have knowledge of the rowid, and have a link to a tutorial on SQLite statements, you're not too far from being able to do this:

sqlite3 -header myspace.messaging.database#database.db 'select messageid, datetime(createddate/1000, "unixepoch", "localtime") as Date, mm.AuthorID, c0AuthorDisplayName as "Author Display Name", c1AuthorUserName as "Author Username", c0subject as Subject, c1Body as Body from MessageMetaData mm, MessageData_content mc, AuthorData_Content ac, AuthorMetaData am where mm.AuthorID = am.AuthorID and am.rowid = ac.rowid and mm.rowid = mc.rowid limit 2;'
MessageId|Date|AuthorId|Author Display Name|Author Username|Subject|Body1289081|2010-08-03 14:27:00|41265701|A User|auser|Hi|Hey, what's up?

I ask you, on which output would you rather examine and report?

Addendum

That last query is really not so scary. It's just long because we're grabbing seven fields from four tables, and converting a date stamp. But, in reality, it's very straight forward.

Let's take a look:

select
   messageid,
   datetime(createddate/1000, "unixepoch", "localtime") as Date,
   mm.AuthorID,
   c0AuthorDisplayName as "Author Display Name",
   c1AuthorUserName as "Author Username",
   c0subject as Subject,
   c1Body as Body
from
MessageMetaData mm,
   MessageData_content mc,
   AuthorMetaData am,
   AuthorData_Content ac
where
     mm.AuthorID = am.AuthorID
   and am.rowid = ac.rowid
   and mm.rowid = mc.rowid;

The select clause simply picks the fields we want to display. The datetime function converts the unixepoch time, which is recorded in milliseconds, to local time. The 'as' statements are naming the columns something more user friendly and are not required.

The from statement simply declares what tables to query for the fields we are trying to display. Each table is followed by an alias I chose to make easier reference to field names common to more than one table. For example, AuthorID is found in both the MessageMetaData and AuthorMetaData tables. By giving MessageMetaData the alias of mm, I can now reference the MessageMetaData.AuthorID field as mm.AuthorID.

The wherestatement is a filter. It keeps the tables 'aligned,' so to speak. It ensures that only the correct author content and message content is returned for each row. This post is a lot long in the tooth, so I won't go into detail describing how it works. But, very succinctly, the MessageMetaData record is matched to a AuthorMetaData record by AuthorID. The the AuthorMetaData record is matched to its corresponding AuthorData_Content record by rowid. Finally, the MessageMetaData record is matched to its corresponding MessageData_content, also by rowid.

Now that I have a basic handle on the iOS6 sms.db, it's time to look at the iOS6 Address Book. After all, I now know what's being said when I examine the sms.db, but I don't have a real good picture, other than the phone number, of who's sending the message. The iOS AddressBook.sqlitedb is the place to look for data related to the phone number.

A brief look at AddressBook.sqlitedb

The iOS6 AddressBook.sqlitedb database has 29 tables:

ABAccount ABPersonFullTextSearch_segdir
ABGroup ABPersonFullTextSearch_segments
ABGroupChanges ABPersonFullTextSearch_stat
ABGroupMembers ABPersonLink
ABMultiValue ABPersonMultiValueDeletes
ABMultiValueEntry ABPersonSearchKey
ABMultiValueEntryKey ABPhoneLastFour
ABMultiValueLabel ABRecent
ABPerson ABStore
ABPersonBasicChanges FirstSortSectionCount
ABPersonChanges FirstSortSectionCountTotal
ABPersonFullTextSearch LastSortSectionCount
ABPersonFullTextSearch_content LastSortSectionCountTotal
ABPersonFullTextSearch_docsize _SqliteDatabaseProperties

The abperson table seems to be the obvious target for the data we want. Here's its schema:

CREATE TABLE ABPerson (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, First TEXT, Last TEXT, Middle TEXT, FirstPhonetic TEXT, MiddlePhonetic TEXT, LastPhonetic TEXT, Organization TEXT, Department TEXT, Note TEXT, Kind INTEGER, Birthday TEXT, JobTitle TEXT, Nickname TEXT, Prefix TEXT, Suffix TEXT, FirstSort TEXT, LastSort TEXT, CreationDate INTEGER, ModificationDate INTEGER, CompositeNameFallback TEXT, ExternalIdentifier TEXT, ExternalModificationTag TEXT, ExternalUUID TEXT, StoreID INTEGER, DisplayName TEXT, ExternalRepresentation BLOB, FirstSortSection TEXT, LastSortSection TEXT, FirstSortLanguageIndex INTEGER DEFAULT 2147483647, LastSortLanguageIndex INTEGER DEFAULT 2147483647, PersonLink INTEGER DEFAULT -1, ImageURI TEXT, IsPreferredName INTEGER DEFAULT 1)

If you bothered to read through that, you'll notice there is nothing about phone numbers or email addresses. We need to find the table containing the phone numbers, and somehow join that to this table that has the name data.

Take a look at a sample (fictitious) row:

ROWID = 1

First = Some

Last = Name

Middle =

FirstPhonetic =

MiddlePhonetic =

LastPhonetic =

Organization = SomeCompany

Department =

Note =

Kind = 0

Birthday =

JobTitle =

Nickname =

Prefix =

Suffix =

FirstSort = -'- 1'?7=W +'= 17I/ )'MM'=7CA +57/1

LastSort = 1'?7=W -'- +'= 17I/ )'MM'=7CA +57/1

CreationDate = 350880583

ModificationDate = 369882045

CompositeNameFallback =

ExternalIdentifier =

ExternalModificationTag =

ExternalUUID = 68516E99-7C39-4C4D-8871-BDE114EDD6B4

StoreID = 0

DisplayName =

ExternalRepresentation =

FirstSortSection = -

LastSortSection = 1

FirstSortLanguageIndex = 0

LastSortLanguageIndex = 0

PersonLink = -1

ImageURI =

IsPreferredName = 1

A little more digging, we find the ABMultivalue table contains phone numbers, email addresses, URLs to social networking sites, and potentially more. There are only six fields (making it a little easier on the eyes), but there's a minor issue. Reading the table flat (that is, not related to any other tables) we don't know whom the numbers and email addresses belong to, nor to we know what type of data it is. For example, is it a home, work, mobile, or FAX phone number? Take a look at its makeup and some sample rows and you'll see what I mean:

CREATE TABLE ABMultiValue (UID INTEGER PRIMARY KEY, record_id INTEGER, property INTEGER, identifier INTEGER, label INTEGER, value TEXT)

UID = 643

record_id = 1

property = 4

identifier = 0

label = 3

value = someguy@somedomain.com

UID = 1026

record_id = 1

property = 3

identifier = 0

label = 1

value = (###) ###-####

That's a little easier on the eyes. You can see that the 'value' field at the end of the row contains the phone numbers and email addresses we're looking for. The problem comes from the fact its the only text field in the table. All other values are integers. So, how do we relate the names in the ABPerson table to the numbers in the ABMultivalue? I made it pretty obvious here, but ABperson.ROWID = ABMultivalue.record_id.

So, what about that other problem: what kind of data is that? Here comes the ABMultiValueLabel table to the rescue! Take a look at this:

value = iPhone

value = _$![MOBILE]!$_

value = _$![HOME]!$_

value = mobile1

value = mobile2

value = DAD

value = MOM

If that looks a bit strange, its because the iOS user can create their own labels. Some, like mobile1 and mobile2, are going to be default. Others like Mom and DAD are custom. But something else should catch your eye, too: there are no values in the ABMultiValueLabel table with which to match the label integer in the ABMultivalue table! Now what?

I introduced the SQLite rowid field in an earlier article. In sum, every SQLite table has a rowid field, which is a unique integer, whether or not it was specified when the table was created. In the case of the ABMultiValueLabel table, the rowid is the relation to the label integer in the ABMultivalue table to inform us of the data type. Observe:

$ sqlite3 -line AddressBook.sqlitedb 'select rowid, value from abmultivaluelabel where rowid = 1 or rowid = 3'

rowid = 1
value = iPhone

rowid = 3
value = _$![Home]!$_

Now that we know how these tables interrelate, we can compose a query to generate a simple contacts list.

A Simple Contact List

To quickly recap, the ABPerson, ABMultivalue, and ABMultivalueLabel tables are interrelated: ABPerson contains the contact names, ABMultivalue houses the phone numbers, email addresses, and potentially other values (e.g. URLs), and ABMultivalueLabel defines the values. We can create a list of contacts, sorted by contact identification number, thusly:

$ sqlite3 -header AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, l.value as Type, m.value from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id and l.rowid=m.label order by p.rowid;'

ROWID|First|Last|Organization|Type|value

1|Some|Guy|SomeCompany|_$![Home]!$_|someguy@somedomain.com
1|Some|Guy|SomeCompany|iPhone|(###) ###-####

(Note: Keep reading! Though it might not be apparent, this query above is flawed.)

Checking our work

While the query above works nicely, it does not return all the data available in the tables. Let's take a look some database stats and you'll see what I mean:

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson;'

542

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivalue;'

1588

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivaluelabel;'

The ABPerson table has 542 entries. We'd expect at least 542 lines in our query, assuming one phone number, email or other value in the ABMultivalue table. Considering there are 1588 ABMultivalue rows, we actually expect an average of three rows per contact. But, how many rows are returned with our "Simple Contact List" query?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id and l.rowid=m.label;'

305

"Ruh-row, Raggy!" What happened? Maybe there are contacts in ABPerson not found in ABMultivalue?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson where rowid not in (select record_id from abmultivalue);'

No, all 542 rowids from ABPerson are found in the record_ids of ABMultivalue. We certainly want all the ABPerson records represented in our output, and clearly we're not getting them with the query as written. But how many of the ABMultivalue records relate to the ABPerson records?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abmultivalue where record_id in (select rowid from abperson);'

1588

All of them. So, now we know the score: our query won't be correct until we have 1588 rows of data. So, where's the breakdown? To find out, we need to look carefully at our original query:

select # here we a just selecting the columns we desire

p.rowid,

p.first,

p.last,

p.organization,

l.value as Type,

m.value

from # here we list the source tables, and assign an alias

abperson p,

abmultivalue m,

abmultivaluelabel l

where # here we filter the data, and this is our danger area

p.rowid=m.record_id

and l.rowid=m.label

order by # here we are simply ordering the results

p.rowid;'

So, it would appear we need to concentrate on the where clause which has the effect of filtering our data. We are doing two things here:

p.rowid=m.record_id - ABPerson's rowid must match a ABMultivalue record_id, which we have already shown occurs in all cases
l.rowid=m.label - There must be a matching rowid in ABMultivalueLabel for each ABMultivalue label.

It appears that the second statement in where clause is our culprit. Can we demonstrate this to be the case and verify our assessment of the original query?

$ sqlite3 AddressBook.sqlitedb 'select count(*) from abperson, abmultivalue where rowid=record_id;'

1588

Replacing our count function with the original field names (replacing the label name with the integer because we don't call to the ABMultivalueLabel table in this instance) doesn't change the count because the select statement is the location where we define the fields to be displayed, not filter the results. I'll demonstrate:

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, m.label as Type, m.value from abperson p, abmultivalue m where p.rowid = m.record_id;' | wc -l

1588

So, it was the final filter that resulted in the reduced data set. That means there must be undefined labels in the ABMultivalue table (and in fact, 1283 of the records have NULL label values)! So, can we just delete that filter from our original query and call it good?

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, l.value as Type, m.value from abperson p, abmultivalue m, abmultivaluelabel l where p.rowid=m.record_id order by p.rowid;' | wc -l

22233

What?! Over 22,000 rows are returned? How does that happen? It's simple, but its also easily misunderstood. Without our last filter, our select statement says to show the ABMultivaluelabel value field. There are 14 rows in that table, so we get 14 rows for each of the 1588 records in the ABMultivalue table. I'm not a math genius, but 542+1588+14= ... two plus eight plus four is fourteen, carry the one... no where near 22K!

So, we need a way to display label name if it exists in the ABMultivaluelabel table, otherwise display the ABMultivalue label integer. Even if we don't know what it means, we don't want to ignore an ABMutlivalue entry because we may come to understand its meaning through another phase of our investigation. We can accomplish this with a pair of nested select statements in the primary select statement.

select

p.rowid,

p.first, p.last,

p.organization,

case

when m.label in (select rowid from abmultivaluelabel)

then (select value from abmultivaluelabel where m.label = rowid)

else m.label

end as Type,

m.value

from

abperson p,

abmultivalue m,

where

p.rowid=m.record_id

order by

p.rowid;

What we have going on in the highlighted portion is a case statement. Case is the if/then/else of SQLite. What we are saying above is, if the ABMultivalue label is in the ABMultivalueLabel table, then print the english translation from the ABMultivalueLabel table. Otherwise, just print the label integer. The 'as Type' suffix just labels the column 'Type' in the output.

Let's see what this does to our results:

$ sqlite3 AddressBook.sqlitedb 'select p.rowid, p.first, p.last, p.organization, case when m.label in (select rowid from abmultivaluelabel) then (select value from abmultivaluelabel where m.label = rowid) else m.label end as Type, m.value from abperson p, abmultivalue m where p.rowid=m.record_id order by p.rowid;' | wc -l

1588

Bingo! We now have our expected results. One more tiny addition to our query cleans up our output. It turns out, that a lot of the entries in the ABMultivalue table have NULL values and are useless for our purpose. We can remove them with a filter in the where clause and get a final product:

725

So, we improved our results greatly over the initial 305 records that "seemed right".

The Larger Lesson Learned

You likely tuned into this article because of the iOS6 AdressBook.sqlitedb topic. And I hope I've helped you extract the results you sought. But the bigger takeaway comes from the analysis after the initial query that resulted in 305 records. And, it is this: SQLite will do what you tell it, so long as you use proper syntax, no matter how stupid your query is. In otherwords, you ask the question, it answers that question, not the one you were thinking, the one you typed. Make sure you really understand the question before you rely on the answer!

Addendum: What About Street Addresses?

It's great that we've figured out the phone numbers email addresses, etc. of our contacts, but what if we want to find their physical addresses? The ABMultivalue table I examined did not have street addresses in its values. Instead, I found Street addresses in the ABMultivalueEntry and ABPersonFullTextSearch_content tables. I would not preclude addresses from existing ABMultivalue, however, since we have seen there can be custom labels and the value field can contain any text data. But, I digress.

The ABPersonFullTextSearch_content has the following structure:

CREATE TABLE 'ABPersonFullTextSearch_content'(docid INTEGER PRIMARY KEY, 'c0First', 'c1Last', 'c2Middle', 'c3FirstPhonetic', 'c4MiddlePhonetic', 'c5LastPhonetic', 'c6Organization', 'c7Department', 'c8Note', 'c9Birthday', 'c10JobTitle', 'c11Nickname', 'c12Prefix', 'c13Suffix', 'c14DisplayName', 'c15Phone', 'c16Email', 'c17Address', 'c18SocialProfile', 'c19URL', 'c20RelatedNames', 'c21IM', 'c22Date')

At first glance, the structure of this table might cause you to think you can simply dump the data from this table for a very complete list of contacts. But this table was intended for searching if you take its name at face value. Looking at the data, this assertion is supported by the fact that phone numbers are merged into one field and then recorded in a variety of formats (i.e., international, with/without area code, with/without punctuation, etc). Thus, it is difficult to read and there is no distinction in the type of data (i.e, home/work phone number, etc).

EDIT:
I have discovered how the ABMultivalueEntry relates to the ABMultivalue table and solved the mystery of the blank ABMultivalue "values" fields.

ABMultivalueEntry contains physical addresses which are distinguished a "key" integer. The key is is given its meaning, such as "Country," "Street," "Zip," "City," etc., by the ABMultivalueEntryKey table. The key in ABMultivalueEntry relates to the rowid in ABMultivalueEntryKey. The address data itself is stored in the "value" field of the ABMultivalueEntry table, which in turn relates by the "parent_id" field to the ABmultivalue table by a matching UID field. Confused?

Let me try it this way: Contact names are in the ABperson table. Contact phone numbers, emails addresses, and URLs are located in the ABmultivalue table. Physical addresses are located in the ABMultivalueEntry table. The definitions of the values in the '*value*' tables are located in the "*label" and "*key" tables discussed.

What is the query to produce a complete contact list? Still working on that one...

This is my first attempt at using asciidoc to create a more legible document for the reader. My thanks go out to Dr. Michael Cohen at Google for pointing out the formatting tool.

The Ghost of HFS Past

It is no great revelation that Apple favors the HFS/HFS+ file systems. But, when I started my adventures in Linux-based forensics, my chief forensics tool, The Sleuthkit, did not support HFS file systems. The solution was to mount the file systems read-only and examine the allocated files with standard Linux tools and applications. Even when The Sleuthkit added HFS support, it was still useful to mount the partition, especially when it came to examining media files.

Microsoft Windows

Windows users have been doing pretty much the same using tools like HFSExplorer.

The process in Linux was pretty straight forward:

Image the device to be examined.
Read the partition table to identify the byte-offset to the HFS partition in the image.
Use the mount command with the byte-offset to mount the partition.

For example

# mmls macbookpro320.dd -a
GUID Partition Table (EFI)
Offset Sector: 0
Units are in 512-byte sectors

     Slot    Start        End          Length       Description
04:  00      0000000040   0000409639   0000409600   EFI system partition
05:  01      0000409640   0624880263   0624470624   Customer

# mount -o ro,loop,offset=$((409640*512)) macbookpro320.dd /mnt/analysis

The mount command attaches file systems on storage devices to the GNU Linux file tree. Conversely, the umount(not ‘unmount’ as you might expect) command detaches the device file system from the GNU Linux file system. The mount command above worked with the macbookpro320.dd disk image because of the loop option.

Loop Device
In Unix-like operating systems, a loop device, vnd (vnode disk), or lofi (loopback file interface) is a pseudo-device that makes a file accessible as a block device.
http://en.wikipedia.org/wiki/Loop_device
— Wikipedia

The Ghost of HFS Present

But, somewhere around the 2.6.34 Linux kernel, the mount command stopped working with HFS loop devices in the manner demonstrated above. The mount command can still be used to mount a partition addressed block device, but not addressed by image by offset like the illustration above.

By Partition Block Device File (verbose mode for illustration)

# blkid
/dev/sdd1: LABEL="EFI" UUID="2860-11F4" TYPE="vfat"
/dev/sdd2: UUID="f2477645-5489-3419-b477-d504574057e3" LABEL="Macintosh 
HD" TYPE="hfsplus"

# mount -o ro -v /dev/sdd2 /mnt/analysis/
mount: you didn't specify a filesystem type for /dev/sdd2
       I will try type hfsplus
/dev/sdd2 on /mnt/analysis type hfsplus (ro)

# umount -v /mnt/analysis/
/dev/sdd2 has been unmounted

As you can see, the mount command works as expected when the partition is addressed by the special block device file. But, when the partition is addressed by byte-offset from the raw device or disk image, it fails:

By Byte-Offset (block device shown)

# mmls -a /dev/sdd
GUID Partition Table (EFI)
Offset Sector: 0
Units are in 512-byte sectors

     Slot    Start        End          Length       Description
04:  00      0000000040   0000409639   0000409600   EFI system partition
05:  01      0000409640   0624880263   0624470624   Customer

# mount -o ro,offset=$((409640*512)) /dev/sdd /mnt/analysis
mount: wrong fs type, bad option, bad superblock on /dev/loop2,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

This bug only applies to HFS formatted partitions. Other common file systems—FAT, NFTS, EXT2/3/4—do not seem to be affected.

The Ghost of HFS Future

The work around for storage devices is obvious: point mount at the partition’s block device file. But in forensics, we work with disk images to avoid making changes to the original storage medium. How do we address HFS partitions in disk images if we can’t address them by offset?

Enter kpartx. The kpartx command reads partition tables and maps partitions to device files. It works on devices and disk images. This means we can map HFS partitions in a disk image to a special block device file and mount those partitions by addressing that block device file as if it were part of an attached device!

kpartx

# kpartx
usage : kpartx [-a|-d|-l] [-f] [-v] wholedisk
    -a add partition devmappings
    -r devmappings will be readonly
    -d del partition devmappings
    -u update partition devmappings
    -l list partitions devmappings that would be added by -a
    -p set device name-partition number delimiter
    -g force GUID partition table (GPT)
    -f force devmap create
    -v verbose
    -s sync mode. Don't return until the partitions are created

To demonstrate its use, we will use a raw (dd) image from a MacBook Pro. Other disk image formats will work too, so long as they can be exposed to GNU Linux as raw devices (think xmount).

Image Information

# img_stat macbookpro320.dd
IMAGE FILE INFORMATION
 --------------------------------------------
Image Type: raw

Size in bytes: 320072933376

We can read the partitions in the disk image with kpartx.

kpartx: Detected Partitions

# kpartx -l macbookpro320.dd
loop1p1 : 0 409600 /dev/loop1 40
loop1p2 : 0 624470624 /dev/loop1 409640

# kpartx -lg macbookpro320.dd
loop1p1 : 0 409600 /dev/loop1 40
loop1p2 : 0 624470624 /dev/loop1 409640

The GUID Partition Table

kpartx has a option to force a GUID partition table, which the MacBook uses, but as demonstrated, it wasn’t necessary in this case.

All that remains to do is to map the partitions so that we can mount them. We can use the -r option to create read-only devices.

kpartx: Mapping Partitions

# kpartx -av macbookpro320.dd
add map loop1p1 (254:0): 0 409600 linear /dev/loop1 40
add map loop1p2 (254:1): 0 624470624 linear /dev/loop1 409640

Now, mounting is as simple as pointing at the new block device files, which are found in the /dev/mapper directory.

Mounting special HFS block devices

# mount -o ro -v /dev/mapper/loop1p2 /mnt/analysis
mount: you didn't specify a filesystem type for /dev/mapper/loop1p2
       I will try type hfsplus
/dev/mapper/loop1p2 on /mnt/analysis type hfsplus (ro)

# mount
...
/dev/mapper/loop1p2 on /mnt/analysis type hfsplus (ro,relatime,umask=22,
uid=0,gid=0,nls=utf8)

After you have completed your analysis, unmount the partition and remove the special block devices.

Cleaning up

# umount -v /mnt/analysis/
/dev/mapper/loop1p2 has been unmounted

# kpartx -dv macbookpro320.dd
del devmap : loop1p2
del devmap : loop1p1
loop deleted : /dev/loop1

Accessing Unallocated File System Space

Unallocated space in HFS partitions is addressable through The Sleuthkit with the blkls tool.

Now, go forth and conquer!

Rising from the Ashes

I’ve received many, many inquiries about recovering deleted records from SQLite databases ever since I posted an article about my first attempt to recover deleted data. Well, the hypothesis of checking the difference between the original database and a vacuumed copy seemed sound at the time and did in fact yield dropped record data, but it also included data from allocated records. The main thing I learned was that I had much to learn about the SQLite database file format.

Since that time, I’ve run into more and more SQLite databases, and the issue of recovering dropped records has become paramount. I have learned how to do so, and I’ll share some of the secrets with you now. But first, you need to know a little about SQLite databases…

This article is not a treatise on the SQLite file format. The best resource for that is located at SQLite.org. I hope to put the salient points here so you can understand the complexity of the task of recovering dropped records from SQLite databases.

SQLite Main Database Header

The first 100 bytes of a SQLite database define and describe the database. The key value to record recovery is the page size, a 16-bit (two-bytes) big-endian integer at byte offset 16. SQLite databases are divided into pages, usually matching the underlying file system block size. Each page has a single use, and those containing the records that are the interest of forensic examiners is the table b-tree leaf page, which I’ll refer to as the TLP. The TLP is distinguished from other page types by its first byte, \x0d or integer 13.

Thus, we can find the TLPs with the knowledge of the database page size we obtain from the database header and check the first byte of each page for \x0d. In python, that might look like:

Python 3: Finding table b-tree leaf pages

from struct import unpack

with open('some.db','rb') as f:
    data = f.read()

pageSize =unpack('>h', data[16:18])[0]

pageList =[]

for offset inrange(0,len(data), pageSize):
if data[offset]==13;
        pageList.append(offset)

Note	The code above prints the offset of TLPs. Make sure you are using Python 3 if you want to try this for yourself.

Table B-Tree Leaf Pages

The TLPs hold the records, and consequently, the dropped (deleted) records data when they occur. Each page has an 8-byte header, broken down as follows:

Table 1. Table b-tree leaf page header
Offset	Size	Value
0	1	Page byte \x0d (int 13)
1	2	Byte offset to first freeblock
3	2	Number of cells
5	2	Offset to first cell
7	1	Number of freebytes

The header introduces some terms that need explaining. A freeblock is unallocated space in the page below one or more allocated records. It is created by the dropping of a record from the table. It has a four-byte header: the first two bytes are a 16-bit big-endian integer pointing to the next freeblock (zero means its the last freeblock), and the second two bytes are a 16-bit big-endian integer representing the size of the freeblock, including the header.

Cells are the structures that hold the records. The are made up of a payload length, key, and payload. The length and key, also known as the rowid, are variable length integers. What are those? I’m glad you asked:

Variable-length Integers

A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more byte which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are the more significant and bits taken from the later bytes.

http://www.sqlite.org/fileformat2.html
— SQLite.org

I won’t go into varints any further in this post, because I will not be discussing cell decoding in this post. Suffice it to say that with the payload length, we can define the payload, which itself is made up of a header and columns. The header is a list of varints, the first describing the header length, and the remainder decribing the column data and types. The page header contains number of cells and the offset to the first cell on the page.

The last value in the header, freebytes, describes the number of fragmented bytes on the page. Fragmented bytes are byte groupings of three or less that cannot be reallocated to a new cell (which takes a minimum of four bytes).

Immediately following the page header is a cell pointer array. It is made up of 16-bit big endian integers equal in length to the number of cells on the page. Thus, if there are 10 cells on the page, the array is 20 bytes long (10 2-btye groupings).

Page Unallocated Space

There are three types of unallocated space in a TLP. Freeblocks and freebytes we’ve discussed, and the third is the space between the end of the cell array and the first cell on the page referred to in the SQLite documentation as simply "unallocated". Freeblocks and unallocated can contain recoverable record data, while freebytes are too small for interpretation. Thus, knowing the first freeblock (defined in the page header), the length of the cell array (interpreted from the number of cells defined in the page header) and the offset to the first cell (yep, you guessed it, defined in the page header), we can recover all the unallocated space in the page for analysis.

Python 3: Finding table b-tree leaf page unallocated space

for offset in pageList:
    page = data[offset: offset + pageSize]

    pageHeader =unpack('>bhhhb', page[:8])
    pageByte, fbOffset, cellQty, cellOffset, freebytes = pageHeader

# get unallocated
    start =8+ cellQty *2
    end = cellOffset-start
    unalloc = page[start:end]
print(offset, unalloc, sep=',')

# get freeblocks, if any
if fbOffset >0:
while fbOffset !=0:
            start, size =unpack('>hh', page[fbOffset: fbOffset +4])
            freeblock = page[fbOffset: fbOffset + size]
print(offset, freeblock, sep =',')
            fbOffset = start

With the lines from the two code boxes, we have coaxed the unallocated data from the "some.db" SQLite database. We have printed the offset of each unallocated block and the contents (in python bytes format) to stdout. With just a little manipulation, we can turn this into a script a reuseable program, and the content can be grepped for strings. At bare minimum, we now have a way to determine if there is deleted content in the database related to our investigation, e.g., we could grep the output of the Android mmssms.db for a phone number to see if there are deleted records. Searching against the whole database would not be valuable because we cannot separate the allocated from the unallocated content!

Now, this obviously does not reconstruct the records for us, but recovering the unallocated data is a good start. In future posts I will describe how to reconstruct allocated records with an eye towards reconstructing unallocated records.

I was attempting to brute force an iPhone 4 passcode for data recovery. The phone was in poor condition and had undergone modifications: the home button had been replaced as well as the back cover, maybe more. I could not reliably get the phone into recovery mode, possibly the result of a faulty home button, so i used libimobiledevice’s ideviceenterrecovery command.

It worked wonderfully. Maybe too wonderfully. I eventually achieved DFU mode (the home button was probably the culprit in making this normally simple process quite difficult), executed my exploit, and obtained the passcode and data. My goal was to unlock the phone and pass it off to another investigator. But, when I rebooted the phone after DFU mode I found it was in recovery again!

I tried a variety of things, from trying to rest with FirmwareUmbrella (formerly TinyUmbrella) to disassembling the phone and disconnecting the battery, but nothing worked. Then a colleague (thanks, Perry) suggested iRecovery.

What is libirecovery?

libirecovery is a cross-platform library which implements communication to iBoot/iBSS found on Apple’s iOS devices via USB. A command-line utility is also provided.

https://github.com/libimobiledevice/libirecovery
— libirecovery

libirecovery can be compiled in Linux. I found I had to install the libreadline-dev package in my ubuntu install, but you may find you have to do more depending on the packages you already have installed. Building requires you to execute the autogen.sh followed by make and then make install. I had to also run ldconfig to register the library since this was not done automatically.

The command line utility is the irecovery tool. It is used as follows:

iRecovery - iDevice Recovery Utility
Usage: irecovery [args]
        -i <ecid>       Target specific device by its hexadecimal ECID
        -v              Start irecovery in verbose mode.
        -c <cmd>        Send command to client.
        -f <file>       Send file to client.
        -k [payload]    Send usb exploit to client.
        -h              Show this help.
        -r              Reset client.
        -s              Start interactive shell.
        -e <script>     Executes recovery shell script.

On first blush, it might seem that the solution to my problem was the command irecovery -r to reset the device. But that is not so. Instead, I needed to enter the shell, change and environment variable, and reboot.

iRecovery Shell

$ sudo irecovery -s
> setenv auto-boot true
> saveenv
> reboot

Important

Running the command as root was required or the program failed with a segmentation fault.

The device rebooted into the normal operating system and I was able to unlock it with the passcode I had recovered. If you find yourself in a recovery loop, I hope this post will help you, uh, recover from it!

SQLite to the Rescue

One of the tasks I’m asked to perform is to geolocate mobile phone calls from Call Detail Reports (CDR). These usually arrive from a carrier as spread sheets: one with details of calls to and from a particular number, and one or more cell tower listings. I’ve tried a variety of ways to process these over time such as BASH scripting and python coding. But by far the easiest and most flexible way to process these records is by importing them into a SQLite database.

The long term difficulty in processing CDRs is that they change over time. It seems that every time I get new records to process, the format has changed which breaks previous code. It takes much more effort to recode a script than it does to write a SQL query on the fly, and I’m certainly no SQL guru. SQLite has enough built in functions to handle nearly any problem you might encounter.

Lets take some Sprint records I recently processed as an example. I was tasked with plotting the locations of the voice calls on a map. There were over 3,600 records for a 12 day period with text message details mixed with voice call details. Only the call details contained references to the tower records, however. Call records included five digit integers that represented the first and last cell towers the mobile phone used during the communication. Text messages contained only zeros in these columns.

The challenge was to retrieve the call records for mapping, ignoring the text messages that did not contain cell tower details. SQLite seemed the easiest way to accomplish this in light of the follow up requirement of looking up each cell tower integer in any one of four associated tower record spread sheets.

Creating the Database

The first step was to create a SQLite database. Fortunately, creating a database is a simple, straight forward process. I performed the work using in the SQLite command line program. However, GUI tools like the excellent SQLite Manager can accomplish the same thing and I recommend them if you are new to SQLite as they can be good teachers.

To create a new database, I simply provided the new database name when I opened the command line program. I called my database cdr.sqlite.

Creating a SQLite database

$ sqlite3 cdr.sqlite
SQLite version 3.7.172013-05-2000:56:22
Enter ".help"for instructions
Enter SQL statements terminatedwith a ";"
sqlite>

Next, I needed a table to hold the call records. I made this something easy to type, so I called it simply cdr. The columns I named for the columns found in the spreadsheet sent by Sprint.

SQLite Create Table Statement for the Call Sprint Detail Report

sqlite>CREATETABLE"cdr"
...>(
...>"Calling_NBR"TEXT,
...>"Called_NBR"TEXT,
...>"Dialed_Digits"TEXT,
...>"Type"TEXT,
...>"Start_Date"TEXT,
...>"End_Date"TEXT,
...>"Duration" INTEGER,
...>"NEID" INTEGER,
...>"Repoll"TEXT,
...>"First_Cell" INTEGER,
...>"Last_Cell" INTEGER,
...>);
sqlite>

Note	In the SQLite command line program, all SQL queries must end in a semi-colon or the interpreter assumes you are adding lines to the statement. If you forget to put the semi-colon at the end of your command, you can enter it on the next line.

To import the CDR data from the spreadsheet, simply export the data without the header row, in text file with comma separated values (CSV). In this case, I called the file "call_records.csv". I had to tell SQLite how the data was delimited (SQLite uses pipes by default), and import the CSV file into the "cdr" table.

Importing the CDR data into the cdr table

sqlite>.separator ","
sqlite>.import call_records.csv cdr

Note	The "dot" commands are special SQLite functions (use .help to view them all) and do not require a semi-colon.

To import the cell tower details, I followed the same process: I created a table I called "towers" using the column headers from the spreadsheet as the column names of the table. Then I exported the cell tower data from each spreadseet to a CSV file and imported the CSVs into the tower table. While I won’t repeat the full process, I will display the table layout (schema) below.

Towers table schema

sqlite>.schema towers
CREATETABLE"towers"(
"Cell" INTEGER,
"Cascade"TEXT,
"Switch"TEXT,
"NEID" INTEGER,
"Repoll" INTEGER,
"Site_Name"TEXT,
"Address1"TEXT,
"Address2"TEXT,
"City"TEXT,
"County"TEXT,
"State"TEXT,
"Zip"TEXT,
"Latitude"TEXT,
"Longitude"TEXT,
"BTS_Manufacturer"TEXT,
"Sector"TEXT,
"Azimuth"TEXT,
"CDR_Status"TEXT
);
sqlite>

Looking Up Records

Now it was time to lookup the call records in the tower tables to find that latitude and longitude of the tower used to initiate the call an place the mobile device in time and space. I expected it to be straight forward: take the 5-digit tower number from the first_call field of the cdr table, match it to the cell field in the towers table, and return the map coordinates. Easy peasy, right? The SQL equivalent of show me the latitude and longitude of the tower where the CDR first_cell integer matches the Tower cell integer.

First attempt to match call tower numbers to tower coordinates

sqlite>select latitude, longitude from towers, cdr where first_cell = cell;
sqlite># Ruh roh, raggy, no matches!
sqlite>select first_cell from cdr where first_cell !=0limit5;
40385
10962
10962
20962
30392
sqlite>select cell, latitude, longitude from towers;
<redacted>
385|34.046944|-118.448056
385|34.046944|-118.448056
385|34.046944|-118.448056
392|34.063806|-118.30366
392|34.063806|-118.30366
392|34.063806|-118.30366
962|37.657222|-122.094653
962|37.657222|-122.094653
962|37.657222|-122.094653
385|37.838333|-122.298611
385|37.838333|-122.298611
392|37.693|-122.0939
392|37.693|-122.0939
392|37.693|-122.0939
385|37.403633|-121.89436
385|37.403633|-121.89436
385|37.403633|-121.89436
<redacted>
sqlite>

Whoa, the cdr first_cell and towers cell values do not jibe! And, as we can see, there is more than one entry in the tower table for each cell designator. Take cell 385 for example: there are three distinct groupings of tower 385 with three different map coordinates for each group. It turns out that cell towers are grouped by the switch they are part of, recorded in the CDR and tower records as the NEID. The appropriate cell tower record can be further reconciled by the sector number, or side of the tower from which the call originated. The first_cell value is actually a concatenation of the sector and the tower number. How did I figure all this out? The answer came from reading the documentation (RTFM) that came with the records and some analysis of the spreadsheets.

I’ll demonstrate below the values that make a tower record unique and that must be considered when matching call records to tower details. I’ll focus on tower 385

Values that make the tower records unique.

sqlite>.headers on
sqlite>.mode columns
sqlite>select cell, sector, neid, azimuth, latitude, longitude from towers
...>where cell =385;
Cell        Sector      NEID        Azimuth     Latitude    Longitude
----------  ----------  ----------  ----------  ----------  -----------
3851656034.046944-118.448056
38526520034.046944-118.448056
38536529034.046944-118.448056
38515125537.838333-122.298611
385251215537.838333-122.298611
385195037.403633-121.89436
38529511037.403633-121.89436
38539519037.403633-121.89436
sqlite>

Now it is easy to see that the three different groupings of tower 385 are a result of that tower designator being used in three different switches, or NEIDs. Further, each tower can be resolved to a sector, which corresponds to a unique asimuth or direction the cell tower antenna points.

SQLite Substrings

The remaining problem in querying this data is the configuration of the first_cell value in the call details. Recall that it is the sector concatentated to the tower number. I needed a way to take the first digit from the integer and assign it to a sector value, and use the remaining four digits as the tower designator. Fortunately, SQLite has a built-in substring function to make this easy.

substr(X,Y,Z), substr(X,Y)

The substr(X,Y,Z) function returns a substring of input string X that begins with the Y-th character and which is Z characters long. If Z is omitted then substr(X,Y) returns all characters through the end of the string X beginning with the Y-th. The left-most character of X is number 1. If Y is negative then the first character of the substring is found by counting from the right rather than the left. If Z is negative then the abs(Z) characters preceding the Y-th character are returned. If X is a string then characters indices refer to actual UTF-8 characters. If X is a BLOB then the indices refer to bytes.

http://www.sqlite.org/lang_corefunc.html
— SQLite

From the SQLite documentation, we see that the substr function takes 2-3 arguments and returns a substring of the of the input string based according to those documents. To return the sector, I needed to take the first digit from the first_cell string in this manner: substr(first_cell, 1, 1). To return the tower identification, I needed to skip the first digit and return the rest of the string thusly: substr(first_cell, 2).

Note	I did not need to specify the third argument in the second substr() expression because I wanted the entirety of the string past the first digit.

Finally, I needed to include the NEID from the call detail records to ensure I’ve looked up the correct tower. Putting it all together, we can see how to created the values we need from the call records to find the matching tower details. I’ve added a second query to demonstrate, using the ltrim() function to strip the leading zeros from the cell column.

Using the SQLite substr() function

sqlite>select substr(first_cell,1,1)as sector,
...> substr(first_cell,2)as cell, neid
...>from cdr where first_cell !=0limit5;
sector      cell        NEID
----------  ----------  ----------
4038595
10962169
10962169
20962169
30392512
sqlite>select substr(first_cell,1,1)as sector,
...> ltrim(substr(first_cell,2),0)as cell, neid
...>from cdr where first_cell !=0limit5;
sector      cell        NEID
----------  ----------  ----------
438595
1962169
1962169
2962169
3392512
sqlite>

Putting it All Together

As usual, the explanation is more step intensive that the actual work. The whole process can be done in one query, but I wanted to break it down so that it would be easier to recognize the elements of the query. To make it more legible, I’ll write it across several lines.

Matching the CDR call record to the correct tower details

sqlite>select start_date asDate, calling_nbr as Number, latitude, longitude
...>from cdr, towers
...>where substr(first_cell,1,1)= towers.sector and
...> ltrim(substr(first_cell,2),0)= cell and
...> cdr.neid = towers.neid limit5;
Date                 Number      Latitude    Longitude
-------------------  ----------  ----------  -----------
2012-12-1007:36:39##########  37.657222   -122.094653
2012-12-1008:24:21##########  37.657222   -122.094653
2012-12-1008:26:09##########  37.657222   -122.094653
2012-12-1009:59:40##########  37.693      -122.0939
2012-12-1010:00:26##########  37.705128   -122.047417
sqlite>

This can be converted to a CSV file suitable for mapping through a website like gpsvisualizer.com[GPS Visualizer] or a program like GPSBabel. First, I change the output mode to CSV and I change the columns names to comply with the mapping software’s requirements, printing a sample to ensure I have the format I am seeking. Then I output the data to a file for import to the mapping program.

Exporting the data for mapping

sqlite>.mode csv
sqlite>select start_date as name, calling_nbr asdesc, latitude, longitude
...>from cdr, towers
...>where substr(first_cell,1,1)= towers.sector and
...> ltrim(substr(first_cell,2),0)= cell and
...> cdr.neid = towers.neid limit5;
name,desc,Latitude,Longitude
"2012-12-10 07:36:39",##########,37.657222,-122.094653
"2012-12-10 08:24:21",##########,37.657222,-122.094653
"2012-12-10 08:26:09",##########,37.657222,-122.094653
"2012-12-10 09:59:40",##########,37.693,-122.0939
"2012-12-10 10:00:26",##########,37.705128,-122.047417
sqlite>.output call_map.csv
sqlite>select start_date as name, calling_nbr asdesc, latitude, longitude
...>from cdr, towers
...>where substr(first_cell,1,1)= towers.sector and
...> ltrim(substr(first_cell,2),0)= cell and
...> cdr.neid = towers.neid;
sqlite>

So, in its simplest form, you can see this is not necessarily a difficult process. It can be distilled into three basic steps:

Review the records and determine the relationships between them
Import the data into a SQLite database
Query the database for the output needed

Though there is commercial mapping software available, the software I’ve seen either lacks flexibility to deal with differences in records or in output. Further they usually require configuration that can take as long or longer than importing the data into SQLite and writing the specific query you need for your investigation. If you have the software and a happy with it, use it. If you find it is lacking the flexibility you need, consider doing the work by hand. You’ll be better for it!

Note

In the interest of full disclosure, the results demonstrated here are over simplified. For the real casework, I interpreted the call direction with a case statement to select the called number or calling number as appropriate. The result was a Google Earth map with waypoints named for the date and time of the call. Clicking the waypoint showed call details, e.g., (To: ()-## for 37 secs, azimuth 270).

Locked Devices are not Always Secure

I was handed a device I’ve never seen before: A Verizon Ellipsis 7" tablet. The device was suspected to be stolen, but it was password locked with no sd card or sim card installed. USB debugging and mass storage mode were disabled, too, checked by plugging the device into a computer while the device was booted into the normal operating system. What to do now?

I’ve learned through much hands-on experience to put a device through a few checks before I give up hope. Is there a bootloader mode? How about recovery? I’ve been surprised to find full access to devices in recovery mode, left wide open by the phone’s distributor. More often I find limited access, and sometimes none.

With a little online research—the forensic community owes a debt of gratitude the the modder community—I found that the way to put the Ellipsis into recovery mode: Press and hold between the up and down volume button while powering the device (pressing up and down at the same time did not work). I plugged the device into my PC again, ran adb devices and observed that the Ellipsis was running the adb daemon in recovery mode! I dropped into the ADB shell and determined I was the shell user, which meant limited privileges.

Getting the lay of the land in Android

$ adb shell
shell@android:/ $ printenv
_=/system/bin/printenv
LD_LIBRARY_PATH=/vendor/lib:/system/lib
HOSTNAME=android
TERM=vt100
PATH=/sbin:/vendor/bin:/system/sbin:/system/bin:/system/xbin
LOOP_MOUNTPOINT=/mnt/obb
ANDROID_DATA=/data
ANDROID_ROOT=/system
SHELL=/system/bin/sh
MKSH=/system/bin/sh
USER=shell
ANDROID_PROPERTY_WORKSPACE=8,49664
EXTERNAL_STORAGE=/storage/sdcard0
RANDOM=17656
SECONDARY_STORAGE=/storage/sdcard1
HOME=/data
ANDROID_BOOTLOGO=1
PS1=$(precmd)$USER@$HOSTNAME:${PWD:-?} $
shell@android:/ $

The printenv command reveals some other interesting details about the device. For example, I know where the user data is mounted (HOME=/data), where the operating system files are located (ANDROID_ROOT=/system), and where the sdcards are mounted (EXTERNAL_STORAGE=/storage/sdcard0, SECONDARY_STORAGE=/storage/sdcard1). I know the system path, i.e., the location of executable files that can be called from anywhere in the system. I can also see what partitions are mounted:

Mountpoints of Verizon Ellipsis in Recovery Mode

shell@android:/ $ mount
rootfs / rootfs ro,relatime 00
tmpfs /dev tmpfs rw,nosuid,relatime,mode=75500
devpts /dev/pts devpts rw,relatime,mode=60000
proc /proc proc rw,relatime 00
sysfs /sys sysfs rw,relatime 00
none /acct cgroup rw,relatime,cpuacct 00
tmpfs /mnt/obb tmpfs rw,relatime,mode=755,gid=100000
emmc@android /system ext4 ro,noatime,noauto_da_alloc,commit=1,data=ordered 00
emmc@usrdata /data ext4 rw,nosuid,nodev,noatime,nodelalloc,noauto_da_alloc,commit=1,data=ordered 00
/emmc@cache /cache ext4 rw,nosuid,nodev,noatime,discard,noauto_da_alloc,data=ordered 00
/emmc@protect_f /protect_f ext4 rw,nosuid,nodev,noatime,nodelalloc,noauto_da_alloc,commit=1,data=ordered 00
/emmc@protect_s /protect_s ext4 rw,nosuid,nodev,noatime,nodelalloc,noauto_da_alloc,commit=1,data=ordered 00
/emmc@fat /storage/sdcard0 vfat rw,dirsync,nosuid,nodev,noexec,relatime,uid=1000,gid=1015,fmask=0702,dmask=0702,allow_utime=0020,codepage=cp437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro 00
shell@android:/ $

I see that the /data partition is mounted read/write, but upon exploration, I’ll see there is little I can see or retrieve from there because the shell user does not have sufficient rights. But where can I look to find information about the owner, then? Take a close look at that last entry:

Internal SDCard Mount Point

/emmc@fat /storage/sdcard0 vfat rw,dirsync,nosuid,nodev,noexec,relatime,uid=1000,gid=1015,fmask=0702,dmask=0702,allow_utime=0020,codepage=cp437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro 00
shell@android:/ $ ls -dl storage/sdcard0
d---rwxr-x system   sdcard_rw          1969-12-3116:00 sdcard0

Members of the sdcard_rw group have read/write/execute privileges in the /storage/sdcard0 directory, and other users can read and execute there. A little more exploration of the root directory, we see that /sdcard is a link to /storage/sdcard0, so we can shortcut our typing a bit.

What remains is figure out who owns this device from the data I can read in the /sdcard mount point. One thing all Androids have in common is that the users register them with Google and create associate the device with a gmail account. I performed a simple search:

Finding the Owner of a Device from SDCard Data

shell@android:/ $ ls -R sdcard/| grep "\@gmail.com"
...
sdcard//Android/data/com.google.android.apps.books/files/accounts/somebody@gmail.com/volumes/######/res2:
sdcard//Android/data/com.google.android.apps.books/files/accounts/somebody@gmail.com/volumes/######/segments:
...

Note	The email address and path has been altered above to protect privacy. It is offered as an example of what can be expected from such a search.

I found over 230 instances of that email address (modified above for privacy) in the file paths alone, without looking inside any files at all. In fact, I found two accounts. I was able to contact those persons and determine the device was in fact stolen. There are certainly other ways to find user information, and I did in fact find that some of the apps that stored user namest hat corroborated the gmail accounts I found in the file paths.

I’ve known investigators to hear a device description of "Locked with no USB debugging" and declare, "There is nothing that can be done." I hope this quick post demonstrates otherwise. While it is true that some devices are buttoned up pretty tight, I find that the vast majority provide at least some access. Maybe now you’ll be inspired to look a little more closely, too.

Apple iDevices have their serial number engraved on the back, right? So why the article? Because it's not true of newer devices like the iPhone 5, 5s, and 5c. Also, original cases can be replaced and serial numbers obliterated through unprotected use or deliberate act. Now I have your attention again, I hope.

Getting the Message

I've written in the past about the libimobiledevice library and it's utilities. One, which is quite handy for gathering device information is ideviceinfo. It provides information such as the device description (color), device class (iPhone, iPod, iPad), device name, etc. When the device is unlocked, you can retrieve the serial number, as well. Basically, you retrieve the contents of the Info.plist.

But ideviceinfo is not so informative with a locked device. In fact, it won't show you any output unless you use the -s simple option. While you can obtain some information, such as the description, class, name, UDID (unique identifier), and Mac address, you can't display the serial number. But never fear, there is a way...

Linux has a system log that tracks systems events, included the plugging and unplugging of devices. The system log can be dumped the the terminal with the dmesg command. Run by itself, you dump the entire log and it's quite a lot of information to sift through, though in truth what you want will be found at or near the end of the log. You can shorten the output to the content you need with

$ dmesg syslog

But an even niftier trick is to set up your system to display the log as it is created and watch the output:

$ tail -f /var/log/syslog

This will display the last 10 lines of the system log and the "follow" it until you cancel with ctrl-c. Now you can hotplug your iDevice and watch the data that the system log records about the device. Unfortunately, you will see that it displays the device UDID and not the serial number in the "SerialNumber" field for a locked iDevice.

Recovering the Serial Number

The serial number is recoverable in Recovery Mode, however. Pressing and holding the hardware power button brings up the software power off slide button. Power off the device, and then replug it into your Linux box while holding the hardware home button. The device will boot into recovery mode. Now check your syslog with wither of the two methods discussed above. Two serial numbers are displayed in the syslog after the product (iPhone, etc) and manufacturer (Apple) are listed. The first is the UDID, but the second includes several key:value pairs, one of which is the device serial number (key SRNM).

When you are done collecting the device data revealed in the syslog, reboot it, if required, by pressing and holding the power button approximately 10 seconds until the recovery screen goes blank. The device will then reboot into the operating system, probably feeling very ashamed of itself for revealing its secrets so readily.

In a recent examination of smart phone content, it became necessary to know the personal interests of the device's owner. You can browse internet and app history, but that can be extensive to review every URLs to every clicked link and served page. To get directly to the point, I decided to search for his browser/app search query history. I was hoping to craft a regular expression (or several) that would assist in giving me a good idea of the person's interests.

I studied some top search engine results and reviewed some browser history and crafted the following GNU extended regular expression:

[?&](k|p|q|query)=[a-zA-Z0-9+_%-]+

This search, run against strings output of files, found search queries for Google, Yahoo!, Bing, Ask, Aol, Faceboot, YouTube, Vimeo and some x-rated sites as well as app content such as Twitter. Search results appear (depending on what you feed and how you configure GNU grep) similar to:

https://www.google.com/search?q=you+found+me
http://m.youtube.com/results?q=some%20video%20i%20like
https://m.facebook.com/search/?query=that%20guy%20

An added benefit to this expression is that it also hits on additional page results, Google images page refreshes, etc. With little command line wiz-bangery, it's even possible to sort and count the results to get a histogram of searches:

I'll explain the command above:

strings History.plist # extract ascii strings from the iPhone Safari History.plist
egrep -o '[?&](k|p|q|query)=[a-zA-Z0-9+_%-]+' # grep for the regular expression described above
sed 's/.*=//' # strip off the query tag at the front of the user typed query
sort # sort the results alphabetically
uniq -c # count the matching lines
sort -nr # reverse sort, placing the most frequent query terms first.

Results of the command look similar to the following:

21 I+search+for+this+a+lot

11 this%20one%20a%20little%less

2 why+would+anyone+read+linux+sleuthing

1 testing%20one%20two

The expression could be run against all logical files in a device and against unallocated space, if applicable. I only demonstrate it using the History.plist because it's easy illustrate.

I post this short article both because I want to remember this regular expression (the whole reason for my blog in the first place) and to solicit favorite search box/engine regular expressions you might have. Please share them in a comment if you get a chance. Happy searching!

"Do no harm" is the modern translation of the Hippocratic Oath which is applied to physicians. But it has application in data forensics as well. It takes shape in the edicts that require write-blocking be used during the acquisition of data sources and analysis to be done on copies rather than original data. (I’ll leave the very valid discussion about triage through direct examination of data sources aside for another time. We’re talking general principles here.)

Can we ever change the evidence?

The short answer is, "No." We should never change the original data, period. But that doesn’t mean that we can’t render a copy of the data readable. After all, it is better to read the data with the programs intended to use the data… that way we know we are rendering the information as it was intended to be read. And if there is a way to repair a file or file system in a manner that doesn’t change the substantive content, should we not consider that option?

Enough vagaries. Let’s get down to brick and mortar to make this point. I’ve previously discussed using xmount to run the operating systems encapsulated in forensic images. In that situation, xmount uses a cache file to record and read changes to the file system that necessarily occur when the operating system and applications are running. Because the changes are written to the cache, the forensic image is unchanged.

Yesterday, I encountered another use for xmount when examining an image of an eMMC NAND chip from a Samsung Galaxy S3. I attempted to mount the 12GB userdata partition for analysis, but mounting failed. This may have happened to you in the past: you attempted to mount a file system the way you always do, but for unknown reasons, your command failed. What to do?

When Standard Procedure Fails

Let’s take my S3 image for example. My goal was to access the userdata partition and extract the files. The S3 is supposed to have ext4 partitions which I should be able to mount and examine in Linux.

BASH

$ mmls image
GUID Partition Table (EFI)
Offset Sector: 0
Units are in 512-byte sectors

     Slot    Start        End          Length       Description
00:  Meta    0000000000   0000000000   0000000001   Safety Table
01:  -----   0000000000   0000008191   0000008192   Unallocated
02:  Meta    0000000001   0000000001   0000000001   GPT Header
03:  Meta    0000000002   0000000033   0000000032   Partition Table
04:  00      0000008192   0000131071   0000122880   modem
05:  01      0000131072   0000131327   0000000256   sbl1
06:  02      0000131328   0000131839   0000000512   sbl2
07:  03      0000131840   0000132863   0000001024   sbl3
08:  04      0000132864   0000136959   0000004096   aboot
09:  05      0000136960   0000137983   0000001024   rpm
10:  06      0000137984   0000158463   0000020480   boot
11:  07      0000158464   0000159487   0000001024   tz
12:  08      0000159488   0000160511   0000001024   pad
13:  09      0000160512   0000180991   0000020480   param
14:  10      0000180992   0000208895   0000027904   efs
15:  11      0000208896   0000215039   0000006144   modemst1
16:  12      0000215040   0000221183   0000006144   modemst2
17:  13      0000221184   0003293183   0003072000   system
18:  14      0003293184   0028958719   0025665536   userdata
19:  15      0028958720   0028975103   0000016384   persist
20:  16      0028975104   0030695423   0001720320   cache
21:  17      0030695424   0030715903   0000020480   recovery
22:  18      0030715904   0030736383   0000020480   fota
23:  19      0030736384   0030748671   0000012288   backup
24:  20      0030748672   0030754815   0000006144   fsg
25:  21      0030754816   0030754831   0000000016   ssd
26:  22      0030754832   0030765071   0000010240   grow
27:  -----   0030765072   0030777343   0000012272   Unallocated

$ sudo mount -o ro,loop,offset=$((3293184*512)) $image /mnt
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$

Note	I’m i’ve created a link from `image` to the original raw device image to keep commands simple.

So, what happened here? I used the Sleuthkitmmls tool to read the partition table. I located the partition of interest - userdata - and tried to mount it read-only. I did not specify a file system type but instead let mount auto-magically determine it. I used the mount options of ro (read-only), loop (to create loopback device), and provided the offset to the partition. Since the offset required by mount is in bytes, I used shell math to translate the sector offset provided in the mmls output to bytes. But, in the end, mount did not appear to recognize the partition.

What do we do in such situation? We could use the mount -v verbose flag to try to determine what’s wrong.

BASH

$ sudo mount -v -o ro,loop,offset=$((3293184*512)) image /mnt
mount: enabling autoclear loopdev flag
mount: going to use the loop device /dev/loop0
mount: you didn't specify a filesystem type for /dev/loop0
       I will try type ext4
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$

In this case, verbose output is not much help other than showing that mount attempted to use ext4 as the file system type. And, though ext4 is what I expected for the partition, too, maybe it is wrong. Short of a hex editor, how can we check a partition type in a disk image?

The file command is a well-known tool for providing file types by reading the file magic (the file’s hexadecimal signature). But did you know it will tell you partition types as well?

BASH

$ img_cat image -s 3293184 | file -
/dev/stdin: Linux rev 1.0 ext4 filesystem data,
UUID=57f8f4bc-abf4-655f-bf67-946fc0f9f25b (needs journal recovery)
(extents) (large files)
$

The img_cat tool is another member of the Sleuthkit tool chest. It exports data from an image to stdout in blocks. Here we provide the starting sector offset to the userdata partition and pipe it to the file command. The hyphen following the file command to tells file to use stdout as input rather than a file.

What did we learn here? Well, though there is plenty of information, two values are of particular interest. First, the file system is in fact formatted ext4. Second, It appears that the file system journal is damaged. There is the smoking gun for our mounting problem.

So, we need a way to fix the journal so we can mount the partition, but we must not alter the original image. We have a few options here: * make a copy of the image * make a copy of just the partition Initially, making copies doesn’t sound too bad. After all, we’re only talking about 16GB for the image or 12GB for the partition. But what if this were a 250gb image, or larger? That sounds less palatable. Further, either action could consume a lot of resources in the form of drive space, processing power and time.

Important Note

Reader Carlos (with credit to Hal Pomerantz) correctly points out that the dirty journal issue can be avoid in the mount command by passing the noload option. In fact, according to the mount man page, noload is a good option to invoke whenever mounting ext3/4 read only to ensure no writes with dirty filesystems.

In our example, the command would be: sudo mount -o ro,noload,loop,offset=$3293184*512 image /mnt

This fact does not invalidate the rest of this discussion. You file system may need more significat repairs, such a repairing a partition table, and these repairs can still be effected with the technique that follows. As always, pick the path the meets the needs of your investigation.

Xmount to the Rescue

What if there was a way to fix the journal without taking the time and resources mentioned above? Enter xmount. Just as xmount can create a cache file to capture changes in an OS running from an image, it can capture changes when repairing a file system. We can quickly mount and repair the file system and leave the original image none-the-worse for wear. And because the blessings of the fuse file system driver on which xmount is built, we’ve only consumed new disk space for the xmount cache compared to full disk images and partitions.

BASH

$ md5sum image
15a9134d72a5590cf6589e9c9702a5ba  image
$

We start with an MD5 baseline of the image file. We’ll use this to determine if xmount allows any changes to the image.

BASH

$ sudo xmount --in dd  --out dd -o ro,allow_other \
--cache image.cache image /mnt
$ ls /mnt
image.dd  image.info
$ mount
...
xmount on /mnt type fuse.xmount (rw,nosuid,nodev,allow_other)
$

We use xmount to create a virtual image from our original image file. The --in option specifies the format of the original image. The input format can be Expert Witnes Format (ewf), Advanced Forensic Format (AFF) or raw (dd). The --out option specifies the format of the virtual image and can be raw (dd) or any one of the following virtual machine formats: vdi, vhd, or vdmk. The fuse -o allow_other option gives access to the virtual file system to all users (not just root). The final option --cache specifies the file to use for disk caching (image.cache). Then, much like the mount command, we specify the input file (image) and the mount point (/mnt).

The result of out xmount command is a virtual disk file being created in the /mnt folder that is accessible to normal users. Forget this option and you’ll have issues with listing the /mnt directory as a normal user. The name of the virtual disk image is the input file name appended with the format type. Thus, "image" became "image.dd". In the /mnt folder is also an .info file with image information.

The virtual image doesn’t consume real disk space. It is mounted read-write because we are going to fix the journal, but don’t fret, by passing the --cache image.cache option to xmount, we told xmount to capture the changes in the image.cache file. The image.cache file does not need to previously exist; xmount will create it for us.

The virtual disk image can be accessed just like the original image.

BASH

$ mmls /mnt/image.dd
GUID Partition Table (EFI)
Offset Sector: 0
Units are in 512-byte sectors

     Slot    Start        End          Length       Description
00:  Meta    0000000000   0000000000   0000000001   Safety Table
01:  -----   0000000000   0000008191   0000008192   Unallocated
02:  Meta    0000000001   0000000001   0000000001   GPT Header
03:  Meta    0000000002   0000000033   0000000032   Partition Table
04:  00      0000008192   0000131071   0000122880   modem
05:  01      0000131072   0000131327   0000000256   sbl1
06:  02      0000131328   0000131839   0000000512   sbl2
07:  03      0000131840   0000132863   0000001024   sbl3
08:  04      0000132864   0000136959   0000004096   aboot
09:  05      0000136960   0000137983   0000001024   rpm
10:  06      0000137984   0000158463   0000020480   boot
11:  07      0000158464   0000159487   0000001024   tz
12:  08      0000159488   0000160511   0000001024   pad
13:  09      0000160512   0000180991   0000020480   param
14:  10      0000180992   0000208895   0000027904   efs
15:  11      0000208896   0000215039   0000006144   modemst1
16:  12      0000215040   0000221183   0000006144   modemst2
17:  13      0000221184   0003293183   0003072000   system
18:  14      0003293184   0028958719   0025665536   userdata
19:  15      0028958720   0028975103   0000016384   persist
20:  16      0028975104   0030695423   0001720320   cache
21:  17      0030695424   0030715903   0000020480   recovery
22:  18      0030715904   0030736383   0000020480   fota
23:  19      0030736384   0030748671   0000012288   backup
24:  20      0030748672   0030754815   0000006144   fsg
25:  21      0030754816   0030754831   0000000016   ssd
26:  22      0030754832   0030765071   0000010240   grow
27:  -----   0030765072   0030777343   0000012272   Unallocated
$

We can automatically repair the journal by mounting the /userdata partition read-write. Once its repaired and mounted, we can remount as read-only for analysis.

BASH

$ mkdir mnt
$ sudo mount -o loop,offset=$((3293184*512)) /mnt/image.dd mnt
$ mount
...
xmount on /mnt type fuse.xmount (rw,nosuid,nodev,allow_other)
$ sudo mount -o remount,ro mnt
$ mount
...
/mnt/image.dd on /home/user/mnt type ext4 (ro)
$ ls -S mnt/
data
dalvik-cache
smart_stay.dmc
anr
app
app-asec
app-private
audio
backup
BackupPlus
bluetooth
bms
clipboard
dontpanic
drm
fota
fota_test
local
log
lost+found
media
misc
property
...
$

First, I created a new directory in the current working directory called "mnt". Don’t confuse this with the /mnt directory where the virtual disk image is located. Like before, we used the mount command to create a loopback device and address the partition by offset. This time, we did not set the read-only flag, and we specified a new directory for the partion since we were using /mnt to host the virtual disk. This time, we succeeded in mounting the partition, and then we immediately remounted read-only to avoid making further changes.

Wrapping Up

In summary, we tried to mount a partition in our original disk image, but it failed. We determined the partition had a damaged journal, so we created a virtual disk image to effect repairs. Then we mounted the repaired partition and listed the root directory. But did we do no harm?

BASH

$ md5sum image /mnt/image.dd
15a9134d72a5590cf6589e9c9702a5ba  image
2e16cbbeefc9e33bc754b47d2f8a4da0  /mnt/image.dd
$

The original image hash remains unchanged. The xmounted image shows its been changed. So xmount has done its job, protecting the original image while allowing us to repair the partition in the virtual image!

Oh, and what about that cache file? How much real space did we use when we created and repaired the virtual image?

BASH

$ ls -lh image.cache
-rw-r--r-- 1 root root 40M Mar 21 15:51 image.cache
$

Yep, a whole 40mb was used to repair the journal and mount a 12GB partition. That’s not too shabby, and you didn’t wait for a long copy operation of halve your storage capacity!

Xmount Cache Caveats

A quick sidebar on the xmount cache file. The cache can be reused, meaning that the changes from the last session are brought forward to the next session. In plain terms, if we unmount the userdata/ partition and then the virutal image, but later remount the image while pointing to the cache file we previous created with the--cacheoption, the file system will remain repaired. If we want to start afresh, we would use the overwrite cache option, or --owcache. Finally, we don’t need to specify a cache at all if changes are not necessary, and, in fact, this is the manner I usually employ xmount.

I work in a college town. That means lots of unsecured electronics. Lots of unsecured electronics means lots of thefts and 'misplaced'--"I'm not as think as you drunk I am!"--devices.

I've seen a trend in recovered stolen devices over the past few years: the bad guys are rapidly restoring devices to factory settings to prevent them from being tracked by the owner or law enforcement. That leaves me with a problem, though: how do I determine the owner of a device that has been restored? Allocated data that could show ownership is deleted upon a system restore. Since, I've discussed other devices in the past, today I'll focus on Androids.

Dispossessed Androids

I've had uneven success with Androids in the past. This may be due in part to the fact that I've not always know what to look for. But I received two more such devices this week and decided to apply myself, once again, to the problem of identifying the owners. Since I became an Android owner myself over the past 18 mos, I had a device with known data with which to experiment.

Nearly all data that contains identifying information is stored in the 'data' partition. When a device is restored or 'wiped' through the Android recovery system, personal data is removed. This process is usually quite fast, which leads me to believe that 'wiping' user data is a simple delete in most cases. There are custom recoveries where this might not be true, but a study of unallocated data in a wiped device reveals a rich data field.

In Unix-like systems, physical storage devices are attached to the operating system through special files (drivers) called device nodes. These nodes provide raw access to devices and their partitions. Thus, if a device node, also referred to as a block device, is addressed, all content is accessible, allocated and unallocated alike. Block devices can be thought of and addressed by software tools as files. To access block devices, however, one must have root access to the operating system. I will not be discussing the various ways to achieve root access to an Android device in this article, however. I will continue on the assumption that the device has been rooted.

Tinkering under the hood

Access to a running Android device is done through the Android Debug Bridge (adb). In a stock recovery or Android operating system, adb provides shell user access to the file system. The shell user has limited access to the device and commands, but the root user has full access. Root access, when not immediately granted through the adb shell command, is obtained by the su command.

shell@device:/ $
shell@device:/ $ su
root@device:/ #

Block device files are found in the /dev/block directory. The file representing the entire NAND flash is the /dev/block/mmcblk0 file. Partitions are represented as /dev/block/mmcblk0p1, /dev/block/mmcblk0p2, etc. A paritial directory listing in my device, for example, is:

/dev/block/mmcblk0
/dev/block/mmcblk0p1
/dev/block/mmcblk0p10
/dev/block/mmcblk0p11
/dev/block/mmcblk0p12
/dev/block/mmcblk0p13
/dev/block/mmcblk0p14
/dev/block/mmcblk0p15
/dev/block/mmcblk0p16
/dev/block/mmcblk0p17
/dev/block/mmcblk0p18
/dev/block/mmcblk0p19

We could address the entire memory storage device through mmcblk0, but it would be more efficient to address just the data partition. But which of these is the data partition? There are several ways to figure this out, and while not all of the following methods will work on every device, at least one should.

If the data partition is mounted, such as would occur in a rooted and running operating system, simply issue the mount command:

# mount | grep /data
/dev/block/mmcblk0p25 on /data type ext4 (ro,relatime,barrier=1,data=ordered)
Check the contents of the /etc/fstab file:

# cat /etc/fstab
/dev/block/mmcblk0p24 /system ext4 rw
/dev/block/mmcblk0p25 /data ext4 rw
/dev/block/mmcblk0p26 /cache ext4 rw
/dev/block/mmcblk1p1 /sdcard vfat rw
/dev/block/mmcblk0p28 /emmc vfat rw
/dev/block/mmcblk1p2 /sd-ext rw
/dev/block/mmcblk0p21 /efs ext4 rw
Look for the 'by-name' directory somewhere in the /dev/block/platform subtree:

# ls /dev/block/platform/msm_sdcc.1/by-name/ -l | grep data
lrwxrwxrwx root root 2014-06-24 03:10 data -> /dev/block/mmcblk0p25

Note that the 'by-name' data file is actually link to the /dev/block/mmcblk0p25.

Getting to the Point

Ok, we know how to identify and address the data partition, but for what do we search? After some experimentation with my own device, it appears that a very profitable target are application license files. The com.application.vending domain contains application licensing information. On my device, I found 16 binary files in the /data/data/com.android.vending/cache/main/ directory that appear to be application licenses from applications downloaded from the Google Play store. While I could not find specific information about these files, a reading of Android developer page for licensing applications suggests this files purpose. Importantly all contained my username in the form of:

account="androiduser@gmail.com"

Crafting a search of the data partition of a restored device with this knowledge is fairly simple:

# strings mmcblk0p25 | egrep -o 'account="?.{1,25}@gmail.com"?'

Note: the strings and egrep commands are available through busybox which can be temporarily installed to the /dev/ folder (a temporary file system in RAM) if not already present in your environment using the adb push busybox /dev/ command.

Output of the search can be sorted and counted using a sort | uniq pipeline for clean results.

# strings -td mmcblk0.raw | \egrep -o 'account="?.{1,25}@gmail.com"?' | \sort | uniq -c | sort -n

1 account=user1@gmail.com
13 account=user2@gmail.com
970 account="user2@gmail.com"
2161 account="user1@gmail.com"

From the output, we can see there have been two user accounts. Did they both exist on the system at the same time. Has the device changed hands? We don't know, but we have two email addresses for contacting people who might know!

I sometimes get questions about showing attachments in Apple iDevice messaging databases. The questions, however, seem to come at a time when I don’t have any databases on hand to study the issue. Well, this week I stumbled on the chats.db during an exam of a MacBook Air. The database contains iMessage and SMS messages, and can be configured to sync with the user’s other iDevices (iPhone, for example) through iCloud. So, I took a look at the database an determined a way to match the attachments with the messages.

The chats.db is found in the users directory in the Library/Messages folder.

Location of chats.db

Library/Messages/
Library/Messages/Attachments
Library/Messages/chat.db
Library/Messages/chat.db-shm
Library/Messages/chat.db-wal

As you can see, message attachments are located in the Attachments sub-folder. But how are they referenced in the chats.db, and how are they matched to the correct message? The database schema gives us the clues we need.

The chats.db table schema

CREATETABLE _SqliteDatabaseProperties
(keyTEXT,
    value TEXT,
UNIQUE(key));

CREATETABLE chat
(ROWID INTEGER PRIMARYKEY AUTOINCREMENT,
    guid TEXTUNIQUENOTNULL,
    style INTEGER,
    state INTEGER,
    account_id TEXT,
    properties BLOB,
    chat_identifier TEXT,
    service_name TEXT,
    room_name TEXT,
    account_login TEXT,
    is_archived INTEGER DEFAULT0,
    last_addressed_handle TEXT,
    display_name TEXT,
    group_id TEXT);

CREATETABLE attachment
(ROWID INTEGER PRIMARYKEY AUTOINCREMENT,
    guid TEXTUNIQUENOTNULL,
    created_date INTEGER DEFAULT0,
    start_date INTEGER DEFAULT0,
    filename TEXT,
    uti TEXT,
    mime_type TEXT,
    transfer_state INTEGER DEFAULT0,
    is_outgoing INTEGER DEFAULT0,
    user_info BLOB,
    transfer_name TEXT,
    total_bytes INTEGER DEFAULT0);

CREATETABLE handle
( ROWID INTEGER PRIMARYKEY AUTOINCREMENT UNIQUE,
    id TEXTNOTNULL,
    country TEXT,
    service TEXTNOTNULL,
    uncanonicalized_id TEXT,
UNIQUE(id,
    service));

CREATETABLE chat_handle_join
( chat_id INTEGER REFERENCES chat (ROWID)ONDELETE CASCADE,
    handle_id INTEGER REFERENCES handle (ROWID)ONDELETE CASCADE,
UNIQUE(chat_id,
    handle_id));

CREATETABLE message
(ROWID INTEGER PRIMARYKEY AUTOINCREMENT,
    guid TEXTUNIQUENOTNULL,
textTEXT,
replace INTEGER DEFAULT0,
    service_center TEXT,
    handle_id INTEGER DEFAULT0,
    subject TEXT,
    country TEXT,
    attributedBody BLOB,
    version INTEGER DEFAULT0,
type INTEGER DEFAULT0,
    service TEXT,
    account TEXT,
    account_guid TEXT,
    error INTEGER DEFAULT0,
date INTEGER,
    date_read INTEGER,
    date_delivered INTEGER,
    is_delivered INTEGER DEFAULT0,
    is_finished INTEGER DEFAULT0,
    is_emote INTEGER DEFAULT0,
    is_from_me INTEGER DEFAULT0,
    is_empty INTEGER DEFAULT0,
    is_delayed INTEGER DEFAULT0,
    is_auto_reply INTEGER DEFAULT0,
    is_prepared INTEGER DEFAULT0,
    is_read INTEGER DEFAULT0,
    is_system_message INTEGER DEFAULT0,
    is_sent INTEGER DEFAULT0,
    has_dd_results INTEGER DEFAULT0,
    is_service_message INTEGER DEFAULT0,
    is_forward INTEGER DEFAULT0,
    was_downgraded INTEGER DEFAULT0,
    is_archive INTEGER DEFAULT0,
    cache_has_attachments INTEGER DEFAULT0,
    cache_roomnames TEXT,
    was_data_detected INTEGER DEFAULT0,
    was_deduplicated INTEGER DEFAULT0,
    is_audio_message INTEGER DEFAULT0,
    is_played INTEGER DEFAULT0,
    date_played INTEGER,
    item_type INTEGER DEFAULT0,
    other_handle INTEGER DEFAULT-1,
    group_title TEXT,
    group_action_type INTEGER DEFAULT0,
    share_status INTEGER,
    share_direction INTEGER,
    is_expirable INTEGER DEFAULT0,
    expire_state INTEGER DEFAULT0,
    message_action_type INTEGER DEFAULT0,
    message_source INTEGER DEFAULT0);

CREATETABLE chat_message_join
( chat_id INTEGER REFERENCES chat (ROWID)ONDELETE CASCADE,
    message_id INTEGER REFERENCES message (ROWID)ONDELETE CASCADE,
PRIMARYKEY(chat_id,
    message_id));

CREATETABLE message_attachment_join
( message_id INTEGER REFERENCES message (ROWID)ONDELETE CASCADE,
    attachment_id INTEGER REFERENCES attachment (ROWID)ONDELETE CASCADE,
UNIQUE(message_id,
    attachment_id));

I’ll provide a summary of the database as I understand it. Messages are predictably stored in the message table. The message table lacks a reference to attachments, other than the fact that one exists: + cache_has_attachments INTEGER DEFAULT 0+. The default setting is zero, meaning no attachements are stored. A value of 1 indicates there is an attachment in the Attachments sub-folder. One other issue we find when examining the message table is that there is a reference to the remote party in the conversation (handle_id INTEGER DEFAULT 0), but not the extact address—email, account identifier, or phone number—that an investigator would desire. That information is stored in the handle table. It is up to us to figure out how to relate the tables together.

Can’t we all just get along?

The difficulty in examining SQLite databases is determining how they are intended to relate information. There is seldom anything in the database itself that explains its intended use. It can similar to stumbling upon raw building materials and trying to figure out what is being built. Sometimes it’s easy, other times, not so much. But with the chats.db database, three table schema entries give us a clue as to the database design.

SQLite table join hints

CREATETABLE chat_handle_join
( chat_id INTEGER REFERENCES chat (ROWID)ONDELETE CASCADE,
    handle_id INTEGER REFERENCES handle (ROWID)ONDELETE CASCADE,
UNIQUE(chat_id,
    handle_id));

CREATETABLE chat_message_join
( chat_id INTEGER REFERENCES chat (ROWID)ONDELETE CASCADE,
    message_id INTEGER REFERENCES message (ROWID)ONDELETE CASCADE,
PRIMARYKEY(chat_id,
    message_id));

CREATETABLE message_attachment_join
( message_id INTEGER REFERENCES message (ROWID)ONDELETE CASCADE,
    attachment_id INTEGER REFERENCES attachment (ROWID)ONDELETE CASCADE,
UNIQUE(message_id,
    attachment_id));

INFO: The message_attachment_join table shows us that the message_id column in the table refers to the message table rowid column. Likewise, the attachment_id refers to the attachment table rowid. Thus, the message_attachment_join table is used to match attachments to messages.

Each of the table names above ends in the word join. As used, the word join is just part of a table name, but it hints at a SQLite table operation called a JOIN. A join combines two tables into one, and in SQLite there are two basic joins: INNER and OUTER. Inner joins, which come three variations, result in a combined table that includes only rows matching the join criteria. That is, the combined table only includes records with rows from each table that have one or more matching column values. While these are the default type of JOIN in SQLite, we are interested in results that show all messages, not just those with attachments.

OUTER joins, by contrast, do not require the records from each table to have a matching column. This means we can have combined table that shows all message rows, and if properly joined to the attachment table, rows containing messages with attachments will show attachment details. Further, if we join the handle table to the message table, we have everything we might want for an investigation.

I will be using a LEFT OUTER JOIN, which is shortened in syntax to LEFT JOIN. The basic syntax is "SELECT column(s) FROM left_table LEFT JOIN right_table ON left_table.columnName = right_table.columnName". A LEFT JOIN returns all rows of the left_table regardless of matching rows in the right table. Where rows match in the right table, they are joined to the matching left table row.

Tip

It is easier to understand and troubleshoot SQL queries by reading them backwards: Predicate, then subject. For example, reading the query in the paragraph above as "FROM left_table LEFT JOIN right_table ON left_table.columnName = right_table.columnName SELECT column(s)" can lend clarity to the output.

Applying a LEFT JOIN to the chats table, we can create a "Super table" combining the message, attachment, and handle tables.

SELECT*
FROM message AS m
LEFTJOIN message_attachment_join AS maj ON message_id = m.rowid
LEFTJOIN attachment AS a ON a.rowid = maj.attachment_id
LEFTJOIN handle AS h ON h.rowid = m.handle_id

Tip	The "expr1 `AS` expr2" statement sets expr2 as an alias for expr1, saving keystrokes and making the lines easier to read. Thus `message_attachement_join.attachment_id` becomes `maj.attachment_id`.

Entirely accurate, but probably containing more information than we need, the above query results in the following columns:

Table 1. Columns
ROWID	guid	text	replace	service_center
handle_id	subject	country	attributedBody	version
type	service	account	account_guid	error
date	date_read	date_delivered	is_delivered	is_finished
is_emote	is_from_me	is_empty	is_delayed	is_auto_reply
is_prepared	is_read	is_system_message	is_sent	has_dd_results
is_service_message	is_forward	was_downgraded	is_archive	cache_has_attachments
cache_roomnames	was_data_detected	was_deduplicated	is_audio_message	is_played
date_played	item_type	other_handle	group_title	group_action_type
share_status	share_direction	is_expirable	expire_state	message_action_type
message_source	message_id	attachment_id	ROWID	guid
created_date	start_date	filename	uti	mime_type
transfer_state	is_outgoing	user_info	transfer_name	total_bytes
ROWID	id	country	service	uncanonicalized_id

Note	If you look carefully at the schema at the top of this article, and the column listing above, you will notice that the columns are those of all four tables combined and in the order they are referenced.

We can refine the output by identifying specific columns we wish to display from each row. We can use the DATETIME function to convert the Mac Absolute Time in the date column to local time (by first converting to Unix epoch by adding a few more than 978 million seconds) and interpret the is_from_me column from integer to text using a CASE statement.

SELECT
    m.rowid,
DATETIME(date+978307200,'unixepoch','localtime')ASdate,
    id AS address,
    m.service,
    CASE is_from_me
        WHEN 0 THEN "Received"
        WHEN 1 THEN "Sent"
        ELSE is_from_me
    END AStype,
text,
    CASE cache_has_attachments
        WHEN 0 THEN Null
        WHEN 1 THEN filename
    END AS attachment
FROM message AS m
LEFTJOIN message_attachment_join AS maj ON message_id = m.rowid
LEFTJOIN attachment AS a ON a.rowid = maj.attachment_id
LEFTJOIN handle AS h ON h.rowid = m.handle_id

With this query, we end up with an easy to read output containing interpreted values with the following columns:

Table 2. Columns
ROWID	date	address	service	type	text	attachment

Tip

Why include message table ROWID? Row id’s a generated automatically for each message added to the database. A break in sequence will show a record has been deleted. Since it is possible to recover deleted records from SQLite databases, it is a convenient way to alert the investigator more analysis is required. Further, in the case of attachments multiple attachments, there will be one row for each attachment in a message. A repeating ROWID indicates two or more attachments are present for the message.

I hope this discussion of SQLite JOIN operations as they relate to the Apple iOS chats.db will help you in your examination of SQLite databases.

Digital devices are common place. Digital device examiners are not. How does the digital dutch boy prevent the digital device dam from breaking? By sticking his preview thumb into the leak.

The point of a forensic preview is to determine if the device you are examining has evidentiary value. If it does, the device goes into your normal work flow. If it does not, it gets set aside. The dam remains intact by relieving it of the pressure of non-evidentiary devices.

The point of this post is not to enter a discussion of the benefits and short comings of forensic previewing. I’m merely going to record a method I recently used to differentiate between the files created by the owner of a laptop computer and those generated by the thief who stole the computer. Hopefully, you see something useful here to adapt to your investigation.

The Plot

Police officers recovered a laptop from a home that they believed was stolen. One roommate said the device had arrived in the home a few days earlier, but did not know how it got there. The remaining members of the household claimed to know nothing about the computer at all.

I booted the device with a Linux boot disc designed for forensic examination. The disc allows storage devices to be examined without making changes. I was lucky enough to find a user account that had been established a few years earlier, and files in that account that allowed me to identify and contact the computer’s owner. The owner reported the device had been stolen from him two weeks earlier. The owner had password protected his account, but there was a guest account available for use.

Catching the Thief

I could have stopped there, but the job would have been only half-done. I knew who owned the computer, but I didn’t know who’d stolen it. Fingerprints were not an option, so I decided to look for data in the computer that might identify who had used the computer since it had been stolen. A quick look in the guest account showed me I was not going to be as lucky identifying the suspect as I had the victim: there were no user created documents.

What I need to do was to find the files modified by the suspect and inspect those files for identifying information. The suspect may not have purposely created files, but browsing the Internet, etc, creates cache and history files that point out a person as surely than a witness in a suspect lineup (that is to say, not with 100 percent certainty, but often reliable none-the-less).

File systems are very helpful in examinations of this nature: they keep dates and times that files are created, accessed and modified, just to name a few date attributes. Modern operating systems are very helpful, too, because they usually auto-sync the computer’s clock with NTP (Network Time Protocol) servers. Simply stated, modern operating systems keep accurate time automatically.

With this knowledge in mind, I was looking for guest account files (and, ultimately, all files) that were modified in in the past two weeks. Files modified outside that range were changed by the owner and of no interest. Fortunately, the find command provides a solution:

GNU Find command, example 1

# This command returns all files modified less than 14 days ago
$ find path/to/search -mtime -14 -daystart

Note	The -daystart option causes find to measure times from the start of the day rather than the last 24 hours.

The -mtime n option takes integer argument n. This is where a little explanation is in order. Had I passed the integer "14", I would have only returned files modified 14 days ago. Passing "-14" returns all files modified less than 14 days ago. Passing "+14" would cause find to return all files modified more that 14 days ago. It is possible to pass two -mtime options to create a narrow range, such as:

GNU Find command, example 2

# This command returns all files modified between 7 and 14 days ago
$ find path/to/search -mtime -14 -mtime +7

The command in the first example resulted in just over 1600 file names being returned. I saw that most of these were Google Chrome browser application data files. Both the "History" and "Login Data" SQLite databases contained data leading to the identity of the computer user since the date the laptop was stolen (a roommate) and the dates of the activity suggested the computer had been in that person’s possession since shortly after the theft.

Telling Time

The date command can really be your friend in figuring out dates and date ranges. It is easier to demonstrate than explain:

GNU Date command, example 1

$ date
Mon Feb 2312:41:41 PST 2015
$ date -d 'now'
Mon Feb 2312:41:50 PST 2015

Note	The two commands above do the same thing.

GNU Date command, example 1

$ date -d 'yesterday'
Sun Feb 2212:43:42 PST 2015
$ date -d 'tomorrow'
Tue Feb 2412:43:49 PST 2015

Note	The `date` command understands simple english. Used thusly, it calculates based on 24 hour periods, not from the start of the day.

GNU Date command, example 1

$ date -d '1 day ago'
Sun Feb 2212:48:57 PST 2015
$ date -d '1 year ago'
Sun Feb 2312:49:14 PST 2014
$ date -d 'next week'
Mon Mar  212:49:53 PST 2015

Note: The info date command will show you many, many more useful invocations of the date command.

Determining Elapsed Days

You may recall that the find command takes an integer for its date range options, but none of the date commands I illustrated above yielded and integer show the number of days elapsed or until that date. If there is an option for date to yield such information, I have not discovered it. However, a simple shell script can be created to allow us to use the "plain language" of the date command to help us determine the integers required by find.

count_days.sh

# This is a simple script that does not test user input for correctness
# usage: count_days.sh date1 date2

# collect dates from command line and covert to epoch
first_date=$(date -d "$1"+%s)
secnd_date=$(date -d "$2"+%s)

# calculate the difference between the dates, in seconds
difference=$((secnd_date - first_date))

# calculate and print the number of days (86400 seconds per day)
echo $((difference /86400))

Note	This script can be made executable with `chmod +x count_days.sh` or simply executed by calling it with bash: `bash count_days.sh`

Now, we can figure out the number of days elapsed using the same plain language conventions accepted by the date command. Be sure to enclose each date in parenthesis if the date string is more than one word.

count_days.sh

# How many days have elapsed since January 10
$ bash count_days.sh "jan 10""now"
44

# How many days elapsed between two dates
$ bash count_days.sh "nov 27 2013""Aug 5 2014"
250

# How many days will elapse between yesterday and 3 weeks from now
$ bash count_days.sh "yesterday""3 weeks"
22

You get the idea. And I hope I’ve given you some ideas on how to use the find and date commands to your advantage in a preview or other forensic examination.

In my early days of forensics, I considered URLs in web histories as nothing more than addresses to websites, and strictly speaking, that’s true. But URLs often contain form information supplied by the user and other artifacts that can be relevant to an investigation, too. Most of us in the business know this already, at least it concerns one commonly sought after ingot: the web search term.

Consider the following URL:

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

Most examiners would key in on the domain google.com and the end of the url, q=linuxsleuthing, and conclude this was a Google search for the term "linuxsleuthing", and they’d be right. But is there anything else to be gleaned from the URL? Just what do all those strings and punctuation mean, anyway?

What’s in a URL

Let’s use the URL above as our discussion focus. I’ll break down each element, and I’ll mention at least one value of the element to the forensic investigator (you may find others). Finally, I’ll identify and demonstrate a Python library to quickly dissect a URL into its constituent parts.

Protocol

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

The URL starts with the protocol, the "language" the browser must speak to communicate with the resource. In the Python urllib module that I will introduce later, the protocol is referred to as the "scheme".

Examples:

http: - Internet surfing
https: - Secure Internet surfing
ftp: - File transfer operations
file: - Local file operations
mailto: - Email operations

The forensics value of a protocol is that it clues you into the nature of the activity occurring at that moment with the web browser.

Domain

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

The domain can be thought of as the place "where the resource lives." Technically, it can consist of three parts: the top-level domain (TLD), second-level domain, and the host name (or subdomain). If you are more interested in those terms, I’ll leave it to you to research. Suffice it to say that we think of it as the "name" of the website, and with good reason. The names exist in this form because they can be easily memorized and recognized by humans. You may also encounter the domains evil twin in a URL, the Internet Protocol (IP) address, which domain names represent.

The Python urllib module referes to the domain as the "netloc" and identifies it by the leading "//", which is the proper introduction according to RFC 1808.

The forensic value of a domain is that you know where the resource defined in the remainder of the URL can be found or was located in the past.

Port

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

The port is not listed in this url, nor is it often included in URLs intended for human consumption. However, if you see something like www.google.com:80, the ":80" indicates communication is occurring across port 80. You’ll often see port numbers for URLs to video servers, but port numbers are by no means limited to such uses. The Python urllib module incorporates the port in the "netloc" attribute.

The chief forensic value of a port is that it can clue you intohttp://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers[the type of activity] occuring on the domain because many port numbers are well known and commonly used for specific tasks.

Path

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

In terms of a web server, the path indicates the path to the resource on the server. If the "file:" protocol is seen in the URL, then the path signifies the logical location of the file on the local machine. In fact, there will not be a domain, though the domain preamble is present, which is why you see three forward slashes for a file:

file:///path.

The Python urllib module also uses the name "path" to describe this hierarchal path on the server. Please understand that both hard paths and relative paths are possible. In addition, Python describes "params" for the last path element which are introduced by a semicolon. This should not be confused with the parameters I describe in the next section.

The principle forensic value of the path is the same as the over riding principle of real estate: location, location, location.

Parameters

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

Parameters are information passed to the web server by the browser. They are also referred to as "query strings". Parameters can include environment information, web form data, window size, and anything else the web site is coded to pass on. Parameter strings are indicated by a leading "?" followed by key:value pairs. Multiple parameters are separated by "&". Python calls parameters the "query."

Consider our sample URL. It can be seen to have four parameters:

sourceid=chrome-instant
ion=1
espv=2
ie=UTF-8

Parameters are really the meat and potatoes of URL analysis, in my opinion. It is here I find the most interesting details: the user name entered on the previous web page; in the case of mobile devices, the location of the device (lat/lon) when the Facebook post was made; the query on the search engine, etc.

Despite what I said in the preceding paragraph, note that query string is not present the case of our sample URL. The search was conducted through the Google Chrome browser address bar (sourceid=chrome-instant). Thus, it is not safe to assume that all search engine search terms or web form data are to be found in the URL parameters.

To throw a little more mud on the matter, consider that the entry point of the search and the browser make a difference in the URL:

Search for linuxsleuthing from the Ubuntu start page, FireFox

https://www.google.com/search?q=linuxsleuthing&ie=UTF-8&sa=Search&channel=fe&client=browser-ubuntu&hl=en&gws_rd=ssl

Here, we see the same search, but different parameters:

q=linuxsleuthing
ie=UTF-8
sa=Search
channel=fe
client=browser-ubuntu
hl=en
gws_rd=ssl

Caution

Parameters will mean different things to different sites. There is no "one-definition fits all" here, even if there be obvious commonality. It will take research and testing to know the particular meaning of any given parameter even though it may appear obvious on its face.

Anchor

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing

The anchor links to some location within the web page document itself. If you’ve ever clicked a link and found yourself halfway down a page, then you understand the purpose of the anchor. Somewhere in the html code of that page is a bookmark of sorts to which that anchor points. Python calls the anchor a "fragment."

In the case of our sample URL, the anchor is the search term I entered in the address bar of the Google Chrome browser.

The forensics value of an anchor is that you know what the user saw or should have seen when at that site. It might demonstrate a user interest or that they had knowledge of a fact, depending on your particular circumstances, of course.

Making Short Work of URL Parsing

Python includes a library for manipulating URLs named, appropriately enough, urllib. The python library identifies the components of a URL a little more precisely than I described above, which was only intended as an introduction. By way of quick demonstration, we’ll let Python address our sample URL

iPython Interative Session, Demonstrating urllib

In [1]:import urllib

In [2]: result = urllib.parse.urlparse('https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing')

In [3]:print(result)
ParseResult(scheme='https', netloc='www.google.com', path='/webhp', params='', query='sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8', fragment='q=linuxsleuthing')

In [4]: result.query
Out[4]:'sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8'

In [5]: result.query.split('&')
Out[5]:['sourceid=chrome-instant','ion=1','espv=2','ie=UTF-8']

In [6]: result.fragment
Out[6]:'q=linuxsleuthing'

Note	The Python urllib calls the parameters I discussed a query and the anchor a fragment.

If you have a little Python knowledge, then you can see how readily you could parse a large list of urls. If not, it is not much more difficult to parse a url using BASH.

Parsing URLs using BASH variable substitution

$ url="https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=linuxsleuthing"
$ anchor=${url##*\#}
$ parameters=${url##*\?}
$ parameters=${parameters//#$anchor/}
$ echo ${parameters//&/ }
sourceid=chrome-instant ion=1espv=2ie=UTF-8
$ echo $anchor
q=linuxsleuthing

Finding Parameters

If you want to narrow your search for URLs containing parameters and anchors, you need only grep your list for the "&" or "#" characters. If you are processing a history database such as the Google Chrome History SQLite database, you can export the relevant urls with the following query:

SQLite query for Google Chrome History

select*from urls where url like"%?%"or url like"%#%";

What’s All the Fuss?

So, why go to all this length to study a URL? I’ll give two simple illustrations:

In the first case, I had the computer of a person suspected of drug dealing. I found little relevant data on his computer doing basic analysis, including an analysis of search engine search terms. When I examined URL parameters, however, I found searches at website vendors that demonstrated the purchase of materials for growing marijuana.

In the second case, a stolen computer was recovered in close proximity to a suspect who claimed to have no knowledge of the device. The Google Chrome browser in the guest account was used since the date of the theft, so analysis was in order. URL parameters showed a login to the suspect’s Apple account 12 hours after the left. There was no useful data in the cache, only the URL history.

Finally, bear in mind that the URL history is the only artifact you may have of secure website activity. Browsers, by default, do not cache secure elements. Understanding the contents of a URL can clue you into activity for which may find no other artifacts.

It is good to know what’s in a URL!

CCL Forensics did the mobile forensics world a great service when it released several python scripts for cracking Android gesture, pin, and password locks. I have mostly encountered gesture locks in my examinations, and I have successfully cracked them each time with CCL’s Tools. But recently, I had a chance to take a look at the pin/password cracking tool.

A colleague contacted me and said he’d been running the CCL BruteForceAndroidPin.py tool for more than two weeks but had not cracked a password. I was pretty naive when I thought, "That’s ridiculous, you must be doing something wrong," but I’m glad I had the thought none the less. I’m glad because the thought made me investigate. And the investigation led to some pretty surprising facts and an improvement in the brute force attack.

What’s In an Android Password?

First, let’s look at what an Android password can look like: * 4-16 Characters long * 94 possible characters per space: Uppercase and lowercase letters, digits, and punctuation (known as keyspace) * Spaces are not allowed * Alphanumeric passwords MUST have at least one letter * The password is salted with a random integer and stored in a text file as a concatenated SHA-1 and MD5 value while the salt is stored in a SQLite Database.

So, how many password possibilites are there in a 4-16 character length password where each place value can be anyone of 94 characters? I mean, what exactly are we asking our cpu’s to do here? The table below provides the answer.

**Table 1.** Android Password Possibilities
Length	Number of Possibilities
4	78,074,896
5	7,339,040,224
6	689,869,781,056
7	64,847,759,419,264
8	6,095,689,385,410,816
9	572,994,802,228,616,704
10	53,861,511,409,489,970,176
11	5,062,982,072,492,057,196,544
12	475,920,314,814,253,376,475,136
13	44,736,509,592,539,817,388,662,784
14	4,205,231,901,698,742,834,534,301,696
15	395,291,798,759,681,826,446,224,359,424
16	37,157,429,083,410,091,685,945,089,785,856

The figures in each row in the table are independent of the other rows

Yes, the chart above totals over 37.5 trillion quadrillion possibilities! And totaling the columns—which is the true effect of searching all 4-16 length password combinations, by the way—is a whopping 3.7557e+31!

It can be difficult to understand what large numbers mean. So, lets assume a worst case scenario: a password of length of 16 characters. Since we can’t know the password length from examining the hash values, we have to start our attack at a length of 4. Now, Assuming we have a reasonably good processor, we can make about 200,000 password attempts a second with the CCL brute force script. That means it will take us about 1.87784856658094e+26 seconds to complete the task. There are 86400 seconds in a day, and 365 days in a year (yes, I know you knew that one). So, that’s a mere: 5,954,618,742,329,212,000 years! Let me try to say that in English, "That’s 5.9 quintilian years!"

"Alright," you say, "I understand that I’ve no hope of cracking a password of 16 characters in length. But, how long of a password can I reasonably expect to crack with this script?" Not too many, I’m afraid. Let’s continue to assume 94 characters are possible per place value, and we can generate and test passwords at a rate of 200,000 per second.

**Table 2.** Time to Complete, 200K/sec
Length	Number of Possibilities	Time to complete
4	7.80749e+07	6 minutes
5	7.33904e+09	10 hours
6	6.8987e+11	39 days
7	6.48478e+13	1 decades
8	6.09569e+15	96 decades
9	5.72995e+17	90 millennia
10	5.38615e+19	8,539 millennia
11	5.06298e+21	802,730 millennia
12	4.7592e+23	75,456,670 millennia
13	4.47365e+25	7,092,927,066 millennia
14	4.20523e+27	666,735,144,231 millennia
15	3.95292e+29	62,673,103,557,788 millennia
16	3.71574e+31	5,891,271,734,432,093 millennia

The figures in each row in the table are independent of the other rows

The stats above don’t really offer much hope if the password has a length of seven or more. But, since this is simple math, it becomes readily apparent how to speed our results: reduce the keyspace, or speed the number of calculations per second. Better yet, do both!

==Reducing the Keyspace

The CCL brute force tool, as written, includes 95 ASCII characters. One of these, the space character, is illegal in Android passwords, so, we can edit the CHAR_LIST variable a the start of the code to 94 characters. But if we consider what characters are visible on the default android keyboard at the lock screen, we see that there are 26 lowercase letters and 5 characters of punctuation accessible with a single, regular key press. Human nature suggests that most passwords would be composed of these keys alone, and since words are easiest to remember, probably the 26 lowercase letters would suffice.

So, what happens if we reduce the CHAR_LIST variable to just lower case letters, thus reducing our keyspace to 26, but are still calculating at a rate of 200K/sec? Let’s take a look:

**Table 3.** Time to Complete, keyspace = 26, 200k/sec
Length	Number of Possibilities	Time to complete
4	456976	2 seconds
5	1.18814e+07	59 seconds
6	3.08916e+08	25 minutes
7	8.03181e+09	11 hours
8	2.08827e+11	12 days
9	5.4295e+12	314 days
10	1.41167e+14	2 decades
11	3.67034e+15	58 decades
12	9.5429e+16	15 millennia
13	2.48115e+18	393 millennia
14	6.451e+19	102,278 millennia
15	1.67726e+21	265,927 millennia
16	4.36087e+22	6,914,120 millennia

The figures in each row in the table are independent of the other rows

Hey, now we’re cooking! Unless the password is 9 or more characters, we have some hope of cracking it. And since most Android devices are smart phones, users are probably not creating ultra-long passwords because they want quick access to their devices.

This modification to the CCL script might be sufficient for most attacks. And, we can improve the script by making different keyspaces selectable through options, such as -l for lower case, -u for upper case, -d for digits, and -s for special characters (punctuation, etc.). In fact, I went to the effort to do just that, but before you get too excited and ask me to send the modified tool to you, please read on.

Obviously, a keyspace of only lowercase letters will fail to crack a password that includes any characters not included that keyspace, and you won’t know definitively if it does or does not for 6.9 billion years! We need to do more than just reduce the keyspace, we need to increase the calculation rate, too!

Increasing the Calculation Rate

I hope you realize from the previous section that an intelligent approach to the keyspace can significantly improve your search times… to a point. But if we are going to make real progress on password cracking, we are going to need to thank CCL Forensics mightily for their contribution, but politely move on to other tools. Python simply isn’t the best platform for this type of process.

On the other hand, OCL is apparently an excellent language for this type of processing, and the specialized Graphical Processing Unit is more adept and this kind of calculation than the more generally capable CPU. And it just so happens that the hashcat tool uses both.

CPU Version

Hashcat comes in several flavors. The original hashcat is coded to use the cpu for password cracking, and unlike the CCL tool, can use multiple cores. As and example of what the OCL programming language and multi-core processing can do for you, I achieved 25 million/sec on my core i5 processor with hashcat compared to 200k/sec with python script. Let’s see what that does for our processing times, and couple that with the lower-case letters keyspace we used previously:

**Table 4.** Time to Complete, keyspace=26, 25M/sec
Length	Number of Possibilities	Time to complete
4	456976	0 seconds
5	1.18814e+07	0 seconds
6	3.08916e+08	12 seconds
7	8.03181e+09	5 minutes
8	2.08827e+11	2 hours
9	5.4295e+12	2 days
10	1.41167e+14	65 days
11	3.67034e+15	4 years
12	9.5429e+16	12 decades
13	2.48115e+18	3 millennia
14	6.451e+19	81 millennia
15	1.67726e+21	2,127 millennia
16	4.36087e+22	55,312 millennia

The figures in each row in the table are independent of the other rows

Wow! Cracking passwords of 9 characters just became reasonable (recall that initially, 6 character-length password took us about 40 days). With a little luck, we’ll be examining the device’s contents in a fairly short time. And we didn’t spend a dime to improve our systems, we just found a tool that is more adept at the task.

GPU Version

But, I mentioned GPU, right? Hashcat has two other versions, called oclHashcat-plus and ocl-Hashcat-lite, both of which are coded to run on Nvidia and ATI GPUs (currently, you must select the right tool for the processor type, but a future version is planned with multi-processor support). Because the GPU is better suited for this type of calculation, the GPU versions of hashcat are preferred.

As it turns out, I have an inexpensive but supported NVidia GeForce 9500 graphics card. I spent $50 on it about three years ago. This is not a high-end card by any stretch (1023MB, 1375Mhz). This is a good place to mention that while NVidia graphics cards are generally thought to perform better rendering 3D graphics, the ATI cards are better for password cracking.

I used the cudaHashcat-lite version of the tool. CUDA refers to NVidia’s parallel processing platform. I achieved a significant performance increase over my quad-core Intel processor, reaching 35 million calculations/sec. This had a significant effect on my cracking times:

**Table 5.** Time to Complete, keyspace=26, 35M/sec
Length	Number of Possibilities	Time to complete
4	456976	0 seconds
5	1.18814e+07	0 seconds
6	3.08916e+08	8 seconds
7	8.03181e+09	3 minutes
8	2.08827e+11	1 hours
9	5.4295e+12	1 days
10	1.41167e+14	46 days
11	3.67034e+15	3 years
12	9.5429e+16	8 decades
13	2.48115e+18	2 millennia
14	6.451e+19	58 millennia
15	1.67726e+21	1,519 millennia
16	4.36087e+22	39,509 millennia

The figures in each row in the table are independent of the other rows

We cut our 9 character length processing time in half and put 10 characters in reach. That’s not bad for an low end graphics card, and it was substantially less expensive than the quad-core processor I purchased at the same time.

Fine Tuning the Attack

Other than getting a better graphics cards or more of them (yes, hashcat can make parallel use of up to 128 GPUs!), you might think we’re done with this topic. But I had one more idea that I thought worth checking. The password.key file contains two abutting hash values, a 40 byte SHA-1 hash and a 32 byte MD5 hash. The CCL brute force attack is on the SHA-1 value, but what if we targeted the MD5 value? It can’t make that much difference, right?

Wrong! It makes a three-fold difference for the better! Utilizing the MD5 hash and settings.db salt value, I achieved 107 million calculations/sec with the same GPU as before! (Yes, I know that was three exclamations in a row, but darn it, it deserves three exclamations! Ok, that’s four…) The effect on our calculation times is dramatic:

**Table 6.** Time to Complete, keyspace=26, 107M/sec
Length	Number of Possibilities	Time to complete
4	456976	0 seconds
5	1.18814e+07	0 seconds
6	3.08916e+08	2 seconds
7	8.03181e+09	1 minutes
8	2.08827e+11	32 minutes
9	5.4295e+12	14 hours
10	1.41167e+14	15 days
11	3.67034e+15	1 years
12	9.5429e+16	2 decades
13	2.48115e+18	73 decades
14	6.451e+19	19 millennia
15	1.67726e+21	497 millennia
16	4.36087e+22	12,923 millennia

The figures in each row in the table are independent of the other rows

We’re cracking 8 character passwords in half and hour! What if we bought one of those fancy $400 ATI 4790 cards that they brag about on the hashcat forums, you know, the ones that calculate at a rate of 9 billion a second?

**Table 7.** Time to Complete, keyspace=26, 9B/sec
Length	Number of Possibilities	Time to complete
4	456976	0 seconds
5	1.18814e+07	0 seconds
6	3.08916e+08	0 seconds
7	8.03181e+09	0 seconds
8	2.08827e+11	23 seconds
9	5.4295e+12	10 minutes
10	1.41167e+14	4 hours
11	3.67034e+15	4 days
12	9.5429e+16	122 days
13	2.48115e+18	8 years
14	6.451e+19	22 decades
15	1.67726e+21	5 millennia
16	4.36087e+22	153 millennia

The figures in each row in the table are independent of the other rows

And what of those (apparently wealthy) guys who are stringing 8 of those cards together and getting speeds of 72 billion calculations/sec?

**Table 8.** Time to Complete, keyspace=26, 72B/sec
Length	Number of Possibilities	Time to complete
4	456976	0 seconds
5	1.18814e+07	0 seconds
6	3.08916e+08	0 seconds
7	8.03181e+09	0 seconds
8	2.08827e+11	2 seconds
9	5.4295e+12	1 minutes
10	1.41167e+14	32 minutes
11	3.67034e+15	14 hours
12	9.5429e+16	15 days
13	2.48115e+18	1 years
14	6.451e+19	2 decades
15	1.67726e+21	73 decades
16	4.36087e+22	19 millennia

The figures in each row in the table are independent of the other rows

And what if we’re a three-letter agency that could maximize hashcat with 128 GPUs? …Alright, alright, I’ll stop. But I actually posted those last two tables for a reason. Notice that 8 high-end cards only put one more password level within realistic reach. Most of us, if properly motivated and only modestly funded, could manage one of those cards and achieve impressive results. But do the math before you drop nearly $4000 on graphics cards, and consider waiting for the next version of hashcat to put your collection of older cards to work for you.

A HashCat Tip

You won’t find a lot of helpful documentation on using hashcat, unless you already understand cryptography and cryptographic principals, that is. In order to use the salt with hashcat, you need to convert it from an integer to hexadecimal notation. In python3, this can be accomplished as follows:

>>>import binascii, struct
>>> binascii.hexlify(struct.pack('>q', salt ))#where 'salt' is an integer

There is quite a bit more to know about hashcat, including password masks and keyspace definitions, but I’ll cover that in a future post. If you decide to give hashcat a whirl before I blog any further, make sure you test on known data: I had to roll back a version to get cudaHashcat-lite and cudaHashcat-plus to work properly.

Oh, Apple, you've done it to me again!...

With each iOS incarnation, key databases change structure. This is no secret to anyone who examines data from iDevices. The iOS4 sms.db differs greatly from the iOS5 sms.db, and both differ significantly from the new iOS6 sms.db. This is expected, and no heartburn here at all.

But last month I was slapped in the face by Apple in an unexpected way: I found two different versions of the sms.db from the same version of iOS! This is unexpected, and highlights why me must not take our tools for granted and assume our output in this case is accurate because of a tool test we conducted in a previous case.

The Quandry

The exhibits:

iPhone #1: Product type: iPhone 4,1; Product Version 5.1.1
iPhone #2: Product type: iPhone 4,1; Product Version 5.1.1

So, for all intents and purposes, I was dealing with the same phone and operating system.

Take a look at the sms.db message table schemas:

iPhone #1
CREATE TABLE message (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, address TEXT, date INTEGER, text TEXT, flags INTEGER, replace INTEGER, svc_center TEXT, group_id INTEGER, association_id INTEGER, height INTEGER, UIFlags INTEGER, version INTEGER, subject TEXT, country TEXT, headers BLOB, recipients BLOB, read INTEGER, madrid_attributedBody BLOB, madrid_handle TEXT, madrid_version INTEGER, madrid_guid TEXT, madrid_type INTEGER, madrid_roomname TEXT, madrid_service TEXT, madrid_account TEXT, madrid_flags INTEGER, madrid_attachmentInfo BLOB, madrid_url TEXT, madrid_error INTEGER, is_madrid INTEGER, madrid_date_read INTEGER, madrid_date_delivered INTEGER, madrid_account_guid TEXT);

iPhone #2
CREATE TABLE message (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, address TEXT, date INTEGER, text TEXT, flags INTEGER, replace INTEGER, svc_center TEXT, group_id INTEGER, association_id INTEGER, height INTEGER, UIFlags INTEGER, version INTEGER, subject TEXT, country TEXT, headers BLOB, recipients BLOB, read INTEGER, madrid_attributedBody BLOB, madrid_handle TEXT, madrid_version INTEGER, madrid_guid TEXT, madrid_type INTEGER, madrid_roomname TEXT, madrid_service TEXT, madrid_account TEXT, madrid_account_guid TEXT, madrid_flags INTEGER, madrid_attachmentInfo BLOB, madrid_url TEXT, madrid_error INTEGER, is_madrid INTEGER, madrid_date_read INTEGER, madrid_date_delivered INTEGER);

Do you see it? I didn't initially, because I tried to automate extracting the text messages from iPhone #1 with a python program I had previously authored. When it failed, I was very confused because I had just used the program on iPhone #2 similar device days earlier. And frankly, it didn't dawn on me to immediately check the schema while seeking the error source, which is the purpose of this post: saving you some of my pain.

Finding the Worms

If you didn't spot the issue, don't worry, I'll help:

iPhone #1
(CREATE TABLE message (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, address TEXT, date INTEGER, text TEXT, flags INTEGER, replace INTEGER, svc_center TEXT, group_id INTEGER, association_id INTEGER, height INTEGER, UIFlags INTEGER, version INTEGER, subject TEXT, country TEXT, headers BLOB, recipients BLOB, read INTEGER, madrid_attributedBody BLOB, madrid_handle TEXT, madrid_version INTEGER, madrid_guid TEXT, madrid_type INTEGER, madrid_roomname TEXT, madrid_service TEXT, madrid_account TEXT, madrid_flags INTEGER, madrid_attachmentInfo BLOB, madrid_url TEXT, madrid_error INTEGER, is_madrid INTEGER, madrid_date_read INTEGER, madrid_date_delivered INTEGER, madrid_account_guid TEXT);

iPhone #2
CREATE TABLE message (ROWID INTEGER PRIMARY KEY AUTOINCREMENT, address TEXT, date INTEGER, text TEXT, flags INTEGER, replace INTEGER, svc_center TEXT, group_id INTEGER, association_id INTEGER, height INTEGER, UIFlags INTEGER, version INTEGER, subject TEXT, country TEXT, headers BLOB, recipients BLOB, read INTEGER, madrid_attributedBody BLOB, madrid_handle TEXT, madrid_version INTEGER, madrid_guid TEXT, madrid_type INTEGER, madrid_roomname TEXT, madrid_service TEXT, madrid_account TEXT, madrid_account_guid TEXT, madrid_flags INTEGER, madrid_attachmentInfo BLOB, madrid_url TEXT, madrid_error INTEGER, is_madrid INTEGER, madrid_date_read INTEGER, madrid_date_delivered INTEGER);

Ok, you say, "I see the highlights, but they are the same content. What's the big deal?" I concur, they do say the same thing... but the "madrid_account_guid" is in a different position in the database, or to be more linguistically correct, the field order is different in the two databases! Does it matter? Well, yes, and no...

Consider:

'SELECT ROWID, address, text, madrid_date_read FROM message;'

This query would work equally well in either phone's message table because it calls the fields by name. Any application, forensic or otherwise, making specific queries would continue to operate completely oblivious of the database differences. But a more generic query, could lead to trouble.

'SELECT * FROM message;'

This query would output every field in the record, and any tool that tried to read data output positionally would get unexpected data in the last eight fields (in one case, anyway). This is what happened in my program. "Well, stupid," you say, "don't code like that." Again, I concur, and I fixed my program by changing the manner in which I queried the database... but it turns out I'm not the only one coding this way.

What I failed to mention was why I was processing these phones. iPhone #1 was part of a shooting investigation. I retrieved from a suspect vehicle and initially processed it by making an iTunes backup of the running device. The second phone was brought to me after an up-to-date Cellebrite UFED failed to extract ANY message from the iMessage Service. The 'madrid' fields relate to the iMessage service, and it is the madrid fields that are thrown out of order by the database schema change. Seems that Cellebrite may have been thrown by the flag order in the same way I was. At least I'm in good company!

This also has implications in SQLite record carving. In the raw data, fields are laid down in the order of the schema. If a template for carving fields from dropped SQLite records has the wrong schema (and why would you expect one iOS 5.1.1 sms.db to differ from another), then you are getting incorrect and unreliable data.

I have been working quite hard on recovering dropped records from SQLite pages and have been quite successful. Stay tuned for what I've learned on this front...

Impossibly Large Numbers Revisited

In October, 2012 I posted about a article about cracking Android passwords. I spoke primarily on the difficulty in cracking the passwords based on the sheer number of possibilities (a whopping 37,556,971,331,618,802,349,234,821,094,576!)

Don’t believe me? Let’s to a little rehashing: The key space (range of possible ASCII characters) for each position in the password is 94 (upper and lower case letters, digits, and extended characters), or hexadecimal range \x21-\x7F. The password can be a minimum length of 4 and a maximum of 16 characters long.

A little Python 3 math

>>> total =0
>>>for i inrange(4,17):
...     total = total +94**i
...
>>>print(total)
37556971331618802349234821094576
>>>#python will even put in the commas!
>>>print('{:,}'.format(total))
37,556,971,331,618,802,349,234,821,094,576

And voilà, we have 37.6 trillion-quadrillion possibilities! (Just rolls off the tongue, doesn’t it?) I spoke then that while the CCL Forensics python script was a great tool, python was not the best choice for password cracking because its relatively slow for that purpose. I introduced hashcat, a cross-platform password recovery application, as a better way to do business.

Hashcat-lite: Harness Feline Speed

Hashcat is coded with performance in mind and can use multi-core CPUs and GPUs (Nvidia and AMD) to perform the calculations. The CPU version, hashcat is remarkably faster than the CCL python script, and the GPU verson,oclHashcat-plus leaves the CPU version in the dust!

Using hashcat for cracking Android passwords can be a bit confusing, however, and I hope to deobfuscate the process here.

Spicy Passwords

Android uses a salted password and stores the password as a hash in the /data/system/password.key file. Well, two hashes, actually. A SHA-1 hash is calculated followed by an MD5 hash, and the values are concatenated into a single 72 byte string.

The salt, in this case a randomly generated signed, 64-bit integer, randomizes the hash and makes dictionary and brute force attacks ineffective. The integer is stored in the settings.db located in the /data/data/com.android.providers.settings/databases directory. The integer is converted a hexadecimal string (8 bytes in length) and is appended to the password. The password + salt string is then hashed and stored.

The CrackStation website has an excellent treatise on salted password hashing if you are looking for a more in-depth explanation. The Android salted password formate is not the only salted password hashing method in practice.

Creating Test Data

We can use python to create some salted hashes after the manner of Android. This is useful for testing hashcat or other tools you might use. After all, if you don’t first test, a failed crack attempt leaves you wondering if the tool failed or if you failed to use the tool properly. To create an Android style password hash, we need a 4-16 character length ASCII character

First, let’s pick a password. We’ll keep it fairly short to allow it to be cracked in a reasonable amount of time: "secret". Keeping it lower case allows us to attack it with a 26 character key space—after all, we’re about cracking the password here, not generating secure passwords!

BASH

$ password="secret"
$

Next, we need to generate a random salt integer (we could just make something up here, but we’ll use python to randomly generate a salt to keep the exercise more realistic). The maximum size of a 64-bit integer is 9,223,372,036,854,775,807. It is signed, meaning it can be positive or negative. Yes, mathematicians, that’s the definition of an integer: a positive or negative whole number including zero. But knowing its signed is important for the hexadecimal conversion in Python or other programming languages. To keep the exercise simple, however, we’ll stay in bash and generate a random number (we’re fudging a bit in generating the random number, but it works for our purposes)/

BASH

$ password="secret"
$ salt=$(($RANDOM**4))
$ echo $salt
15606337825758241
$

Note

Extracting the salt from settings.db

Recall that in an Android device, the salt would be stored in the /data/data/com.android.providers.settings/databases/settings.db in the "secure" table. The table salt can be obtained as follows:

BASH

$ sqlite3 settings.db 'SELECT value FROM secure WHERE \
name = "lockscreen.password_salt";'
15606337825758241
$

On the BASH command line, we can convert the salt to a 8-byte hex string with the built-in function printf. The function formats and prints the a string, in this case we’ll be using the salt, according to a format string. Below, we tell print f to convert the string held in the variable $salt to hexadecimal, padding it with leading zeros if necessary until the output string is 16 characters long.

BASH

$ password="secret"
$ salt=$(($RANDOM**4))
$ echo $salt
15606337825758241
$

Now, we generate a hash by concatenating the password and hexadecimal salt into a string and hashing it. We’ll use the MD5 algorithm because it is faster to crack than SHA-1 (recall the password.key file contains both hashes). We pass the -n option to echo to prevent it from appending the output with a line feed as this would change our MD5 hash.

BASH

$ password="secret"
$ salt=$(($RANDOM**4))
$ echo $salt
15606337825758241
$ echo -n $password$salt | md5sum
b6b97079899c5f22d94f27027549cd7d  -

Now we have the salted MD5 hash of the password secret using the salt 15606337825758241!

Note	Extracting the MD5 from password.key We have been generating a salted hash for testing. You’ll need to extract the MD5 from the password.key when working with real data. The following command makes short work of it. BASH $ tail -c32 password.key b6b97079899c5f22d94f27027549cd7d $

Using Hashcat

I’ll be demonstrating the Nvidia version of hashcat. You’ll want to check the help for your version of hashcat, but you’ll find the following demonstration informative.

The basic command for hashcat follows this form: --- hashcat [options] hash [mask] ---

The chief options we are interested in are the hash type (-m) and minimum/maximum password lengths (--pw-min/--pw-max). Reading the help (-h/--help) tells us that for salted MD5 passwords, we us the -m10 option. And since we are using the -m10 option, we need to append the salt to the hash using a colon (:) separator.

Our command and ouput look as follows:

BASH

$ ./cudaHashcat-lite64.bin -m10 --pw-min=4 --pw-max=16 \
b6b97079899c5f22d94f27027549cd7d:15606337825758241
cudaHashcat-lite v0.13 by atom starting...

Password lengths: 4 - 16
Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger set to 80c
Device #1: GeForce 9500 GT, 1023MB, 1375Mhz, 4MCU


b6b97079899c5f22d94f27027549cd7d:15606337825758241:secret

Session.Name...: cudaHashcat-lite
Status.........: Cracked
Hash.Target....: b6b97079899c5f22d94f27027549cd7d:15606337825758241
Hash.Type......: md5($pass.$salt)
Time.Started...: Sat Jan 19 16:49:59 2013 (10 secs)
Time.Estimated.: Sat Jan 19 16:50:39 2013 (26 secs)
Plain.Mask.....: ?1?2?2?2?2?2
Plain.Text.....: ***yd3
Plain.Length...: 6
Progress.......: 1051066368/3748902912 (28.04%)
Speed.GPU.#1...:   102.8M/s
HWMon.GPU.#1...: -1% Util, 45c Temp, 100% Fan

Started: Sat Jan 19 16:49:59 2013
Stopped: Sat Jan 19 16:50:13 2013
$

Wait. Was that 14 seconds? Yes, it was!

Put on a Mask and Speed Your Results

Now, if the password gets very much longer, the exponential increase in the number of password possibilities gets quite large. Hashcat has one more trick up its sleeve (actually, there’s at least one more, but we’ll cover than another time). Hashcat makes use of masks that allow you to narrow the key space. Simply put, you can choose limited character sets to be used in the search, either from a predefined list, or lists your own creation.

The predefined character sets are:

?l = abcdefghijklmnopqrstuvwxyz
?u = ABCDEFGHIJKLMNOPQRSTUVWXYZ
?d = 0123456789
?s = !"#$%&'()*+,-./:;<⇒?@[\]^_`{|}~
?a = ?l?u?d?s
?h = 8 bit characters from 0xc0 - 0xff
?D = 8 bit characters from German alphabet
?F = 8 bit characters from French alphabet
?R = 8 bit characters from Russian alphabet

To limit the password search to passwords containing only lowercase letters, for example, you would pass the command:

$ ./cudaHashcat-lite64.bin -m10 --pw-min=4 --pw-max=16 \
b6b97079899c5f22d94f27027549cd7d:15606337825758241 \
?l?l?l?l?l?l?l?l?l?l?l?l?l?l?l?l

I hope this gets you started using hashcat. It is a very effective tool, and it keeps on improving!

"Cleanliness is next to Godliness," it is often said. And if you believe that, then you might think the Android operating system is seeking after the divine when it comes to its messaging service. Why do I say that? Because in my quest for a thorough understanding of SQLite databases, I discovered that the mmssms.db, Android’s built-in messaging database, has the auto-vacuum option enabled! And in Full-mode at that!

SQLite Vacuum

In SQLite, Vacuum is an operation that rebuilds the entire database. Frequent updates, deletions and insertions can leave the database file fragmented. Vacuum reduces the size of fragmented databases by copying the active records to a temporary file and then overwriting the original database file. During this process, it uses the rollback journal or write-ahead log as it would for any database transaction.

SQLite has two auto-vacuum modes, full and incremental. The auto-vacuum mode can only be set when the database is created. The setting is stored in the database header (the first 100 bytes of the database file), at file offset 52. If the 32-bit, big-endian integer at offset is non-zero, it represents the address (page number) of the largest root b-tree page. For this discussion, the significance of the non-zero value is that database auto-vacuum is enabled.

The 32-bit, big-endian File offset 64 indicates the auto-vacuum mode. An non-zero value means the database is set for incremental vacuum mode, while a zero value means full mode.

Figure 1. SQLite Database Header

Note	Don’t be fooled by shortcutting your analysis by jumping straight to offset 64 to check the value. A zero value and offset 64 coupled with a zero value at offset 52 means auto-vacuum is not enabled!

SQLite Structure

Before we can really understand the SQLite vacuum operation, we have to first understand a little bit about how SQLite manages its data. SQLite organizes itself into pages. The page sizes usually match the underlying file system block size and can be determined definitively by the 16-bit, big-endian integer located at file offset 16. Each page has a single purpose and can be any one of the following types:

Lock-byte
Freelist
- trunk
- leaf
B-tree
- table interior
- table leaf
- index interior
- index leaf
Payload overflow
Pointer map

For this discussion, we need to know about about the freelist pages and B-Tree table pages.

Freelist Pages

When data is deleted, or dropped, from the SQLite database, the database file does not get smaller (absent a vacuum operation). The database notes the location of the free space and reuses it as needed. The freelist contains the addresses, by page number, of full pages no longer being used to store data. The number of freelist pages in the database is store in the database file header as a 32-bit, big-endian integer at file offset 36.

Freelist trunk pages store the addresses—by page number, not offset—to the next trunk page, if any, and to freelist leaf pages. The freelist leaf pages are the pages that once stored data, that is, they were once B-Tree pages.

B-Tree Pages

Table records and structures are stored in B-Tree pages. B-Tree pages have headers that describe the data in the page:

Byte offset to first freeblock (unallocated space between the records)
- Free blocks are chained together, each one pointing to the next
Number of cells on the page
- B-Tree table leaf pages contain cells with table data
Offset to the first cell on the page
Number of fragmented free bytes
- May not exceed 60 bytes

A cell pointer array follows immediately after the B-Tree page header. The array is a list of offsets to the allocated cells on the page. Cells are self describing, using integers to describe things like the cell length, the unique record index number (ROWID), and the cell payload content (by means of a record header). Not all B-Tree pages contain the table data that is the usual subject of an examination, but those that do can be identified by the page header.

The take away here is that it is B-Tree pages that contain table data. B-Tree pages can contain both allocated and unallocated space, and become fragmented when one record is dropped from the midst of other records. All records may be deleted from a B-Tree table leaf page making it subject to becoming a Freelist page.

Note	A SQLite database may reorganize, or defragment a page so there are no freeblocks or byte fragments (groups of three or less bytes), packing all the allocated cells at the end of the page. This is an internal housekeeping function independent of the vacuum function.

A Tale of Two Modes

As I already stated, auto-vacuum comes in two flavors: Full and Incremental. So, what is the difference and how does it affect our examinations?

Auto-Vacuum: Full Mode

In full auto-vacuum mode, every transaction commit to the database causes the pages in the freelist to be moved to the end of the database, and the database is truncated to remove the pages. It is important to distinguish that only the freelist pages are removed, not the fragmented B-Tree pages. Also, Full auto-vacuum does not cause B-Tree page defragmentation to occur.

Auto-Vacuum: Incremental Mode

In Incremental auto-vacuum mode, vacuuming does not occur with every commit. Instead, the database programatically receives a command to remove N pages from the freelist. The pages are moved to the end of the database, and the database is truncated. The page references are removed from the free list. If there are fewer pages in the list than required by the command, all the freepages are moved and truncated.

So What’s the Big Deal?

I started this discussion by noting that I had discovered that the Android mmssms.db was set to full auto-vacuum mode. This means that every commit to the database could cause dropped records in freepages to be moved to the end of the database and dropped off a cliff. Tools designed to recover dropped records from logigal SQLite databases won’t recover the records because they are no longer part of the database! And logical file extraction tools won’t recover the deleted pages, either.

Think of it like this: A drug dealer is seen conducting a transaction and flees when approached by police. He momentarily escapes, and takes the opportunity to delete all his text messages should he be captured. Sure enough, the good guys find him. While he’s being pat down, one of his customers texts the internationally recognized "do you have any drugs?" abbreviation: "Wuz up?"

Wuz up? You just helped the drug dealer remove all dropped records from his messing database, that’s WUZ UP!

The Google Chrome cache represents a challenge to forensic investigators. If the extent of your examination has been to open the cache folder and view the files in a file browser, you are likely missing a lot of content.

For starters, files stored in the cache are renamed from their original names on the web-server. Next, text elements (like HTML, JSON, etc.) are zlib compressed. Finally, files smaller than 16384 bytes (16k) are stored in block files which are container files that hold many smaller files. The meta-data about the cache files are stored in these container files, too, and its all mapped by a binary index file.

So, while its easy enough to point a file browser or image viewer at the cache directory and see some recognizable data structures, making sense of all that’s there can be more challenging. In the remainder of this discussion, I’ll attempt to give you more insight into the Google Chrome cache. This should be of interest to disk and mobile forensicators alike, as the structure is the same whether you are examining a desktop computer or a mobile device such as an Android phone or tablet.

Cache Structure

All the files in the Google Chrome cache are stored in a single folder called cache. The cache consists of at least five files: and index file and four data files known as block files. As I stated above, downloaded files are stored in one of the block files or directly to the cache directory, and the index keeps track of the transaction and storage location.

A cache can consist of only the five mentioned files (named index, data_0, data_1, data_2, and data_3) if all the data files in the cache are smaller than 16k. Larger files are stored outside the block files. Go ahead, check your cache if you don’t believe me… I’ll wait. In case you don’t have one handy, here’s as truncated file listing from a recent Android exam I conducted, sorted by size.

Output of ls -lSr

-rw-r--r-- 1 user user   16519 Feb 20 13:06 f_00008a
-rw-r--r-- 1 user user   16566 Feb 20 13:06 f_0000cf
-rw-r--r-- 1 user user   16604 Feb 20 13:06 f_0000cc
-rw-r--r-- 1 user user   16944 Feb 20 13:06 f_0000cd
...
-rw-r--r-- 1 user user   70659 Feb 20 13:06 f_0000f7
-rw-r--r-- 1 user user   73804 Feb 20 13:06 f_00008f
-rw-r--r-- 1 user user   74434 Feb 20 13:06 f_0000df
...
-rw-r--r-- 1 user user   81920 Feb 20 13:06 data_0
...
-rw-r--r-- 1 user user  262512 Feb 20 13:06 index
-rw-r--r-- 1 user user 1581056 Feb 20 13:06 data_1
-rw-r--r-- 1 user user 2105344 Feb 20 13:06 data_2
-rw-r--r-- 1 user user 4202496 Feb 20 13:06 data_3

Warning

When one or more of the five base files get corrupted or deleted, the entire set gets recreated. I experimented by deleting a data block file and restarting Chrome. On browser restart, the entire cache was deleted and new base files were created.

You might be wondering about the the four data blocks. How are they distinguished? The answer is: size; not size of the block-files themselves, but size of the internal data blocks. Each block file is defined to hold data in blocks, much like a file system. And like different sized file systems can be defined with different sized blocks, so the cache data block-files are defined with different block sizes, and data can be allocated to no more than four blocks at a time before being considered too large for that block-file.

Table 1. Default Data Block-file sizes
File	Block SZ	Max Data SZ
data_0	36b	rankings
data_1	256b	1k
data_2	1k	4k
data_3	4k	16K

Note	When a data block-file reaches maximum capacity (each file is only allowed to hold a defined number of objects) a new data block-file is created and pointed to in the previous data block-file header.

Cache addresses

All cached objects have an address. The address is a 32-bit integer that describes where the data is stored. Meta-data about the object is stored, too, and includes:

HTTP Headers
Request Data
Entry name (Key)
Other auxiliary information (e.g., rankings)

Examples of cache addresses:

0x00000000: not initialized
0x8000002A: external file f_0002A
0xA0010003: block-file number 1 (data_1), initial block number 3, 1 block of length.

Important

The addresses above are ordered as they read, but on disk you will find them in little-endian format, e.g, the external file appears on disk as 0x2A000080

Cache addresses are interpreted at the bit level. That means we have to convert the 32-bit integer into bits and evaluate to understand the address. The first 4 bits are the header, which consist of the initialized bit followed three file type bits.

Table 2. File Types
BINARY	INTEGER	INTERPRETATION
000	0	separate file on disk
001	1	rankings block-file
010	2	256b block-file
011	3	1k block-file
100	4	4k block-file

The remaining 28-bits are interpreted according to the file type:

Table 3. Separate File
Init	File Type	File #
`1`	`000`	`1111111111111111111111111111`

Table 4. Block File
Init	File Type	Reserved	Contiguous Blocks	Block File	Block#
`1`	`001`	`00`	`11`	`00000000`	`1111111111111111`

Lets take a look at the last two cache addresses above:

External File Address

0x8000002A, interpreted as a 32-bit integer, is 2147483690. In binary, it is 10000000000000000000000000101010. We interpret it as follows:

Table 5. Binary Interpretation
	Init	File Type	File #
Binary	`1`	`000`	`0000000000000000000000101010`
Integer	`1`	`0`	`42 (0x2A)`

Block File Address

0x080001A0, interpreted as a 32-bit integer, is 2684420099. In binary, it is 10100000000000010000000000000011. We interpret it as follows:

Table 6. Binary Interpretation
	Init	File Type	Reserved	Contiguous Blocks	Block File	Block#
Binary	`1`	`010`	`00`	`00`	`00000001`	`0000000000000011`
Integer	`1`	`2`	`0`	`0`	`1`	`3`

Note	The odd man out here is Contiguous blocks, which weighs in at zero but appears to be interpreted as 1 block as based on Google documentation.

There is a lot more to discuss here, but this is a good start. You now know that the cache index holds a map of cached browser data, and that the data_# files can contain cached web object you’ve may have been missing. I’ll cover more on following the cache map and extracting the file content of the data-blocks in a future article (or two)!