[转]HDFS Metadata Directories Explained - An overview of files and configurations under the HDFS meta-directory

DLevin — Sun, 25 Jan 2015 15:28:00 GMT

转自：http://zh.hortonworks.com/blog/hdfs-metadata-directories-explained/

HDFS metadata represents the structure of HDFS directories and files in a tree. It also includes the various attributes of directories and files, such as ownership, permissions, quotas, and replication factor. In this blog post, I’ll describe how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files. All examples shown are from testing a build of the soon-to-be-released Apache Hadoop 2.6.0.

WARNING: Do not attempt to modify metadata directories or files. Unexpected modifications can cause HDFS downtime, or even permanent data loss. This information is provided for educational purposes only.

Persistence of HDFS metadata broadly breaks down into 2 categories of files:

fsimage – An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.
edits – An edits file is a log that lists each file system change (file creation, deletion or modification) that was made after the most recent fsimage.

Checkpointing is the process of merging the content of the most recent fsimage with all edits applied after that fsimage is merged in order to create a new fsimage. Checkpointing is triggered automatically by configuration policies or manually by HDFS administration commands.

NameNode

Here is an example of an HDFS metadata directory taken from a NameNode. This shows the output of running the tree command on the metadata directory, which is configured by setting dfs.namenode.name.dir in hdfs-site.xml.

data/dfs/name
├── current
│ ├── VERSION
│ ├── edits_0000000000000000001-0000000000000000007
│ ├── edits_0000000000000000008-0000000000000000015
│ ├── edits_0000000000000000016-0000000000000000022
│ ├── edits_0000000000000000023-0000000000000000029
│ ├── edits_0000000000000000030-0000000000000000030
│ ├── edits_0000000000000000031-0000000000000000031
│ ├── edits_inprogress_0000000000000000032
│ ├── fsimage_0000000000000000030
│ ├── fsimage_0000000000000000030.md5
│ ├── fsimage_0000000000000000031
│ ├── fsimage_0000000000000000031.md5
│ └── seen_txid
└── in_use.lock

In this example, the same directory has been used for both fsimage and edits. Alternatively, configuration options are available that allow separating fsimage and edits into different directories. Each file within this directory serves a specific purpose in the overall scheme of metadata persistence:

VERSION – Text file that contains:

layoutVersion – The version of the HDFS metadata format. When we add new features that require changing the metadata format, we change this number. An HDFS upgrade is required when the current HDFS software uses a layout version newer than what is currently tracked here.
namespaceID/clusterID/blockpoolID – These are unique identifiers of an HDFS cluster. The identifiers are used to prevent DataNodes from registering accidentally with an incorrect NameNode that is part of a different cluster. These identifiers also are particularly important in a federated deployment. Within a federated deployment, there are multiple NameNodes working independently. Each NameNode serves a unique portion of the namespace (namespaceID) and manages a unique set of blocks (blockpoolID). The clusterID ties the whole cluster together as a single logical unit. It’s the same across all nodes in the cluster.
storageType – This is either NAME_NODE or JOURNAL_NODE. Metadata on a JournalNode in an HA deployment is discussed later.
cTime – Creation time of file system state. This field is updated during HDFS upgrades.

edits_start transaction ID-end transaction ID – These are finalized (unmodifiable) edit log segments. Each of these files contains all of the edit log transactions in the range defined by the file name’s through . In an HA deployment, the standby can only read up through the finalized log segments. It will not be up-to-date with the current edit log in progress (described next). However, when an HA failover happens, the failover finalizes the current log segment so that it’s completely caught up before switching to active.
edits_inprogress__start transaction ID – This is the current edit log in progress. All transactions starting from are in this file, and all new incoming transactions will get appended to this file. HDFS pre-allocates space in this file in 1 MB chunks for efficiency, and then fills it with incoming transactions. You’ll probably see this file’s size as a multiple of 1 MB. When HDFS finalizes the log segment, it truncates the unused portion of the space that doesn’t contain any transactions, so the finalized file’s space will shrink down.
fsimage_end transaction ID – This contains the complete metadata image up through . Each fsimage file also has a corresponding .md5 file containing a MD5 checksum, which HDFS uses to guard against disk corruption.
seen_txid - This contains the last transaction ID of the last checkpoint (merge of edits into a fsimage) or edit log roll (finalization of current edits_inprogress and creation of a new one). Note that this is not the last transaction ID accepted by the NameNode. The file is not updated on every transaction, only on a checkpoint or an edit log roll. The purpose of this file is to try to identify if edits are missing during startup. It’s possible to configure the NameNode to use separate directories for fsimage and edits files. If the edits directory accidentally gets deleted, then all transactions since the last checkpoint would go away, and the NameNode would start up using just fsimage at an old state. To guard against this, NameNode startup also checks seen_txid to verify that it can load transactions at least up through that number. It aborts startup if it can’t.
in_use.lock – This is a lock file held by the NameNode process, used to prevent multiple NameNode processes from starting up and concurrently modifying the directory.

JournalNode

In an HA deployment, edits are logged to a separate set of daemons called JournalNodes. A JournalNode’s metadata directory is configured by setting dfs.journalnode.edits.dir. The JournalNode will contain a VERSION file, multiple edits__ files and an edits_inprogress_, just like the NameNode. The JournalNode will not have fsimage files or seen_txid. In addition, it contains several other files relevant to the HA implementation. These files help prevent a split-brain scenario, in which multiple NameNodes could think they are active and all try to write edits.

committed-txid – Tracks last transaction ID committed by a NameNode.
last-promised-epoch – This file contains the “epoch,” which is a monotonically increasing number. When a new writer (a new NameNode) starts as active, it increments the epoch and presents it in calls to the JournalNode. This scheme is the NameNode’s way of claiming that it is active and requests from another NameNode, presenting a lower epoch, must be ignored.
last-writer-epoch – Similar to the above, but this contains the epoch number associated with the writer who last actually wrote a transaction. (This was a bug fix for an edge case not handled by last-promised-epoch alone.)
paxos – Directory containing temporary files used in implementation of the Paxos distributed consensus protocol. This directory often will appear as empty.

DataNode

Although DataNodes do not contain metadata about the directories and files stored in an HDFS cluster, they do contain a small amount of metadata about the DataNode itself and its relationship to a cluster. This shows the output of running the tree command on the DataNode’s directory, configured by setting dfs.datanode.data.dir in hdfs-site.xml.

data/dfs/data/
├── current
│ ├── BP-1079595417-192.168.2.45-1412613236271
│ │ ├── current
│ │ │ ├── VERSION
│ │ │ ├── finalized
│ │ │ │ └── subdir0
│ │ │ │ └── subdir1
│ │ │ │ ├── blk_1073741825
│ │ │ │ └── blk_1073741825_1001.meta
│ │ │ │── lazyPersist
│ │ │ └── rbw
│ │ ├── dncp_block_verification.log.curr
│ │ ├── dncp_block_verification.log.prev
│ │ └── tmp
│ └── VERSION
└── in_use.lock

The purpose of these files is:

BP-random integer-NameNode-IP address-creation time – The naming convention on this directory is significant and constitutes a form of cluster metadata. The name is a block pool ID. “BP” stands for “block pool,” the abstraction that collects a set of blocks belonging to a single namespace. In the case of a federated deployment, there will be multiple “BP” sub-directories, one for each block pool. The remaining components form a unique ID: a random integer, followed by the IP address of the NameNode that created the block pool, followed by creation time.
VERSION – Much like the NameNode and JournalNode, this is a text file containing multiple properties, such as layoutVersion, clusterId and cTime, all discussed earlier. There is a VERSION file tracked for the entire DataNode as well as a separate VERSION file in each block pool sub-directory. In addition to the properties already discussed earlier, the DataNode’s VERSION files also contain:

storageType – In this case, the storageType field is set to DATA_NODE.
blockpoolID – This repeats the block pool ID information encoded into the sub-directory name.

finalized/rbw - Both finalized and rbw contain a directory structure for block storage. This holds numerous block files, which contain HDFS file data and the corresponding .meta files, which contain checksum information. “Rbw” stands for “replica being written”. This area contains blocks that are still being written to by an HDFS client. The finalized sub-directory contains blocks that are not being written to by a client and have been completed.
lazyPersist – HDFS is incorporating a new feature to support writing transient data to memory, followed by lazy persistence to disk in the background. If this feature is in use, then a lazyPersist sub-directory is present and used for lazy persistence of in-memory blocks to disk. We’ll cover this exciting new feature in greater detail in a future blog post.
dncp_block_verification.log – This file tracks the last time each block was verified by checking its contents against its checksum. The last verification time is significant for deciding how to prioritize subsequent verification work. The DataNode orders its background block verification work in ascending order of last verification time. This file is rolled periodically, so it’s typical to see a .curr file (current) and a .prev file (previous).
in_use.lock – This is a lock file held by the DataNode process, used to prevent multiple DataNode processes from starting up and concurrently modifying the directory.

Commands

Various HDFS commands impact the metadata directories

Commands	Description
hdfs namenode	NameNode startup automatically saves a new checkpoint. As stated earlier, checkpointing is the process of merging any outstanding edit logs with the latest fsimage, saving the full state to a new fsimage file, and rolling edits. Rolling edits means finalizing the current edits_inprogress and starting a new one.
hdfs dfsadmin -safemode enter hdfs dfsadmin -saveNamespace	This saves a new checkpoint (much like restarting NameNode) while the NameNode process remains running. Note that the NameNode must be in safe mode, so all attempted write activity would fail while this is running.
hdfs dfsadmin -rollEdits	This manually rolls edits. Safe mode is not required. This can be useful if a standby NameNode is lagging behind the active and you want it to get caught up more quickly. (The standby NameNode can only read finalized edit log segments, not the current in progress edits file.)
hdfs dfsadmin -fetchImage	Downloads the latest fsimage from the NameNode. This can be helpful for a remote backup type of scenario.

Configuration Properties

Several configuration properties in hdfs-site.xml control the behavior of HDFS metadata directories.

dfs.namenode.name.dir – Determines where on the local filesystem the DFS name node should store the name table (fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.namenode.edits.dir - Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir.
dfs.namenode.checkpoint.period - The number of seconds between two periodic checkpoints.
dfs.namenode.checkpoint.txns – The standby will create a checkpoint of the namespace every ‘dfs.namenode.checkpoint.txns’ transactions, regardless of whether ‘dfs.namenode.checkpoint.period’ has expired.
dfs.namenode.checkpoint.check.period – How frequently to query for the number of uncheckpointed transactions.
dfs.namenode.num.checkpoints.retained - The number of image checkpoint files that will be retained in storage directories. All edit logs necessary to recover an up-to-date namespace from the oldest retained checkpoint will also be retained.
dfs.namenode.num.extra.edits.retained – The number of extra transactions which should be retained beyond what is minimally necessary for a NN restart. This can be useful for audit purposes or for an HA setup where a remote Standby Node may have been offline for some time and need to have a longer backlog of retained edits in order to start again.
dfs.namenode.edit.log.autoroll.multiplier.threshold – Determines when an active namenode will roll its own edit log. The actual threshold (in number of edits) is determined by multiplying this value by dfs.namenode.checkpoint.txns. This prevents extremely large edit files from accumulating on the active namenode, which can cause timeouts during namenode startup and pose an administrative hassle. This behavior is intended as a failsafe for when the standby fails to roll the edit log by the normal checkpoint threshold.
dfs.namenode.edit.log.autoroll.check.interval.ms – How often an active namenode will check if it needs to roll its edit log, in milliseconds.
dfs.datanode.data.dir – Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. Heterogeneous storage allows specifying that each directory resides on a different type of storage: DISK, SSD, ARCHIVE or RAM_DISK.

Conclusion

We briefly discussed how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files, the relevant configurations that drive specific behaviors, and appropriate HDFS metadata directory commands that print out the directory tree, initiate checkpoint, and create a fsimage.

In a future blog, we’ll explore lazy persistence, a scheme to persist in-memory data to disk, in more in details.

DLevin 2015-01-25 23:28 发表评论

eclipse下使用cygwin直接运行shell文件配置

DLevin — Wed, 11 Apr 2012 16:38:00 GMT

最近事情挺多的，好久不写博客了。今天在看hadoop的时候，突然心血来潮，想在windows的eclipse下跑hadoop的shell脚本，这样就方便多了，查了一个晚上，终于看似找到了，泪奔。记录&分享一下，如果有人也有这样的需求或者备以后自己看。

当然在windows下跑shell当然是要先安装cygwin了，关于这个cygwin的安装就不在这里说了，另外关于如何在cygwin配置hadoop貌似已经有人写过了，也不在这里介绍了，有兴趣的童鞋可以参考：http://blog.csdn.net/yanical/article/details/4474830

所以本文只关注如何将cygwin引入到eclipse中以运行shell脚本。
在eclipse中，通过External Tools方式来支持嵌入包括cygwin在内的其他工具，以下是这些步骤：
1. eclipse->Run->External Tools->External Tools Configuration....
2. 在配置页面中，那么可以按你的爱好随便指定，如cywin_hadoop，location是指externl tools的地址，这里就是解释shell的bash，简单一点的，可以直接指定：C:\cygwin\bin\bash.exe，这样可以执行一些简单的命令，但是如果要引用其他解释器，那就有问题了，比如执行hadoop的shell文件，就会发现dirname命令找不到的提示。所以一种解决方法是自己写一个bat脚本，把需要用到的目录都写道PATH中，比如我编写了如下的bat脚本（当然如果需要其他更多其他目录的命令，可以往PATH中添加）：

@echo off
rem 上一行关闭命令回显

PATH=C:\cygwin\bin\;%PATH%

bash %1

rem 开启命令回显
echo on

然后把location指向该文件。
3. Work Directory是指工作目录，可以指定你脚本所在目录，如我的hadoop脚本在scripts下，那么我就指定了：${workspace_loc:/hadoop/scripts}
4. Arguments我指定了当前的文件名：${resource_name}，如果在实际运行hadoop脚本时参数可以往后再添加。

这样就配置好了，直接点击Run就可以运行了。这样感觉以后开发就方便多了。

另外，还发现了一个非常有趣的东东，一同记录分享。
为了在windows下点击shell脚本文件就可以直接运行shell脚本，有人想出了如下命令（出自：http://stackoverflow.com/questions/105075/how-can-i-associate-sh-files-with-cygwin）：

assoc .sh=bashscript

ftype bashscript=C:\cygwin\bin\bash.exe --login -i -c 'cd "$(dirname "$(cygpath -u "%1")")"; bash "$(cygpath -u "%1")"'

即设置*.sh文件的默认执行软件是bash，如果在win7下需要用管理员身份打开cmd，然后运行这两个指令。可惜我好像木有运行成功，没有仔细找原因，不过我尝试了一下命令确实可以运行：

assoc .sh=bashscript
ftype bashscript=C:\cygwin\bin\bash.exe %1

感觉挺好玩的。。。。

DLevin 2012-04-12 00:38 发表评论

BlogJava-上善若水-随笔分类-Hadoop